大数据python词频统计之hdfs分发-cacheArchive

-cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件

1.The_Man_of_Property.txt文件如下(将其上传至hdfs上)

hadoop fs -put The_Man_of_Property.txt  /mapreduce
Preface
“The Forsyte Saga” was the title originally destined for that part of it which is called “The Man of Property”; and to adopt it for the collected chronicles of the Forsyte family has indulged the Forsytean tenacity that is in all of us. The word Saga might be objected to on the ground that it connotes the heroic and that there is little heroism in these pages. But it is used with a suitable irony; and, after all, this long tale, though it may deal with folk in frock coats, furbelows, and a gilt-edged period, is not devoid of the essential heat of conflict. Discounting for the gigantic stature and blood-thirstiness of old days, as they have come down to us in fairy-tale and legend, the folk of the old Sagas were Forsytes, assuredly, in their possessive instincts, and as little proof against the inroads of beauty and passion as Swithin, Soames, or even Young Jolyon. And if heroic figures, in days that never were, seem to startle out from their surroundings in fashion unbecoming to a Forsyte of the Victorian era, we may be sure that tribal instinct was even then the prime force, and that “family” and the sense of home and property counted as they do to this day, for all the recent efforts to “talk them out.”
So many people have written and claimed that their families were the originals of the Forsytes that one has been almost encouraged to believe in the typicality of an imagined species. Manners change and modes evolve, and “Timothy’s on the Bayswater Road” becomes a nest of the unbelievable in all except essentials; we shall not look upon its like again, nor perhaps on such a one as James or Old Jolyon. And yet the figures of Insurance Societies and the utterances of Judges reassure us daily that our earthly paradise is still a rich preserve, where the wild raiders, Beauty and Passion, come stealing in, filching security from beneath our noses. As surely as a dog will bark at a brass band, so will the essential Soames in human nature ever rise up uneasily against the dissolution which hovers round the folds of ownership.
“Let the dead Past bury its dead” would be a better saying if the Past ever died. The persistence of the Past is one of those tragi-comic blessings which each new age denies, coming cocksure on to the stage to mouth its claim to a perfect novelty.
But no Age is so new as that! Human Nature, under its changing pretensions and clothes, is and ever will be very much of a Forsyte, and might, after all, be a much worse animal.
Looking back on the Victorian era, whose ripeness, decline, and ‘fall-of’ is in some sort pictured in “The Forsyte Saga,” we see now that we have but jumped out of a frying-pan into a fire. It would be difficult to substantiate a claim that the case of England was better in 1913 than it was in 1886, when the Forsytes assembled at Old Jolyon’s to celebrate the engagement of June to Philip Bosinney. And in 1920, when again the clan gathered to bless the marriage of Fleur with Michael Mont, the state of England is as surely too molten and bankrupt as in the eighties it was too congealed and low-percented. If these chronicles had been a really scientific study of transition one would have dwelt probably on such factors as the invention of bicycle, motor-car, and flying-machine; the arrival of a cheap Press; the decline of country life and increase of the towns; the birth of the Cinema. Men are, in fact, quite unable to control their own inventions; they at best develop adaptability to the new conditions those inventions create.
But this long tale is no scientific study of a period; it is rather an intimate incarnation of the disturbance that Beauty effects in the lives of men.
The figure of Irene, never, as the reader may possibly have observed, present, except through the senses of other characters, is a concretion of disturbing Beauty impinging on a possessive world.
One has noticed that readers, as they wade on through the salt waters of the Saga, are inclined more and more to pity Soames, and to think that in doing so they are in revolt against the mood of his creator. Far from it! He, too, pities Soames, the tragedy of whose life is the very simple, uncontrollable tragedy of being unlovable, without quite a thick enough skin to be thoroughly unconscious of the fact. Not even Fleur loves Soames as he feels he ought to be loved. But in pitying Soames, readers incline, perhaps, to animus against Irene: After all, they think, he wasn’t a bad fellow, it wasn’t his fault; she ought to have forgiven him, and so on!

2.white_list1与white_list2做为白名单,找出白名单文件中单词在The_Man_of_Property.tx中出现的次数(实现将2个文件打包为white.tar.gz,上传至hdfs上)

white_list1如下:

suitable
against
recent
across

white_list2如下:

Age
on

打包并上传至hdfs:

tar czvf white.tar.gz white_list1 white_list2
hadoop fs -put  white.tar.gz  /mapreduce

map函数代码如下:思路(1.遍历找到所有文件的路径,2.读取white_list文件内容;3.进行过滤)

#!usr/bin/python
import sys
import os
def read_dir_file(file_dir,dir):
        fs = os.listdir(dir)
        for f1 in fs:
                tmp_path=os.path.join(dir,f1)
                if not os.path.isdir(tmp_path):
                        file_dir.append(tmp_path)
                else:
                        read_dir_file(file_dir,tmp_path)
        return file_dir
def read_local_file(file_dir):
        word_set = set()
        for file in file_dir:
                file_in = open (file,'r')
                for line in file_in:
                        word = line.strip()
                        word_set.add(word)
        return word_set
def mapper_func(dir):
        file_dir=[]
        file_dir=read_dir_file(file_dir,dir)
        word_set=read_local_file(file_dir)
        for line in sys.stdin:
                ss=line.strip().split()
                for word in ss:
                        word.strip()
                        if word != "" and (word in word_set):
                                print "%s\t%s"%(word,"1")
if __name__ == "__main__":
        func = getattr(sys.modules[__name__],sys.argv[1])
        args = None
        if len(sys.argv) > 1:
                args = sys.argv[2:]
        func(*args)

4.reduce端代码如下:

#!usr/bin/python
import sys
def reducer_func():
        word="None"
        sum=0
        for line in sys.stdin:
                ss=line.split()
                cur_word=ss[0]
                cnt=int(ss[1])
                if cur_word!=word:
                        if word!="None":
                                print "%s\t%s"%(word,sum)
                        word=cur_word
                        sum=0
                else:
                        sum+=cnt
        print "%s\t%s"%(word,sum)
if __name__ == "__main__":
        func = getattr(sys.modules[__name__],sys.argv[1])
        args = None
        if len(sys.argv) > 1:
                args=sys.argv[2:]
        func(*args)

5.运行脚本run.sh如下:

HADOOP="/usr/local/src/hadoop-1.2.1/bin/hadoop"
HADOOP_STREAMING="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"
INPUT_PATH="/mapreduce/The_Man_of_Property.txt"
OUTPUT_PATH="/mapreduce/out"
$HADOOP fs -rmr $OUTPUT_PATH
$HADOOP jar $HADOOP_STREAMING \
        -input "$INPUT_PATH" \
        -output "$OUTPUT_PATH" \
        -mapper "python map.py mapper_func ABC" \
        -reducer "python red.py reducer_func" \
        -file "./map.py"\
        -file "./red.py"\
        -cacheArchive "hdfs://master:9000/mapreduce/white.tar.gz#ABC"