用Python来写MapReduce的实

发布时间:2019-08-25 09:34:29编辑:auto阅读(1249)

    用Python来写分布式的程序。这样速度快。便于调试,更有实际意义。MapReduce适合于对文本文件的处理及数据挖掘用:

      在每台机器上:
    su - hadoop
    wget http://www.python.org/ftp/python/3.0.1/Python-3.0.1.tar.bz2
    tar jxvf Python-3.0.1.tar.bz2
    cd Python-3.0.1
    ./configure --prefix=/home/hadoop/python;make;make install

    vi /home/hadoop/mapper.py

    
    #!/home/hadoop/python/bin/python3.0
    
    import sys
    for line in sys.stdin:
        line = line.strip()
        words = line.split()
        for word in words:
            print ("%st%s" % (word, 1))

    vi /home/hadoop/reduce.py

    
    #!/home/hadoop/python/bin/python3.0
    
    from operator import itemgetter
    import sys
    
    word2count = {}
    
    for line in sys.stdin:
        line = line.strip()
        word, count = line.split('t', 1)
        try:
            count = int(count)
            word2count[word] = word2count.get(word, 0) + count
        except ValueError:
            pass
    
    sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
    
    for word, count in sorted_word2count:
        print ("%st%s" % (word, count))

      测测好不好用:
    echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
    foo 1
    foo 1
    quux 1
    labs 1
    foo 1
    bar 1
    quux 1

    echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reduce.py
    bar 1
    foo 3
    labs 1
    quux 2

      在各个节点上都要准备好这两个文件啊!!!

      在master主节点上执行:

    # 拷贝conf目录到hdfs文件系统中
    $ cd /home/hadoop/hadoop-0.19.1
    $ bin/hadoop dfs -copyFromLocal conf 111

      # 查看一下是否已经拷过去了
    $ bin/hadoop dfs -ls
    Found 1 items
    drwxr-xr-x - hadoop supergroup 0 2009-05-18 15:27 /user/hadoop/111

      # 分布计算
    $ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reduce.py -input 111/* -output 111-output
    additionalConfSpec_:null
    null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
    packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar29198/] [] /tmp/streamjob29199.jar tmpDir=null
    [...] INFO mapred.FileInputFormat: Total input paths to process : 12
    [...] INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local]
    [...] INFO streaming.StreamJob: Running job: job_200905191453_0001
    [...] INFO streaming.StreamJob: To kill this job, run:
    ...
    [...]
    [...] INFO streaming.StreamJob: map 0% reduce 0%
    [...] INFO streaming.StreamJob: map 43% reduce 0%
    [...] INFO streaming.StreamJob: map 86% reduce 0%
    [...] INFO streaming.StreamJob: map 100% reduce 0%
    [...] INFO streaming.StreamJob: map 100% reduce 33%
    [...] INFO streaming.StreamJob: map 100% reduce 70%
    [...] INFO streaming.StreamJob: map 100% reduce 77%
    [...] INFO streaming.StreamJob: map 100% reduce 100%
    [...] INFO streaming.StreamJob: Job complete: job_200905191453_0001
    [...] INFO streaming.StreamJob: Output: 111-output [hadoop@wangyin4 hadoop-0.19.1]$

    $ bin/hadoop dfs -ls 111-output
    Found 2 items
    drwxr-xr-x - hadoop supergroup 0 2009-05-19 14:54 /user/hadoop/111-output/_logs
    -rw-r--r-- 2 hadoop supergroup 30504 2009-05-19 16:26 /user/hadoop/111-output/part-00000

    $ bin/hadoop dfs -cat 111-output/part-00000
    you 3
    you've 1
    your 1
    zero 3
    zero, 1

    Over,搞定。大家可以拓展这个例子,写出自己的应用来。

关键字