Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to implement Hadoop MapReduce Program

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

How to use Python to achieve Hadoop MapReduce programs, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

The operation effect of the author's machine is as follows (the input data is the help manual of find, and as expected, the is the most):

-- here is the original post.

In this example, I will show you how to write a simple MapReduce for Hadoop using Python

program.

Although the Hadoop framework is written in Java, we still need to implement Hadoop programs in languages like C++ and Python. Although the sample program given on the official Hadoop website is written in Jython and packaged into a Jar file, this obviously causes inconvenience, in fact, it doesn't have to be implemented this way, we can use Python to program with Hadoop, take a look at the example at / src/examples/python/WordCount.py, and you'll see what I'm talking about.

What do we want to do?

We will write a simple MapReduce program that uses C-Python instead of a program written by Jython and packaged into a jar package.

Our example will mimic WordCount and use Python to implement it. The example reads a text file to count the number of word occurrences. The result is also output as text, with each line containing a word and the number of occurrences of the word, separated by tabs.

precondition

Before you write this program, you should learn to set up a Hadoop cluster so that you don't go blind in your later work. If you haven't set it up, there is a concise tutorial to teach you how to build it on Ubuntu Linux (also applicable to other distributions such as linux and unix)

How to use Hadoop Distributed File System (HDFS) to establish a single-node Hadoop cluster in Ubuntu Linux

How to use Hadoop Distributed File System (HDFS) to build a multi-node Hadoop cluster in Ubuntu Linux

MapReduce code for Python

The trick of writing MapReduce code with Python is that we use HadoopStreaming to help us pass data between Map and Reduce through STDIN (standard input) and STDOUT (standard output). We only use Python's sys.stdin to input data and use sys.stdout to output data, because HadoopStreaming will do other things for us. It's true, don't believe it!

Map: mapper.py

Save the following code in / home/hadoop/mapper.py, which reads data from STDIN and separates the words into lines to generate a list that maps the relationship between words and the number of occurrences:

Note: make sure the script has sufficient permissions (chmod + x / home/hadoop/mapper.py).

#! / usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip () # split the line into words words = line.split () # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. The input for reducer.py # # tab-delimited The trivial word count is 1 print'% s\\ t% s'% (word, 1)

In this script, the total number of words is not calculated, it will output "1" >

Reduce: reducer.py

Storing the code in / home/hadoop/reducer.py, this script reads the results from mapper.py 's STDIN, then calculates the sum of the number of occurrences of each word, and outputs the results to STDOUT.

Also, note the script permissions: chmod + x / home/hadoop/reducer.py

#! / usr/bin/env python from operator import itemgetterimport sys # maps words to their countsword2count = {} # input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip () # parse the input we got from mapper.py word, count = line.split (1) # convert count (currently a string) to int try: count = int (count) word2count [word] = word2count.get (word 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass # sort the words lexigraphically # # this step is NOT required, we just do it so that our# final output will look more like the official Hadoop# word count examplessorted_word2count = sorted (word2count.items (), key=itemgetter (0)) # write the results to STDOUT (standard output) for word, count in sorted_word2count: print'% s\\ t% s% (word, count)

Test your code (cat data | map | sort | reduce)

I suggest you try to test your mapper.py and reducer.py scripts manually before running the MapReduce job test, in case you don't get any results back.

Here are some suggestions on how to test the functionality of your Map and Reduce:

-

\ r\ n

# very basic test hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | / home/hadoop/mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1-hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | / home/hadoop/mapper.py | | sort | / home/hadoop/reducer.py bar 1 foo 3 labs 1-# using on [object Object] e of the ebooks as example input # (see below on where to get the ebooks) hadoop@ubuntu:~$ cat / tmp/gutenberg/20417-8.txt | / home/hadoop/mapper.py | The 1 Project 1 Gutenberg 1 EBook 1 of 1 [...] (you get the idea) quux 2 quux 1

For this example, we will need three kinds of e-books:

Download them and use us-ascii encoding to store the extracted files in a temporary directory, such as / tmp/gutenberg.

Hadoop@ubuntu:~$ ls-l / tmp/gutenberg/ total 3592-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt hadoop@ubuntu:~$ before we run MapReduce job We need to copy the local files into HDFS: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-copyFromLocal / tmp/gutenberg gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-ls Found 1 items / user/hadoop/gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-ls gutenberg Found 3 items / user/hadoop/gutenberg/20417-8.txt 674425 / user/hadoop/gutenberg/7ldvc10.txt 1423808 / user/hadoop/ Gutenberg/ulyss12.txt 1561677 now Everything is ready and we will run Python MapReduce job on the Hadoop cluster. As I said above, we are using standardized input and output to help us pass data between Map and Reduce and through STDIN and STDOUT. Hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar-mapper / home/hadoop/mapper.py-reducer / home/hadoop/reducer.py-input gutenberg/*-output gutenberg-output is running. If you want to change some settings of Hadoop, such as increasing the number of Reduce tasks, you can use "- hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar-mapper." An important memo is that this task will read the directory under the HDFS directory under the HDFS directory. The results of the previous execution are as follows: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar-mapper / home/hadoop/mapper.py-reducer / home/hadoop/reducer.py-input gutenberg/*-output gutenberg-output additionalConfSpec_:null null=@@@userJobConfProps_.get (stream.shipped.hadoopstreaming packageJobJar: [/ usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/] [] / tmp/streamjob54544.jar tmpDir=null [...] INFO mapred.FileInputFormat: Total input paths to process: 7 [...] INFO streaming.StreamJob: getLocalDirs (): [/ usr/local/hadoop-datastore/hadoop-hadoop/mapred/local] [...] INFO streaming.StreamJob: Running job: job_200803031615_0021 [...] [...] INFO streaming.StreamJob: map 0% reduce 0% [...] INFO streaming.StreamJob: map 43% reduce 0% [...] INFO streaming.StreamJob: map 86% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 33% [...] INFO streaming.StreamJob: map 100% reduce 70% [...] INFO streaming.StreamJob: map 100% reduce 77% [...] INFO streaming.StreamJob: map 100% reduce 100% [...] INFO streaming.StreamJob: Job complete: job_200803031615_0021 [...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ as you can see in the above output, Hadoop also provides a basic WEB interface to display statistical results and information. When the Hadoop cluster is executing, you can use a browser to access As shown in the figure: check whether the results are output and stored in the HDFS directory: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-ls gutenberg-output Found 1 items / user/hadoop/gutenberg-output/part-00000 903193 2007-09-21 13:00 hadoop@ubuntu:/usr/local/hadoop$ can use the command to check the file directory hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-cat gutenberg-output/part-00000 "(Lo) cra" 1 "1490 1" 1498 "1" 35 "1" 40, "1" A 2 "AS-IS". 2 "A1" Absoluti 1 [...] Hadoop@ubuntu:/usr/local/hadoop$ note that the (") symbol of the above result is not inserted by Hadoop. After reading the above, have you mastered how to use Python to implement Hadoop MapReduce programs? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report