Summary of big data's practical study (3)-- MapReduce 04/15 Update SLTechnology News&Howtos

Summary of big data's practical study (3)-- MapReduce

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

With regard to entanglement, it used to be very difficult for people who engaged in big data to learn the code, not to mention that they were in the pre-sales direction. Just understand the principle, and later found that the entanglement of time is more and more. I understand that instead of struggling with whether or not to do the code, I should actually operate the code and find a business scenario to replace it with a good study. To put it simply: instead of entangling, it is better to do it!

To put it simply, the study of MapReduce began to feel difficult. It may be related to the lack of code foundation before. Although I have studied the principle foundation of big data for three years, I still have no actual operation on the implementation of the relevant code. But after typing the code this time, you can see the final WEB result display. I still feel more or less relieved. For some technologies, you know, understand and can be applied to the actual work process, there is a stage.

MAP stage

In fact, this process requires you to be able to count the words in the source file one by one, and here you use the python statement, which is very easy to write. It's a simple mapping relationship, which is easy to understand.

In fact, MAP is a sub-idea, which means that when you have a lot of data, you need to divide the data into different machines first. In practice, it is equivalent to putting your large files directly on the HDFS cluster. The relevant mapping operation is performed on each machine. The code on Hadoop is basically the same as that of VIM, but the difference is whether to add the starting statement of hadoop before it. Here is also related to the path guide, here is mainly set in the JAVA environment variable, when you have set up, you can use the + TAB key to fully work.

Reduce stage

This is a merging process, which is equivalent to a merge reduction of your previously mapped files, while my practice this time is the wordcount operation, which is equivalent to a count of all the repeated words.

It involves the implementation of FIFO, the traversal algorithm. It is equivalent to summarizing the work results of each of your machines on a master PC. This phase has more code than the MAP phase. This piece has the relevant knowledge of the array, as well as the accumulation function, which needs to be understood by the related function package.

In the process of learning MAPREDUCE, found that their biggest problem is the VIM command is not familiar with, some content, only you have operated you can more in-depth understanding of its implementation principle. I only knew its principle before, but in the process of learning and practice of MAPREDUCE, I found that my actual operation was still a little unsatisfactory, and I also had an understanding of mapredeuce. Change another dataset over the weekend and familiarize yourself with the code you already know. Come on!

Ulimit-a # ability to view all readable files #

Cd / usr/local/src/ # Open the corresponding folder of hadoop #

Ll # ll is to view the relevant attributes of files, and ls is to view files under related folders #

Touch *. * # create a file #

Mkdir python_mr # create a folder #

Cd / home/badou/python_mr/ # Open the related mapreduce folder #

Cd mapreduce_wordcount_python/ # enter the relevant word frequency statistics function, and copy the source files to this directory through the corresponding shared file destination #

Rm output result.data # Delete documents that have previously manipulated the output #

# View the source file locally #

Cat The_Man_of_Property.txt

# View the uploaded file. If the file is too long, use ctrl+c to exit and view it #

Cat The_Man_of_Property.txt | head-1

# View the first header information in the file #

Cat The_Man_of_Property.txt | head-2 | tr'\ n'

# convert all spaces to newline characters #

Cat The_Man_of_Property.txt | head-2 | tr'\ n' | sort-K1-nr |

# sort (k _ nr k1 ~ 1 column mark,-n-OR in reverse order of size) #

# count the same number of rows #

Cat The_Man_of_Property.txt | head-2 | tr'\ n'| sort-k 1 | uniq-c | awk'{print$2 "\ t" $1}'| head

# output to Key-Value #

Cat The_Man_of_Property.txt | head-2 | tr'\ n'| sort-k 1 | uniq-c | awk'{print$2 "\ t" $1}'| sort-K2-nr | head # sort the output in K-Value format and display the first 10 lines #

# upload it to the hadoop system for processing #

Vim ~ / .bashrc # enter the environment variable setting of JAVA #

Export PATH=$PATHJAVA_HOME/bin:/usr/local/src/hadoop-1.2.1/bin # modify the reference to extend the match to hadoop#

After sourc ~ / .bashrc # saves and exits, use the command to make the assigned environment variables take effect #

Hadoop fs-ls / # View the file under hadoop #

Hadoop fs-rmr / The_Man_of_Property.txt # Delete uploaded documents #

Hadoop fs-put The_Man_of_Property.txt / # upload the corresponding documents, note: must be files in this directory #

Hadoop fs-cat / The_Man_of_Property.txt | head # View files on hadoop, but only see plaintext files #

Hadoop fs-text / The_Man_of_Property.txt | head # View the file on hadoop, you can see the ciphertext file and the compressed file #

# map.py Code #

Import sys # define a system module #

For line in sys.stdin: # read data from standard input #

Ss = line.strip (). Split ('') # processes a line of strings, which is equivalent to separating each participle with spaces. Ss is a lot of words #

For s in ss: # process every word #

If s.strip ()! = ""

Print "% s\ t% s"% (s, 1) # if each word is not finished, remember the word as 1 #

Cat The_Man_of_Property.txt | head | python map.py | head # Test whether the above code is successful #

# reduce.py Code #

Import sys

Current_word = None

Count_pool = []

Sum = 0 # initial definition parameter value #

For line in sys.stdin: # read each line value from standard input #

Word, val = line.strip () .split ('\ t')

If current_word = = None: # whether the current word is empty as a condition #

Current_word = word

If current_word! = word:

For count in count_pool:

Sum + = count

Print "% s\ t% s"% (current_word, sum)

Current_word = word

Count_pool = []

Sum = 0

Count_pool.append (int (val)) # Key appends to the equivalent array #

For count in count_pool:

Sum + = count # summation of Value row repetition times #

Print "% s\ t% s"% (current_word, str (sum)) # outputs the corresponding Key-Value value #

# run.sh shell script to start map.py and red.py#

HADOOP_CMD= "/ usr/local/src/hadoop-1.2.1/bin/hadoop" # set the target path for easy reference #

STREAM_JAR_PATH= "/ usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"

# set STREAM_JAR_PATH path to facilitate input and output #

INPUT_FILE_PATH_1= "/ The_Man_of_Property.txt"

OUTPUT_PATH= "/ output"

# $HADOOP_CMD fs-rmr-skipTrash $OUTPUT_PATH

# Step 1.

$HADOOP_CMD jar $STREAM_JAR_PATH\

-input $INPUT_FILE_PATH_1\

-output $OUTPUT_PATH\

-mapper "python map.py"\

-reduce "python red.py"\

-file. / map.py\

-filt. / red.py # upload related files to HADOOP #

. / run.sh # run the shell script directly, call hadoop to run the related python file #

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.