How to implement Hadoop 07/13 Update SLTechnology News&Howtos

How to implement Hadoop

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to implement Hadoop. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Getting started with Hadoop

Hadoop is a Java implementation of GoogleMapReduce. MapReduce is a simplified distributed programming model that allows programs to be automatically distributed to a large cluster of ordinary machines for concurrent execution. Just as java programmers can ignore memory leaks, MapReduce's run-time system solves the distribution details of input data, performs scheduling across machine clusters, handles machine failures, and manages communication requests between machines. This pattern allows programmers to deal with the resources of very large distributed systems without any experience in concurrent processing or distributed systems.

I. introduction

As a Hadoop programmer, what he has to do is:

1. Define Mapper, process input Key-Value pairs, and output intermediate results.

2. Define Reducer, optionally, specify the intermediate result and output the final result.

3. Define InputFormat and OutputFormat. Optionally, InputFormat converts the contents of each line of input file into a Java class for use by the Mapper function. It defaults to String when it is not defined.

4. Define the main function, in which you define a Job and run it.

And then it's up to the system. The first step to getting started with Hadoop is to understand the basic concepts.

1. Basic concept: Hadoop's HDFS implements google's GFS file system, and NameNode is responsible for scheduling and running on master,DataNode on each machine as the file system. At the same time, Hadoop implements the MapReduce,JobTracker of Google as the total scheduling of MapReduce, which runs on master,TaskTracker and executes Task on each machine.

The 2.main () function creates the JobConf, defines the Mapper,Reducer,Input/OutputFormat and the input / output file directory, * submits the Job to JobTracker, and waits for the Job to finish.

3.JobTracker, create an instance of InputFormat, call its getSplits () method, split the files in the input directory into FileSplist as input to Mappertask, and generate Mappertask to join Queue.

4.TaskTracker asks JobTracker for the next Map/Reduce.

MapperTask first creates RecordReader from InputFormat, reads through the contents of FileSplits to generate Key and Value, and passes them to Mapper function. After processing, the intermediate result is written as SequenceFile.

ReducerTask uses the http protocol to obtain the required intermediate content from the Jetty of TaskTracker running Mapper (33%). After Sort/Merge (66%), execute the Reducer function, and * write the result directory according to OutputFormat.

TaskTracker reports the operation to JobTracker every 10 seconds, and after each Task10 second is completed, it asks JobTracker for the next Task.

All data processing of the Nutch project is built on top of Hadoop, as detailed in ScalableComputingwithHadoop. Let's take another look at the code written by programmers in the introduction to Hadoop.

Code written by programmers

Let's do a simple distributed Grep that simply matches the input file line by line and prints the line to the output file if it matches. Because it is a simple full output, we just write the Mapper function, not the Reducer function, nor define the Input/OutputFormat.

Packagedemo.hadoop publicclassHadoopGrep {publicstaticclassRegMapperextendsMapReduceBaseimplementsMapper {privatePatternpattern; publicvoidconfigure (JobConfjob) {pattern=Pattern.compile (job.get ("mapred.mapper.regex"));} publicvoidmap (WritableComparablekey,Writablevalue,OutputCollectoroutput,Reporterreporter) throwsIOException {Stringtext= ((Text) value). ToString (); Matchermatcher=pattern.matcher (text); if (matcher.find ()) {output.collect (key,value);}} privateHadoopGrep () {} / / singleton publicstaticvoidmain (String [] args) throwsException {JobConfgrepJob=newJobConf (HadoopGrep.class) GrepJob.setJobName ("grep-search"); grepJob.set ("mapred.mapper.regex", args [2]); grepJob.setInputPath (newPath (args [0])); grepJob.setOutputPath (newPath (args [1])); grepJob.setMapperClass (RegMapper.class); grepJob.setReducerClass (IdentityReducer.class); JobClient.runJob (grepJob);}}

The configure () function of the RegMapper class accepts the lookup string passed in by the main function, the map () function performs regular matching, key is the number of lines, value is the content of the file lines, and the file lines that match are put in the intermediate result.

The main () function defines the input and output directory and the matching string passed in by the command line arguments, the Mapper function is the RegMapper class, and the Reduce function does nothing, but outputs the intermediate result directly to the IdentityReducer class of the final result, and runs Job. The whole code is very simple, without any details of distributed programming.

Thank you for reading! This is the end of the article on "how to achieve Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.