[Hadoop in Action] Chapter 1 introduction to Hadoop 07/11 Update SLTechnology News&Howtos

[Hadoop in Action] Chapter 1 introduction to Hadoop

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Write scalable, distributed data-intensive programs and basic knowledge

Understand Hadoop and MapReduce

Write and run a basic MapReduce program

1. What is Hadoop

Hadoop is an open source framework for writing and running distributed applications to process large-scale data.

What makes Hadoop different is the following:

Convenience-Hadoop runs on large clusters of ordinary business machines, or on top of cloud computing services

Robust-Hadoop is dedicated to running on general commercial hardware, and its architecture assumes frequent hardware failures

Scalable-Hadoop can scale linearly to handle larger datasets by adding cluster nodes

Simplicity-Hadoop runs and users quickly write efficient parallel code.

2. Understand distributed systems and Hadoop

Understand the comparison between distributed systems (scale out) and large stand-alone servers (scale up), and consider the performance-to-price ratio of existing Icano technologies.

Understand the differences between Hadoop and other distributed architectures (SETI@home):

Hadoop design philosophy is code to data migration, while SETI@home design philosophy is data migration.

The program to run is several orders of magnitude smaller than the data and is easier to move; in addition, it takes more time to move the data on the network than to load the code on it, so it is better to leave the data motionless and move the executable code to the machine where the data is located.

3. Compare SQL database with Hadoop

SQL (structured query language) is designed for structured data, while many of the initial applications of Hadoop are for unstructured data such as text. Let's compare Hadoop with a typical SQL database in more detail from a specific perspective:

Scale out instead of scale up-- it will be more expensive to expand commercial relational databases

Replace relational tables with key / value pairs-Hadoop uses key / value pairs as basic data units to handle less structured data types flexibly

Use functional programming (MapReduce) instead of declarative query (SQL)-in MapReduce, the actual data processing steps are specified by you, very similar to an execution plan of the SQL engine

Replacing online processing with offline processing-Hadoop is designed for offline processing and large-scale data analysis, and is not suitable for online transaction processing mode that randomly reads and writes several records

4. Understand MapReduce

MapReduce is a data processing model, and its biggest advantage is that it is easy to extend to multiple computing nodes to process data.

In the MapReduce model, data processing primitives are called mapper and reducer

Decomposing a data processing application into mapper and reducer is sometimes tedious, but once an application is written in the form of a MapReduce, it can be extended to hundreds, thousands, or even tens of thousands of machines in the cluster by simply modifying the configuration.

[start to expand a simple program]

A small amount of document processing: for each document, use the word segmentation process to extract words one by one; for each word, add 1 to the corresponding items in the multiple sets wordcount; finally, the display () function prints out all the entries in the wordcount.

A large number of document processing methods: distribute the work across multiple machines, each processing different parts of these documents, and when all the machines are complete, the second processing phase will merge the results.

Some details may prevent the program from working as expected, such as overreading the document, causing the bandwidth performance of the central storage server to lag behind, and too many multiple collection wordcount entries to exceed the computer's memory capacity. In addition, in the second stage, there is only one computer to handle wordcount tasks, which is prone to bottlenecks, so it can be run in a distributed way, dividing it into multiple computers in some way, so that it can run independently, that is, the wordcount needs to be partitioned after the first phase, so that each computer in the second stage only needs to deal with one partition.

In order for it to work on a distributed computer cluster, the following features need to be added:

Store files on many computers (phase I)

Write a disk-based hash table so that processing is not limited by memory capacity

Divide the intermediate data from the first phase (i.e. wordcount)

Shuffle these partitions to the appropriate computers in the second phase.

The execution of MapReduce program is divided into two main phases, mapping and reducing, each of which is defined as a data processing function, called mapper and reducer, respectively. In the mapping phase, the MapReduce takes the input data and loads the data units into the mapper;. In the reduce phase, reducer processes all the output from the mapper and gives the final result. In short, mapper means filtering and transforming the input so that the reducer can complete the aggregation.

In addition, in order to expand the distributed word statistics program, we have to write partitioning and shuffling functions.

Writing an application in the MapReduce framework is the process of customizing mapper and reducer. Here is the complete data flow:

The applied input must be organized into a list of key / value pairs list ()

The list containing key / value pairs is split, and each individual key / value pair is processed by calling the map function of mapper

The output of all mapper is aggregated into a large list of pairs

Each reducer processes each aggregate separately and outputs it.

5. Count words with Hadoop-- run the first program

Linux operating system

Running environment above JDK1.6

Hadoop operating environment

Usage:hadoop [- config configdir] COMMAND

Here COMMAND is one of the following:

Namenode-format format DFS file system

Secondarynamenode runs the second namenode of DFS

Namenode runs DFS's namenode

Datanode runs a DFS datanode

Dfsadmin runs an admin client for DFS

Fsck runs a DFS file system check tool

Fs runs a normal file system user client

Balancer runs a cluster load balancing tool

Jobtracker jobtracker node running MapReduce

Pipes runs a pipes job

Tasktracker runs a tasktracker node of MapReduce

Job processes MapReduce jobs

Version print version

Jar runs a jar file

Distcp copies files or directories recursively

Archive-archiveName NAME * generates a Hadoop file

Daemonlog gets or sets the log level of each daemon

When CLASSNAME runs a class named CLASSNAME, most commands use the wampo parameter

Type a help message when.

The command form to run the sample program for word statistics is as follows:

Hadoop jar hadoop-*-examples.jar wordcount [- m] [- r reduces] input output

The command form of the compiled and modified word statistics program is as follows:

Javac-classpath hadoop-*-core.jar-d playground/classes playground/src/WordCount.java

Jar-cvf playground/src/wordcount.jar-C playground/classes/

The command form to run the modified word count program is as follows:

Hadoop jar playground/wordcount.jar org.apache.hadoop.examples.WordCount input output

Code listing WordCount.java

Package org.apache.hadoop.examples;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()) / / (1) use spaces to segment while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); / / (2) put Token in the Text object context.write (word, one);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable () Public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result) / / (3) output the statistical results of each Token}} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length)

< 2) { System.err.println("Usage: wordcount [...] "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }} 在（1）的位置上wordcount以默认配置使用了Java的StringTokenizer，这里仅基于空格来分词。为了在分词过程中忽略标准的标点符号，将它们加入到stringTokenizer的定界符列表中： StringTokenizer itr ＝ new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'"); 因为希望单词统计忽略大小写，把它们转换为Text对象前先将所有的单词都变成小写： word.set(itr.nextToken().toLowerCase()); 希望仅仅显示出现次数大于4次的单词： if (sum >

4) context.write (key, result)

6. Hadoop history

Founder: Doug Cutting

Around 2004-Google published two papers on the Google File system (GFS) and the MapReduce framework.

January 2006-Yahoo hired Doug to work with a special team to improve Hadoop as an open source project.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.