In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Write scalable, distributed data-intensive programs and basic knowledge
Understand Hadoop and MapReduce
Write and run a basic MapReduce program
1. What is Hadoop
Hadoop is an open source framework for writing and running distributed applications to process large-scale data.
What makes Hadoop different is the following:
Convenience-Hadoop runs on large clusters of ordinary business machines, or on top of cloud computing services
Robust-Hadoop is dedicated to running on general commercial hardware, and its architecture assumes frequent hardware failures
Scalable-Hadoop can scale linearly to handle larger datasets by adding cluster nodes
Simplicity-Hadoop runs and users quickly write efficient parallel code.
2. Understand distributed systems and Hadoop
Understand the comparison between distributed systems (scale out) and large stand-alone servers (scale up), and consider the performance-to-price ratio of existing Icano technologies.
Understand the differences between Hadoop and other distributed architectures (SETI@home):
Hadoop design philosophy is code to data migration, while SETI@home design philosophy is data migration.
The program to run is several orders of magnitude smaller than the data and is easier to move; in addition, it takes more time to move the data on the network than to load the code on it, so it is better to leave the data motionless and move the executable code to the machine where the data is located.
3. Compare SQL database with Hadoop
SQL (structured query language) is designed for structured data, while many of the initial applications of Hadoop are for unstructured data such as text. Let's compare Hadoop with a typical SQL database in more detail from a specific perspective:
Scale out instead of scale up-- it will be more expensive to expand commercial relational databases
Replace relational tables with key / value pairs-Hadoop uses key / value pairs as basic data units to handle less structured data types flexibly
Use functional programming (MapReduce) instead of declarative query (SQL)-in MapReduce, the actual data processing steps are specified by you, very similar to an execution plan of the SQL engine
Replacing online processing with offline processing-Hadoop is designed for offline processing and large-scale data analysis, and is not suitable for online transaction processing mode that randomly reads and writes several records
4. Understand MapReduce
MapReduce is a data processing model, and its biggest advantage is that it is easy to extend to multiple computing nodes to process data.
In the MapReduce model, data processing primitives are called mapper and reducer
Decomposing a data processing application into mapper and reducer is sometimes tedious, but once an application is written in the form of a MapReduce, it can be extended to hundreds, thousands, or even tens of thousands of machines in the cluster by simply modifying the configuration.
[start to expand a simple program]
A small amount of document processing: for each document, use the word segmentation process to extract words one by one; for each word, add 1 to the corresponding items in the multiple sets wordcount; finally, the display () function prints out all the entries in the wordcount.
A large number of document processing methods: distribute the work across multiple machines, each processing different parts of these documents, and when all the machines are complete, the second processing phase will merge the results.
Some details may prevent the program from working as expected, such as overreading the document, causing the bandwidth performance of the central storage server to lag behind, and too many multiple collection wordcount entries to exceed the computer's memory capacity. In addition, in the second stage, there is only one computer to handle wordcount tasks, which is prone to bottlenecks, so it can be run in a distributed way, dividing it into multiple computers in some way, so that it can run independently, that is, the wordcount needs to be partitioned after the first phase, so that each computer in the second stage only needs to deal with one partition.
In order for it to work on a distributed computer cluster, the following features need to be added:
Store files on many computers (phase I)
Write a disk-based hash table so that processing is not limited by memory capacity
Divide the intermediate data from the first phase (i.e. wordcount)
Shuffle these partitions to the appropriate computers in the second phase.
The execution of MapReduce program is divided into two main phases, mapping and reducing, each of which is defined as a data processing function, called mapper and reducer, respectively. In the mapping phase, the MapReduce takes the input data and loads the data units into the mapper;. In the reduce phase, reducer processes all the output from the mapper and gives the final result. In short, mapper means filtering and transforming the input so that the reducer can complete the aggregation.
In addition, in order to expand the distributed word statistics program, we have to write partitioning and shuffling functions.
Writing an application in the MapReduce framework is the process of customizing mapper and reducer. Here is the complete data flow:
The applied input must be organized into a list of key / value pairs list ()
The list containing key / value pairs is split, and each individual key / value pair is processed by calling the map function of mapper
The output of all mapper is aggregated into a large list of pairs
Each reducer processes each aggregate separately and outputs it.
5. Count words with Hadoop-- run the first program
Linux operating system
Running environment above JDK1.6
Hadoop operating environment
Usage:hadoop [- config configdir] COMMAND
Here COMMAND is one of the following:
Namenode-format format DFS file system
Secondarynamenode runs the second namenode of DFS
Namenode runs DFS's namenode
Datanode runs a DFS datanode
Dfsadmin runs an admin client for DFS
Fsck runs a DFS file system check tool
Fs runs a normal file system user client
Balancer runs a cluster load balancing tool
Jobtracker jobtracker node running MapReduce
Pipes runs a pipes job
Tasktracker runs a tasktracker node of MapReduce
Job processes MapReduce jobs
Version print version
Jar runs a jar file
Distcp copies files or directories recursively
Archive-archiveName NAME * generates a Hadoop file
Daemonlog gets or sets the log level of each daemon
When CLASSNAME runs a class named CLASSNAME, most commands use the wampo parameter
Type a help message when.
The command form to run the sample program for word statistics is as follows:
Hadoop jar hadoop-*-examples.jar wordcount [- m] [- r reduces] input output
The command form of the compiled and modified word statistics program is as follows:
Javac-classpath hadoop-*-core.jar-d playground/classes playground/src/WordCount.java
Jar-cvf playground/src/wordcount.jar-C playground/classes/
The command form to run the modified word count program is as follows:
Hadoop jar playground/wordcount.jar org.apache.hadoop.examples.WordCount input output
Code listing WordCount.java
Package org.apache.hadoop.examples;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()) / / (1) use spaces to segment while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); / / (2) put Token in the Text object context.write (word, one);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable () Public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result) / / (3) output the statistical results of each Token}} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length)
< 2) { System.err.println("Usage: wordcount [...] "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }} 在(1)的位置上wordcount以默认配置使用了Java的StringTokenizer,这里仅基于空格来分词。为了在分词过程中忽略标准的标点符号,将它们加入到stringTokenizer的定界符列表中: StringTokenizer itr = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'"); 因为希望单词统计忽略大小写,把它们转换为Text对象前先将所有的单词都变成小写: word.set(itr.nextToken().toLowerCase()); 希望仅仅显示出现次数大于4次的单词: if (sum >4) context.write (key, result)
6. Hadoop history
Founder: Doug Cutting
Around 2004-Google published two papers on the Google File system (GFS) and the MapReduce framework.
January 2006-Yahoo hired Doug to work with a special team to improve Hadoop as an open source project.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.