What are the core components of Apache Hadoop 04/27 Update SLTechnology News&Howtos

What are the core components of Apache Hadoop

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about what the core components of Apache Hadoop are. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Apache Hadoop core components

Apache Hadoop includes the following modules:

Hadoop Common: common utilities to support other Hadoop modules.

Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data.

Hadoop YARN: a framework for job scheduling and cluster resource management.

Hadoop MapReduce: a parallel processing system for large datasets based on YARN.

Other projects related to Apache Hadoop include:

Ambari: a Web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, including support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Ambari also provides a dashboard to view the health of the cluster, such as heat maps, and MapReduce, Pig, and Hive applications that can be viewed in a user-friendly manner, making it easy to diagnose their performance.

Avro: data serialization system.

Cassandra: scalable, multi-master database without a single point of failure.

Chukwa: data acquisition system for managing large distributed systems.

HBase: an extensible distributed database that supports large table storage of structured data. (the content of HBase will be covered in later chapters.)

Hive: data warehouse infrastructure that provides data summarization and specific queries.

Mahout: an extensible machine learning and data mining library.

Pig: a high-level data flow parallel computing language and execution framework.

Fast and general computing engine for Spark:Hadoop data. Spark provides a simple and powerful programming model to support a wide range of applications, including ETL, machine learning, streaming and graphical computing. (the content of Spark will be covered in later chapters.)

TEZ: a general data flow programming framework based on Hadoop YARN. It provides a powerful and flexible engine to perform arbitrary DAG tasks to achieve batch and interactive data processing. TEZ is being adopted by other frameworks in the Hive, Pig, and Hadoop ecosystems, and it is also possible to replace Hadoop MapReduce as the underlying execution engine through other commercial software, such as ETL tools.

ZooKeeper: a high-performance distributed application coordination service. (the content of ZooKeeper will be covered in later chapters.)

Installation configuration on a single node of Apache Hadoop

The following will demonstrate how to quickly complete the installation and configuration of Hadoop on a single node so that you can have some experience with the Hadoop HDFS and MapReduce framework.

1. precondition

Support platform:

GNU/Linux: it has been confirmed that Hadoop can support clusters of 2000 nodes on the GNU/Linux platform.

Windows . The examples demonstrated in this article are all running on the GNU/Linux platform. If you run on Windows, you can refer to http://wiki.apache.org/hadoop/Hadoop2OnWindows.

Required software:

Java must be installed. Hadoop 2.7 and later, you need to install Java 7, which can be OpenJDK or JDK/JRE of Oracle (HotSpot). For other versions of JDK requirements, see http://wiki.apache.org/hadoop/HadoopJavaVersions

Ssh must install and keep sshd running so that the remote Hadoop daemon can be managed with Hadoop scripts. The following is an example of an installation on Ubuntu:

$sudo apt-get install ssh$ sudo apt-get install rsync2. download

The download address is http://www.apache.org/dyn/closer.cgi/hadoop/common/.

3. Preparation for running a Hadoop cluster

Extract the downloaded Hadoop distribution. Edit the etc/hadoop/hadoop-env.sh file and define the following parameters:

# set the installation directory export JAVA_HOME=/usr/java/latest of Java

Try the following command:

$bin/hadoop

The usage document for the hadoop script will be displayed.

Now you can start the Hadoop cluster in one of the three supported modes:

Local (stand-alone) mode

Pseudo-distributed mode

Fully distributed mode

4. Operation method of stand-alone mode

By default, Hadoop is configured as a stand-alone Java process running in non-distributed mode. This is very helpful for debugging.

The following example takes the extracted copy of the conf directory as input to find and display entries that match a given regular expression. The output is written to the specified output directory.

$mkdir input$ cp etc/hadoop/*.xml input$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs [a murz.] +' $cat output/*5. The operation method of pseudo-distributed mode

Hadoop can run on a single node in a so-called pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process.

Configuration

Use the following:

Etc/hadoop/core-site.xml:

Fs.defaultFS hdfs://localhost:9000

Etc/hadoop/hdfs-site.xml:

Dfs.replication 1 password-free ssh setting

Now confirm whether you can log in to localhost with ssh without entering a password:

$ssh localhost

If you cannot log in to localhost with ssh without entering a password, execute the following command:

$ssh-keygen-t rsa-P''- f ~ / .ssh/id_rsa$ cat ~ / .ssh/id_rsa.pub > > ~ / .ssh/authorized_keys$ chmod 0600 ~ / .ssh/authorized_keys execution

The following shows how to run a job of MapReduce locally, and here are the steps to run it.

(1) format a new distributed file system:

$bin/hdfs namenode-format

(2) start the NameNode daemon and DataNode daemon:

$sbin/start-dfs.sh

Logs for the Hadoop daemon are written to the $HADOOP_LOG_DIR directory (default is $HADOOP_HOME/logs)

(3) browse the network interfaces of NameNode, and their addresses are:

NameNode-http://localhost:50070/

(4) create a HDFS directory to execute the job of MapReduce:

$bin/hdfs dfs-mkdir / user$ bin/hdfs dfs-mkdir / user/

(5) copy the input file to the distributed file system:

$bin/hdfs dfs-put etc/hadoop input

(6) run the sample program provided by the distribution:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs [a murz.] +'

(7) View the output file

Copy the output file from the distributed file system to the local file system to view:

$bin/hdfs dfs-get output output$ cat output/*

Alternatively, view the output file on the distributed file system:

$bin/hdfs dfs-cat output/*

(8) after all operations are completed, stop the daemon:

$sbin/stop-dfs.sh runs on a single-node YARN

You can run MapReduce job on YARN in pseudo-distributed mode by setting several parameters, in addition to the daemons running ResourceManager and the NodeManager daemons.

Here are the steps to run.

(1) configuration

Etc/hadoop/mapred-site.xml:

Mapreduce.framework.name yarn

Etc/hadoop/yarn-site.xml:

Yarn.nodemanager.aux-services mapreduce_shuffle

(2) start ResourceManager daemon and NodeManager daemon

$sbin/start-yarn.sh

(3) browse the network interfaces of ResourceManager, and their addresses are:

ResourceManager-http://localhost:8088/

(4) run MapReduce job

(5) after all operations are completed, stop the daemon:

$sbin/stop-yarn.sh6. Operation method of fully distributed mode

For information on building a fully distributed mode, see the "installation configuration on Apache Hadoop clusters" section below.

Installation configuration on an Apache Hadoop cluster

This section describes how to install, configure, and manage Hadoop clusters, ranging in size from a small cluster of several nodes to a very large cluster of thousands of nodes.

1. precondition

Make sure that all the necessary software is installed on each node in your cluster. To install the Hadoop cluster, you usually unpack the installation software to all the machines in the cluster. Refer to the previous section, "installation configuration on a single node in Apache Hadoop".

Typically, one machine in the cluster is designated as NameNode and another machine is designated as ResourceManager. These are all master. Whether other services, such as Web application proxy servers and MapReduce Job History servers, run on dedicated hardware or shared infrastructure depends on the load.

The remaining machines in the cluster act as DataNode and NodeManager. These are all slave.

two。 Configuration in Non-Secure Mode (non-secure mode)

There are two types of important profiles for Hadoop configurations:

Default read-only, including core-default.xml, hdfs-default.xml, yarn-default.xml, and mapred-default.xml

Site-specific configuration, including etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml, and etc/hadoop/mapred-site.xml.

In addition, you can configure the values of the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh script files in the bin directory to control the Hadoop script.

To configure the Hadoop cluster, you need to configure the execution environment of the Hadoop daemon and the configuration parameters of the Hadoop daemon.

The daemons for HDFS are NameNode, econdaryNameNode, and DataNode. The daemons for YARN are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is in use, then MapReduce Job History Server is also running. In large clusters, these typically run on different hosts.

Configure the running environment for the Hadoop daemon

Administrators should use etc/hadoop/hadoop-env.sh, etc/hadoop/mapred-env.sh, and etc/hadoop/yarn-env.sh scripts to make some custom configuration of the environment of the Hadoop daemon.

At least you should configure JAVA_HOME correctly on each remote node.

Administrators can use the configuration options in the following table to configure separate daemons:

Daemon environment variable NameNodeHADOOP_NAMENODE_OPTSDataNodeHADOOP_DATANODE_OPTSSecondaryNamenodeHADOOP_SECONDARYNAMENODE_OPTSResourceManagerYARN_RESOURCEMANAGER_OPTSNodeManagerYARN_NODEMANAGER_OPTSWebAppProxyYARN_PROXYSERVER_OPTSMap Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_OPTS

For example, when configuring Namenode, to make it parallelGC (parallel garbage collection), add the following code to etc/hadoop/hadoop-env.sh:

Export HADOOP_NAMENODE_OPTS= "- XX:+UseParallelGC"

Other common parameters that can be customized include:

The process id directory of the HADOOP_PID_DIR-- daemon

The directory where the log files of the HADOOP_LOG_DIR-- daemon are stored. If it does not exist, it will be created automatically.

The maximum available heap size for HADOOP_HEAPSIZE/YARN_HEAPSIZE--, in MB. For example, 1000MB. This parameter sets the heap size of the daemon. The default size is 1000. You can set this value individually for each daemon.

In most cases, you should specify the HADOOP_PID_DIR and HADOOP_LOG_DIR directories so that they can only be written by users who want to run the hadoop daemon. Otherwise, it is possible to be attacked by symbolic links.

This is also the traditional way to configure HADOOP_PREFIX in the shell environment configuration. For example, a simple script in / etc/profile.d is configured as follows:

HADOOP_PREFIX=/path/to/hadoopexport HADOOP_PREFIX daemon environment variable ResourceManagerYARN_RESOURCEMANAGER_HEAPSIZENodeManagerYARN_NODEMANAGER_HEAPSIZEWebAppProxyYARN_PROXYSERVER_HEAPSIZEMap Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_HEAPSIZE configure Hadoop daemon

This part deals with the configuration of important parameters of Hadoop cluster.

Etc/hadoop/core-site.xml

Parameter value Note the size of the read-write buffer in fs.defaultFSNameNode URIhdfs://host:port/io.file.buffer.size131072SequenceFiles

Etc/hadoop/hdfs-site.xml

Used to configure NameNode:

The value of the parameter notes the local file system path where dfs.namenode.name.dirNameNode persists the namespace and transaction log. When this value is a comma-separated directory list, name table data will be copied to all directories for redundant backup. List of DataNodes allowed / excluded by dfs.hosts / dfs.hosts.exclude. If necessary, use these files to control the list of allowed datanodes. Dfs.blocksize268435456 sets the HDFS block size to 256MBdfs.namenode.handler.count100 in a large file system and uses more NameNode server threads to control RPC in a large number of DataNodes.

Used to configure DataNode:

The value of the parameter notes the local file system path where dfs.datanode.data.dirDataNode stores block data, and a comma-separated list. When this value is a comma-separated directory list, the data is stored in all directories, usually distributed across different devices.

Etc/hadoop/yarn-site.xml

Used to configure ResourceManager and NodeManager:

The value of the parameter indicates whether yarn.acl.enabletrue / false enables ACLs. The default is the administrator set up on the falseyarn.admin.aclAdmin ACLACL cluster. ACLs is separated by commas. The default * means anyone. A special value space means that no one can enter. Yarn.log-aggregation-enablefalse configuration algorithm enables log aggregation

Used to configure ResourceManager:

Parameter value Note yarn.resourcemanager.addressResourceManager host:port, which is used to submit jobs to the client. If host:port is set, it overrides hostnameyarn.resourcemanager.scheduler.addressResourceManager host:port in yarn.resourcemanager.hostname and is used for communication between ApplicationMasters (master node) and Scheduler (scheduler) to obtain resources. If host:port is set, hostnameyarn.resourcemanager.resource-tracker.addressResourceManager host:port in yarn.resourcemanager.hostname is overridden and used in NodeManagers if host:port is set. Then overwrite hostnameyarn.resourcemanager.admin.addressResourceManager host:port in yarn.resourcemanager.hostname for management commands if host:port is set, override hostnameyarn.resourcemanager.webapp.addressResourceManager web-ui host:port in yarn.resourcemanager.hostname, and use it for web management if host:port is set Then override the hostnameyarn.resourcemanager.scheduler.classResourceManager Scheduler class CapacityScheduler (recommended), FairScheduler (also recommended) or FifoScheduleryarn.scheduler.minimum-allocation-mb in yarn.resourcemanager.hostname the minimum memory unit allocated to each container request ResourceManager is the list of maximum memory units allocated by MByarn.scheduler.maximum-allocation-mb to each container request ResourceManager is MByarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path allowed / denied NodeManager if necessary Use these files to control the listed allowed NodeManager

Used to configure NodeManager:

The value of the parameter notes that the physical memory available in yarn.nodemanager.resource.memory-mbNodeManager is defined on all resources on NodeManager and is used to run the container. The maximum percentage of virtual memory used by yarn.nodemanager.vmem-pmem-ratiotask may exceed physical memory, the virtual memory used by each task may exceed its physical memory, and virtual memory is limited by this ratio. This ratio limits the total amount of virtual memory used by task on NodeManager, which may exceed its physical memory. Yarn.nodemanager.local-dirs the path where intermediate data is written in the local file system. Multiple paths are separated by commas. Multiple paths help spread the path of the disk I/Oyarn.nodemanager.log-dirs in the local file system, where the log is written. Multiple paths are separated by commas. Multiple paths help spread the default time (in seconds) for disk I/Oyarn.nodemanager.log.retain-seconds10800 log files to be saved on NodeManager, only when log aggregation is turned off. Yarn.nodemanager.remote-app-log-dir/logs when the application is complete, the application log will be moved to this HDFS directory. The appropriate permissions need to be set. Only suitable for use when log aggregation is turned on. Yarn.nodemanager.remote-app-log-dir-suffixlogs is appended to the remote log directory yarn.nodemanager.aux-services and mapreduce.shuffle to set up the Shuffle service for the Map Reduce application.

Used to configure History Server (need to be moved to another location):

The value of parameter indicates how long yarn.log-aggregation.retain-seconds-1 retains the aggregate log.-1 means it is not enabled. It is important to note that this value cannot be set too small for the interval between yarn.log-aggregation.retain-check-interval-seconds-1 inspection aggregate log retention, where-1 indicates that it is not enabled. It should be noted that the value cannot be set too small.

Etc/hadoop/mapred-site.xml

Used to configure MapReduce applications:

Parameter values Note mapreduce.framework.nameyarn runtime framework is set to the maximum resource of Hadoop YARN.mapreduce.map.memory.mb1536maps. Heap size of mapreduce.map.java.opts-Xmx1024Mmaps sub-virtual machine mapreduce.reduce.memory.mb3072reduces maximum resource. Mapreduce.reduce.ja va.opts-Xmx2560Mreduces sub-virtual machine heap size mapreduce.task.io.sort.mb512 task internal sort buffer size mapreduce.task.io.sort.factor100 combines at one time when collating files Maximum number of parallel replications run by mapreduce.reduce.shuffle.parallelcopies50reduces Used to get a large number of maps output

Used to configure MapReduce JobHistory Server:

Parameter values Note mapreduce.jobhistory.addressMapReduce JobHistory Server host:port default port is 10020.mapreduce.jobhistory.webapp.addressMapReduce JobHistory Server Web interface host:port default port is 19888.mapreduce.jobhistory.intermediate-done-dir/mr-history/tmpMapReduce jobs write history file directory mapreduce.jobhistory.done-dir/mr-history/doneMR JobHistory Server managed history file directory 3. Monitor the health status of NodeManager

Hadoop provides a mechanism for administrators to configure NodeManager to run provisioning scripts to periodically confirm the health of a node.

The administrator can determine whether the node is healthy by performing a check in the script. If the script checks that the node is unhealthy, you can print a standard ERROR (error) output. NodeManager checks his output periodically through some scripts, and if the script output has ERROR information, as mentioned above, the node will be reported as unhealthy, and the node will be added to the blacklist of ResourceManager, and the task will not be assigned to that node. Then NodeManager continues to run the script, so if the Node node becomes healthy, it will be automatically removed from the ResourceManager blacklist, and the node's health status will be available to the administrator on the ResourceManager web interface if it becomes unhealthy as the script outputs. At this time, the health status of the node will not be displayed on the web interface.

Under etc/hadoop/yarn-site.xml, you can control the health check script for the node:

Parameter Note yarn.nodemanager.health-checker.script.pathNode health script this script checks the health status of the node. Yarn.nodemanager.health-checker.script.optsNode health script options check the health status of the node script option yarn.nodemanager.health-checker.script.interval-msNode health script interval the interval between running the health script the execution timeout of the yarn.nodemanager.health-checker.script.timeout-msNode health script timeout interval health script

If only the local hard drive is broken, the health check script will not set the node to ERROR. But NodeManager has the ability to periodically check the health of the local disk (check the nodemanager-local-dirs and nodemanager-log-dirs directories), and when the threshold set by yarn.nodemanager.disk-health-checker.min-healthy-disks is reached, the entire node will be marked as unhealthy.

4. Slaves File

All slave hostname or IP are saved in the etc/hadoop/slaves file, one per line. Scripts can run commands from multiple machines through the etc/hadoop/slaves file. He does not use any Java-based Hadoop configuration. In order to use this feature, ssh must set up a usage account in order to run Hadoop. So when installing Hadoop, you need to configure ssh login.

5. Hadoop Rack Awareness (rack awareness)

Many Hadoop components benefit from rack awareness, which brings great improvement to performance and security. The daemon of Hadoop calls the module of management configuration to obtain the rack information of cluster slave. For more rack awareness information, please see http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/RackAwareness.html here.

Rack awareness is highly recommended when using HDFS.

6. Journal

Hadoop uses Apache log4j as the logging framework and edits the etc/hadoop/log4j.properties file to customize the configuration of the log.

7. Manipulate Hadoop clusters

All the necessary configurations are complete, distribute the HADOOP_CONF_DIR configuration files to all machines, and all machines should have the same path to install the Hadoop directory.

In general, it is recommended that HDFS and YARN run as separate users. In most installations, HDFS executes "hdfs". YARN usually uses the "yarn" account.

Hadoop start

To start the Hadoop cluster, you need to start the HDFS and YARN clusters.

Formatting is required for the first time using HDFS. The newly distributed file system is formatted as hdfs:

[hdfs] $$HADOOP_PREFIX/bin/hdfs namenode-format

As hdfs, start HDFS NameNode to the specified node with the following command:

[hdfs] $HADOOP_PREFIX/sbin/hadoop-daemon.sh-- config $HADOOP_CONF_DIR-- script hdfs start namenode

As a hdfs, start HDFS DataNode to each specified node with the following command:

[hdfs] $HADOOP_PREFIX/sbin/hadoop-daemons.sh-- config $HADOOP_CONF_DIR-- script hdfs start datanode

As a hdfs, if etc/hadoop/slaves and ssh trusted access are configured, then all HDFS processes can be started through scripting tools:

[hdfs] $$HADOOP_PREFIX/sbin/start-dfs.sh

As yarn, start YARN and run the specified ResourceManager with the following command:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemon.sh-- config $HADOOP_CONF_DIR start resourcemanager

As a yarn, run the script to start all NodeManager on the slave machine:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemons.sh-- config $HADOOP_CONF_DIR start nodemanager

As yarn, start the localized WebAppProxy server. If you want to use a large number of servers for load balancing, it should run on their respective machines:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemon.sh-- config $HADOOP_CONF_DIR start proxyserver

As a yarn, if etc/hadoop/slaves and ssh trusted access are configured, then all YARN processes can be started through scripting tools:

[yarn] $$HADOOP_PREFIX/sbin/start-yarn.sh

As mapred, start MapReduce JobHistory Server according to the following command:

[mapred] $$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh-- config $HADOOP_CONF_DIR start historyserverHadoop closed

As hdfs, stop NameNode with the following command:

[hdfs] $HADOOP_PREFIX/sbin/hadoop-daemon.sh-- config $HADOOP_CONF_DIR-- script hdfs stop namenode

As a hdfs, run the script to stop all DataNode on all slaves:

[hdfs] $HADOOP_PREFIX/sbin/hadoop-daemons.sh-- config $HADOOP_CONF_DIR-- script hdfs stop datanode

As a hdfs, if etc/hadoop/slaves and ssh trusted access are configured, then all HDFS processes can be shut down through scripting tools:

[hdfs] $$HADOOP_PREFIX/sbin/stop-dfs.sh

As yarn, stop ResourceManager with the following command:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemon.sh-- config $HADOOP_CONF_DIR stop resourcemanager

As yarn, run the script to stop NodeManager on the slave machine:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemons.sh-- config $HADOOP_CONF_DIR stop nodemanager

As a yarn, if etc/hadoop/slaves and ssh trusted access are configured, then all YARN processes can be shut down through scripting tools

[yarn] $$HADOOP_PREFIX/sbin/stop-yarn.sh

As yarn, stop the WebAppProxy server. Because multiple load balancers may be set:

[yarn] $$HADOOP_YARN_HOME/sbin/yarn-daemon.sh-- config $HADOOP_CONF_DIR stop proxyserver

As mapred, stop MapReduce JobHistory Server with the following command:

[mapred] $$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh-- config $HADOOP_CONF_DIR stop historyserver8. Web interface

When Hadoop starts, you can view the following Web interface:

The guardian carries on the Web interface remarks NameNode http://nn_host:port/ default HTTP port is 50070.ResourceManager http://rm_host:port/ default HTTP port is 8088MapReduce JobHistory Server http://jhs_host:port/ default HTTP port is 19888 example: word frequency statistics WordCount program

The following is an example of the word frequency statistics WordCount program provided by Hadoop. Before running the modified program, make sure that HDFS has been started.

Import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.net.URI;import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper Import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.Counter;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.StringUtils;public class WordCount2 {public static class TokenizerMapper extends Mapper {static enum CountersEnum {INPUT_WORDS} private final static IntWritable one = new IntWritable (1); private Text word = new Text (); private boolean caseSensitive Private Set patternsToSkip = new HashSet (); private Configuration conf; private BufferedReader fis; @ Override public void setup (Context context) throws IOException, InterruptedException {conf = context.getConfiguration (); caseSensitive = conf.getBoolean ("wordcount.case.sensitive", true); if (conf.getBoolean ("wordcount.skip.patterns", true)) {URI [] patternsURIs = Job.getInstance (conf). GetCacheFiles () For (URI patternsURI: patternsURIs) {Path patternsPath = new Path (patternsURI.getPath ()); String patternsFileName = patternsPath.getName (). ToString (); parseSkipFile (patternsFileName);}} private void parseSkipFile (String fileName) {try {fis = new BufferedReader (new FileReader (fileName)); String pattern = null While ((pattern = fis.readLine ())! = null) {patternsToSkip.add (pattern);}} catch (IOException ioe) {System.err.println ("Caught exception while parsing the cached file'" + StringUtils.stringifyException (ioe)) } @ Override public void map (Object key, Text value, Context context) throws IOException, InterruptedException {String line = (caseSensitive)? Value.toString (): value.toString (). ToLowerCase (); for (String pattern: patternsToSkip) {line = line.replaceAll (pattern, ");} StringTokenizer itr = new StringTokenizer (line); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); context.write (word, one) Counter counter = context.getCounter (CountersEnum.class.getName (), CountersEnum.INPUT_WORDS.toString ()); counter.increment (1);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0 For (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result);}} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); GenericOptionsParser optionParser = new GenericOptionsParser (conf, args); String [] remainingArgs = optionParser.getRemainingArgs () If (! (remainingArgs.length! = 2 | | remainingArgs.length! = 4)) {System.err.println ("Usage: wordcount [- skip skipPatternFile]"); System.exit (2);} Job job = Job.getInstance (conf, "wordcount"); job.setJarByClass (WordCount2.class); job.setMapperClass (TokenizerMapper.class); job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class) Job.setOutputValueClass (IntWritable.class); List otherArgs = new ArrayList (); for (int item0; I < remainingArgs.length; + + I) {if ("- skip" .equals (mainingArgs [I]) {job.addCacheFile (new Path (mainingArgs [+ + I]). ToUri ()); job.getConfiguration (). SetBoolean ("wordcount.skip.patterns", true);} else {otherArgs.add (mainingArgs [I]) } FileInputFormat.addInputPath (job, new Path (otherArgs.get (0); FileOutputFormat.setOutputPath (job, new Path (otherArgs.get (1); System.exit (job.waitForCompletion (true)? 0: 1);}}

The sample files to be imported are as follows:

$bin/hadoop fs-ls / user/joe/wordcount/input//user/joe/wordcount/input/file01/user/joe/wordcount/input/file02 $bin/hadoop fs-cat / user/joe/wordcount/input/file01Hello World, Bye Worldwide $bin/hadoop fs-cat / user/joe/wordcount/input/file02Hello Hadoop, Goodbye to hadoop.

Run the program:

$bin/hadoop jar wc.jar WordCount2 / user/joe/wordcount/input / user/joe/wordcount/output

The output is as follows:

$bin/hadoop fs-cat / user/joe/wordcount/output/part-r-00000Bye 1Goodbye 1Hadoop, 1Hello 2World! 1World, 1hadoop. 1to 1

Set the word filtering policy through DistributedCache:

$bin/hadoop fs-cat / user/joe/wordcount/patterns.txt\. To

Run it again, this time with more options:

$bin/hadoop jar wc.jar WordCount2-Dwordcount.case.sensitive=true / user/joe/wordcount/input / user/joe/wordcount/output-skip / user/joe/wordcount/patterns.txt

The output is as follows:

$bin/hadoop fs-cat / user/joe/wordcount/output/part-r-00000Bye 1Goodbye 1Hadoop 1Hello 2World 2hadoop 1

Run it again, this time without case sensitivity:

$bin/hadoop jar wc.jar WordCount2-Dwordcount.case.sensitive=false / user/joe/wordcount/input / user/joe/wordcount/output-skip / user/joe/wordcount/patterns.txt

The output is as follows:

$bin/hadoop fs-cat / user/joe/wordcount/output/part-r-00000bye 1goodbye 1hadoop 2hello 2horld 2 Thank you for reading! This is the end of this article on "what are the core components of Apache Hadoop?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.