Getting started with Hadoop 07/03 Update SLTechnology News&Howtos

Getting started with Hadoop

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1 big data probability

Big data refers to the data collection that can not be captured, managed and processed by conventional software tools in a certain period of time. It is a massive, high growth rate and diversified information growth that requires a new processing model to have stronger decision-making power, insight and process optimization ability.

It mainly solves the problems of massive storage and analysis and calculation of massive data.

1.1 characteristics of big data

Volume (mass)

Velocity (high speed)

Variety (diverse)

Value (low value density)

1.2Application scenarios of big data

Logistics warehousing: big data analysis system helps businesses refine their operations, increase sales and save costs.

Retail: analyze users' consumption habits and provide convenience for users to buy goods, so as to improve product sales.

Tourism: in-depth combination of big data's ability and the needs of the tourism industry to build the future of intelligent management, intelligent service and intelligent marketing of the tourism industry.

Product recommendation: recommend products according to users' purchase records.

Insurance: massive data mining and risk prediction, help the insurance industry accurate marketing, enhance the fine pricing ability.

Finance: multi-dimensions reflect the characteristics of users, help financial institutions recommend high-quality customers and prevent the risk of fraud.

Real estate: big data helps the real estate industry in an all-round way, creating accurate investment and marketing, selecting more suitable land, building more suitable buildings, and selling to more suitable people.

Artificial intelligence: relying on big data.

2What is the Hadoop of big data Ecology 2.1 discussed from the Hadoop framework?

Hadoop is a distributed system infrastructure developed by the Apache Foundation.

It mainly solves the problems of massive data storage and massive data analysis and calculation.

In a broad sense, Hadoop usually refers to a broader concept-the Hadoop biosphere.

2.2 Hadoop release

The most original (basic) version of Apache is best for starters.

Cloudera is widely used in large Internet enterprises.

Hortonworks documentation is good.

2.3 benefits of Hadoop

High reliability: the underlying Hadoop maintains multiple copies of data, so even if a computing element or storage fails in Hadoop, it will not result in data loss.

High scalability: task data is distributed among clusters, which can easily expand thousands of nodes.

Efficiency: under the idea of MapReduce, Hadoop works in parallel to speed up task processing.

High fault tolerance: the ability to automatically reassign failed tasks.

2.4 Hadoop composition

2.4.1 Overview of HDFS architecture

NameNode (nn): stores the metadata of a file, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), as well as the list of blocks of each file and the DataNode where the blocks are located.

DataNode (dn): stores block data on the local file system, as well as a checksum of the block data.

Secondary NameNode (2nn): a secondary daemon used to monitor the status of HDFS, taking snapshots of HDFS metadata at regular intervals.

2.4.2 Overview of YARN architecture

2.4.3 Overview of MapReduce architecture

MapReduce divides the calculation process into two stages: Map and Reduce

Parallel processing of input data in Map stage

Summarize the Map results in the Reduce phase

Big data's technological ecosystem

The technical terms involved in the figure are explained as follows:

1) Sqoop:Sqoop is an open source tool, which is mainly used to transfer data between Hadoop, Hive and traditional database (MySQL). It can import the data from a relational database into the HDFS of Hadoop, or the data from HDFS into the relational database.

2) Flume:Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing all kinds of data senders in the log system for data collection; at the same time, Flume provides the ability to simply process data and write to various data receivers (customizable).

3) Kafka:Kafka is a high-throughput distributed publish and subscribe messaging system with the following features:

(1) the persistence of messages is provided through the disk data structure of O (1), which can maintain stable performance for a long time even for message storage with TB.

(2) High throughput: even very ordinary hardware Kafka can support millions of messages per second.

(3) support partitioning messages through Kafka servers and consumer machine clusters.

(4) support Hadoop parallel data loading.

4) Storm:Storm is used for "continuous computing" to continuously query the data stream and output the results to the user in the form of a stream during the calculation.

5) Spark:Spark is the most popular open source big data memory computing framework, which can be calculated based on big data stored on Hadoop.

6) Oozie:Oozie is a workflow scheduling management system for managing Hdoop jobs (job).

7) Hbase:HBase is a distributed, column-oriented open source database. Different from the general relational database, HBase is a database suitable for unstructured data storage.

8) Hive:Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple SQL query function, and convert SQL statements into MapReduce tasks to run. Its advantage is low learning cost, simple MapReduce statistics can be realized quickly through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

10) R language: r is the language and operating environment for statistical analysis and drawing. R is a free, free and open source software belonging to GNU system. It is an excellent tool for statistical calculation and statistical mapping.

11) Mahout:Apache Mahout is an extensible machine learning and data mining library.

12) ZooKeeper:Zookeeper is an open source implementation of Chubby of Google. It is a reliable coordination system for large-scale distributed systems. The functions provided include: configuration maintenance, name service, distributed synchronization, group service, etc. The goal of ZooKeeper is to encapsulate complex and error-prone key services, and to provide users with easy-to-use interfaces and systems with efficient performance and stable functions.

2.6 recommendation system framework diagram

3 Hadoop running environment building 3.1 virtual machine environment preparation

Turn off the firewall

# disable Firewall systemctl stop firewalld# and disable Firewall systemctl disable firewalld on boot

Create a user

# create user useradd djm# to modify password passwd djm

Configure the user to have root permissions

Djm ALL= (ALL) NOPASSWD:ALL

Create a folder under the / opt directory

Sudo mkdir / opt/softwaresudo mkdir / opt/module3.2 install JDK

Uninstall an existing Java

Rpm-qa | grep java | xargs sudo rpm-e-nodeps

Extract to / opt/module directory

Tar-zxvf jdk-8u144-linux-x64.tar.gz-C / opt/module/

Configure environment variables

Sudo vim / etc/profile#JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_144export PATH=$PATH:$JAVA_HOME/bin

Refresh configuration

Source / etc/profile

Test whether the installation is successful

Java-version3.3 install Hadoop

Extract to / opt/module directory

Tar-zxvf hadoop-2.7.2.tar.gz-C / opt/module/

Configure environment variables

Sudo vim / etc/profile#HADOOP_HOMEexport HADOOP_HOME=/opt/module/hadoop-2.7.2export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Refresh configuration

Source / etc/profile

Test whether the installation is successful

Hadoop version3.4 Hadoop directory structure

Bin directory: stores scripts that operate on Hadoop-related services (HDFS,YARN)

Etc directory: the configuration file directory of Hadoop and the configuration file of Hadoop

Lib directory: the local library for storing Hadoop (compression and decompression of data)

Sbin directory: stores scripts to start or stop Hadoop-related services

Share directory: stores Hadoop's dependent jar packages, documents, and official cases

4 Hadoop operation mode 4.1 local operation mode

Create an input folder

[djm@hadoop101 hadoop-2.7.2] $mkdir input

Copy the xml configuration file of Hadoop to input

[djm@hadoop101 hadoop-2.7.2] $cp etc/hadoop/*.xml input

Execute the MapReduce program in the share directory

# output must be a non-existent folder [djm@hadoop101 hadoop-2.7.2] $hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs [a Mel z.] +'

View the output result

[djm@hadoop101 hadoop-2.7.2] $cat output/*4.2 pseudo-distributed operation mode 4.2.1 start HDFS and run the MapReduce program

Configure hadoop-env.sh

# modify JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_144

Configure core-site.xml

Fs.defaultFS hdfs://hadoop101:9000 hadoop.tmp.dir / opt/module/hadoop-2.7.2/data/tmp

Configure hdfs-site.xml

Dfs.replication 1

Start the cluster

# format NameNode [djm@hadoop101 hadoop-2.7.2] $hdfs namenode-format# launch NameNode [DJM @ hadoop101 hadoop-2.7.2] $hadoop-daemon.sh start namenode# launch DataNode [DJM @ hadoop101 hadoop-2.7.2] $hadoop-daemon.sh start datanode

Check to see if the startup is successful

Jps

Web side views HDFS file system

Http://hadoop101:50070/dfshealth.html#tab-overview

Operation cluster

# create input [djm@hadoop101 hadoop-2.7.2] $hdfs dfs-mkdir-p / user/djm/input# on the HDFS file system and upload the contents of the test file to the file system [djm@hadoop101 hadoop-2.7.2] $hdfs dfs-put wcinput/wc.input / user/djm/input/# run the MapReduce program [djm@hadoop101 hadoop-2.7.2] $hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2. Jar wordcount / user/djm/input/ / user/djm/output# View run result [djm@hadoop101 hadoop-2.7.2] $hdfs dfs-cat / user/djm/output/*# delete run result [djm@hadoop101 hadoop-2.7.2] $hdfs dfs-rm-r / user/djm/output

Why can't you format NameNode all the time, format NameNode, what should you pay attention to?

When we perform file system formatting, we will save a dfs/data/current/VERSION file in the NameNode data folder (that is, the path of dfs.name.dir in the configuration file to the local system), recording clusterID and datanodeUuid, formatting NameNode will generate a new clusterID, but the VERSION file only records the clusterID saved during the first format, resulting in ID inconsistency between DataNode and NameNode, the solution is to delete the VERSION file.

4.2.2 start YARN and run the MapReduce program

Configure yarn-env.sh

# modify JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_144

Configure yarn-site.xml

Yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname hadoop101

Configure mapred-env.sh

# modify JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_144

Configure mapred-site.xml

# rename mapred-site.xml.template to mapred-site.xml [djm@hadoop101 hadoop-2.7.2] $mv mapred-site.xml.template mapred-site.xml mapreduce.framework.name yarn

Start the cluster

# launch NameNode [DJM @ hadoop101 hadoop-2.7.2] $hadoop-daemon.sh start namenode# launch DataNode [DJM @ hadoop101 hadoop-2.7.2] $hadoop-daemon.sh start datanode# launch ResourceManager [DJM @ hadoop101 hadoop-2.7.2] $sbin/yarn-daemon.sh start resourcemanager# launch NodeManager [DJM @ hadoop101 hadoop-2.7.2] $sbin/yarn-daemon.sh start nodemanager

View YARN on web

Http://hadoop101:8088/cluster

Cluster operation

# Delete the output file on the file system [djm@hadoop101 hadoop-2.7.2] $hdfs dfs-rm-R / user/djm/output# execute MapReduce program [djm@hadoop101 hadoop-2.7.2] $hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount / user/djm/input / user/djm/output4.2.3 configuration history server

Configure mapred-site.xml

Mapreduce.jobhistory.address hadoop101:10020 mapreduce.jobhistory.webapp.address hadoop101:19888

Start the history server

Mr-jobhistory-daemon.sh start historyserver

View JobHistory

Http://hadoop101:19888/jobhistory

4.2.4 configure aggregation of logs

Log aggregation concept: after the application is run, upload the log information of the program to the HDFS system.

Log aggregation function benefits: you can easily view the details of the operation of the program, convenient for development and debugging.

Yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800

Restart NodeManager, ResourceManager, and HistoryServer

[djm@hadoop101 hadoop-2.7.2] $yarn-daemon.sh stop resourcemanager [djm@hadoop101 hadoop-2.7.2] $yarn-daemon.sh stop nodemanager [djm@hadoop101 hadoop-2.7.2] $mr-jobhistory-daemon.sh stop historyserver [djm@hadoop101 hadoop-2.7.2] $yarn-daemon.sh start resourcemanager [djm@hadoop101 hadoop-2.7.2] $yarn-daemon.sh start nodemanager [djm@hadoop101 hadoop-2.7.2] $mr-jobhistory-daemon.sh start historyserver

Delete output files that already exist on HDFS

[djm@hadoop101 hadoop-2.7.2] $hdfs dfs-rm-R / user/djm/output

Execute the WordCount program

[djm@hadoop101 hadoop-2.7.2] $hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount / user/djm/input / user/djm/output

View the log

Http://hadoop101:19888/jobhistory

4.2.5 profile description

Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a default configuration value, they need to modify the custom configuration file and change the corresponding property values.

Custom profile:

The four configuration files core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml are stored on the $HADOOP_HOME/etc/hadoop path, and users can modify the configuration according to the needs of the project.

4.3 fully distributed operation mode 4.3.1 Writing cluster distribution script xsync

Create xsync

[djm@hadoop102 ~] $mkdir bin [djm@hadoop102 ~] $cd bin/ [djm@hadoop102 bin] $touch xsync [djm@hadoop102 bin] $vi xsync

Write the following code in this file

#! / bin/bash#1 gets the number of input parameters. If there are no parameters, exit pcount=$#if ((pcount==0)) directly; thenecho no args;exit;fi#2 gets the file name p1basename 1fname= `basename $p1`echo fname=$fname#3 to get the superior directory to the absolute path pdir= `CD-P $(dirname $p1); pwd`echo pdir=$pdir#4 gets the current user name user= `whoami` # 5 for ((host=103; host)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.