Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to install and configure Hadoop

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to install and configure Hadoop". In daily operation, I believe many people have doubts about how to install and configure Hadoop. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "how to install and configure Hadoop". Next, please follow the editor to study!

4. Data acquisition module 4.1 View scripts for all processes in the cluster

1) create the script xcall.sh under the / home/atguigu/bin directory

Vim xcall.sh

2) write the following in the script

#! / bin/bash#1, determine whether to enter the parameter if [$#-lt 1] then echo "at least one parameter must be entered." Exitfi#2, execute the command for host in hadoop102 hadoop103 hadoop104do echo "= $host=" ssh $host "$*" done#xcall.sh mkdir-p / xx/xx

3) modify script execution permissions

Chmod + x xcall.sh

4) start the script

Xcall.sh jps

4.2 Hadoop installation

Install Hadoop

4.2.1 HDFS storage of project experience in multiple directories

1) production environment server disk condition

2) configure multiple directories in the hdfs-site.xml file, and pay attention to the access permission of the newly mounted disk.

The path where the DataNode node of HDFS saves data is determined by the dfs.datanode.data.dir parameter, whose default value is file://${hadoop.tmp.dir}/dfs/data. If the server has multiple disks, this parameter must be modified. If the server disk is shown in the figure above, the parameter should be modified to the following value.

Dfs.datanode.data.dir file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4

Note: each server mounts different disks, so the multi-directory configuration of each node can be inconsistent. It can be configured separately.

Summary: 1, HDFS storage multi-directory 1, benefits: 1, increase storage capacity 2, improve IO, concurrency 2, How to store multiple directories: you need to configure dfs.datanode.data.dir file:// disk mount point 1 / storage path in hdfs-site.xml File:// disk mount point 2 / storage path in a real production environment, the number of server disks and mount points may be different every day, so this configuration requires a separate configuration on each server 4.2.2 Cluster data equalization data may accumulate on one or some datanode node, or when the server has multiple disks, the data accumulates on one disk. So at this time, it may cause some nodes or disk load pressure to be relatively large. Therefore, it is necessary to carry out data equalization on nodes and disks. 1, enable node data equalization: start-balancer-threshould N N represents that the disk utilization between nodes cannot exceed N% 2, stop node data equalization: stop-balancer 2, disk data equalization 1, Generate execution plan: hdfs diskbalancer-plan host name 2, open balance: hdfs diskbalancer-execute host .plan.json 3, view balance progress: hdfs diskbalancer-query host name 4, stop balance: hdfs diskbalancer-cancel host .plan.json whether it is node data balance or disk data balance, you need to choose when the cluster is idle. Because equalization requires a large amount of disk IO and network IO. 4.2.3 Project experience supports LZO compression configuration

Hadoop itself does not support lzo compression, if you want to support lzo compression, you need to configure additional operations.

1) put the compiled hadoop-lzo-0.4.20.jar into hadoop-3.1.3/share/hadoop/common/

[atguigu@hadoop102 common] $pwd/opt/module/hadoop-3.1.3/share/hadoop/common [atguigu@hadoop102 common] $lshadoop-lzo-0.4.20.jar

2) synchronize hadoop-lzo-0.4.20.jar to hadoop103 and hadoop104

[atguigu@hadoop102 common] $xsync hadoop-lzo-0.4.20.jar

3) core-site.xml added configuration supports LZO compression

Io.compression.codecs org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, com.hadoop.compression.lzo.LzoCodec Com.hadoop.compression.lzo.LzopCodec io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec

Com.hadoop.compression.lzo.LzoCodec: subsequent MR will not slice lzo compressed files when reading com.hadoop.compression.lzo.LzopCodec: subsequent MR will slice when reading lzo compressed files

4) synchronize core-site.xml to hadoop103 and hadoop104

[atguigu@hadoop102 hadoop] $xsync core-site.xml

5) start and view the cluster

[atguigu@hadoop102 hadoop-3.1.3] $sbin/start-dfs.sh [atguigu@hadoop103 hadoop-3.1.3] $LZO indexing of project experience in sbin/start-yarn.sh4.2.4

1) create the index of the LZO file, and the slicing property of the LZO compressed file depends on its index, so we need to manually create the index for the LZO compressed file. Without an index, there is only one slice of the LZO file.

Hadoop jar / opt/module/hadoop/share/hadoop/common/hadoop-lzo-4.1.x.jar com.hadoop.compression.lzo.DistributedLzoIndexer requires indexed lzo filename 4.2.5 benchmark for project experience 1, HDFS throughput test ①, Write throughput: the number of hadoop jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO-write-nrFiles N-size 128MB//-nrFiles should be set to-1 ② of all CPU in the cluster, Read throughput: hadoop jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO-read-nrFiles N-size 128MB Test results: 2020-04-16 13 read 41read 24724 INFO fs.TestDFSIO:-TestDFSIO -: write 2020-04-16 13 read 41R 24724 INFO fs.TestDFSIO: Date & time: Thu Apr 16 13:41:24 CST 2020 2020-04-16 13 INFO fs.TestDFSIO 24724 INFO fs.TestDFSIO: Number of files: 10 2020-04-16 13 13 INFO fs.TestDFSIO 24725 INFO fs.TestDFSIO: Total MBytes processed: 1280 2020-04-16 13 13 CST 24725 INFO fs.TestDFSIO: Throughput mb/sec: 8.88 2020-04-16 13 Swiss 41mer 24725 INFO fs.TestDFSIO: Average IO rate mb/sec: 8.96 2020-04-16 13 IO rate std deviation 41 IO rate std deviation 24725 INFO fs.TestDFSIO: Test exec time sec: 67.61HDFS read / write throughput = single maptask write rate * nrFiles ③, Delete test generation data hadoop jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO-clean2, MR Computing performance Test: (1) use RandomWriter to generate random numbers Each node runs 10 Map tasks Each Map generates a binary random number of about 1G in size hadoop jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar randomwriter random-data (2) executes the Sort program hadoop jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar sort random-data sorted-data (3) verifies that the data is really sorted hadoop Hadoop parameter tuning of jar / opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar testmapredsort-sortInput random-data-sortOutput sorted-data4.2.6 project experience 1. Thread number adjustment for namenode thread pool: reason: there is a thread pool in namenode Threads in the thread pool are mainly used for datanode and namenode heartbeats and client requests to get / write metadata So if concurrency is large or datanode is high, there may be not enough threads in the thread pool. You usually need to increase the default value of the parameter dfs.namenode.handler.count: how to adjust: configure in hdfs-site.xml: how many dfs.namenode.handler.count 10 is adjusted to: 20 * log (number of cluster machines) 2. Adjust the number of resources that can be used by nodemanager. Reason: not all the resources of the server will be used by nodemanager. 8 GB of memory is provided to nodemanager by default. In real production, the default value is not enough. How to adjust it: configure in yarn-site.xml: yarn.nodemanager.resource.memory-mb 1024 is generally set to 70% 80% 4.3 Zookeeper of the total memory of the server. Install 4.3.1 install ZK.

Install Zookeeper

4.3.2 ZK cluster startup and stop script #! / bin/bash#1 to determine whether the parameter is passed into if [$#-lt 1] then echo "must pass parameter." Exitfi#2, execute the corresponding logic case $1 in "start") for host in hadoop102 hadoop103 hadoop104 do echo "= start $host zookeeper=" ssh $host "/ opt/module/zookeeper/bin/zkServer.sh start" done according to the parameters "stop") for host in hadoop102 hadoop103 hadoop104 do echo "= stop $host zookeeper=" ssh $host "/ opt/module/zookeeper/bin/zkServer.sh stop" done;; "status") for host in hadoop102 hadoop103 hadoop104 do echo "= status $host zookeeper=" ssh $host "/ opt/module/zookeeper/bin/zkServer.sh status" done *) echo "parameter input error." Esaczookeeper common instructions 1, create node: create node path 2, save data: set path data 3, get data: get path 4, view node: ls path 5, delete node: 1, delete non-empty node: deleteall path 2, delete empty node: delete path 4.4 Kafka installation 4.4.1 Kafka cluster installation

Install Kafka

4.4.2 Kafka cluster startup and stop script #! / bin/bash#1 to determine whether parameters are passed into if [$#-lt 1] then echo "parameters must be passed." Exitfi#2, according to parameter matching logic case $1 in "start") for host in hadoop102 hadoop103 hadoop104 do echo "= start $host kafka=" ssh $host "/ opt/module/kafka/bin/kafka-server-start.sh-daemon / opt/module/kafka/config/server.properties" done "stop") for host in hadoop102 hadoop103 hadoop104 do echo "= stop $host kafka=" ssh $host "/ opt/module/kafka/bin/kafka-server-stop.sh" done "status") for host in hadoop102 hadoop103 hadoop104 do echo "= status $host kafka=" pid=$ (ssh $host "ps-ef | grep server.properties | grep-v grep") ["$pid"] & & echo "kafka process is normal" | | echo "kafka process does not exist" done;;*) echo "parameter input error." Esac4.4.3 Kafka common commands 1, shell common commands 1, topic related 1, create topic: bin/kafka-topics.sh-- create-- topic topic name-- bootstrap-server hadoop102:9092 Hadoop103:9092-- number of partitions partitions-- number of replication-factor copies 2, view all topic of the cluster: bin/kafka-topics.sh-- list-- bootstrap-server hadoop102:9092 3, view topic details: bin/kafka-topics.sh-- describe-- topic topic name-- bootstrap-server hadoop102:9092 4, modify topic: bin/kafka-topics.sh-- alter-- topic topic name-- bootstrap-server hadoop102:9092-- partitions partition number 5, Delete topic: bin/kafka-topics.sh-- delete-- topic topic name-- bootstrap-server hadoop102:9092 2, producer related: bin/kafka-console-producer.sh-- topic topic name-- broker-list hadoop102:9092 3, consumer related: 1, consumption data: bin/kafka-console-consumer.sh-- topic topic name-- bootstrap-server hadoop102:9092 [--group consumer id] [--from-beginning] 2, Check the progress of consumer group consumption topic: bin/kafka-consumer-group.sh-- all-groups-- all-topics-- describe-- bootstrap-server hadoop102:9092 4, data related: bin/kafka-dump-log.sh-- files file path to be viewed-- Kafka stress test of print-data-log4.4.4 project experience 1, kafka throughput test: 1, read throughput: bin/kafka-consumer-perf-test.sh-- broker-list hadoop102:9092 Hadoop103:9092,hadoop104:9092-- name of topic topic-- how many data items have been pulled by messages? test results: start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec 2019-02-19 20 start.time 297 messages 566, 2019-02-19 20 20 29 Vista 12 messages 170, 9.5368, 2.0714, 100010, 21722.4153 start test time At the end of the test, the total consumption data 9.5368MB, throughput 2.0714MB/s, a total of 100010 items, an average of 21722.4153 items per second. MB.sec is the throughput test result 2. Write throughput: bin/kafka-producer-perf-test.sh-- record-size specifies each data size-- topic topic name-- num-records writes topic data entries-- producer-props bootstrap.servers=hadoop102:9092-- the rate of throughput writes [- 1 means unlimited] 100000 records sent, 95877.277085 records/sec (9.14 MB/sec), 187.68 ms avg latency, 424.00 ms max latency 155 ms 50th, 411 ms 95th, 423 ms 99th, 424 ms 99.9th. 9.14 MB/sec is the number of Kafka machines experienced in the throughput test result 4.4.5 project. Calculate the number of Kafka machines (empirical formula) = 2 * (peak production speed * number of copies / 100) + 1 get the peak production speed first, and then according to the set number of copies, you can estimate the number of Kafka to be deployed. For example, our peak production speed is 50M/s. The number of copies is 2. Number of Kafka machines = 2 * (50 machines 2 + 100) + 1) 3 Kafka partitions with 4.4.6 project experience 1) create a topic2 with only one partition) test the producer throughput and consumer throughput of this topic. 3) suppose their values are Tp and Tc, respectively, and the unit can be MB/s. 4) then assume that the total target throughput is Tt, then the number of partitions = Tt / min (Tp,Tc) for example: producer throughput = 20m Universe Flume4.5.1 Throughput = 50m/s, expected throughput 100m consumer; number of partitions = 100m / 20 = 5 partition number of partitions is generally set to: 3-10 Flume4.5.1 log collection Flume installation

Install three Flume (log collection: 102,103, consumption kafka:104)

4.5.2 selection of Flume components for project experience

TailDirSource,kafkaChannel

4.5.3 Log Collection Flume configuration

(1) create a file-flume-kafka.conf file under the / opt/module/flume/conf directory

[atguigu@hadoop102 conf] $vim file-flume-kafka.conf

Configure the following in the file

# 1. Define agent, source, channel name a1.sources = r1a1.channels = c1room2, Describe sourcea1.sources.r1.type = TAILDIR# specified filegroup name a1.sources.r1.filegroups = fallow specified group monitoring directory a1.sources.r1.filegroups.f1 = / opt/module/applog/log/app.*# specified breakpoint continuation file a1.sources.r1.positionFile = / opt/module/flume/position.json# specify how much data to collect in a batch a1.sources.r1.batchSize = 100room3, Describe lan truncator a1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.ETLInterceptor$Builder#4, describe channela1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel# specify kafka cluster address a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092 Hadoop103:9092# specifies the topic name of the data written a1.channels.c1.kafka.topic = applog# data is written in event format when writing kafka: true= is false: no, only body data a1.channels.c1.parseAsFlumeEvent = false#5, associated source- > channela1.sources.r1.channels = c14.5.4 Flume interceptor @ Override public Event intercept (Event event) {byte [] body = event.getBody () String log = new String (body, StandardCharsets.UTF_8); if (JSONUtils.isJSONValidate (log)) {return event;} else {return null;}} @ Override public List intercept (List list) {Iterator iterator = list.iterator (); while (iterator.hasNext ()) {Event next = iterator.next () If (intercept (next) = = null) {iterator.remove ();}} return list } 4.5.5 Log Collection Flume start and stop script #! / bin/bash#1, determine whether the parameter is passed into if [$#-lt 1] then echo "must pass parameter...." fi#2, According to the parameter matching logic case $1 in "start") for host in hadoop102 hadoop103 do echo "= turn on $host server collection =" ssh $host "nohup / opt/module/flume-1.9.0/bin/flume-ng agent-n A1-c / opt/module/flume-1.9.0/conf/-f / opt/module/flume-1.9.0/jobs/file_to_kafka.conf-Dflume.root.logger=INFO LOGFILE > / opt/module/flume-1.9.0/logs 2 > & 1 & "done ; "stop") for host in hadoop102 hadoop103 do echo "= stop $host server collection =" ssh $host "ps-ef | grep file_to_kafka.conf | grep-v grep | awk'{print\ $2}'| xargs kill-9" done;;*) echo "parameter input error."; esac4.6 consumption Kafka data Flume4.6.1 project experience Flume gradual selection

KafkaSource 、 fileChannel/memoryChannel 、 hdfsSink

4.6.2 Flume Blue Cutter @ Override public Event intercept (Event event) {Map headers = event.getHeaders (); String log = new String (event.getBody (), StandardCharsets.UTF_8); JSONObject jsonObject = JSONObject.parseObject (log); String ts = jsonObject.getString ("ts"); headers.put ("timestamp", ts); return event } @ Override public List intercept (List list) {events.clear (); for (Event event: list) {events.add (intercept (event));} return events } 4.6.3 Log Peak removal Flume configuration # 1, define agent, channel, source, sink name a1.sources = r1a1.channels = c1a1.sinks = k1q2, description sourcea1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource# specify kafka cluster address a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092 Hadoop103:9092# specifies which topic to read the data from a1.sources.r1.kafka.topics = applog# specifies the ida1.sources.r1.kafka.consumer.group.id of the consumer group = source specifies how many messages source pulls from a batch of kafka: batchSizesinka1.sources.r1.channels = c1a1.sinks.k1.channel = c14.6.4 log consumption Flume startup stop script

1) create the script f2.sh under the / home/atguigu/bin directory

[atguigu@hadoop102 bin] $vim f2.sh

Fill in the following in the script

#! / bin/bashcase $1 in "start") {for i in hadoop104 do echo "- launch $I consumption flume-" ssh $I "nohup / opt/module/flume/bin/flume-ng agent-- conf-file / opt/module/flume/conf/kafka-flume-hdfs.conf-- name A1-Dflume.root.logger=INFO LOGFILE > / opt/module/flume/log2.txt 2 > & 1 & "done} ; "stop") {for i in hadoop104 do echo "- stop $I consumption flume-" ssh $I "ps-ef | grep kafka-flume-hdfs | grep-v grep | awk'{print\ $2}'| xargs-N1 kill" done};; esac

2) increase script execution permissions

[atguigu@hadoop102 bin] $chmod uplix f2.sh

3) f2 cluster startup script

[atguigu@hadoop102 module] $f2.sh start

4) f2 cluster stop script

[atguigu@hadoop102 module] $f2.sh stop

4.6.5 Project experience Flume memory optimization flume default memory maximum is 2000m, generally in the project need to be set to about 4G Xms: startup memory size Xmx: run-time maximum memory size Xms and Xmx are best set when setting the same. Because of the different settings, the Xms memory is relatively small at startup, resulting in a quick shortage of memory, which requires memory expansion, and GC will affect performance. The two settings are the same, there will be no memory expansion, reduce the number of GC so far, the study on "how to install and configure Hadoop" is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report