Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to build Hadoop Cluster and how to operate Python

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

In this issue, the editor will bring you about how to build a Hadoop cluster and how to operate Python. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Recently, we are doing hundreds of billions of big data storage and retrieval requirements in the project. 10T of text data should be parsed and stored in the database for real-time retrieval. File storage has become a primary problem, using a variety of storage methods, which do not meet the requirements. Finally, we use HDFS distributed file storage system to find that the efficiency, management and other aspects are quite good, so we study the construction and use methods, and hereby record the documents.

Environment

Modify the hostname # modify the hostname vi / etc/hostname of each machine according to the above environment configuration # use the hostname command to make it effective There is no need to restart hostname xxxx to modify the hosts file vi / etc/hosts 192.168.143.130 master 192.168.143.131 slave1 192.168.143.132 slave2 192.168.143.133 slave3 192.168.143.134 slave4 configuration secret-free login ssh-keygen-t rsa ssh-copy-id-I ~ / .ssh / id_rsa.pub master ssh-copy-id-I ~. Ssh/id_rsa.pub slave1 ssh-copy-id-I ~ / .ssh / id_ Rsa.pub slave2 ssh-copy-id-I ~ / .ssh/id_rsa.pub slave3 ssh-copy-id-I ~ / .ssh/id_rsa.pub slave4 installation JDK (per machine) apt-get install-y openjdk-8-jre-headless openjdk-8-jdk configuration environment variable

Add the following at the end of the / etc/profile file:

Export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools/jar export HADOOP_HOME=/usr/hadoop-3.3.0/ export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP _ YARN_HOME=$HADOOP_HOME export HADOOP_OPTS= "- Djava.library.path=$HADOOP_HOME/lib/native" export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Make the environment variable effective

Source / etc/profile create directory (daily machine)

When you create a directory, you should first check the disk space on this machine through the df-h command, determine the disk for data storage, then create the following three directories, and modify the corresponding directory configuration in the configuration file hdfs-site.xml below.

Mkdir-p / home/hadoop/dfs/name mkdir-p / home/hadoop/dfs/data mkdir-p / home/hadoop/temp installation and configuration Hadoop

Download the Hadoop installation package

Http://archive.apache.org/dist/hadoop/core/stable/hadoop-3.3.0.tar.gz

# copy it to tar-xzvf hadoop-3.3.0.tar.gz mv hadoop-3.3.0 / usr to configure Hadoop under / usr directory after decompression

The configuration file is in the

/ usr/hadoop-3.3.0/etc/hadoop directory

Hadoop-env.sh

Export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root

Core-site.xml

Fs.defaultFS hdfs://master:9000 hadoop.http.staticuser.user root dfs.permissions.enabled false

Hdfs-site.xml configures multiple file storage directories, separated by commas

Dfs.namenode.name.dir / home/hadoop/dfs/name dfs.dataname.data.dir / home/hadoop/dfs/data,/usr1/hadoop/dfs/data dfs.replication 2

Mapred-site.xml

Mapreduce.framework.name yarn

Yarn-site.xml

Yarn.resourcemanager.hostname master yarn.resourcemanager.webapp.address master:8088 yarn.nodemanager.aux-services mapreduce_shuffle

Workers

What is configured here is the DataNode storage machine. It is not recommended to use master as storage. If the storage is full of clusters, it cannot be used.

Slave1 slave2 slave3 slave4

Just copy the / usr/hadoop-3.3.9 on master to another machine.

Scp / usr/hadoop-3.3.0 slave1:/usr scp / usr/hadoop-3.3.0 slave2:/usr scp / usr/hadoop-3.3.0 slave3:/usr scp / usr/hadoop-3.3.0 slave4:/usr format HDFS directory (on master machine) hdfs namenode-format starts Hadoop

It can be executed on the master machine. After execution, you can use the jps command to check the process status on all machines.

Cd / usr/hadoop-3.3.0/sbin. / start-all.sh to view process status

Execute the jps command on master and slave, respectively

Check to see if it is successful

Open the following web page in the browser to see if you can access it properly.

# Hadoop Cluster Information http://192.168.143.130:8088/cluster # HDFS address http://192.168.143.130:9870/dfshealth.html # DataNode address http://192.168.143.130:9864/datanode.html # NodeManager address http://192.168.143.130:8042/node # SecondaryNameNode http://192.168.143.130:9868/status.html

Test File upload (master)

Hdfs dfs-mkdir / test hdfs dfs-put start-dfs.sh / testHDFS operation command

Create a folder

Hdfs dfs-mkdir / myTask

Create a multi-tier file

Hdfs dfs-mkdir-p / myTask/input

Upload files

Hdfs dfs-put / opt/wordcount.txt / myTask

View files and folders in the general directory

Hdfs dfs-ls /

View the contents of the wordcount.txt file in the myTask directory

Hdfs dfs-cat / myTask/wordcount.txt

Delete a file or folder

Hdfs dfs-rm-r / myTask/wordcount.txt

Download the file to local

Hdfs dfs-get / myTask/wordcount.txt / optPython Operation hdfs

When python operates hdfs, if you want to upload and download files, you must configure the hosts file on the machine where the code is executed, because the namenode and datanode of hdfs are recorded in hostname after registration. If you do not configure the upload / download operation directly, hostname will be used for operation, so you need to configure the corresponding configuration of IP and hostname of hdfs cluster machines on the local machine. For example, when I operate on this machine, I must configure it as follows:

C:\ Windows\ System32\ drivers\ etc\ hosts 192.168.143.130 master 192.168.143.131 slave1 192.168.143.132 slave2 192.168.143.133 slave3 192.168.143.134 slave4 installation Library pip install hdfs operation

Connect

From hdfs.client import Client client = Client ("http://192.168.143.130:9870")

Create a directory

Client.makedirs (hdfs_path)

Delete a file

Client.delete (hdfs_path)

Upload files

Client.download (hdfs_path, local_path)

Get the list of files under the directory

Client.list (hdfs_path)

The advantages of HDFS file storage cluster are: low configuration requirements, easy to expand, high efficiency, very suitable for mass file storage, and can provide web management pages and provide a very good third-party library. In web development, as a file and image repository is also a very good choice.

The above is how to build a Hadoop cluster and how to operate Python. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report