How to build Hadoop in CentOS 04/21 Update SLTechnology News&Howtos

How to build Hadoop in CentOS

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to build Hadoop in CentOS. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Software environment:

Virtual machine: VMware Pro14

Linux:CentOS-6.4 (download address, download DVD version)

JDK:OpenJDK1.8.0 (it is strongly recommended that you do not use Oracle's Linux version of JDK)

Hadoop:2.6.5 (download address)

Virtual machine installation and Linux system installation here omitted, you can refer to the online tutorial installation, generally no big problem, need to pay attention to remember the user password you entered here, the following is also used, as shown in the following figure.

Set user password .jpg

User selection

After installing the system using the virtual machine, you can see the login interface, as shown in the following figure.

Select Other, enter root in the Username input box, enter enter, and then enter your password when you created the user in the Password input box. Root users are superusers who are automatically created by installing CentOS, but the password is the same as the normal user password you create when you install the system.

Usually, when using CentOS, it is not recommended to use root user, because the user has the highest permissions of the entire system, using this user may lead to serious consequences, but only if you are familiar with Linux. Building the big data platform of Hadoop, using ordinary users, many commands require sudo commands to obtain the permissions of root users, which is troublesome, so simply use root users directly.

Install SSH

Both cluster and single-node modes require SSH login (similar to remote login, where you can log in to a Linux host and run commands on it).

First of all, make sure that your CentOS system can surf the Internet normally. You can check the network icon in the upper right corner of the desktop. If it shows a red cross, you can click to select an available network, or you can use the Firefox browser in the upper left corner of the desktop to enter the URL to verify whether the network connection is normal. If you are still unable to access the Internet, check the settings of the virtual machine, choose NAT mode, or go online Baidu to solve.

Check the state of the network. Jpg

After confirming that the network connection is normal, open the terminal of CentOS, right-click on the desktop of CentOS and select Open In Terminal, as shown in the following figure.

Open the terminal .jpg

In general, SSH client and SSH server are installed by default in CentOS. You can open the terminal and execute the following command to verify:

Rpm-qa | grep ssh

If the returned results are shown in the figure below, including SSH client and SSH server, you no longer need to install them.

Check to see if SSH has .jpg installed

If you need to install, you can install it through yum, the package manager. (during the installation process, you will be asked to enter [y] and y.)

Note: commands are executed individually, not directly pasted through the two commands.

Paste in the terminal can be pasted by right-clicking the mouse and selecting Paste, or by [Shift + Insert] shortcut.

Yum install openssh-clientsyum install openssh-server

After the SSH installation is complete, execute the following command to test whether SSH is available (SSH first login prompts yes/no information, enter yes, and then follow the prompts to enter the password of the root user, so you can log in to the local machine), as shown in the following figure.

But in this way, you need to enter your password every time, and we need to configure SSH to log in without a password.

First type exit to exit the ssh, then go back to our original terminal window, and then use ssh-keygen to generate the key and add the key to the authorization.

Exit # exit the previous ssh localhostcd ~ / .ssh/ # if prompted that there is no such directory, please execute ssh localhostssh-keygen-t rsa # first. Press enter cat id_rsa.pub > > authorized_keys # to join the authorization chmod. / authorized_keys # to modify the file permissions.

At this point, use the ssh localhost command to log in without entering a password, as shown in the following figure.

Install the Java environment

The Java environment can choose JDK of Oracle, or OpenJDK (which can be regarded as the open source version of JDK). Now the general Linux system is basically installed by default OpenJDK, here the installation is the OpenJDK1.8.0 version.

Some CentOS 6.4s have OpenJDK 1.7 installed by default. Here we can use the command to check the value of the environment variable JAVA_HOME, just like the command under Windows.

Java-version # View the version of java javac-version # View the version of the compilation command Javac echo $JAVA_HOME # View the value of the environment variable $JAVA_HOME

If OpenJDK is not installed on the system, we can install it through the yum package manager. (during installation, you can enter [y] or [y].)

The copy code is as follows:

Yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel # install openjdk1.8.0

Install OpenJDK with the above command, and the default installation location is / usr/lib/jvm/java-1.8.0, which is used when configuring JAVA_HOME below.

Next, you need to configure the JAVA_HOME environment variable. for convenience, set it directly in ~ / .bashrc, which is equivalent to configuring the user environment variable of Windows, which only takes effect for a single user. When the user logs in, the .bashrc file will be read every time the shell terminal is opened.

To modify the file, you can open the file directly using the vim editor, or you can use a gedit text editor similar to Windows notepad.

Choose one of the following orders.

Vim ~ / .bashrc # use the vim editor to open the .bashrc file gedit ~ / .bashrc # use the gedit text editor to open the .bashrc file

Add the following separate line at the end of the file (pointing to the location where JDK is installed), and save it.

Configure the JAVA_HOME environment variable .jpg

Then you need to make the environment variable take effect by executing the following command.

Source ~ / .bashrc # enables variable settings to take effect

Once set up, let's check to see if the setting is correct, as shown in the following figure.

Echo $JAVA_HOME # verifies the variable value java-versionjavac-version$JAVA_HOME/bin/java-version # the same as performing java-version directly

Check that the JAVA_HOME environment variable is configured correctly. Jpg

In this way, the Java runtime environment required for Hadoop is installed.

Install Hadoop

The download address of hadoop2.6.5 has been given in the previous software environment, which can be downloaded directly through Firefox. The default download location is under the Downloads folder in the user's Home, as shown in the following figure.

Download Hadoop.jpg

After the download is complete, we unzip the Hadoop into / usr/local/.

Tar-zxf ~ / download / hadoop-2.6.5.tar.gz-C / usr/local # extract to / usr/local directory cd / usr/local/ # change the current directory to / usr/local directory mv. / hadoop-2.6.5/. / hadoop # change the folder name to hadoopchown-R root:root. / hadoop # modify file permissions, root is the current user name

Hadoop can be used after decompression. Enter the following command to check whether Hadoop is available. If you succeed, the Hadoop version information will be displayed.

Cd / usr/local/hadoop # switch the current directory to the / usr/local/hadoop directory. / bin/hadoop version # View the version information of Hadoop

Or you can enter the hadoop version command directly to view it.

Hadoop version # View the version information of Hadoop

View Hadoop version information .jpg

There are three ways to install Hadoop, which are stand-alone mode, pseudo-distributed mode and distributed mode.

Stand-alone mode: the default mode of Hadoop is non-distributed mode (local mode) and can run without additional configuration. Non-distributed is a single Java process, which is convenient for debugging.

Pseudo-distributed mode: Hadoop can run in a pseudo-distributed manner on a single node, the Hadoop process runs as a separate Java process, the node acts as both NameNode and DataNode, and reads files in HDFS.

Distributed mode: use multiple nodes to form a cluster environment to run Hadoop, which requires multiple hosts or virtual hosts.

Hadoop pseudo-distributed configuration

Now we can run some examples using Hadoop. Hadoop comes with a lot of examples, and you can run hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar to see all the examples.

Let's run an example of a query here, take the input folder as the input folder, filter the words in accordance with the regular expression dfs [a murz.] +, count its times, and output the filter results to the output folder.

Cd / usr/local/hadoop # switch current directory to / usr/local/hadoop directory mkdir. / input # create input folder cp. / etc/hadoop/*.xml. / input # copy the configuration file of hadoop to the newly created input folder input. / bin/hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep. / input. / output 'dfs [Amurz.] + 'cat. / output/* # View the output

Using the command cat. / output/* to see the result, the regular word dfsadmin appears once.

Run the test Hadoop example. jpg

If there is an error in the operation, such as the following figure prompt.

Error running Hadoop example.jpg

If the prompt "WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform … using builtin-java classes where applicable" appears, the WARN prompt can be ignored and does not affect the normal operation of Hadoop.

Note: Hadoop does not overwrite the result file by default, so running the above example again will prompt an error. You need to delete the output folder first.

Rm-rf. / output # is executed under the / usr/local/hadoop directory

There is no problem testing our Hadoop installation, we can start setting the environment variables for Hadoop, which are also configured in the ~ / .bashrc file.

Gedit ~ / .bashrc # Open the .bashrc file using a gedit text editor

Add the following at the end of the .bashrc file, pay attention to whether the location of the HADOOP_HOME is correct, if it is all in accordance with the previous configuration, this part can be copied.

# Hadoop Environment Variablesexport HADOOP_HOME=/usr/local/hadoopexport HADOOP_INSTALL=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Configuration of Hadoop environment variables .jpg

Remember to close the gedit program after saving, otherwise it will occupy the terminal and cannot execute the following command. You can press the [Ctrl + C] key to terminate the program.

After saving, don't forget to execute the following command to make the configuration effective.

Source / .bashrc

The configuration file of Hadoop is located under / usr/local/hadoop/etc/hadoop/, and the pseudo-distribution needs to modify two configuration files core-site.xml and hdfs-site.xml. The configuration file for Hadoop is in xml format, and each configuration is implemented by declaring name and value for property.

Modify the configuration file core-site.xml (it is convenient to edit through gedit, enter the command, gedit. / etc/hadoop/core-site.xml).

Insert the following code in the middle.

Hadoop.tmp.dir file:/usr/local/hadoop/tmp Abase for other temporary directories. Fs.defaultFS hdfs://localhost:9000

Similarly, modify the configuration file hdfs-site.xml, gedit. / etc/hadoop/hdfs-site.xml

Dfs.replication 1 dfs.namenode.name.dir file:/usr/local/hadoop/tmp/dfs/name dfs.datanode.data.dir file:/usr/local/hadoop/tmp/dfs/data

After the configuration is complete, the formatting of the NameNode is performed. (this command is required for Hadoop to start for the first time)

Hdfs namenode-format

If successful, you will see the prompt of "successfully formatted" and "Exitting with status 0". If it is "Exitting with status 1", it will be an error.

NameNode format .jpg

Next, start Hadoop.

Start-dfs.sh # start NameNode and DataNode processes

If the following SSH prompt "Are you sure you want to continue connecting" appears, enter yes.

Considerations for starting Hadoop .jpg

After the startup is completed, you can determine whether the startup is successful by using the command jps. If the following four processes: NameNode, DataNode, SecondaryNameNode and Jps appear, the Hadoop starts successfully.

Jps # check the process to determine whether Hadoop started successfully

Determine whether Hadoop starts successfully. Jpg

After successful startup, you can also visit the Web interface http://localhost:50070 to view NameNode and Datanode information, and you can also view files in HDFS online.

Hadoop starts the Web interface normally. Jpg

Start YARN

YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on top of MapReduce and provides high availability and high scalability. (pseudo-distributed does not start YARN, which generally does not affect program execution.)

The above starts Hadoop through the start-dfs.sh command, which only starts the MapReduce environment. We can start YARN and let YARN be responsible for resource management and task scheduling.

To first modify the configuration file mapred-site.xml, you need to rename the mapred-site.xml.template file to mapred-site.xml.

Mv. / etc/hadoop/mapred-site.xml.template. / etc/hadoop/mapred-site.xml # File rename gedit. / etc/hadoop/mapred-site.xml # Open mapreduce.framework.name yarn with a gedit text editor

Then modify the configuration file yarn-site.xml.

Gedit. / etc/hadoop/yarn-site.xml # Open yarn.nodemanager.aux-services mapreduce_shuffle with a gedit text editor

Then you can start YARN and execute the start-yarn.sh command.

Note: before starting YARN, make sure that dfs Hadoop has been started, that is, start-dfs.sh has been executed.

Start-yarn.sh # start YARNmr-jobhistory-daemon.sh start historyserver # start the history server before you can view the task running in Web

When enabled, you can see that there are two more processes, NodeManager and ResourceManager, via jps, as shown below.

Start YARN.jpg

After starting YARN, the method of running the instance is still the same, except that the resource management and task scheduling are different. One advantage of starting YARN is that you can see how the task is running through the Web interface: http://localhost:8088/cluster is shown in the following figure.

YARN Web interface .jpg

YARN mainly provides better resource management and task scheduling for the cluster. If you do not want to start YARN, be sure to rename the configuration file mapred-site.xml to mapred-site.xml.template and change it back when needed. Otherwise, if the configuration file exists and YARN is not turned on, the running program will prompt the error "Retrying connect to server: 0.0.0.0max 0.0.0.0pur8032", which is why the initial file name of the configuration file is mapred-site.xml.template.

The command to close YARN is as follows, with start on and stop off.

Stop-yarn.shmr-jobhistory-daemon.sh stop historyserver

In normal learning, it is enough for us to use pseudo-distribution.

This is the end of the article on "how to build Hadoop in CentOS". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.