How to build pseudo fractions of Hadoop2 07/06 Update SLTechnology News&Howtos

How to build pseudo fractions of Hadoop2

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to build the pseudo-fraction of Hadoop2". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1.Hadoop ecosystem map

1. How did it all start-huge data on Web! 2. Use Nutch to fetch Web data 3. To preserve the huge amount of data on Web-HDFS came into being. How to use these huge data? 5. Use Java or any stream / pipeline language to build a MapReduce framework for coding and analysis 6. How to get Web logs, clickstream, Apache logs, server logs and other unstructured data-- fuse,webdav, chukwa, flume, Scribe7. Hiho and sqoop load data into HDFS, and relational databases can also join the Hadoop team. 8. High-level interfaces needed for MapReduce programming-Pig, Hive, Jaql.9. BI tool with advanced UI reporting function-Intellicus10. Workflow tools and high-level languages used in Map-Reduce processing 11. Monitor and manage hadoop, run jobs/hive, and view the high-level view of HDFS-Hue,karmasphere,eclipse plugin,cacti,ganglia12. Support framework-Avro (for serialization), Zookeeper (for collaboration) 13. More advanced interfaces-Mahout, Elastic map Reduce14. OLTP--Hbase can also be carried out.

Introduction to 2.Hadoop Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. In a nutshell, Hadoop is a software platform that makes it easier to develop and run software that handles large-scale data. Actual scene: how to deal with massive logs, how to deal with massive web page data hdfs solves the distributed storage of massive data, high reliability, easy to expand, high throughput mapreduce solves the analysis and processing of massive data, strong versatility, easy development, robustness

Common: a set of components and interfaces (serialization, Java RPC, and persistent data structures) for a distributed file system and a generic Icano. MapReduce: distributed data processing model and execution environment, running in large business computer clusters. HDFS: a distributed file system that runs on large commercial computer clusters. Zookeeper: a distributed, highly available coordination service that provides basic services such as distributed locks for building distributed applications. HBase: a distributed, column-based database that uses HDFS as the underlying storage, while supporting batch calculations and point queries (random reads) for MapReduce. Pig: a data flow language and runtime environment that retrieves very large datasets and runs on clusters of MapReduce and HDFS. Hive: a distributed, column-stored data warehouse that manages data stored in HDFS and provides a SQL-based query language (translated into MapReduce jobs caused by runtime) for querying data. Mahout: an extensible machine learning and data mining library (such as classification and clustering algorithms) running on Hadoop. Avro: a serialization system that supports efficient, cross-language RPC and persistent data storage. Sqoop: a tool for efficiently transferring data between a database and HDFS. The lowest level platform hdfs yarn mapreduce spark application layer hbase hive pig sparkSQL nutch tool class zookeeper flume3. The bottleneck of centralized storage and computing

4. Technical differences between Virtualization and Hadoop in Cloud Computing

5. How to solve the problem of mass storage-the simple concept of HDFS

The difference between ordinary NFS and HDFS and their respective characteristics

1. Advantages: transparent, convenient programming, easy, only need to open,close,fread some library operations. two。 Disadvantages: countless data redundancy, all data on one machine, data replication, there may be bandwidth restrictions. HDFS is designed to overcome the shortcomings of NFS. Reliable storage, easy to read, and integrated with mapreduce. Highly scalable and highly configurable (a bunch of configuration files). Support web interface: http://namenode-name:50070/ traffic file system. Support shell interface operation at the same time.

The overall architecture of HDFS: see the blog for more details.

The structure of HDFS in the official document

6. Building Hadoop pseudo-distributed Cluster

What is pseudo-distributed clustering:

Environment preparation: virtual machine: VMware 10 operating system: CentOS 6 JDK:1.7 Hadoop:2.4 client access tool: secureCRT user name: hadoop6.1 early network environment is ready to change the network type of the virtual machine in vmware,-> NAT mode (the ip of the virtual switch can be seen from the edit-- > vertual network editor of vmvare) according to the address of the switch (gateway), set the ip of our client windown7 (Vmnet8 this network card) to start the linux host, and modify the ip address of the linux system (through the graphical interface) After the modification is completed, switch to the root user in terminal (command line terminal) and execute the command to restart the network service to allow ip to modify the hostname effectively: under the identity of root Modify vi / etc/sysconfig/network with command to change hostname to yun-10-1 add hostname and ip mapping under root identity vi / etc/hosts add a line 192.168.2.100 yun-10-1 talk about hadoop this user is added to sudoers to vi / etc/sudoers under root identity, find root ALL=ALL ALL in the file Add a line below to stop the firewall service of hadoop under the identity of root, the automatic start of the shutdown of the firewall under the identity of root, the command of chkconfig iptables off reboot to restart the machine under the identity of root, use the command of ping to check the network connectivity between the windows host and the linux server, enter the linux to modify the graphical interface startup configuration, do not start the graphical interface again, under the identity of root. Vi / etc/inittab change it to id:3:initdefault: reboot again, it will not start to the graphical interface (when you want to start the graphical interface later, you can click startx (init 5) at the command line, in the graphical interface you want to close the graphical interface Init install JDK install jdk--upload the jdk package using the filezilla tool-unzip the jdk package to a special installation directory / home/hadoop/app-hit the command tar-zxvf jdk-7u65-linux-i586.tar in the home directory of hadoop. Gz-C. / app-- the environment variable sudo vi / etc/profile for configuring java is added at the end of the file: export JAVA_HOME=/home/hadoop/app/jdk1.7.0_65 export PATH=$PATH:$JAVA_HOME/bin-- to make the configuration effective Source / etc/profile-- Javac,Java-version detection 6.3 installation Hadoop and configuration related information a. Use secureCRT tool to upload hadoop installation package b. Extract hadoop to the app directory to tar-zxvf hadoop-2.4.1.tar.gz-C. / app/ c. Modify the 5 configuration files of hadoop, located in the / home/hadoop/app/hadoop-2.4.1/etc/hadoop directory d. Configure the environment variable sudo vi / etc/profile----vi hadoop-env.sh of hadoop and change the JAVA_HOME to the path where we install jdk JAVA_HOME=/home/hadoop/app/jdk1.7.0_65----vi core-site.xml fs.defaultFS hdfs://yun-10-1 etc/profile----vi hadoop-env.sh 9000 hadoop.tmp. Dir / home/hadoop/app/hadoop-2.4.1/tmp-vi hdfs-site.xml dfs.replication 1-- modify the file name mv mapred-site.xml.template mapred-site.xml before editing vi mapred-site.xml Mapreduce.framework.name yarn--vi yarn-site.xml yarn.resourcemanager.hostname yun-10-1 yarn.nodemanager.aux-services mapreduce_shuffle Testing Hadoop pseudo-distribution: formatting HDFS Start HDFS,Yarn, process viewing and web browsing. The hostname was changed to yun10-0 during the test. The result proves that the test is successful!

Detailed explanation of 7.SSH protocol and process use

See another blog address. Http://my.oschina.net/codeWatching/blog/342253

8. Problems in the process of building environment

Clone virtual machine in Vmware prompts "No such device eth0" solution 1: delete the configuration file directly, delete the configuration file directly, and Linux will find a new network card after restart. If the network card does not work after sudo rm / etc/udev/rules.d/70-persistent-net.rules restart, delete the following line of / etc/sysconfig/networking-scripts/ifcfg-eth0, or modify the new MAC address assigned to vmware: HWADDR= "XX:XX:XX:XX:XX:XX" restart the network card service network restart method 2: modify the configuration file, delete the original eth0, and then modify the NAME= "eth2" of eth2 to NAME= "eth0" After reboot, Linux will use the new configuration file to set up the network card. The pre-modified / etc/udev/rules.d/70-persistent-net.rules looks like: # PCI device 0x1022:0x2000 (pcnet32) SUBSYSTEM== "net", ACTION== "add", DRIVERS== "? *", ATTR {address} = = "00:0c:29:50:XX:XX", ATTR {dev_id} = = "0x0", ATTR {type} = = "1", KERNEL== "eth*", NAME= "eth0" # PCI device 0x1022:0x2000 (pcnet32) SUBSYSTEM== "net", ACTION== "add", DRIVERS== "? *" ATTR {address} = = "00:0c:29:85:XX:XX", ATTR {dev_id} = = "0x0", ATTR {type} = = "1", KERNEL== "eth*", NAME= "eth2" delete the eth0 line Keep the remaining lines: # PCI device 0x1022:0x2000 (pcnet32) SUBSYSTEM== "net", ACTION== "add", DRIVERS== "? *", ATTR {address} = = "00:0c:29:85:XX:XX", ATTR {dev_id} = = "0x0", ATTR {type} = = "1", KERNEL== "eth*", NAME= "eth0", "how to build the pseudo-fraction of Hadoop2". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.