Getting started with Hadoop 07/02 Update SLTechnology News&Howtos

Getting started with Hadoop

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

When it comes to Column Family databases, we have to mention Google's BigTable, whose open source version is known as HBASE. BigTable is based on Google's other two systems, GFS and Chubby, which together with the distributed computing programming model MapReduce form the basis of Google cloud computing, and Chubby solves the basis of master-slave automatic switching. Next, Hadoop is introduced through a table comparison.

The corresponding distributed file system GFSHDFS in Google cloud computing Hadoop is responsible for physical data storage distributed management service ChubbyZookeeper, server distributed computing framework MapReduceMapReduce, computing distributed database BigTableHBase, and data access

Hadoop was developed by Boug Cutting, the author of Apache Lucene, and its body structure is shown in the following figure.

HDFS (Hadoop File System)

NameNode: the brain of the entire file system, providing directory information for the entire system and managing various data servers.

DataNode: in a distributed file system, each file is cut into several data blocks, and each data block is stored on a different server. These are data servers.

Block: each segmented data block is a piece of file content, which is the basic unit of storage, called a data block, and the typical size is 64MB.

Tip: because hardware errors are the norm and HDFS is a collection of many Server, error detection and recovery are the core functions. It focuses on streaming reading, batch operations, and high throughput of data access.

HDFS adopts master/slave architecture, a HDFS cluster consists of a NameNode and several DataNode, and the central server NameNode is responsible for managing the namespace of the file system and client access to files. DataNode is generally one node at a time, and is responsible for managing the storage attached to the node. Internally, a file is divided into one or more block, and these block are stored in the DataNode collection. Both NameNode and DataNode can run on cheap linux machines. HDFS is developed by java language and has a good cross-platform. The overall structure is shown below.

Replication: use rack-aware strategy to improve data reliability and network bandwidth utilization; NameNode determines the Rack id; of each DataNode in most cases, with a replication factor of 3, which simply means placing one copy on the local rack node, one copy on another node on the same rack, and the last one on a different rack; when reading, the nearest copy is selected. When NameNode starts, it enters the SafeMode state. In this state, NameNode does not copy data blocks, which detects the number of copies of DataNode and is considered safe if it meets the requirements.

NameNode is used to store metadata, any changes are recorded by Editlog, and the communication protocol is based on TCP/IP and can be called through java API.

To install Hadoop, follow these steps

View Code

In distributed mode, ip cannot be used in the hadoop configuration file, and the hostname must be used, and hadoop must be installed with the same configuration and installation path on all nodes and started with the same user. HDFS and Map-Reduce in Hadoop can be started separately, and NameNode and JobTracker can be deployed to different nodes, but small clusters are generally together, so you can pay attention to metadata security.

The common operations of Hdfs are shown in the following table. In practice, they are usually called through API. Just learn about them.

Command interpretation Command interpretation # catHadoop fs-cat uri output # chgrp modify file belonging group # chmod modify file owner # put#copyFromLocal copy from local file system to target system # get#getmerge#copToLocal copy file to local system Hadoop fs-gethdfs://host:port/user/Hadoop/file local file#cp copy file # du # dus display directory, file size # expunge empty the Recycle Bin # ls, lsr display file information # mv#movefromLocal move files # rm # rmr delete files # mkdir create directory # setrep change file copy factor # stat returns statistics hadoop fs-stat path other # tail # touchz#test

# text

The example of calling hdfs through Java is shown below, which is actually a file system

View Code

Core concepts of Map Reduce

Job: every calculation request of a user is a job

JobTracker: the server where users submit jobs, and it is also responsible for assigning tasks to each job and managing all task servers.

Task: one needs to be split and handed over to multiple servers to complete, and the split execution unit is the task.

TaskTracker: hard-working workers who are responsible for performing specific tasks.

Map Reduce computing model

In hadoop, each MapReduce task is initialized to a Job, and each Job is divided into two phases: Map phase and Reduce phase. These two phases are represented by two functions, the Map function accepts an input and then produces an intermediate output; then hadoop passes the value set with the same intermediate key to the Reduce function, and then Reduce processes it to get a formal output.

The configuration and code for accessing Hadoop in Java are shown below.

View Code

Data flow and Control flow of MapReduce

Zookeeper is mainly used to solve the data management problems often encountered in distributed applications, such as unified naming service, state synchronization service, cluster management and distributed application configuration item management, Zookeeper typical application scenarios (configuration file management, cluster management, synchronization lock, Leader election and queue management, etc.).

The steps for Zookeeper configuration installation are as follows

View Code

The ZooKeeper data model, which maintains a hierarchical data structure, very similar to the standard file system

The basic use of ZooKeeper, as a distributed service framework, is mainly used to solve the consistency problem of distributed clusters. It provides data storage similar to the file system directory node tree, and maintains and monitors the state changes of the data. The common methods are as follows.

Method interprets Stringcreate to create a point-given directory node path and set data Statexists to determine whether a certain path exists, and set and monitor this directory node Delete parameter path corresponding to directory node StatsetData,getData setting data, obtain data addAuthInfo sends its own authorization information to the server StatsetACL,getACL to set directory node access rights, and obtain permission list

An example of API for java to call zookeeper is as follows

View Code

Typical Application scenarios of ZooKeeper

Unified naming Service (Name Service): distributed applications usually require a complete set of naming conventions, usually using tree naming, which is very similar to JNDI.

Configuration management: ZooKeeper uniformly manages the configuration information and stores it in the corresponding directory. If it changes, the corresponding machine will be notified (observer).

Cluster management: ZooKeeper can not only maintain the current cluster and its service status, but also select a Leader Election to avoid a single point of failure. The example code is as follows.

Shared lock (Locks): shared lock is easy to implement in the same process, but it is difficult to implement it in different Server, but Zookeeper is easy to implement. The way is to create an EPHEMERAL_SEQUENTIAL directory node by the Servere that needs to acquire the lock, then call the getChildren method to get the smallest directory node in the list of current directory nodes, and judge that if you do not create it yourself, you will get the lock. If not, call the exist method to monitor node changes. It is the smallest up to the node you created, so you get the lock, and the release is very cheap, as long as you delete the directory node you created earlier, you will OK.

Queue management (Queue Management): two types of queues can be handled, one is that the queue is available only when the members gather, otherwise it is called synchronous queue, and the other is to join and dequeue according to the FIFO mode, such as implementing the production-consumer model.

HBase (logical structure) is the open source version of BigTable, which is based on HDFS (physical structure). It provides high reliability, high performance, column storage and scalable, real-time read-write database system. It balances between NOSQL and RDBMS, can only retrieve data through primary key and primary key range, supports single-row transactions (complex operations such as multi-table join can be realized through hive support), and is mainly used to store unstructured and semi-structured loose data. Like Hadoop, Hbase relies mainly on scale-out to improve computing and storage capacity.

Hbase's watch has the following characteristics:

Large: a table can have hundreds of millions of rows

Column-oriented: storage and permission control for column families, and independent retrieval of column families.

Sparse: for empty columns, it does not take up space, so tables can be designed to be very sparse.

Logical view: HBase stores data as a table, which consists of rows and columns, which are divided into several column families, row family, as shown in the following table.

Row KeyColumn-family1Column-family2Column1Column1Column1Column2Key1t2:abc t1:bcd

T1:354Key2t3:efy t1:uiot2:tyi t1:456

Row Key: the primary key for retrieving data and accessing rows in HBase, which can be accessed through a single row key (dictionary order, numeric data needs to be supplemented with 0); range access through row key; full table scan.

Column families: each column in the table belongs to the column family, which is part of the table schema and must be defined before use, while the column is not, key to understanding. Column names are prefixed with column families, for example, courses:history and courses:math both data courses column families.

Timestamp: a storage unit cell is determined by row and column. Each cell stores multiple versions of the same data and is indexed by a timestamp. The timestamp is 64-bit certificates, accurate to milliseconds, in reverse chronological order. In order to avoid too many versions, it is usually recycled through the number or time.

Cell: a unit uniquely determined by {row key, column (= +), version}. The data in cell has no type and is stored in bytecode.

Physical storage: refers to how to store large tables on multiple servers.

Features: all exercises on the Table are arranged by row key; Table is divided into multiple HRegion;HRegion in the row direction according to size, and each table has started to have only one region. As the data continues to be inserted, the region increases, and when the threshold is exceeded, the new HRegion;HRegion will be split into a series of distributed storage and load balancing units in the HBase, indicating that different Region can be distributed on different RegionServer. HRegion is the smallest unit of distributed storage, but not the smallest storage unit. In fact, a Region consists of multiple Store, a Store holds a columns family, and a Store consists of a memStore and 0-multiple StoreFile (the key point is that StoreFile is a file in a Hdfs, and the corresponding relationship can be found here, which can be subdivided and not introduced). (I have a general impression in mind)

System architecture

Client: includes access to HBase interfaces, and client maintains some cahce to speed up access, such as region unknown information.

Region Server: maintains the Region assigned to it by Master, processes IO requests for these Region; shards Region that becomes too large in operation.

Tip: you can see that client does not need master in the process of accessing HBase data, addressing access to zookeeper and Region Server, and data read-write access Region Server,master only maintains metadata of table and Region, and the load is low.

Key algorithms and processes

Region location: the large table uses a three-tier structure similar to a B+ tree to store the Region location, saving the data in the zookeeper for the first time and holding the RootRegion location; the second layer RootRegion is the first Region of the .meta table, which stores the location of other Region; and the third layer is a special table that stores the Region location information of all the data tables in the HBase.

Read and write process: HBase uses MemStore and StoreFile to store updates to the table. When the data is updated, the data first written into Log and MemStore,MemStore is sorted. When the MemStore accumulates to a certain threshold, a new MemStore is created, and the old MemStore is added to the Flush queue. A separate thread is written to disk, called a StoreFile, and a Redo point is recorded in the zookeeper, indicating that the update has been persisted. The problem with the system is that you can use log to recover data after check point. (the idea is the same as the traditional database)

Region allocation: a region can only be assigned to one server,master at any time to record the current available Server and the current region allocation. When there is an unallocated region and a server has free space, the master sends a mount request to the server and allocates the region.

Region Server online and offline: master tracks region server status through zookeeper. When a server starts, it will create a file representing itself in zookeeper's server directory and obtain an exclusive lock on the file. Because master subscribes to this directory to be careful about changes, you can be notified when files are added or deleted. When you go offline, disconnect the session with zookeeper and release the exclusive lock. At this time, master will find and delete the corresponding directory file and assign the original region to other server.

Master online and offline: obtain a unique master lock from zookeeper to prevent others called master; from scanning the server directory on zookeeper to get the region server list; communicate with each server to get the Region allocation; scan the META.region collection, calculate the currently unallocated region, and put it in the list to be assigned.

Installation and configuration

View Code

Common operation

For example, create a table like this

# name#grad#course:math#course:artXionger16260xiongda210098

View Code

Tip:

Finally completed, Shuai, after this part focuses on the existing integration solutions, including deployment on docker and so on.

In addition, when you have time to consider the study of blockchain, and re-learn the data structure, it still doesn't feel very OK. For example, the B + tree.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.