Example Analysis of Hadoop,HBase and Hive knowledge points 07/12 Update SLTechnology News&Howtos

Example Analysis of Hadoop,HBase and Hive knowledge points

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you Hadoop,HBase and Hive knowledge points of the example analysis, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to know it!

One Hadoop

Mainly include HDFS and Yarn,Common

For HDFS, be familiar with basic file operations, such as shell commands hadoop fs-ls / hadoop fs-copyFromLocal path2 path3, etc.

For javaAPI, be familiar with FileSystem hdfs=FileSystem.get (conf) FileStatus

HDFS's NameNode and SecondNameNode, the various functions of DataNode.

NameNode file namespace management, before reading and writing files, you should ask him to save the file's metadata, that is, file names and data blocks. Accept the DataNode heartbeat response and the location of the passed data block & & DataNode.

SecondNameNode is not a hot backup of NameNode. It combines fsimage and editLog of NameNode to synthesize a new editLog to reduce memory consumption.

In HA, it is divided into Active NameNode and StandardBy NameNode by sharing information timely by both sides of editLog, with failover, and both are registered with ZK

Because the HDFS is backed up three times, the data is distributed reasonably according to the perceived effect through the rack, and the calculation is carried out locally or on a machine close to the data as far as possible, so as to reduce the consumption of network transmission.

RPC is used for transmission on the network, which is similar to the protocol at the architectural level, that is, some interfaces and the methods in the interface are called protocols. As long as the client and server implement these interface methods, they can communicate. The dynamic proxy technology, NIO, is used in it.

DataNode is responsible for the actual storage location of the data, because the data is divided into blocks, that is, the setting of each Block size, when there is a large file, NameNode splits it into how many blocks, the proportion of the total file.

Because there are multiple DataNode and there is communication between each DataNode, DataNode not only manages local data, communication with JobClient, but also communication with other DataNode.

When reading the file, NameNode asks DataNode for the location of the specific data, and DataNode locates the file.

Then there is the problem of file serialization. Hadoop implements it in its own way.

The communication between NameNode and DataNode depends on heartbeat response. DataNode reports execution status to NameNode at regular intervals. When downtime occurs, NameNode will reassign a new machine, and when tasks on a machine are slow to execute.

NameNode will be set up again and executed at the same time.

For the input and output formats of files in HDFS, you can also define your own formats, FileInputFormat and FileOutputFormat

Optimization for HDFS: initialize the size of Block, the number of data backups, set rack awareness, heartbeat interval, and delete some information in NameNode timely in 3.0to reduce the pressure of NameNode startup.

Yarn is divided into resource allocation management, task supervision and execution, and a real computing platform.

ResourceManager (Scheduler, Application Manager) is responsible for allocating resources needed for tasks to run, and container (CPU and memory)

NodeManager and Application Master are responsible for task monitoring and task execution, respectively.

Yarn's default computing platform supports mapreduce and spark,stome.

MapReduce execution process: map phase and reduce phase. In the map phase, the output data is sorted out by InputFormat, written to memory, and then partitioned. Several reducer are divided into several partition, and the data in each partition

Sort by key, if there is a combiner function, merge the values of the same key, if the amount of data reaches 80% of the memory, then spill to the disk, and finally merge multiple file on the disk, with a maximum of 10 file merged. These are the results of the map phase.

In the reduce phase, first copy, start multiple processes, copy the results on each map to the memory of the reduce machine, and then fill up on the disk. Merge,sort the copied files. Finally, the reduce calculation is carried out.

Optimization in the MapReduce phase: the number of map and reduce tasks, and the maximum and minimum amount of data each map and reduce can handle, whether the combiner function (requiring uniform input and output formats) can be set, and when the proportion of data in memory is overwritten to disk.

Whether the map results are compressed and whether the jvm is reused. The use of caching.

When map files are all small files, package multiple small files or merge them in AVRO format

The number of parallel processes turned on in the copy process.

Hive data warehouse, through the use of HiveQL, it is convenient to use statements similar to SQL to manage tables and support discrete data processing.

First of all, the table created by hive shell is divided into internal table and external table. The internal table is that the data is stored in the actual warehouse, and the metadata and data are deleted when deleted. The external table is just a join that executes the data warehouse, and deleting only the metadata is not good at the actual data.

Create external table name () (partitioned by) row format delimited fields terminnated by'\ t 'location hdfspath

Load data to hive load data inpath hdfspath or from local load data local inpath file:\\

Or specify create table name as select aaa bbb from T1 when creating the table

Use: from tableA insert into tableB select id,name,sex when importing data from multiple tables

Export data from the HIVE table to table insert overwrite table a select ss form nn

There are also three ways to export, insert overwrite local directory file:\\ select aa,bb from table1 to HDFS insert overwrite directory hdfs:\\ select aa from table2

Export to another table insert into table test1 partition () select a from table3

Hive java API

The underlying layer of Hive uses MapReduce for calculation

Hive supports three ways to join tables, map join, reduce join and semi join

The condition that map join uses is that the amount of data in the left table is small and can be placed directly in memory, then the data of the right table and the left table are compared one by one.

Reduce join is to operate on the results of map, because there are multiple map results, so the amount of computation is very large, and the transmission consumption on the network is also very large.

Semi join is to filter the results of map before reduce join, screening some duplicate key, by randomly dividing the left table into a small table, putting it into memory, comparing the other part with him, and removing duplicates.

Hive optimization: data skew problem, using partitions, join,group by or

The strict mode in Hive is set to false in many cases, and strict mode is turned off.

Clustered by is equivalent to distributed by, sort by combination

Set aggregation on map, set hive.map.aggr=true, and set the number of entries aggregated on map. Set hive.groupby.mapaggr.checkinterval=10000

Whether to merge map output files, set hive.merge.mapfile=true

Whether to merge the reduce output file, set hive.merge.mapredfiles=true

HBase column database, there must be column families in the table, the number of columns can be increased in real time, and support real-time query. Query according to the row key

Create 'table1','fam' put, get, scan,Scanner, delete

HMaster is responsible for monitoring each HRegionServer and communicating with ZK

HregionServer manages each HRegion and registers with ZK

There are multiple HRegion in a HRegionServer, and a HRegion is equivalent to a map task.

When writing data, use pre-write log, and then write to Memstore. If full, spill will be HFile, a HFile format. During the writing process, BlockCache will be used to cache the written file.

When reading specific data, first get the information of-ROOT- from ZK and get .meta from-ROOT-. To find the Region in a specific RegionServer, which also caches files that are frequently read

HBase java API HTable HBaseAdmin is responsible for the creation of tables.

HTableDescripter, HColumnDescripter

Bloom filter in HBase

The most important thing is the design of rowkey, which generally prevents hot spots and uses salt or hash key to combine rowkey to make the distribution uniform.

The above is all the content of this article "sample Analysis of Hadoop,HBase and Hive knowledge points". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.