General introduction of big data's technology in the field of hadoop (role of each component) 10/26 Update SLTechnology News&Howtos

General introduction of big data's technology in the field of hadoop (role of each component)

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Saturday, 2019-2-16

General introduction of big data's field technology (role of each component)

1. Big data's technical introduction.

Big data technological ecosystem:

Hadoop elder-class distributed massive data storage and processing technology system, good at offline data analysis

Hbase hadoop-based distributed massive database, offline analysis and online business take-all

Hive sql hadoop-based data warehouse tool, easy to use, rich in functions, similar to SQL

Zookeeper Cluster Coordination Service

Sqoop data Import and Export tool

Flume data collection framework / / often combines kafka+flume data flow or is used for a large number of log collection to hdfs log collection analysis most enterprises use elk

Storm real-time streaming computing framework, leading framework in the field of streaming processing

Spark memory-based distributed computing framework, one-stop processing all in one, rookie, rapid development momentum

SparkCore / / application development

SparkSQL / / sql operation is similar to hive

SparkStreaming / / similar to storm

Machine learning:

Mahout Machine Learning algorithm Library based on mapreduce

MLLIB is based on spark machine learning algorithm library.

As can be seen from the picture above, big data's hadoop ecosystem is similar to a zoo, and the zookeeper component is similar to a manager who manages these animals. / / there are many components in big data's ecosystem. I don't know the components we mentioned above. What is shown in the figure is the basic components.

2. Need to dive to depth

First, understand the functions and applicable scenarios of the framework

2. Use (installation and deployment, programming specification, API)

III. Operating mechanism

IV. Structural principle

5. Source code

3. Basic introduction of hadoop

(1) hadoop is a technical platform for processing (computing and analyzing) massive data, and it adopts the way of distributed cluster.

(2) two major functions of hadoop:

 provides storage service for massive data.

 provides a programming framework and running platform for analyzing massive data.

(3) Hadoop has three core components:

Storage of massive data in  HDFS---- hadoop distributed file system (cluster service)

 MapReduce---- distributed computing framework (programming framework) (lead jar package writer), massive data operation analysis (replacement: storm / spark, etc.)

 Yarn-Resource scheduling management cluster (which can be understood as a distributed operating system that manages and allocates cluster hardware resources)

(4) use Hadoop:

 can understand hadoop as a programming framework (analogy: structs, spring, hibernate/mybatis), with its own specific API encapsulation and user programming specifications, users can use these API to achieve data processing logic; from another point of view, hadoop can be understood as a software that provides services (analogy: database service)

Oracle/mysql, index service solr, cache service redis, etc.), user programs implement specific functions by requesting services from the hadoop cluster from the client

(5) the history of Hadoop.

The first three major technical papers from google: GFS/MAPREDUCE/BIG TABLE

Why does google need such a technology? )

Later, after the "shanzhai" of doug cutting, the java version of hdfs mapreduce and hbase appeared.

And become apache's top-level project hadoop, hbase.

After evolution, there is another yarn (mapreduce+ yarn + hdfs) in the components of hadoop.

Moreover, the periphery of hadoop has produced more and more tool components, forming a huge hadoop ecosystem.

Why do you need hadoop

In the case of a large amount of data, the processing capacity of the single machine is not competent, so we must use the distributed cluster to process the data, and the complexity of the implementation increases in series when using the distributed cluster to process the data. under the requirements of massive data processing, a general distributed data processing technology framework can greatly reduce the difficulty of application development and reduce the workload.

The overall development process of hadoop business: see figure

Flume data acquisition-> MapReduce cleaning-> save to hbase or hdfs---- > hive statistical analysis-> save to hive table-> sqoop import and export-> mysql database-> web display

Tip: when the amount of data is very large, we can join the kafka message queue in the flume data collection node to form a cache; in the data cleaning phase, we can use memory such as spark or storm flink and real-time streaming algorithm framework (for different business scenarios); store it in HBASE or hdfs in hadoop; in the data analysis phase, we can use computing tools such as hive or impala When displaying web, you can use the kabina// data visualization tool kabina or Grafana in elk

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.