Detailed explanation of hadoop biosphere 07/12 Update SLTechnology News&Howtos

Detailed explanation of hadoop biosphere

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Learning and using hadoop for a year, here mainly share the overall understanding of hadoop, introduce the relevant components by category, and finally provide a suggested learning route, hoping to provide reference for beginners of hadoop.

1. What are the core components of Hadoop? What does hadoop mean in a broad sense?

Core components are: Hdfs, Yarn, MapReduce

In a broad sense, it refers to an ecological circle, which generally refers to the open source components or products related to big data's technology, such as hdfs, yarn, hbase, hive, spark, pig, zookeeper, kafka, flume, phoenix, sqoop.

2. What is the connection between Spark and hadoop

Spark is also a biosphere, developing very fast, and many times faster than mapreduce in computing. It provides a simple and rich programming model that supports a variety of applications, including ETL, machine learning, data flow processing, and graphical computing.

Hadoop and Spark overlap in some ways, but both components work well together.

3. Introduce the detailed components by category

In order to make it easier to understand, the following are classified by function, and the more popular ones are introduced at the front, as listed below:

classification

Related products

File system

HDFS, a widely used distributed file system, is the basic general file storage component of the whole big data application scenario.

S3, Simple Storage Service simple storage service, better scalability, built-in persistence, and lower price

Resource scheduling

YARN, distributed resource scheduling, can receive computing tasks and assign it to each node of the cluster for processing, which is equivalent to big data operating system. It has good versatility and ecological support.

Mesos, similar to YARN, favors the abstraction and management of resources

Computing framework

Spark sequence, streaming Computing, Graph Computing, Machine Learning

Flink, which supports constantly changing computing data, that is, incremental computing

Storm, specializing in streaming computing, powerful

Mapreduce, the basic computing framework of distributed computing, has high programming difficulty and low execution efficiency.

Database

Hbase, a NoSQL column cluster database, supports the storage and access of billions of rows and millions of columns of large data, especially the performance of writing data is very good, the real-time performance of data reading is good, it provides a set of API, does not support SQL operation, and the data storage uses HDFS.

Cassandra with the best support for large tables and Dynamo

Redis, which runs unusually fast, can also be applied to distributed cache scenarios

SQL support

Spark SQL, developed from Shark and Hive, accesses data sources in SQL mode (such as hdfs, hbase, S3, redis and even shutting down system databases, etc., the same below)

Phoenix, a set of JDBC drivers focusing on SQL access to hbase, supports most SQL syntax, supports secondary indexes, supports transactions, and low latency

Hive, which uses HQL (similar to SQL) to analyze and generate query results, and parses HQL to generate tasks that can be executed on Mapreduce. A typical application scenario is integration with hbase.

Others: impala, pig, etc., all achieve similar functions, solve the complexity of writing map/reduce analysis data directly, and reduce the threshold for data analysts or developers to use big data.

Other tools

Distributed collaborative zookeeper can be understood as a small, high-performance database that provides publish and subscribe functions for many components in the ecosystem. It can also monitor whether nodes fail (heartbeat detection). For example, zookeeper is used in HBase and Kafka to store master-slave node information.

Kafka, a distributed publish / subscribe-based messaging system, is similar to the function of message alignment. It can receive data from producers (such as webservice, files, hdfs, hbase, etc.), which can be cached and then sent to consumers (ditto) for buffering and adaptation.

Flume, a distributed massive log collection, aggregation and transmission system, the main role is data collection and transmission, but also supports a large number of input and output data sources

Sqoop, mainly used in Hadoop (Hive) and traditional databases (mysql, postgresql...) Data transfer between can import the data from a relational database (such as MySQL, Oracle, Postgres, etc.) into the HDFS of Hadoop, and also import the data from HDFS into the relational database.

4. Typical combined usage scenario

The components of Hadoop and Spark biosphere cooperate with each other, and each component has its own "opportunity to exert its talents". When combined, it can meet a variety of system business requirements. Here are two examples:

(1) data acquisition, storage and analysis scenarios

In this scenario, the entire data is collected, stored, analyzed, and the results are output. The components are as follows:

Flume + kafka (zookeeper) + Hdfs + Spark/Storm/Hive + Hbase (Zookeeper, Hdfs) / Redis

The explanation is as follows:

Flume is used from various channels (e.g. http, exec, file, kafka, etc.). ) collect data and send it to kaffka (of course, it can also be stored in hdfs, hbase, file, …)

Kafka can cache data. Like flume, it also supports the input and output of various protocols. Because kafka needs zookeeper to complete load balancing and HA, it needs zookeeper to support it.

To start computing, there are three options. Spark/Storm/Hive, which has its own advantages, is still widely used compared to Hive, and the technology appeared earlier; Storm focuses on streaming processing with very low latency; Spark is the most promising computing tool; no matter what is used, it is ultimately to clean up the data, count and output the results.

To display the result data storage, you can use Hbase kafka (zookeeper) / Redis or mysql to look at the usage scenarios (factors such as the amount of data, etc.). Because the result data after processing is generally relatively small, you can directly put it into Redis, and then you can use conventional techniques to show reports or other consumption methods to use these calculated result data.

(2) data storage and real-time access

This scenario is very similar to the conventional application development scenario, that is, the big data cluster is accessed through java's JDBC, and the components are matched:

Jdbc + Solr + Phoenix/Spark sql + Hbase kafka (zookeeper) + Hdfs

The explanation is as follows:

Jdbc is the general way for java to manipulate the database, using sql statements

Solr is a full-text search, which completes the search function of site word segmentation

Phoenix/Spark sql is convenient to access Hbase database by jdbc.

Hdfs finally completes the physical storage of data.

5. Recommended learning route

Based on personal experience, it is roughly divided into three stages, as follows:

Advertise, recently recorded a course, most of it is free, busy I gather popularity oh! Please click on HBase Design and programming Development Video course

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.