In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Learning and using hadoop for a year, here mainly share the overall understanding of hadoop, introduce the relevant components by category, and finally provide a suggested learning route, hoping to provide reference for beginners of hadoop.
1. What are the core components of Hadoop? What does hadoop mean in a broad sense?
Core components are: Hdfs, Yarn, MapReduce
In a broad sense, it refers to an ecological circle, which generally refers to the open source components or products related to big data's technology, such as hdfs, yarn, hbase, hive, spark, pig, zookeeper, kafka, flume, phoenix, sqoop.
2. What is the connection between Spark and hadoop
Spark is also a biosphere, developing very fast, and many times faster than mapreduce in computing. It provides a simple and rich programming model that supports a variety of applications, including ETL, machine learning, data flow processing, and graphical computing.
Hadoop and Spark overlap in some ways, but both components work well together.
3. Introduce the detailed components by category
In order to make it easier to understand, the following are classified by function, and the more popular ones are introduced at the front, as listed below:
classification
Related products
File system
HDFS, a widely used distributed file system, is the basic general file storage component of the whole big data application scenario.
S3, Simple Storage Service simple storage service, better scalability, built-in persistence, and lower price
Resource scheduling
YARN, distributed resource scheduling, can receive computing tasks and assign it to each node of the cluster for processing, which is equivalent to big data operating system. It has good versatility and ecological support.
Mesos, similar to YARN, favors the abstraction and management of resources
Computing framework
Spark sequence, streaming Computing, Graph Computing, Machine Learning
Flink, which supports constantly changing computing data, that is, incremental computing
Storm, specializing in streaming computing, powerful
Mapreduce, the basic computing framework of distributed computing, has high programming difficulty and low execution efficiency.
Database
Hbase, a NoSQL column cluster database, supports the storage and access of billions of rows and millions of columns of large data, especially the performance of writing data is very good, the real-time performance of data reading is good, it provides a set of API, does not support SQL operation, and the data storage uses HDFS.
Cassandra with the best support for large tables and Dynamo
Redis, which runs unusually fast, can also be applied to distributed cache scenarios
SQL support
Spark SQL, developed from Shark and Hive, accesses data sources in SQL mode (such as hdfs, hbase, S3, redis and even shutting down system databases, etc., the same below)
Phoenix, a set of JDBC drivers focusing on SQL access to hbase, supports most SQL syntax, supports secondary indexes, supports transactions, and low latency
Hive, which uses HQL (similar to SQL) to analyze and generate query results, and parses HQL to generate tasks that can be executed on Mapreduce. A typical application scenario is integration with hbase.
Others: impala, pig, etc., all achieve similar functions, solve the complexity of writing map/reduce analysis data directly, and reduce the threshold for data analysts or developers to use big data.
Other tools
Distributed collaborative zookeeper can be understood as a small, high-performance database that provides publish and subscribe functions for many components in the ecosystem. It can also monitor whether nodes fail (heartbeat detection). For example, zookeeper is used in HBase and Kafka to store master-slave node information.
Kafka, a distributed publish / subscribe-based messaging system, is similar to the function of message alignment. It can receive data from producers (such as webservice, files, hdfs, hbase, etc.), which can be cached and then sent to consumers (ditto) for buffering and adaptation.
Flume, a distributed massive log collection, aggregation and transmission system, the main role is data collection and transmission, but also supports a large number of input and output data sources
Sqoop, mainly used in Hadoop (Hive) and traditional databases (mysql, postgresql...) Data transfer between can import the data from a relational database (such as MySQL, Oracle, Postgres, etc.) into the HDFS of Hadoop, and also import the data from HDFS into the relational database.
4. Typical combined usage scenario
The components of Hadoop and Spark biosphere cooperate with each other, and each component has its own "opportunity to exert its talents". When combined, it can meet a variety of system business requirements. Here are two examples:
(1) data acquisition, storage and analysis scenarios
In this scenario, the entire data is collected, stored, analyzed, and the results are output. The components are as follows:
Flume + kafka (zookeeper) + Hdfs + Spark/Storm/Hive + Hbase (Zookeeper, Hdfs) / Redis
The explanation is as follows:
Flume is used from various channels (e.g. http, exec, file, kafka, etc.). ) collect data and send it to kaffka (of course, it can also be stored in hdfs, hbase, file, …)
Kafka can cache data. Like flume, it also supports the input and output of various protocols. Because kafka needs zookeeper to complete load balancing and HA, it needs zookeeper to support it.
To start computing, there are three options. Spark/Storm/Hive, which has its own advantages, is still widely used compared to Hive, and the technology appeared earlier; Storm focuses on streaming processing with very low latency; Spark is the most promising computing tool; no matter what is used, it is ultimately to clean up the data, count and output the results.
To display the result data storage, you can use Hbase kafka (zookeeper) / Redis or mysql to look at the usage scenarios (factors such as the amount of data, etc.). Because the result data after processing is generally relatively small, you can directly put it into Redis, and then you can use conventional techniques to show reports or other consumption methods to use these calculated result data.
(2) data storage and real-time access
This scenario is very similar to the conventional application development scenario, that is, the big data cluster is accessed through java's JDBC, and the components are matched:
Jdbc + Solr + Phoenix/Spark sql + Hbase kafka (zookeeper) + Hdfs
The explanation is as follows:
Jdbc is the general way for java to manipulate the database, using sql statements
Solr is a full-text search, which completes the search function of site word segmentation
Phoenix/Spark sql is convenient to access Hbase database by jdbc.
Hdfs finally completes the physical storage of data.
5. Recommended learning route
Based on personal experience, it is roughly divided into three stages, as follows:
Advertise, recently recorded a course, most of it is free, busy I gather popularity oh! Please click on HBase Design and programming Development Video course
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.