In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What are the levels of big data platform construction? in view of this question, this article introduces in detail the corresponding analysis and answers, hoping to help more partners who want to solve this problem to find a more simple and feasible way.
Big data analysis and processing platform is to integrate the current mainstream big data processing analysis framework and tools with different priorities to achieve data mining and analysis. Big data platform is a technology gradually concerned by enterprises with the development of big data technology, and today let's take a look at the architecture levels of big data platform construction.
1. Data transfer layer
Sqoop: supports bidirectional data migration between RDBMS and HDFS, which is usually used to extract data from business databases (such as MySQL, SQLServer, Oracle) to HDFS.
Cannal: Ali's open source data synchronization tool, which implements incremental data subscription and near real-time synchronization by listening to MySQL binlog.
Flume: used for massive log collection, aggregation and transmission to save the resulting data to HDFS or Hbase.
Flume+Kafka: meet the real-time streaming log processing, and then through Spark Streaming and other streaming processing technology, you can complete the real-time analysis and application of the log.
2. Data storage layer
HDFS: distributed file system, which is the basis of data storage management in distributed computing, is an open source implementation of Google GFS, can be deployed on cheap commercial machines, and has high fault tolerance, high throughput and high scalability.
Hbase: distributed, column-oriented NoSQL KV database, which is an open source implementation of Google BigTable, using HDFS as its file storage system, suitable for big data's real-time query (e.g. IM scenario).
Kudu: a distributed database with a compromise between HDFS and Hbase, big data storage engine that supports both random read and write and OLAP analysis (solves the pain point that Hbase is not suitable for batch analysis).
3. Resource management
The resource manager of Yarn:Hadoop is responsible for the unified management and scheduling of Hadoop cluster resources, provides server computing resources (CPU, memory) for computing programs (MR tasks), and can support MR, Spark, Flink and other frameworks.
Kubernates: open source by Google, a containerized orchestration engine for cloud platform, which provides containerized management of applications and can be migrated between different cloud and different versions of operating systems. At present, Spark and Storm already support K8S.
4. Data computing layer
Big data computing engine determines the computing efficiency, which is the core part of big data platform. It roughly goes through the development of the following four generations, and can be divided into offline computing framework and real-time computing framework.
5. Offline computing framework
MapReduce: computing model, framework and platform for big data parallel processing (the design idea is very ingenious to move computing closer to data and reduce data transmission).
Hive: a data warehouse tool that can manage data stored in HDFS, map structured data files to a database table, and provide complete SQL query functions (in practice, Hive SQL is translated into MapReduce tasks), suitable for offline non-real-time data analysis.
Spark sql: a special data structure, RDD (flexible distributed data set), is introduced to convert SQL into RDD computing, and the intermediate results of the calculation are stored in memory, so it has higher performance than Hive and is suitable for data analysis scenarios with higher real-time requirements.
6. Real-time computing framework
Spark Streaming: real-time streaming data processing framework (divided into small batches according to time slices, s-level delay), which can receive real-time input data from Kafka, Flume, HDFS and other data sources. After processing, the results are saved in HDFS, RDBMS, Hbase, Redis, Dashboard and other places.
Storm: real-time streaming data processing framework, real streaming, each piece of data will trigger the calculation, low latency (ms level latency).
Flink: a more advanced real-time streaming data processing framework that has lower latency and higher throughput than Storm, and supports out-of-order and latency adjustment.
7. Multi-dimensional analysis layer
Kylin: distributed analysis engine, which can query huge Hive tables in subseconds, save the calculated results of multi-dimensional combination into Cube and store them in Hbase by pre-calculation (trade space for time). When users execute SQL queries, SQL is converted into Cube queries, which has fast query and high concurrency ability.
Druid: a highly fault-tolerant, high-performance open source distributed system suitable for real-time data analysis, which can aggregate and analyze tables at the level of a billion rows in seconds.
The answers to the questions about the level of big data platform construction are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.