How to build the data Analysis platform of Hadoop 07/09 Update SLTechnology News&Howtos

How to build the data Analysis platform of Hadoop

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to build the data analysis platform of Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

When an enterprise develops to a certain scale, it will build a separate BI platform to do data analysis, that is, OLAP (online Analytical processing), which is generally based on database technology, and is basically a stand-alone product. In addition to the relevant analysis of business data, Internet enterprises will also analyze user behavior and further tap the potential value, then the data will expand greatly, and the amount of data a day may be tens of millions or hundreds of millions, which brings great challenges to the data storage, analysis and calculation of the traditional data analysis platform based on database.

In order to cope with the growth of data and the scalability of data processing performance, many enterprises have turned to Hadoop platform to build data analysis platform. Hadoop platform has the characteristics of distributed storage and parallel computing, so it can easily expand storage nodes and computing nodes to solve the performance bottleneck caused by data growth.

As more and more enterprises begin to use Hadoop platform, many technologies have been introduced into Hadoop platform, such as Hive, Spark SQL, Kafka and so on. Rich components make it possible to use Hadoop to build data analysis platform instead of traditional data analysis platform.

1. The principle of data analysis platform architecture

Conceptually, we can divide the data analysis platform into access layer (Landing), integration layer (Integration), presentation layer (Persentation), semantic layer (Semantic), end-user application (End-user applications) and metadata (Metadata). The basic concepts and logical architecture of the analysis platform based on Hadoop and database are universal, but the technology selection is different:

Access layer (Landing): temporarily stores raw data in the same structure as the source system, sometimes referred to as the "source layer" or ODS

Integration layer (Integration): persistent storage of integrated enterprise data, modeling for enterprise information entities and business events, representing the organization's "source of truth", sometimes referred to as a "data warehouse"

Presentation layer (Presentation): provides consumable data to meet the needs of end users and models business intelligence and query performance, sometimes referred to as "data marts"

Semantic layer (Semantic): provides data presentation and access control, such as a reporting tool

End-user applications (End-user applications): use semantic layer tools to finally present presentation layer data to users, including dashboards, reports, charts and other forms

Metadata (Metadata): record the definition (Definitions), consanguinity (Genealogy) and processing (Processing) of data items in each layer.

The "raw" data from different data sources (access layer), and the data models of the integration layer and presentation layer obtained after intermediate processing will be stored in the data lake for backup.

The implementation of data lake is usually based on Hadoop ecology, which may be stored directly on HDFS, HBase or Hive, and there is also the possibility of using relational database as data lake storage.

The following figure illustrates the data processing flow of the data analysis platform:

Data analysis is basically a separate system, which synchronizes the data from other data sources (that is, external data) to the storage system of the data platform (that is, the data lake). Generally, the data enters the access layer first. this layer simply synchronizes the external data to the data analysis platform without other processing, so you can try again after synchronization errors, including timing synchronization and streaming synchronization:

Timing synchronization means that we set to trigger synchronization at a specified time.

Streaming synchronization means that external data sends data modification notification and content through Kafka or MQ.

The data analysis platform performs corresponding operations to modify the data.

The access layer data needs to go through ETL processing steps before entering the data warehouse. Data analysts do analysis and calculation based on the data of the data warehouse. The data warehouse can be regarded as the source of data analysis. ETL will clean, transform and load the access layer data into the data warehouse, filter or deal with illegal and incomplete data, and use a unified dimension to represent the data status. Some systems will build the data warehouse into a data cube and the dimensional information into a snowflake or star pattern at this layer, while others will only unify all the data information and do not do the data cube and stay in the data Mart to do it.

The data Mart is the further information obtained from the calculation and extraction of the business concern information based on the data warehouse data. it is the information directly faced by the business staff and the result of further calculation and in-depth analysis of the data warehouse. Data cubes are generally built. System developers typically develop pages to show users the data of the data Mart.

Second, build a data analysis platform based on Hadoop

The construction theory and data processing flow of the data analysis platform based on Hadoop are the same as those mentioned above. The traditional analysis platform is built using a database suite, and here we use components of the Hadoop platform.

The figure above is the component of the Hadoop platform we used. The data flows from bottom to top, and the data processing flow is consistent with the above.

Task scheduling is responsible for concatenating the processes of data processing. I chose to use Oozie here, and there are many other options.

1. Data storage

The data lake based on Hadoop mainly uses HDFS, Hive and HBase,HDFS are file storage systems on Hadoop platform, so it is complicated for us to manipulate files directly, so we can use distributed database Hive or HBase to do data lake to store data of access layer, data warehouse and data Mart.

Hive and HBase have their own advantages: HBase is a NoSQL database, random query performance and scalability are good, while Hive is a HDFS-based database, data files are stored in the form of HDFS files (folders), storing the table storage location (that is, the location in the HDFS), storage format and other metadata, Hive supports SQL query, which can be parsed into Map/Reduce execution, which is more friendly to the traditional data analysis platform developers.

Hive data formats are available in text or binary formats, text formats are csv, json, or custom delimited, and binary formats are orc or parquet, both of which are based on determinant storage and perform better when querying. At the same time, you can select partition, so that the amount of data can be further reduced by conditional filtering when querying. The access layer generally chooses text formats such as csv or json and does not do partitions to simplify data synchronization as much as possible. The data warehouse chooses orc or parquet to improve the performance of data offline computing.

The data Mart can choose to pour the data back to the traditional database (RDBMS), or you can stay on the data analysis platform, using NoSQL to provide data query or Apache Kylin to build data cubes, providing SQL query interface.

2. Data synchronization

We use data synchronization function to make data reach the access layer, using Sqoop and Kafka. Data synchronization can be divided into full synchronization and incremental synchronization. Full synchronization can be used for small tables, but it is time-consuming for large tables. Incremental synchronization is generally used to synchronize changes to the data platform to achieve the goal of consistent data on both sides.

Full synchronization is done using Sqoop, and incremental synchronization can also be done with Sqoop if timing is considered. Alternatively, data can be streamed through MQ, such as Kafka, provided that external data sources send changes to MQ.

3. ETL and offline calculation

We use Yarn to uniformly manage and schedule computing resources. Compared with Map/Reduce,Spark SQL and Spark RDD, it is more developer-friendly and memory-based computing is more efficient, so we use Spark on Yarn as the computing selection of the analysis platform.

ETL can be done through Spark SQL or Hive SQL. Hive supports stored procedures after 2.0, which is more convenient to use. Of course, Saprk SQL is a good choice for performance reasons.

This is the end of the content of "how to build the data Analysis platform of Hadoop". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.