Discussion on the Architecture of data Warehouse / data platform in Internet Industry under big data Environment 07/04 Update SLTechnology News&Howtos

Discussion on the Architecture of data Warehouse / data platform in Internet Industry under big data Environment

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Introduction:

Overall architecture

data acquisition

Data storage and analysis

Data sharing

Data application

Real-time computing

Task scheduling and monitoring

Metadata management

Summary

I've been trying to sort out this piece of content. since it's a ramble, just say what you think of. I have always been in the Internet industry, in the case of the Internet industry.

First, give a rough list of the uses of data warehouses and data platforms in the Internet industry:

Integrate all the company's business data and establish a unified data center

Provide all kinds of reports, some for senior management and some for various businesses.

To provide operational data support for website operators is to let operators know the operational effects of websites and products in a timely manner through data.

Provide online or offline data support for various businesses, and become a unified data exchange and provision platform for the company.

Analyze user behavior data, reduce input cost and improve input effect through data mining, such as targeted and accurate advertising, user personalized recommendation, etc.

Develop data products to make money for the company directly or indirectly

Build an open data platform and open up company data

The content listed above looks similar to the use of traditional industry data warehouse, and requires good stability and reliability of data warehouse / data platform. But in the Internet industry, in addition to the large amount of data, more and more businesses require timeliness, and even many require real-time. In addition, the business of the Internet industry is changing so fast that it is impossible to use a top-down method to build a data warehouse, once and for all, as traditional industries do. It requires that new services can be quickly integrated into data warehouses and old offline services. It can be easily removed from the existing data warehouse.

In fact, the data warehouse of the Internet industry is the so-called agile data warehouse, which requires not only rapid response to data, but also rapid response to business.

To build an agile data warehouse, in addition to the technical requirements for architecture, there is also a very important aspect, that is, data modeling. If you want to build a data model that is compatible with all data and business, it will return to the construction of the traditional data warehouse, and it is difficult to meet the rapid response to business changes. To deal with this situation, the core persistent business is generally modeled in depth (for example, website statistical analysis model and user browsing trajectory model based on website log, user model based on company core user data), other businesses generally use dimension + wide table to build data model. This is the latter part.

Overall architecture

The following diagram is a diagram of the architecture of the data platform we are currently using, but most companies should be similar:

Logically, there are generally data acquisition layer, data storage and analysis layer, data sharing layer, data application layer. It may be called differently, and the roles are more or less the same in nature.

We look at it from the bottom up:

Gorgeous dividing line: you can follow the big data field of lxw, or join the mailing list and receive notification emails of blog updates at any time. data acquisition

The task of the data acquisition layer is to collect and store data from various data sources to the data storage, during which some simple cleaning may be done.

There are many kinds of data sources:

Website log:

As an Internet industry, website logs account for the largest share, and website logs are stored on multiple website log servers.

Generally, flume agent is deployed on each web log server to collect web logs in real time and store them on HDFS.

Business database:

There are also a variety of business databases, such as Mysql, Oracle, SqlServer and so on. At this time, we urgently need a tool that can synchronize data from various databases to HDFS. Sqoop is a tool, but Sqoop is too heavy, and regardless of the amount of data, it needs to start MapReduce to execute, and every machine in the Hadoop cluster can access the business database. To cope with this scenario, Taobao open source DataX is a good solution (please refer to the article "massive data Exchange tool for heterogeneous data sources-Taobao DataX download and use"). If you have resources, you can do secondary development based on DataX, which can be solved very well, as is the DataHub we currently use.

Of course, Flume can also synchronize data from the database to HDFS in real time through configuration and development.

Data source from Ftp/Http:

It is possible that the data provided by some partners need to be obtained regularly, such as Ftp/Http, and DataX can also meet this demand.

Other data sources:

For example, some manually entered data can be completed by providing an interface or Mini Program.

Data storage and analysis

There is no doubt that HDFS is the most perfect data storage solution for data warehouse / data platform in big data environment.

Offline data analysis and calculation, that is, the part that does not require high real-time performance, in my opinion, Hive is still the first choice, with rich data types and built-in functions; ORC file storage format with very high compression ratio; and very convenient SQL support, which makes the statistical analysis of Hive based on structured data much more efficient than MapReduce. A sentence of SQL can be completed, and developing MR may require hundreds of lines of code.

Of course, the use of Hadoop framework naturally provides MapReduce interface, if you are really happy to develop Java, or are not familiar with SQL, then you can also use MapReduce for analysis and calculation

Spark is very popular in the past two years, after practice, its performance is really much better than MapReduce, and the combination of Hive and Yarn is getting better and better. Therefore, we must support the use of Spark and SparkSQL for analysis and calculation. Because there is already Hadoop Yarn, it is actually very easy to use Spark. There is no need to deploy Spark clusters separately. For related articles on Spark On Yarn, please refer to "Spark On Yarn Series articles".

The real-time computing part will be discussed separately later.

Gorgeous dividing line: you can follow the big data field of lxw, or join the mailing list and receive notification emails of blog updates at any time. Data sharing

The data sharing here actually refers to the place where the results of previous data analysis and calculation are stored, that is, relational database and NOSQL database.

The previous analysis and calculation results using Hive, MR, Spark and SparkSQL are still on HDFS, but most businesses and applications cannot obtain data directly from HDFS, so they need a place where data can be shared so that various businesses and products can easily obtain data.

In contrast to the data acquisition layer to HDFS, there is a need for a tool to synchronize data from HDFS to other target data sources, and DataX can do the same.

In addition, the result data of some real-time calculations may be directly written to data sharing by the real-time computing module.

Data application

Business products

The data used by business products already exists in the data sharing layer, and they can access it directly from the data sharing layer.

Report form

The data used in the same business products and reports are generally statistically summarized and stored in the data sharing layer.

Ad hoc inquiry

There are many users who make impromptu queries, such as data developers, website and product operators, data analysts, and even department bosses, who all have the need to query data on the fly.

This kind of impromptu query is usually that the data of the existing report and data sharing layer can not meet their needs and needs to be queried directly from the data storage layer.

Ad hoc query is generally done through SQL, the biggest difficulty lies in the response speed, the use of Hive is a bit slow, my solution is SparkSQL, its response speed is much faster than Hive, and it is well compatible with Hive.

Of course, you can also use Impala if you don't care about another framework in the platform.

OLAP

At present, many OLAP tools do not support getting data directly from HDFS, and they all do OLAP by synchronizing the needed data into the relational database, but if the amount of data is huge, the relational database is obviously not good.

At this time, we need to do corresponding development, obtain data from HDFS or HBase, and complete the function of OLAP.

For example: according to the variable dimensions and indicators selected by the user in the interface, through the development interface, get the data from the HBase to display.

Other data interfaces

This kind of interface is general-purpose and customized. For example, an interface for getting user attributes from Redis is universal, and all businesses can call this interface to obtain user attributes.

Gorgeous dividing line: you can follow the big data field of lxw, or join the mailing list and receive notification emails of blog updates at any time. Real-time computing

Now there are more and more requirements for real-time performance of data warehouse, such as: real-time understanding of the overall traffic of the website; real-time access to the exposure and click of an advertisement; under massive data, relying on traditional databases and traditional implementation methods can not be completed, what is needed is a distributed, high-throughput, low-latency, highly reliable real-time computing framework. Storm is more mature in this piece, but I choose Spark Streaming, the reason is very simple, do not want to introduce another framework to the platform, in addition, Spark Streaming latency is a little higher than Storm, that for our needs can be ignored.

At present, we use Spark Streaming to achieve real-time website traffic statistics, real-time advertising effect statistics two functions.

The practice is also very simple. Flume collects website logs and advertising logs on the front-end log server, sends them to Spark Streaming in real time, Spark Streaming completes the statistics, stores the data to Redis, and the business obtains it in real time by visiting Redis.

Task scheduling and monitoring

In the data warehouse / data platform, there are a variety of programs and tasks, such as data acquisition tasks, data synchronization tasks, data analysis tasks, etc.

This requires a very perfect task scheduling and monitoring system, which, as the hub of the data warehouse / data platform, is responsible for scheduling and monitoring the allocation and operation of all tasks.

I have written an article, "Task scheduling and Monitoring in big data platform", which is no longer cumbersome here.

Metadata management

If you want to do this piece well, it is very complicated. I think it is and the value is less than the cost, so we will not consider this piece for the time being. Currently, there is only metadata for daily task runs.

Summary

In my opinion, architecture is not the more technology, the newer, the better, but when the demand can be met, the simpler and more stable the better. At present, in our data platform, developers focus more on business than technology. They understand the business and requirements. Basically, they only need to do simple SQL development, and then configure it to the scheduling system. If the task is abnormal, they will receive an alarm. In this way, more resources can be focused on the business.