In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Content source: Yixin Institute of Technology Phase 2 Technology Salon-online Live broadcast | Construction practice of CITIC Agile data Center
Sharing guests: Lu Shanwei, head of CITIC data platform team.
Introduction: in 2017, Yixin launched a series of big data open source tools, including DBus, Wormhole, Moonbox, Davinci and so on, which have received wide attention and praise in the technology community. How are these tools applied internally? What is the relationship between them and CCTV? How do you drive a variety of daily data business scenarios?
This sharing answers these questions, and focuses on sharing the design, architecture and application scenarios of the reliable agile data center, and puts forward a construction idea of the agile data center for reference and discussion. The following is a record of this sharing.
Sharing outline:
I. introduction
Second, the top-level design of the middle platform of reliable data.
Third, from middleware tools to platforms
IV. Typical case analysis
V. Summary
VI. Qyoga
Video playback address: https://v.qq.com/x/page/r0874wlaomx.html
PPT download address: https://pan.baidu.com/s/1jRumFMj_vQG1rxUvepEMlg
I. introduction
At present, the concept of "middle station" is very popular, including data center, AI center, business center, technology center and so on. Dr. Jing Yuxin, the first phase of the Technology Salon of Citic Institute of Technology, shared the AI Middle platform. This issue of the Technology Salon, let me share with you "the practice of Building Agile data Center" for you.
Why should we add "Agile" in front of the data center? Friends who know us all know that my team is trustworthy Agile big data team, we advocate "Agile civilian", integrate Agile ideas into system construction, and developed four open source platforms: DBus, Wormhole, Moonbox, Davinci. The reliable data center is developed and built by our agile big data team based on four major open source platforms, so we call it the "agile data center".
This sharing is divided into three parts:
The top-level design of trustworthy agile data center. The data center is a company-level platform system, so it can not only be designed from the technical level, but also consider the top-level design, including process, standardization and so on.
From middleware tools to platform, this paper introduces how Yixin designs and builds agile data center.
Combined with typical cases, this paper introduces which data applications and practices are supported by Yixin Agile data Center. Second, the top-level design 2.1 features and requirements of reliable agile data center.
With regard to the construction of data center, there is no standard solution at present, and no data center can be applied to all companies. Each company should develop a data center suitable for its own company according to its own business scale and data needs.
Before introducing the top-level design of CITIC Agile data Center, let's take a look at its background:
There are many business plates and lines. Yixin's business can be divided into four major sectors: inclusive financial sector, wealth management plate, asset management plate, financial technology plate, with nearly 100 business lines and product lines. There are many types of technology. Different business parties have different data requirements. According to these objective needs and subjective preferences, technology selection will choose different data components, including: MySQL, Oracle, HBase, KUDU, Cassandra, Elasticsearch, MongoDB, Hive, Spark, Presto, Impala, Clickhouse and so on. Data requirements are diverse. A variety of business lines, resulting in a variety of data requirements, including: reports, visualization, services, push, migration, synchronization, data applications and so on. Data requirements are changeable. In order to adapt to the rapid changes of the Internet, the data needs of the business side are also changeable, and there are often weekly output data needs and data applications. Data management considerations. It requires that data meta-information is searchable, data definition and process are standardized, and data management is controllable. Data security considerations. As a company with both Internet and financial attributes, Citic has high requirements for data security and permissions, and we have done a lot of work on data security. including: multi-level data security strategy, data link traceability, sensitive data non-disclosure and so on. Data permissions are considered. The work on data permissions includes: table-level, column-level, row-level data permissions, organizational structure, roles, permissions policy automation. Consider the cost of data. It includes cluster cost, operation and maintenance cost, manpower cost, time cost, risk cost and so on. 2.2 location
With regard to the positioning of the data center, each company is different. Some companies are more focused on business, with only one line of business, so when building a data center, it may need a vertical platform to go directly to the front line to better support the operation of the front line.
It is mentioned earlier that there are many lines of trustworthy business, and there is no main business in many businesses, which means that all lines of business are the main body. Based on this background, we need a platform-based data center to support the needs and operations of all business lines.
Fig. 1 location
As shown in the figure above, the green part is the ADX Agile data platform, which we call "Agile data platform", and "A" means "Agile". The reason why it is called "platform" is because we want to build it into a platform system serving the whole line of business to facilitate business development.
Agile data center is in the middle, with various data clusters at the bottom and data teams in various business areas at the top. By integrating and processing the data of the data cluster, the data center provides self-service, real-time, unified, service-oriented, managed and traceable data services for the business domain data team.
The three blue sections on the right are the data Management Committee, the data Operations team and the data Security team. As mentioned earlier, Yixin has very high requirements for data security, so a special data security team is set up to plan the process and strategy of the company's data security; the data management committee is responsible for the standardization and flow of data. Make up for the promotion efficiency of the technology-driven data center to ensure the effective precipitation and presentation of data assets.
Our positioning of the reliable Agile data Center is: from the reuse of data technology and computing power to the reuse of data assets and data services, the Agile data Center will directly empower the business with greater valuable bandwidth, fast, accurate and refined.
2.3 value
The value of Yixin Agile data center is concentrated in three aspects: fast, accurate and provincial.
Figure 2 value
The "fast" customization requirements of agile data lead to repeated development platform, transparent encapsulation and reuse technology components in the package implementation team need to schedule self-service, simple configuration, month = > day Tread1 delay can not meet the real-time and fine operation of real-time, drive business growth, the problems of days = > points, agile data "accurate" data storage is different, data acquisition methods are different, cleaning logic is different. Unified data lake collection and export data isolated island did not get through the integrated management, metadata, data map, consanguinity demand-driven implementation, unable to precipitate the capitalization of data assets, model management makes data reliable, standardized model processing promotes the precipitation of data assets, agile data in the "save" time cost, demand scheduling and repeated development self-service, saving time is to save the cost of manpower. Repeated development and lack of reuse platform, high reuse hardware cost of mature technology components, waste and refinement caused by misuse of cluster resources, estimable and quantifiable 2.4 module architecture dimensions of cluster resources
Figure 3 Module Architecture Dimension
As shown in the figure, the construction of CITIC Agile data Center is also based on the consensus of "small foreground, large and medium-sized platform". The whole middle part belongs to the content of the agile data center, the green part on the left is based on the data dimension, and the blue part on the right is based on the platform dimension.
Data dimension. All kinds of internal and external data are first collected into the data source layer, and then stored in a unified, real-time, standardized, secure and other ways to form a data lake, which processes and systematizes the original data into data assets; the data asset layer includes warehouse system, index system, label system, feature system, master data and so on. Finally, these reusable data assets are provided to the data application layer for the application of BI, AI and data products.
Platform dimension. Each blue box represents a technical module, and the entire reliable agile data center is composed of these technical modules. Among them, DataHub data hub can help users to complete self-help data application, release, desensitization, cleaning and service; DataWorks data workshop can carry out self-help query, job, visualization and other processing of data; and there are DataStar data model, DataTag data label, DataMgt data management, ADXMgt middle platform management and so on.
It is worth mentioning that these modules are not developed from 0, but are based on our existing open source tools. First of all, development based on mature middleware tools can save development time and cost; secondly, open source tools become engines that can work together to support a larger one-stop platform.
2.5 data capability Dimension
Figure 4 data capability dimension
The above architecture modules are re-divided according to the capability dimension, which can be divided into several layers, each of which contains a number of capabilities. As shown in the figure, you can clearly see what data capabilities are needed to build a data center, which functional modules these capabilities correspond to, and what problems can be solved respectively. I will not repeat it here.
III. From middleware tools to platform 3.1 ABD overview
Figure 5 Overview of ABD
Middleware tools refer to the four open source platforms DBus, Wormhole, Moonbox and Davinci, which are abstracted from the concept of Agile big data (ABD,Agile BigData) to form the ABD platform stack, while the Agile data platform is called ADX (Agile Data X Platform). In other words, we have gone through the process from ABD to ADX.
At the beginning, based on the abstraction and summary of the commonness of business requirements, we incubated a number of general middleware to solve a variety of problems. When more complex requirements arise, we try to combine these general middleware. In practice, we find that some specific combinations are often used. at the same time, from the user's point of view, they prefer to be self-service and can use them directly, rather than having to choose and combine them every time. Based on these two points, we encapsulate these open source tools.
3.1.1 ABD-DBus
DBus (data bus platform) is a DBaaS (Data Bus as a Service) platform solution.
DBus is dedicated to providing real-time data acquisition and distribution solutions for big data project development and management operation and maintenance personnel. The platform adopts a highly available streaming computing framework, which provides real-time transmission of massive data, reliable multi-channel message subscription and distribution, through simple and flexible configuration, non-invasive access to source data, aggregates the data generated by each IT system in the business process, and uniformly processes and converts it into UMS format described by JSON, which is provided to different downstream customers for subscription and consumption. DBus can be used as the data source for business such as data warehouse platform, big data analysis platform, real-time report and real-time marketing.
Open source address: https://github.com/BriData
Figure 6 DBus function and location
As shown in the figure, DBus can non-intrusively connect data sources of various databases, extract incremental data in real time, do unified cleaning and processing, and store it in Kafka in UMS format.
The functions of DBus also include batch extraction, monitoring, distribution, multi-tenancy, and configuration of clear rules, as shown in the figure.
The lower right corner of the image above shows a screenshot of DBus. Users can pull incremental data, configure logs and cleaning methods, and complete real-time data extraction on DBus.
Figure 7 DBus architecture
From the architecture diagram above, you can see that DBus includes several different processing modules that support different functions. (GitHub has a specific introduction, so we will not expand it here. )
3.1.2 ABD-Wormhole
Wormhole (streaming processing platform) is a SPaaS (Stream Processing as a Service) platform solution.
Wormhole is dedicated to providing data flow processing solutions for big data project development and management operation and maintenance personnel. The platform focuses on simplifying and unifying the development and management process, providing a visual operation interface, business development based on configuration and SQL, shielding the implementation details of the underlying technology, greatly reducing the development threshold, and making the development and management of × × processing projects more lightweight, agile, controllable and reliable.
Open source address: https://github.com/edp963/wormhole
Figure 8 Wormhole function and location
DBus stores real-time data in Kafka in UMS format. If we want to use these real-time streaming data, we need to use Wormhole as a tool.
Wormhole supports the configuration of streaming processing logic, and the processed data can be written to different data stores. Many features of Wormhole are shown in the figure above, and we are still developing more new features.
The bottom right corner of the image above is a screenshot of the work of Wormhole. As a streaming platform, Wormhole does not redevelop its streaming engine. It mainly relies on two streaming computing engines, Spark Streaming and Flink Streaming. Users can choose one of the streaming computing engines, such as Spark, configure streaming logic, determine how the Lookup library is run, and express this logic by writing SQL. If CEP is involved, it is, of course, based on Flink.
As you can see, the threshold for using Wormhole is configuration plus SQL. This is also in line with the concept that we have always adhered to, that is, to support users to play big data by themselves in an agile way.
Figure 9 Wormhole architecture
The figure above shows the architecture of Wormhole, which contains many functional modules. Introduce several of these features:
Wormhole supports heterogeneous Sink idempotents, which can help users solve the problem of data consistency.
Anyone who has used Spark Streaming knows that launching a Spark Streaming may only do one thing. Wormhole abstracts the concept of "logical Flow" in the physical computing pipeline of Spark Streaming, that is, from where to where, what to do in the middle, this is a "logical Flow". With this decoupling and abstraction, Wormhole supports Flow that runs multiple different business logic simultaneously in a physical Spark Streaming pipeline. So in theory, for example, if there are 1000 different Source tables, after 1000 different streaming processes, to get 1000 different result tables, you can only initiate one Spark Streaming in Wormhole and run 1000 logical Flow in it. Of course, this may increase the latency of each Flow because they are all crowded in the same pipe, but the setting here is very flexible. I can let a Flow monopolize a VIP's Stream, and if some Flow traffic is very small, or if the delay does not affect it so much, I can let them share a Stream. Flexibility is a big feature of Wormhole.
Wormhole has its own set of instructions and feedback system, users can dynamically change the logic online without restarting or stopping the stream, and get jobs and feedback results in real time. 3.1.3 ABD-Moonbox
Moonbox (Computing Service platform) is a DVtaaS (Data Virtualization as a Service) platform solution.
Moonbox is for data warehouse engineers / data analysts / data scientists, etc., based on the design idea of data virtualization, is committed to providing batch computing service solutions. Moonbox is responsible for shielding the physical and usage details of the underlying data sources and providing users with a virtual database-like user experience. Users only need to use a unified SQL language to transparently realize mixed computing and writing across heterogeneous data systems. In addition, Moonbox also provides basic support for data services, data management, data tools, data development and other basic support, which can support more agile and flexible data application architecture and logical warehouse practices.
Open source address: https://github.com/edp963/moonbox
Figure 10 Moonbox function and location
Data comes from DBus and may fall into different data stores after streaming processing by Wormhole. We need to mix these data. Moonbox supports seamless mixing of multi-source heterogeneous systems. The figure above shows the functional features of Moonbox.
The so-called impromptu query is not really "impromptu", because users are required to manually import the data to Hive and then do the calculation, which is a preset work. Moonbox does not need to direct the data to one place in advance, so it does real impromptu query. Data can be scattered in different storage, when users need to write a SQL,Moonbox can automatically split the SQL, so as to know which tables are where, and then plan the implementation plan of SQL, and finally get the result.
Moonbox provides standard REST, API, JDBC, ODBC, etc., so it can also be regarded as a virtual database.
Figure 11 Moonbox architecture
The figure above shows the architecture of Moonbox. You can see that the computing engine part of Moonbox is also based on the Spark engine and is not self-developed. Moonbox extends and optimizes Spark, adding a lot of enterprise database capabilities, such as users, tenants, permissions, class stored procedures, and so on.
From the figure above, the entire server side of Moonbox is a distributed architecture, so it is also highly available.
3.1.4 ABD-Davinci
Davinci (Visual Application platform) is a DVaaS (Data Visualization as a Service) platform solution.
Davinci is dedicated to providing one-stop data visualization solutions for business people / data engineers / data analysts / data scientists. It can be used as a stand-alone deployment of public / private clouds or integrated into a three-party system as a visual plug-in. Users can serve a variety of data visualization applications by simply configuring on the visual UI, and support visualization functions such as advanced interaction / industry analysis / pattern exploration / social intelligence.
Open source address: https://github.com/edp963/davinci
Figure 12 Davinci function and location
Davinci is a visualization tool with functional features shown in the figure.
Figure 13 Davinci architecture
From a design point of view, Davinci has its own inherent logic of completeness and consistency. Including Source, View, Widget, support a variety of data visualization applications.
Figure 14 Davinci rich client application
Davinci is a rich-client application, so it mainly depends on its front-end experience, richness and ease of use. Davinci supports both chart-driven and perspective-driven modes for editing Widget. The image above is an example of a perspective-driven effect. You can see that the horizontal and vertical coordinates are perspective. They will cut the whole picture into different cells, and each cell can choose a different picture.
3.2 ABD architecture
Figure 15 ABD architecture
In the era of ABD, we combine four open source tools through DIY to support a variety of data application requirements. As shown in the figure above, the entire end-to-end process is strung together, and this architecture diagram shows our concept of "opening up the entire link".
Take it. Such as collection, architecture, circulation, injection, computing service query and other functions, need to converge into a platform.
Let it go. In the face of complex business environment, a variety of data sources can not be unified, it is difficult to have a storage or data system to meet all the needs, so that we no longer need to select. Therefore, the practice in this area is open, and you can choose open source tools and components to adapt and compatible. 3.3 ADX Overview
When we develop to a certain stage, we need an one-stop platform to encapsulate the basic components, so that users can more easily complete the data-related work on this platform, so we have entered the construction stage of ADX data center.
Figure 16 Overview of ADX
The picture above is an overview of ADX, which is equivalent to a first-level function menu. Users who log in to the platform can do the following:
Project Kanban: you can see the Kanban of the project, including health and other aspects of statistics. Project management: can do project-related management, including asset management, rights management, approval management and so on. Data management: you can do data management, such as viewing metadata, viewing data consanguinity and so on. Data application: the project is configured, the data is understood, and the actual work can be done. Based on security and permission considerations, not everyone can use the data put in it, so you have to apply for data first. The blue module on the right is the five functional modules of the ADX data console that will be highlighted in this sharing. Data application is mostly realized by DataHub data hub, which supports self-help application, release, standardization, cleaning, desensitization and so on. Impromptu query, batch operation and streaming operation are realized based on DataWorks data workshop. The data model is implemented based on DataStar, a model management platform. Application markets include data visualization (after data processing, you can configure the final display style as a graph or dashboard, etc., where Davinci may be used); common analysis methods such as tag portraits and behavior analysis; intelligent toolkits (to help data scientists better do data set analysis, mining and algorithm modeling), as well as intelligent services and intelligent conversations (such as intelligent chat robots). 3.3.1 ADX-DataHub data Hub
Figure 17 DataHub workflow
The blue dotted box above shows the process architecture of DataHub. The orange box is our open source tool, where "tria" stands for Triangle, which is a job scheduling tool developed by another team.
DataHub does not simply encapsulate the link, but enables users to get better service on a higher level. For example, DataHub can provide users who need snapshots accurate to seconds at a certain historical moment, or want to get a real-time incremental data for streaming.
How does it do that? By motorizing open source tools and then integrating them. For example, different data sources are extracted in real time through DBus and streamed by Wormhole into the HDFS Log data lake, where we store all the real-time incremental data, which means that we can get all the historical change data from it, and the data is synchronized in real time. Then some logic is defined above through Moonbox, and when a user asks for a snapshot or incremental data at a certain historical moment, it can be calculated and provided immediately. If you want to make a real-time report, you need to maintain a real-time snapshot of the data in a storage, here we choose Kudu.
Streaming has many advantages, but also has shortcomings, such as high operation and maintenance costs, poor stability and so on. Considering these problems, we set Sqoop as Plan B in DataHub. If there is a problem with the real-time line at night, you can automatically switch to Plan B and use the traditional Sqoop to support the Tunable 1 report of the next day. When we find and resolve the problem, Plan B will switch to the paused state.
Suppose users have their own data sources, put them in Elasticsearch or Mongo, and also want to publish them and share them with others through DataHub. We should not physically copy Elasticsearch data or Mongo data to one place, because first of all, the data is NoSQL, the amount of data is relatively large; second, users may want others to use Elasticsearch data through fuzzy queries, so it may be better to continue to put the data in Elasticsearch. What we are doing at this time is a logical release through Moonbox, but the user is not aware of the process.
To sum up, it can be seen that DataHub organically integrates and encapsulates several modes commonly used in open source platforms internally to provide consistent and convenient data acquisition, release and other services. Its users can also be in a variety of different roles:
The data owner can do the data examination and approval here; the data engineer can apply for the data, and after the application, the data can be processed here.
APP users can view Davinci reports
Data analysts can directly use their own tools to pick up the data from DataHub, and then do data analysis.
Data users may want to make a data product themselves, and DataHub can provide him with an interface.
Figure 18 DataHub architecture
As shown in the figure, open the DataHub to see its architecture design. From the perspective of functional modules, DataHub is based on different open source components and implements different functions. It includes batch collection, streaming collection, desensitization, standardization, etc., and can also output subscriptions based on different protocols.
The relationship between DataHub and several other components is also very close. The data it outputs is used by DataWorks, and at the same time, it relies on console management and data management to meet its needs.
3.3.2 ADX-DataLake real-time data lake
In a broad sense, the data lake is to put all the data together, first to store and collect, and then to provide different ways of use according to different data.
What we are talking about here is a narrow data lake, which only supports two types of data collection: structured data source and natural language text, and has a unified way of storage.
Figure 19 DataLake
In other words, our real-time data lake is limited, and all the company's structured data sources and natural language texts will be summarized into UbiLog in real time, and ADX-DataHub will provide unified access to the outside world. The access and use of UbiLog can only be output through the capabilities provided by ADX, thus ensuring multi-tenancy, security, and privilege control.
3.3.3 ADX-DataWorks data Workshop
The main data processing is done by self-help in DataWorks.
Figure 20 DataWorks workflow
Look at the workflow of DataWorks as shown in the picture. First of all, after the DataHub data comes out, DataWorks must pick up the DataHub data. DataWorks supports real-time reports, and we use Kudu internally, so by solidifying this model, users don't have to choose their own type, just write their own logic on it. For example, if there is a real-time DM or batch DM, we think it is a good data asset with reuse value. We hope that other businesses can reuse this data, so we can publish it through DataHub, and other businesses can apply for it.
Therefore, the data center encapsulated by components such as DataHub and DataWorks can achieve the effect of data sharing and data operation. CCTV contains Kudu, Kafka, Hive, MySQL and other database components, but users do not need to choose their own type, we have made the best choice and encapsulated it into a platform that can be used directly.
On the left side of the figure above, there is a role of data modeler, who does model management and development in DataStar, and is mainly responsible for logic and model creation in DataWorks; data engineers, needless to say, are the most common roles that use DataWorks; end users can use Davinci directly.
Figure 21 DataWorks architecture
As shown in the figure, open DataWorks to see its architecture, and DataWorks also supports a variety of different functions through different modules. There will be more articles and sharing about this section in the future, which will not be described in detail here.
3.3.4 ADX-DataStar data model
Figure 22 DataStar workflow
DataStar is related to data metrics models or data assets, and each company has its own internal data modeling process and tools. DataStar can be divided into two parts:
Model design, management creation. The management of model life cycle and the precipitation of technological process.
From DW (data warehouse) layer to DM (data Mart) layer, the configuration mode is supported, and the corresponding SQL logic is automatically generated at the bottom, without the need for users to write it themselves.
DataStar is a star model composed of facts and dimension tables of the DW layer, which can be finally precipitated. However, we believe that from DW layer to DM layer or APP layer, there is no need to write SQL development, only by selecting dimensions and configuring indicators, we can automatically visually configure it.
In this way, the requirements for users change, requiring a modeler or business staff to do this, give him a basic data layer, and he can configure the desired indicators according to his own needs. Throughout the process, data implementers only need to focus on the ODS layer to the DW layer.
3.3.5 ADXMgt/DataMgt console Management / data Management
Figure 23 ADXMgt/DataMgt
The central Taiwan management module mainly focuses on tenant management, project management, resource management, authority management, examination and approval management and so on. The data management module mainly focuses on the topic of data management layer or data management layer. These two modules provide support and rule constraints to the three main components in the middle from different dimensions.
3.4 ADX architecture
Figure 24 ADX architecture
The association between several modules of the ADX data platform is shown in the figure. At the bottom are five open source tools, each module is an organic integration and encapsulation of these five open source tools. From the figure, we can see that the relationship between the components is very close, in which the black dotted line represents the dependency relationship, and the green line represents the data flow relationship.
IV. Typical case analysis
As mentioned above, we have integrated and encapsulated organically based on open source tools to create a more modern, self-service and complete one-stop data platform. So how does this platform play its role in providing services to the business? This section lists five typical cases.
4.1 case 1-self-service real-time report
[scene]
The business domain group data team needs to urgently produce a batch of reports, do not want to schedule, hope to be able to complete it by themselves, and some of the reports need to be timely.
[challenge]
The data team of the business group has limited engineering capacity and can only make a simple SQL. Previously, it was either transferred to BI for scheduling, or made reports through tools directly connected to the business repository, or through Excel.
Data sources may come from heterogeneous databases, and there is no good platform to support self-help derivatives.
The demand for the timeliness of data is very high, and it is necessary to do data processing logic on the stream.
[scheme]
Figure 25 Workflow of self-service real-time report
Use ADX data center to solve the problem of self-service real-time report.
Data engineers log in to the platform, create new projects, and apply for data resources.
Data engineers look up and select tables through metadata, choose DataWorks to use, fill in other information, and apply for the forms that need to be used. For example, I need to use 100 tables, of which 70 are used in Tunable 1 and 30 in real time.
By default, CCTV will make a standardized desensitization encryption policy, and after receiving these applications, the administrator of CCTV will approve them in turn according to the policy.
After approval, CCTV will automatically prepare and output the applied data resources, and data engineers can use the data resources for self-query, development, configuration, SQL choreography, batch or streaming processing, configuration DV and so on.
Finally, submit the self-help report or dashboard to the user.
[summary]
Each role through the one-stop data platform interaction, unified process, all actions are recorded, can be queried.
The full self-help ability of the platform greatly improves the business digital driving process without waiting. After short training, each person can complete a real-time report by himself in 3-5 days, and the real-time report no longer asks for help.
Platform support staff do not need to participate too much and no longer become a bottleneck of progress.
[ability]
This scenario requires a lot of data capabilities, including: ad hoc query capabilities, batch processing capabilities, real-time processing capabilities, report Kanban capabilities, data permissions capabilities, data security capabilities, data management capabilities, tenant management capabilities, project management capabilities, job management capabilities, resource management capabilities.
4.2 case 2-collaboration Model Metrics
[scene]
Business lines need to create their own basic data marts that can be shared to other businesses or frontline systems.
[challenge]
How to effectively build and manage the data model.
How to support not only the construction of data models in their own domain, but also the sharing of data models.
How to solidify the process of data sharing and release, and realize the unified management and control of technical security.
How to operate data to ensure effective data asset precipitation and management.
[scheme]
Figure 26 collaboration model metrics workflow
Use ADX data center to solve the problem of collaboration model indicators.
The data modeler logs in to the platform, creates new projects, and applies for resources. Then look up the selected table, design a DW model of one or more dimension tables, and push it to the DataWorks project.
The data engineer selects the required Source table, completes the ETL development from ODS to DW based on the DataStar project, and then submits the assignment and publishes it to DataHub to run.
The data modeler continues to visually configure, maintain and manage the DW/APP layer index set, including dimension aggregation, calculation, and so on.
[summary]
This is a typical case of data asset management and data asset operation. through unified collaborative model index management, it ensures that the model is maintainable, the index is configurable and the quality is traceable.
DataStar also supports consistency dimension sharing, data dictionary standardization, business line carding, etc., which can further flexibly support the construction and precipitation of the company's unified data foundation layer.
[ability]
The capabilities required in this case include: data service capabilities, ad hoc query capabilities, batch processing capabilities, data permissions capabilities, data security capabilities, data management capabilities, data asset capabilities, tenant management capabilities, project management capabilities, job management capabilities, resource management capabilities.
4.3 case 3-Agile Analysis Mining
[scene]
The business domain group data analysis team needs to self-help for rapid data analysis and mining.
[challenge]
The analysis team uses different tools, such as SAS, R, Python, SQL, etc.
Analysis teams often need raw data for analysis (non-desensitization) and full historical data.
The analysis team wants to get the data they need quickly (often without knowing what data is needed) and focus on the data analysis itself with agility and efficiency.
[scheme]
Figure 27 Agile analysis mining workflow
Use ADX data center to solve the problem of agile analysis mining.
Data analysts log in to the platform, create new projects, and apply for resources. Look up and select the form according to the demand, select the customary method of using the tool, fill in other information, and apply for use.
The parties shall examine and approve in accordance with the strategy.
After the approval, the data analyst gets the resources and uses the tools for self-help analysis.
[summary]
Moonbox itself is a data virtualization solution, which is very suitable for impromptu data reading and computing of various heterogeneous data sources, and can save a lot of data engineering work of data analysts.
Datahub/DataLake provides a full incremental data lake for real-time synchronization, as well as security strategies such as configured desensitization and encryption, providing secure, reliable and comprehensive data support for data analysis scenarios.
Moonbox also provides a special mbpy (Moonbox Python) library to make it easier for Python users to perform fast and seamless data viewing, ad hoc computing and common algorithm operations under security control.
Fig. 28 Agile analysis mining example
For example, a user who opens a library package of Jupyter,import and mbpy and logs in to Moonbox as a user can view the tables authorized by the administrator. He can use the data and tables for analysis, calculation, etc., without paying attention to where the data comes from, which is a seamless experience for users.
As shown in the figure above, there are two tables, one with more than 50 million pieces of data stored in Kudu, and the other with more than 6 million pieces of data stored in Oracle. Data is stored in heterogeneous systems, and kudu itself does not support SQL. We worked out the logic through Moonbox and thought that the data were all in a virtual database, and it only took 1 minute and 40 seconds to calculate the result.
[ability]
The capabilities required in this case include: analysis drilling ability, data service ability, algorithm model ability, impromptu query ability, multi-dimensional analysis ability, data permission ability, data security ability, data management ability, tenant management ability, project management ability, resource management ability.
4.4 case 4-scenario multi-screen interaction
[scene]
In order to support omni-directional scene and digital drive, it is sometimes necessary for large, small, medium and multi-screen interaction. The large screen is the projection screen, the middle screen is the computer screen, the small screen is the mobile screen, and the smart screen is the chat client screen.
[challenge]
Due to different positioning, different display size and different operation, multi-screen can require different degrees of visualization and customization, bringing a certain amount of development.
Multiple screens also need to be highly consistent at the data permission level.
Among them, the smart screen needs more intelligent capabilities such as NLP, chat robot and task robot, as well as the ability to generate charts dynamically.
[scheme]
Through the Display function of Davinci, configuration can be well supported to meet the customization requirements of the size screen.
Through the unified data permission system of Davinci, the consistent data permission conditions can be maintained among multiple screens.
Through the Chatbot/NLP capability of ConvoAI, you can support intelligent micro-BI capability, that is, smart screen.
Figure 29 Display editing page of Davinci
The image above shows the Display editing page of Davinci. You can freely define the desired display style by selecting different components, adjusting transparency, arbitrary placement, adjusting foreground background, color scaling, and so on.
Figure 30 Davinci configuration with large screen
The image above shows an example of Davinci configuring a large screen. (the picture comes from the practice of netizens in the open source community of Davinci, and the data has been processed. You can see that you can configure the large screen yourself through Davinci without development.
Figure 31 Davinci configuration small screen
The figure above shows an example of Davinci configuring a small screen. The picture comes from Yixin's annual meeting. The on-site staff check the real-time data through the mobile phone to understand the situation on the spot.
Figure 32 Smart screen
The picture above shows an example of a smart screen. Our company has a ConvoAI-based chat robot, which can interact with users through a chat window and return results according to users' needs, including charts and so on.
4.5 case 5-data Security, Management
Figure 33 Workflow of data security management
This case is relatively simple, a complete data center, not only has the application customer scenario, but also manages the customer scenario, manages the customer typical like the data security team and the data committee.
The data security team needs to manage security policies, scan sensitive fields, approve data resource requests, and so on. Yixin Agile data Center provides automatic scanning function and returns the scanning results to the security team for confirmation in a timely manner. The security team can also define several layers of different security policies, view audit logs, investigate data flow links, and so on.
The data committee needs to do data research, data map viewing, consanguinity analysis, standardized and procedural cleaning rules, and so on. They can also log on to the data center and finish the work. V. Summary
This sharing mainly introduces the top-level design and positioning, internal module architecture and functions, as well as typical application scenarios and cases of CITIC Agile data console. Based on the current business needs and the development background of the data platform, we organically combine and encapsulate the five open source tools, and combine the concept of Agile big data to create an one-stop agile data platform suitable for our own business. and can be applied and landed in business and management, hoping to bring inspiration and reference for everyone.
QroomA:
Q: can enterprises rely solely on the open source tools of the open source community to build a data platform?
A: the data center should be built in line with the actual situation and goals of the enterprise. some good open source tools are already mature and do not need to build wheels repeatedly, while some enterprises need customized development according to their own environment and needs. Therefore, in general, the data center will not only have the selection of open source tools, but also the development of common components in the enterprise according to their own situation.
Q: what detours and pits need to be avoided in the construction of data Central Station?
A: data center requires more capacity building for direct enabling business than pure technology platform, such as data asset precipitation, data service construction, data processing process abstraction, enterprise data standardization and security management, etc., all of which may not be driven by pure technology from the bottom up, but require consensus and support at the company level and business level. And by the actual business demand-driven data in the iterative construction. Such a combination of top-down and bottom-up iterations can effectively avoid unnecessary myopia and over-design.
Q: after the data center has been built, how to evaluate its maturity and effectiveness?
A: the value of the data center is measured by driven business goals. Qualitatively, it is whether it really achieves a fast, accurate and economical effect; quantitatively, maturity can be evaluated through indicators such as platform component reuse, data asset reuse, data service reuse and so on.
Q: how is the metadata of the platform managed?
A: metadata is a big independent topic, from the classification of metadata, to how to collect and maintain all kinds of metadata, to how to build various metadata applications based on metadata information, etc., it can be discussed separately with a complete sharing. As for the metadata management of ADX, we also follow the above ideas. First, we sort out the categories of panoramic metadata, and then a very important point is "business pain point-driven metadata system construction". We will circle the priority according to the company's most urgent needs for metadata, and then we can collect the basic technical metadata of various data sources through Moonbox at the technical level. Generate and execute consanguinity based on the SQL parsing ability of Moonbox. Finally, according to the actual pain points of the business, such as how the structural change of the upstream source data table will affect the downstream data application (consanguinity impact analysis), how to trace the upstream data flow link (data quality diagnostic analysis) and so on, we iteratively develop a metadata application module.
Q: what is the modeling methodology of the data modeler? What is the difference between dimensional modeling and warehouse modeling?
A: our modeling methodology is also based on the famous "data Warehouse Toolbox" to guide the construction, and according to the actual situation of Yixin, we simplify, standardize and generalize the dimensional modeling of Kimball. At the same time, we also refer to the experience of Ali's OneData system, which we do not have much originality. The more important goal of DataStar is how to easily use, effectively attract and help data modelers, unify, online and manage the model construction from the process, at the same time, strive to reduce the burden on ETL developers, and delegate the personalized indicator work from DW to DM/APP layer to non-data developers to help themselves through configuration. Therefore, the main goal of DataStar as a whole is to manage and improve efficiency.
Is the Q:Triangle task scheduling system open source?
A:Triangle is developed and maintained by another team, and they have open source plans, and we are too sure when.
When will Q:Davinci be released?
A: this is an eternal question. Thank you for your continuous attention and recognition of Davinci. We have plans to push Davinci to Apache incubation, so we hope you can continue to support Davinci and make Davinci the best choice for open source visualization tools.
Q: does the data service control all data reads and writes? The best-case scenario is that all business parties can access data through data services, so data management, links, and maps are easier to do. The problem is that in many cases, if you know the connection information, the business side can be directly connected, how to prevent the business side from using API direct connection?
A: yes, the goal of DataHub is to unify data collection, data application, data release and data service, so that data security management, link management, standardization management and so on are easier to achieve. How to prevent the business side from bypassing the DataHub directly connected source database may have to be controlled in the management process. For DataHub itself, because DataHub encapsulates the real-time data lake, DataHub has all the capabilities that the directly connected business repository does not have. In addition, it is believed that the business side will be more willing to connect data from DataHub, coupled with the continuous improvement of DataHub user experience and functions.
Does Q:DBus support Postgres data sources?
A:DBus currently supports MySQL, Oracle, DB2, log and Mongo data sources. Because of the log characteristics of Mongo, DBus can only receive incomplete incremental logs (only updated columns will be output), which puts forward high requirements for strong sequential consumption. Internally, there are not many scenarios in which DBus is connected with Mongo. The community has put forward the need for DBus to interface with PostgreSQL and SQLServer, which can be expanded in theory, but at present, the team has invested in the construction of data center, and the docking of more data source types. If necessary, you can contact our team for discussion.
The bottom layer of Q:Moonbox is this kind of hybrid computing implemented in Spark SQL, which consumes a lot of resources. How is it optimized?
A:Moonbox 's mixed computing engine is based on Spark and has done some optimization work on Spark. The biggest optimization is to support more computing push-down (Pushdown). Spark itself also has the ability of data federated mixed computing, but Spark only supports partial operator push-down. For example, Projection and Predict,Moonbox extend Spark by-pass, and support more operators such as Aggregation, Join, Union, etc. And when parsing the SQL, we will push down the execution plan strategically according to the computing characteristics of the data source, so as to make the data source do more suitable computing work as far as possible, and reduce the computational cost of mixed calculation in the Spark.
Moonbox also supports that if SQL itself does not have mixed computing logic and the data source is suitable for the entire SQL calculation, Moonbox can directly push down the whole SQL to the data source, bypassing Spark. In addition, Moonbox supports Batch computing, distributed Interactive computing and Local Interactive computing models, each with different optimizations and strategies.
Q: how do offline computing and real-time computing cooperate? offline computing can do hierarchical storage, how to achieve hierarchical storage in real-time computing?
A: real-time computing layering is done through Kafka. Of course, if the timeliness of real-time layered data is not too high (such as minutes), you can also choose some real-time NoSQL storage, such as Kudu. " How to coordinate offline computing and real-time computing? with Moonbox, no matter where the data of batch computing and streaming computing are stored, it can be seamlessly mixed through Moonbox. It can be said that Moonbox simplifies and smoothes the complexity of many data flow architectures.
Q: what is the positioning of CCTV? will it be another buzzword? Within Citic, what is the relationship between the data center and the traditional background?
A: the positioning of the middle platform of Credit data has been mentioned at the beginning of the speech. to put it simply, it means unified management and transparency for the lower level, generalization, standardization and process for the middle level, and self-service for the upper level. Buzzword also needs to be divided into two. Some waves leave more lessons and some waves bring more progress. "the relationship between the data center and the traditional backend" here I understand that the traditional background refers to the business backend. A good business backend can better cooperate with and support the data center, while the bad business backend will leave more data-level challenges to the data center to face and solve.
Q: data is heterogeneous stored in so many storage components, how to ensure the efficiency of personalized query?
A: this question should refer to the architecture of Moonbox, how to ensure the efficiency of ad hoc queries. Pure ad hoc query (the source data directly calculates the results), the query efficiency will not be better than the memory-based MPP query engine. For us, Moonbox is mainly used for unified batch computing entry, unified ad hoc query entry, unified data service, unified metadata collection, unified data permissions, unified consanguinity generation, unified data toolbox and so on. If you pursue millisecond / second query efficiency, you can either use precomputing engines such as Kylin, Druid, or ES, Clickhouse, etc., but there is a premise that the basic data is ready. Therefore, our data intermediate link supports the physical writing of DW/DM data to ES, Clickhouse and unified DataHub release after ETL, which can ensure the efficiency of "personalized" query to a certain extent. Purely from the perspective of Moonbox, minute / hour precomputation on heterogeneous storage and writing the results to Clickhouse can support minute / hour data latency and millisecond / second query latency.
Q: if new data enters the system, is the whole process of data collection and storage controlled by the developer, or is the specialized data manager controlled by combining the various components Pattern through the interface?
A: if the new data source comes from the business database repository, and DBus has already docked with this repository, a dedicated data console administrator will configure and publish the new ODS on the data console management interface for downstream users to apply for and use on the DataHub. If the new data source comes from the business's own NoSQL library, the business staff can initiate the publishing data process on the DataHub, and then the downstream users can see it on the metadata and apply for and use it on the DataHub.
The so-called "data acquisition to storage" is also divided into real-time acquisition, batch acquisition, logical acquisition and so on. These commonly used data source types, data docking methods and user usage methods are encapsulated and integrated by DataHub. Both data owners and data users are one-stop DataHub user interface. All data link Pattern, automation processes and best technology selection and practices are transparently encapsulated in DataHub. This is also the value of instrumentalization to platform.
Source: Yixin Institute of Technology
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.