In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
What is the practice of building a data center based on DataWorks? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
First, the business model of the box horse
If you do data, the first important thing is that you must understand the business. A classmate asked me earlier that it was difficult to build a data center. In our view, data is closely related to the business, when we go to build the entire data platform, we must first have a very deep understanding of the business. Box Horse is a new business that has emerged in Ali in the past two or three years. Some students should have experienced it, including Beijing, Shanghai and other Chinese first-and second-tier cities covered with Box Ma Xiansheng stores.
The above picture is the architecture diagram of the box horse business model, and the business revolves around two main points, one online and the other offline. Although the business of Hema is called O2O, one of the more interesting points is that the O2O of Hema is different from that of the early O2O. O2O used to be called Online to Offline. What is the O2O of a box horse? Is Offline to Online, the goal is to introduce offline traffic to online, use offline experience to make users willing to buy online, and ensure that the offline quality is the same as online quality, there will not be online is an e-commerce special version, it seems very cheap, but what you get is not the same as offline.
Based on our O2O business architecture, the customer base of Box Horse is very interesting. Most of them are based on family units, just like when I buy a box of horses, my daughter and my parents also like box horses. I am an online customer who can place orders online. That is like the older generation when he will not use APP shopping, he will go offline to buy, he buys things like me, including my daughter, she may not shop, but box horses have food, she likes to go to box horses to eat seafood, through the closed loop and inheritance of this business, to ensure the development of the business and reputation.
After setting up this business model, Shanma needs to start building its business architecture, so what should this architecture look like? First, it should integrate online and offline to ensure the goal of 020. At the same time, it is confirmed that this is a fresh e-commerce business, fresh e-commerce is basically different from the traditional standard e-commerce. The third is a multi-functional store, which can integrate sales display, warehousing, sorting, online and other business forms. The fourth is time-limited distribution: three kilometers and 30 minutes, which actually broke the logistics of reaching the same day and the next day that the e-commerce platform was proud of before. until now, this kind of time-limited distribution of box horses is still in the lead in the industry. The fifth is the takeout of a box of horses. Today, you really want to eat something, but you can't cook. The box horse will help you cook this thing, or you can cook, but you can't kill fish or chickens. The box horse will help you finish this, and then send it to you. Finally, there is a very important point, because we mentioned the value of the store. The store of Box Horse is not traditional shopping. It has a warehouse setting. What I just said can be done online and offline. What you are looking at offline is the store. For online, he is a warehouse.
II. Technical architecture and prototype of box horse
After determining the business model, we need to design the technical architecture. In fact, the early box horse had a certain entanglement, because found to do retail, do stores, do business super, many traditional software manufacturers have a ready-made software system, such as ERP, WMS. Shall we just buy one? But at that time, Box Horse was a firm business system for all product technologies, including digital systems that had to be built by themselves. Because the box horse needs to do a comprehensive digitization of many traditional businesses, including transactions, stores, warehousing, transportation and distribution, procurement, supply chain, labor and so on.
Now the traditional ERP software or logistics software, it has also done digitization, but a very important difference is that we do digitization not only for simple digitization, to structure the data, but more importantly to provide a very important support for the upper level of strategy. We have carried out a very good intelligent support for flow, logistics implementation, process optimization and financial strategy. I can share a little bit here. We have also investigated some large retailers and super enterprises with stores offline before. They also do online APP, but their inventory is isolated online and offline. If you have a total of 100 fish, he will pre-distribute them. Only 10 fish will be sold online, and there will be no online after selling them out. Through this strategy model, we can basically get through all the data and goods offline.
Another very important point, some of the business mentioned just now, you will find that many business teams in Ali are separate. For example, Cainiao is only responsible for logistics, Taobao is only responsible for marketing and trading, and the business of the whole economy is moving towards integration. However, in order to complete its own business closed loop, all the systems are self-built from trading stores, warehousing and freight, purchasing supply chain, and labor, and enable them to connect all businesses through a collaborative layer. we have business planning, supply chain management, collaborative management, omni-channel and multi-format, and provide a closed-loop solution.
A very important point in the closed loop is the rightmost data layer. Without our unified data center construction, it is very difficult to support the entire enterprise project. This is also the part I will focus on introducing to you today.
When we talk about the data center, in fact, in Alibaba, the data center is not only a solution, it is also a function of a team. There is an independent data center team in Hema to support this business. We regard data as an asset, which is as important as goods, members, and equipment. The students of Box Horse data Center, they are the builders, managers and operators of assets, and they want to drive the whole retail supply chain to upgrade intelligently through these assets. The most important thing is that we will collect, manage and build this data, and make this data better used in business.
The picture above is an overall architecture of the data platform of Box Horse, this part will have some particularity and some versatility.
First of all, let's talk about generality. our entire infrastructure is the same as all departments of Alibaba Group, using Aliyun's infrastructure, and in the whole data layer, we have active data. the source data basically comes from the business system. The access layer is relatively complicated. The box horse just mentioned is omni-channel. We have APP, cable, and our distributor's electric car, as well as some hanging chains, iot, APP, human resources and so on inside the box horse, so there will be a lot of structured and unstructured data. We use the data processing layer to process our unstructured data. Eventually, a very important data asset layer will be formed.
After the data asset layer is built, it will have a certain business meaning, and this part of the data can be used directly by the business. However, we will define another layer of data services on this data asset layer, so that the data will be more convenient to use, that is, out of the box. In addition, when it comes to the service layer, he may still be invisible. A classmate asked me before that today we hope that business users can directly use the data, instead of going to many tables to look up the data. In this respect, the box horse uses the data application layer, and we will build a lot of data products to provide real data use to the business through the way of production. Finally, our box horse side of the product form will be very many, we in different ends through PC, nails, palm treasure, and many iot gadgets, Shenzhen may be a small black-and-white screen, there will be data transmission through. And on the far right we have a set of management system, through this management system, so that our entire operation and maintenance can be effectively implemented. Then this kind of architecture diagram is a business-oriented hierarchical architecture diagram of data platform understood by Hema.
Then based on this business-based data center hierarchical architecture, we have designed a set of data center technology architecture. In fact, if you have done big data, you will often encounter it when collecting data. I have both offline and online calculations, so offline calculation is based on MaxCompute. Almost all of Alibaba's offline data is on MaxCompute. In 2020, the daily data processing capacity of double 11 MaxCompute exceeds 1000PB, reaching EB level. Real-time computing is based on Flink, and the performance of computing is also very powerful. There is also a piece of data storage that we are going to do. In fact, the box horse side of the storage will rely heavily on online storage, such as Lindorm is kv, as well as MMaxCompute interactive analysis (Hologres) and online search Elasticsearch, and we will turn these storage into data services one by one. For data services, there will be detailed indicators, as well as features, labels, and so on. These data will be extended to some of the most commonly used devices, operating platforms, nail mobile offices, intelligent management, and so on, which are more at the runtime level. At the level of market operation, we have metadata, data quality, disaster recovery management, data governance and so on. This technical architecture diagram is more like a technical requirements architecture diagram, which is something our technical team needs to do when doing the data center.
Third, the data center scheme of box horse based on DataWorks.
When our business model, the technical architecture of business products, and the technical requirements of the data center are sorted out, we will begin to do a technology selection of the data center, or do a technical research, what kind of products and what kind of system can support our whole technical architecture. It was mentioned earlier that our business system is self-developed, but the final choice of our entire data center is not self-research, because Aliyun already has a very mature product system that allows us to build our own data center. Big data computing engine we use the MaxCompute that the group has been using all the time, so we did some research on the data development and governance tools for building the data center, and finally chose DataWorks. The following is the overall architecture diagram of DataWorks:
DataWorks provides data integration externally. It has many such batch, incremental, real-time and full-database data integration, which can support so many and complex data sources. Currently, DataWorks data integration supports 50 + data sources offline and 10 + data sources in real time. No matter the data sources are in public network, IDC, VPC and other environments, they can be secure and stable. Flexible and fast data integration. DataWorks also has a set of metadata unified management services, which supports unified task scheduling and provides a very rich one-stop data development tools, covering the entire life cycle table of data development, greatly improving our data development efficiency. The upper layer also includes data governance, data services and so on, and it provides a very important open platform. Because mentioned before that Box Horse is a very independent and rich business, many business systems are self-developed and have their own R & D team, we need to do a secondary processing of many functions through DataWorks OpenAPI and integrate with various self-research systems and project systems. at present, more than 100 OpenAPI provided by DataWorks allows us to achieve this requirement very simply.
Well, let's take a look at this data center technology requirements diagram. Let's make a comparison with DataWorks. The data acquisition part corresponds to the data integration provided by DataWorks. Basically, the data synchronization requirements on our left can be met by DataWorks.
And we do data development. in the data development layer, DataWorks can simultaneously complete our streaming, batch and real-time development through its DataStudio, HoloStudio and StreamStudio, and it also provides the functions of data services and open interfaces, which can be integrated with our existing systems and products through OpenAPI, and crucially, DataWorks provides the capabilities of data mapping and data governance, which seem to be edge functions. But in our box horse and even in Alibaba played a very key role, this piece we will continue to unfold.
In the past, we can see it more as the preparation process of the data center. We have learned about the business, made the design, and made a technology selection. Then it is very important to determine a clear goal before doing something in Ali. The goal does not represent KPI, it may also be a mission or original intention. What is the goal of the data center of the box horse? The data center of Hema is to establish an intermediate layer with rich data, multi-dimensional full link, reliable quality (that is, standard caliber and accurate results), stable operation and timely fault-free output. Many people will say that this is a data Mart, it doesn't matter, it's just an intermediate layer. There is also a very important point that we need to provide reliable data services, data products and business applications for the upper-level business. in fact, this limits that it is not a simple data warehouse or a simple data Mart, but a data center, a data center that can be constantly used by the business. If we just synchronize the data and put it into MaxCompute or open source Hadoop or a database, it's just a warehouse. The data center is defined as a data center that can be directly used by the business, or even bring business value to the business.
After defining such a goal, we will begin to do a step-by-step dismantling. What do we mainly do? First of all, we have to do the design of an index system, because the business to use fields that are not a table needs to be supported by a data model design, let's make the data more standard, and we have to do the development of data processing tasks. Today we have some intelligent ways to build data warehouses, but this may be more of a future, now we have to face a problem, we still rely on human flesh to do data development. And we want to open these data through data services for business to use, the form of data services is not limited to Table, API and Report, or even a product or anything else.
The image above is probably the most frequently seen hierarchical diagram of data models or data Marts on the Internet, and that is the platitudes, ODS, DWD, DWS and ADS. In fact, although there are many concepts and ideas, but everyone's understanding of this layer is different, the box horse has its own very strict and clear definition, each layer should have its own characteristics and responsibilities. To sum up, ADS must be business-oriented, not development-oriented. Your data can be understood or even directly used by the business in the shortest time, and DWS must be an indicator, which is also a carrier of the index system I mentioned earlier, all done by DWS, and DWS summary is basically the support of ADS. Another layer is DWD, which is what we often call the detail layer. How to build the detail layer? We use dimensional modeling, we have dimensional tables, we have fact tables, and that dimensional table also has many hierarchical dimensions, such as enumerated dimensions, and fact tables we have periodic snapshots. Of course, there is a very important point here, the DWD field must be directly understandable, not ambiguous, once there is ambiguity, there will be problems in the use of DWS, which will lead to problems in the entire upstream application. Basically, everyone understands that ODS should be consistent, that is, business data should be synchronized directly. But now there are some architectural changes, and people like to do a preliminary ETL processing in ODS, which will cause the data of ODS to be inconsistent with that of our business. In fact, this is not allowed in Hema, the reason is very simple, we need to ensure that our ODS is consistent with the business library, so that when we have a problem, we can quickly locate the cause of the problem. Once ETL is done, it is possible that there is a bug in the process of ETL, which will lead to data inconsistency between the two sides. So the box horse is strictly required from the business database data to the ODS is not allowed to do any logical processing. If there is a problem, it can only be caused by a problem with middleware or any other storage, not by business logic.
Fourth, the box horse builds the data center based on DataWorks.
In front of me, I will talk more about the ideas, design, architecture, and some goals and requirements of building a data center on this side of the box horse. Next, I will talk about how the box horse uses DataWorks to build a data center and some experience in using the DataWorks platform. The platform of DataWorks is not only for Box Horse, but also for almost all business departments of Alibaba Group. Every day, tens of thousands of operation waiters / product managers / data engineers / algorithm engineers / R & D within the group use DataWorks. At the same time, DataWorks also serves a large number of users on Aliyun. Therefore, many of its designs tend to be open, universal and flexible. When we use it at this time, it will lead to a series of problems, such as being too flexible or there are no standards, and the following content will share some experiences of the horse with you at that time.
First of all, data synchronization is the first step in building a data center. If the data cannot enter the warehouse, then the data center cannot be built. When the box horse is doing data synchronization, there will be several requirements. For example, all the business data of the box horse are synchronized to one project, and only one copy is synchronized, and repeated synchronization is not allowed, so it is convenient for us to manage and reduce the cost. At the same time, it ensures that the data does not have ambiguity. If there is something wrong with the data source, then all the data behind it will be wrong, so we must make sure that the data source is 100% correct. Then, considering the data backtracking and audit, the data life cycle is set to be a permanent storage. Even if the business system has some archiving and deletion due to traffic problems in some online libraries, when they want to use historical data again, they can restore it back intact through the ODS layer.
The second piece is data development, data development this part is basically a test of personal ability, basically everyone uses SQL. We have some experience about data development. To put it simply, the data processing process is the realization of business logic. We should not only ensure the correctness of business logic, but also ensure the stability, timeliness and rationality of data output. DataWorks data development editor, in addition to providing a better coding capabilities, but also provides some visual way to deal with the flow, to help us to do some code review, or even some verification, this function is very helpful in our daily use.
The whole process of data development, because I am also a classmate of Java, we know that every kind of programming has a certain programming paradigm, and we have abstracted several steps in the whole process of data development. The first is a transcoding. What is the main purpose of this transcoding? As I just said, many business systems are designed to complete a business process, and they have a lot of personalized processing, especially when people do the Internet, in order to solve some performance problems or filter problems, they will do some Json fields, media fields, delimiters, and so on. We will have a transcoding in development, such as turning something enumerated into something that can actually be understood, such as what exactly is 0? What is 2? Or what is a? We'll do the transcoding. There is also a format conversion, we have some business systems, it is very difficult to standard, such as time, some use timestamp, some store strings, some store yymm, although they all represent time, but the format is different, in the construction process of the data Mart, it must require that the data format must be consistent. We will convert the non-standard data format into a standard format by format conversion.
There is also a business judgment, business judgment here is basically a business result in the way of conditions. For example, young people will certainly not be counted as a field or business logic called "young people" in the business system. If there is age data, then when we sort it out, we will say that those under 30 years old are called young people, and so on. This is what we call business judgment. The data connection piece, basically very simple, is a table association to supplement the data. Another data aggregation, we will use this part of data aggregation a lot when we do DWS. There is also data filtering, we often encounter some invalid data, we deal with these invalid data through the database. The other is conditional selection, which is basically something of when, which is slightly similar to data filtering. The last part is business parsing, which we use most often, because now NoSQL or MySQL also supports it, and some business teams even use Mongo. There are a lot of business representations in that large field. When we do DWD in data Marts in recent years, we must parse the format of this Json field or map field into fixed column fields. Because we just said that its content must be consistent, so that users can see it directly. Share a lesson here, that is, the business logic will be closed in the data details layer as far as possible, in order to ensure the consistency of the data and simplify the downstream use. Changes at the source can also be transformed through code or format to ensure the stability of the detail layer structure and avoid bringing more changes downstream. A good model also requires collaborative development of upstream business systems. First, the business system must have a reasonable design, and second, changes can be perceived in a timely manner, that is to say, the construction of the data center is not the work of the data team as a team. We should also work with the business team to do a linkage and co-creation.
These parts just mentioned are more of a development phase. If DataWorks only completes these, we think it is an IDE, but DataWorks is an one-stop big data development and governance platform, the development platform is very important that it wants to ensure its operation, how to ensure that the code we do data development can run? Is through the task scheduling of DataWorks. The business of Hema is very complicated, with 30-minute delivery, next-day arrival, three-day arrival, some pre-sale, pre-order and so on. These may not be supported by a simple scheduling system. The good thing about DataWorks is that it provides very flexible cycle choices for task scheduling, such as month, week and day. The business of the box horse is a closed loop, and each business is relevant, so the data task of the box horse is also relevant. At this time, the task scheduling link of the whole box horse is very complex.
In the whole process, Hema has also made a lot of attempts, innovations, and stepped on a lot of holes. Here, let me share with you, that is, data loss or errors may occur when the DataWorks task node is not tuned or at the wrong time. Here is to ensure that our data development for each online task of any problem should be dealt with in a timely manner, because each problem will cause a data problem. Reasonable scheduling strategy can not only ensure the correctness of data output, but also ensure the timeliness of data output. We want him to produce in one day, so don't turn it into an hour, we can just press one day, if three days is three days.
Through these steps, normally, if we complete a project or a requirement in this way, we think that the task of a data development engineer is over. But in general, this is not the case, because the data center is a partial commercial thing, so once it goes wrong, in Ali's words, the impact is particularly great. Business line it has core system, non-core system, department core system, group core system, there are different guarantees in this way, and the business team has p1, p2, p3, p4 way to define the fault level. The data service is different from the normal business system. We rely on DataWorks to ensure the stability of the entire online big data business task. Among them, DataWorks provides a very important module, that is, data quality monitoring. In fact, we are more able to find some problems in time to ensure that when the business has an impact, we know as soon as possible. Because sometimes there is a certain delay in the use of business. This provides a lot of capabilities, such as some monitoring of data quality, the purpose of data quality monitoring is to ensure the correctness of data output, and the monitoring scope must be relatively comprehensive, not only limited to changes in table size and functions, field enumeration values conflict with some primary keys, or even some illegal formats, and it is important that abnormal values trigger alarms or interrupt data processing. Then the personnel on duty should intervene as soon as possible.
The above is about monitoring, but once a lot of monitoring will lead to a flood of monitoring, there will be a lot of early warning alarms, then DataWorks also provides another capability, that is, task baseline management. I just said that there are levels of business, and our online business also has some important and non-important tasks, and we use this baseline approach to isolate these tasks. The experience of the baseline is that the baseline is to ensure the timely output of data assets, and the priority determines the protection of the hardware resources of the system, as well as the protection of operators on duty. The most important business must be a level 8 baseline. This will ensure that your most important tasks will be produced in the first place. And DataWorks has a very good function, DataWorks provides some flashback tools, when there is something wrong with my baseline or broken line, you can quickly brush back the data through the flashback tool, and the DataWorks intelligent monitoring function will help you predict in advance whether there is a broken line through some baseline task status and historical running time, etc., this intelligent monitoring and risk estimation is still very useful.
Well, doing a good job of data quality monitoring and baseline basically ensures the stable and normal operation of our big data task and business, but there is also a very important point is the governance of data assets. Alibaba is a company that advocates data. A very big milestone in its transformation is that Alibaba's hardware cost of data storage and computing exceeds the hardware cost of the business system. This also led to Alibaba's CTO to take data asset governance as one of its very core tasks. DataWorks is the largest and even the only platform used by the data of the entire Alibaba Group, and it also provides a module called UDAP for data assets, which basically allows you to view the overall use of resources today from projects to tables and even to individuals, and it is very important to provide you with a concept of health score. This health score gives a comprehensive view of the ranking of each individual in each business unit. The easiest way to do governance is to hit the head off first, Ali does so, first manage the lowest head health score, and then pull the health score up, the whole level will come down. And it provides a lot of data visualization tools, so that you can quickly see the effect of governance. Some experience that Box Horse has done in this regard: the main goal is to optimize storage and computing, reduce costs, and improve resource utilization; the technical team will build a lot of project space, and we need to work with the technical team to complete data governance. Some of the easier ways to use box horses are the offline use of useless applications, table life cycle management, repeated computing governance, and, most importantly, violent scanning of computing resources, which is strictly prohibited. Some of the functions in UDAP can also be implemented in the resource optimization module of DataWorks, such as duplicate tables, duplicate data development and data integration tasks.
After doing all this, we think that what the data center should do is almost done, and finally, there is a very important point is data security management. With the development of the Internet, China should basically issue a relevant network law every year, such as the e-commerce law, and then the network security law, and so on, and then the data security law should be drafted recently. As an enterprise, it is very important to abide by the law. As one of the most unified data entrances and exits of Ali big data, DataWorks has done many means of data security management. It can be controlled from the engine level and through the project level. At the same time, it can be controlled at the table level and even to the field level. At the field level, each field has a level. For example, there are some fields whose grades can only be approved at the level of department heads or presidents. For example, when we think that even if the approval is approved, there are certain risks, such as ID card numbers, mobile phone numbers, and so on. We will provide a technology called data desensitization. This data has been desensitized and does not affect your statistics or analysis, but you are not visible.
Box Horse is basically similar to the group in terms of data security governance. Alibaba Group has a set of unified data management methods, which are connected with the organizational structure. When our employees leave or change jobs, his authority will be automatically withdrawn. In any enterprise, including Ali, his personnel change is very frequent, through such a function and system, we ensure the security of data under the premise of better application of data.
5. The value of building data center based on DataWorks.
What I talked about before is to build the data center of the box horse based on DataWorks. It was first mentioned that the data center must serve the business. Now I also introduce how the data center of the box horse serves the business. Fortunately, Shouyi and I have witnessed the rapid development of box horses from 0 to 1 and then to N stores, and the process of an enterprise using data is also such a shallow to deep process. First of all, we are all the same. at first, I just looked at the data, what data I had, and then looked at some problems through the data and made some manual assistance and decision-making, but the box horse expanded very fast, and at most it opened 100 stores a year. When its business form changes, it can no longer support the business through simple data reports and data visualization. So we also do a lot of fine control, such as category diagnosis, inventory health, tell the business what problems you have now, rather than let them use the report to do and then find the problem.
After reading the above, have you mastered the practice of building a data center based on DataWorks? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.