In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)12/24 Report--
Enterprise business is becoming more and more complex and diverse, and the demand for data processing capacity is getting higher and higher. at present, there are more and more real-time analysis scenarios. Data technology is changing with each passing day, so it is very important to use appropriate data technology to build their own real-time analysis ability.
In this issue, we have the honor to invite Dr. CEO Chang Lei, the founder of even-numbered technology, who pointed out that the integration of data lake and data warehouse is a general trend, there is an urgent need, and it has come to the era of real-time lake and warehouse integration. He shared the development, construction path and methodology of the real-time lake warehouse.
In addition, Chang Lei also pointed out that this is the best era of technology entrepreneurship, and there are also challenges. Technology has been developed for many years, and now there are not as many mutations as before. In the case of less and less breakthrough power, everyone is doing some stock competition, at this time from the commercial level is actually quite difficult, or from the technical breakthrough, to break such a pattern. He stressed the need to persist in innovation, not to advance is to fall behind. "in line with demand, don't always look for nails with a hammer, you build hammers based on nails. Technological entrepreneurs are most likely to do this, emphasizing that technological products are the biggest pit for technological entrepreneurship."
Follow-up question: why do we need real-time lake warehouse?
ITPUB: Dr. Chang Lei, it's nice to interview you. Please introduce yourself briefly.
Chang Lei: I graduated from Peking University majoring in doctoral database. After graduation, I joined EMC. I was a senior researcher of EMC and director of EMC / Pivotal R & D department.
When EMC acquired Greenplum in 2010, I led the research team at EMC to do the research and development of the database kernel, combined with Greenplum to develop new products HAWQ,HAWQ and GP, which were closed-source products in the early days. In 2015, we opened up all these two products, and HAWQ became the top-level project of Apache.
Seeing the opportunity of the cloud era, in 2016, I came out with a team to create even technology. At that time, we wanted to do a new generation of cloud native analytical database, which slowly evolved into today's real-time lake warehouse. That is, structured data, unstructured data, semi-structured data, all data can be integrated and processed, and cloud original survival calculation can be separated in architecture. Combine the data lake and data warehouse to form a new generation of data platform.
ITPUB: is it a standard Snowflake to build a new generation of cloud native analytical database?
Chang Lei: in fact, we target two foreign companies, Snowflake and Databricks, and Snowflake started as a cloud primary data warehouse, which is an analytical data warehouse. Databricks is to do Spark, early positioning is machine learning, followed by Lakehouse (lake warehouse integration). These two companies, the former is from the database point of view, the latter is Hadoop ecology, starting from the data lake, are now to the lake warehouse integrated development, we actually happen to be both standard. GP is an analytical database. HAWQ originates from GP, is a SQL on Hadoop engine, and is a Hadoop ecology. We start from HAWQ, do cloud survival calculation separation, evolve to OushuDB, and build a Skylab real-time data platform based on OushuDB.
There used to be lakes and warehouses, but now the capabilities of lakes, such as transaction consistency, performance, and so on, are being enhanced. Warehouse can only do structured data before, but now slowly integrate all aspects of streaming processing into it.
We have expanded the product stack and have a complete product matrix, which can provide enterprises with a very complete data analysis product stack, with the overall data analysis solution capability, like an aircraft carrier.
ITPUB: the introduction of the company's official website is to usher in the era of real-time integration of lakes and warehouses. What kind of era do you see?
Chang Lei: database after several generations of development, in fact, the context is relatively clear, the earliest is the transaction database, such as Oracle, DB2, transaction database actually does not change so much, that is, the traditional centralized to distributed changes.
The architecture of analytical database has changed greatly, which is due to the change of demand scenario. The analytical database can only do some statistical reports in the early days, and then it has to deal with a large amount of data after a large amount of data, to do BI. Later, in the era of big data, there are different types of data out, the amount of data is also very large, data processing becomes complex, the emergence of Hadoop big data platform. In recent years, with the rise of cloud computing, lakes and warehouses have evolved into the integration of Yunyuan lakes and warehouses.
The concept of integration of lake and warehouse was first put forward in the United States, which is Lakehouse in English. The integration of lake and warehouse only means that the integration of lake and warehouse reduces the isolated island of data. In the past, lakes and warehouses were separated, and data had to be stored in both lakes and warehouses, resulting in data redundancy, not a piece of data, which increased development costs and maintenance costs. The integration of lakes and warehouses did solve some problems and reduced the operating costs of customers.
We think that it is not enough to talk about the integration of lakes and warehouses. From the perspective of application scenarios, in terms of analysis, the previous T + 1 can no longer meet the needs of many real-time scenarios. There are more and more real-time scenarios in Troom0. We not only have to do the integration of lakes and warehouses, but also to build a new technical architecture for real-time scenes. Therefore, in the era of integration of lakes and warehouses, we emphasize not only the technical architecture, but also the support and integration of technology and application scenarios.
There is also the concept of real-time "counting" positions in the market, although there is an one-word difference between real-time "lake" positions, but there is a big difference. Real-time data warehouse deals with structured data, real-time lake warehouse is a product matrix, real-time lake warehouse includes real-time data warehouse, the scope is larger, will manage all kinds of data of the enterprise.
ITPUB: the core of real-time data warehouse and real-time lake warehouse is the real-time requirement. How do you think it happened?
Chang Lei: now more and more real-time scenarios appear, such as real-time large screen, real-time reports, real-time indicators, real-time recommendations, anti-fraud, risk control, IoT scenarios and so on. For example, when a user is browsing products, he will receive some real-time recommendations.
From the point of view of the business scenario, the traditional Tunable 1 can no longer support this demand, and the demand for real-time is already very urgent.
The stronger the customer's IT ability is, the greater the investment is, the better the business is, and the more attention is paid to real-time. Some traditional enterprise technology is relatively weak, and it seems that it is not necessary now, and the business is quite good, but in fact, the digital transformation has not been done well, and there is still a lot of room for business improvement.
ITPUB: maybe companies really don't need it?
Chang Lei: it's not that he doesn't need it, but he doesn't think he needs it. Everyone else has done it. He's a follower. Any new scene, new business or the development of new technology, there will be some innovative forerunners, there are many followers, followers actually account for the majority, innovators account for only a small part.
Real-time Hu Cang case, we have done a lot of head customers, basically every industry has a head. I think the technology is changing with the business scenario, often there is the business scenario before the technology, and sometimes after the technology, it will open up some business scenarios that could not be done before. Real-time lake warehouse seems to be going hand in hand with demand and technology. One is that there is demand, and the other is the development of technology, which has come to an era of real-time integration of lakes and warehouses.
ITPUB: what are the features that make it a real real-time lake warehouse?
Chang Lei: according to the nature of the integrated platform, we have summarized the six characteristics of the integrated platform-ANCHOR, in which six letters represent: All Disparate Data (multi-source heterogeneous data), Native on Cloud (cloud native), Consistency (data consistency), High Concurrency (ultra-high concurrency), One Data in Open Format (a piece of open format data) and Realtime (real-time data 0). ANCHOR means "anchor" in Chinese. Using the six characteristics of ANCHOR, it is easy to judge whether a certain system design really satisfies the integration of lake and warehouse, and "anchor" determines the integration of lake and warehouse.
ITPUB: in terms of real-time performance, there are many technologies and concepts, such as real-time data warehouse, real-time lake warehouse, integrated flow and approval, HTAP database, etc. What do you think the enterprise needs? Why does the even number specifically mention the real-time lake warehouse?
Chang Lei: these concepts actually have their own application scenarios, such as HTAP scenarios. In the trading library, sometimes you have to do a little analysis query, and you may be able to do a little transaction-oriented scenario in the analysis scenario. Judging from the current business situation, when we talk about the database, the scenarios are basically built separately, and the bank TP and AP are still separate, which are done by completely different department teams.
General scenarios are focused on, this scenario focuses on analysis, that scenario focuses on transactions, and then selects different products, we actually focus on analytical scenarios, but also support some transactions. Some databases are partial transaction databases, which also support a little analysis. But when an enterprise really wants to purchase an analysis platform, no one is going to look for a transaction storehouse. Similarly, if you choose a transaction library, no one will choose an analysis library. I think it is very clear in the actual project.
Why even choose a real-time lake warehouse? I think the integration of lake and warehouse is necessary, what everyone will do in the future, and not the icing on the cake. Analysis scenarios will move towards the real-time lake warehouse platform in the future, and now enterprises are thinking about reducing costs and increasing efficiency, and real-time lake warehouses can bring great value.
ITPUB: with regard to real-time scenarios, many people are talking about online, offline and near-line scenarios. How do you understand real-time scenarios?
Chang Lei: Gartner has a definition of real time, which is quite clear. According to the limitation of analysis, it can be divided into strategic decision, tactical analysis, business operation and automatic processing, and the frequency of limitation and analysis is getting higher and higher. Strategic decisions, such as corporate acquisitions and overseas expansion, usually take months to six months of analysis; tactical analysis, such as pricing strategies for market segments, usually takes weeks to a month; automatic processing, such as automatic credit card approval and quantitative transactions of stocks, are usually completed in milliseconds, while business operations are sandwiched in the middle, ranging from 1 second to several days.
Therefore, in the business operation scenario, we need to have more specific requirements for real-time. Gartner believes that 15 minutes can be regarded as real-time and quasi-real-time. According to our observation and practice, 10 seconds can be regarded as strong real-time, and the range of 10 seconds to 15 minutes can be regarded as quasi-real-time. Many enterprises are upgrading the traditional Tunable 1 report to a minute-level quasi-real-time report, which, in my opinion, can be made into strong real-time interactive analysis.
Even real-time lake warehouse is from offline to online, from quasi-real-time to strong real-time, covering all of them. We propose a concept called full real-time and on-demand real-time, which is supported by Omega technology architecture.
ITPUB: what are the similarities and differences in the needs of different enterprises for real-time lake warehouses?
Chang Lei: the commonness of the same industry is basically strong, and the demand for products is basically similar. However, there are great differences in the demand for innovation among enterprises of different sizes. The business scenarios of large enterprises are relatively complex, and their technological innovation capabilities are relatively strong. For example, slightly larger banks are much better than small and medium-sized banks in terms of innovation. New real-time scenarios are often they are the first to try, and then small and medium-sized banks will follow.
Practice: how do enterprises build real-time lake warehouses?
ITPUB: how do enterprises build real-time lake warehouses?
Chang Lei: according to some differences in the current situation, enterprises will take different ways to build, roughly divided into three categories.
First, the previous informationization is relatively weak, maybe the analysis of the scene is basically not done, or he thinks that the previous one is too backward, he only made a traditional ODS, and the new big data platform is not available, which often adopts a new model.
Second, the previous IT basic stack is relatively complete, there may be lakes, warehouses, and data marts. Based on the existing IT construction, upgrade to real-time lake warehouse. For example, your storage is HDFS, I can use your original storage, use our computing layer, plus an even number of real-time storage to transform the architecture to real-time storage.
The third category, there used to be a traditional data warehouse, but there is no Hadoop big data platform. In this case, you can upgrade the warehouse to a separate structure of cloud original survival calculation, and use OushuDB first. Other new application scenarios introduce new components to slowly form a real-time lake warehouse platform.
So there are basically three paths, the new construction, the transformation from the lake to the real-time lake warehouse, or the transformation from several positions to the real-time lake warehouse.
We encounter more new projects, such as a new platform, hardware can be reused, application scenarios are gradually migrated, not all of them are migrated at once. For customers, new construction is relatively easy because it does not involve a major historical burden. If there is a lot of business running before, and the transformation takes a relatively long time, it will take several months or half a year, we try our best to let the enterprise see the value in the short term and increase his confidence.
ITPUB: can you share some methodology of project construction?
Chang Lei: combined with the long-term exploration and experience summary of even-numbered data platform project construction, this paper refines the even-numbered lake warehouse integrated construction methodology. This methodology mainly includes three sub-processes: Planning, Implementation and Operation, which link up successively and form a closed loop; Strategy is a contingent sub-process, which is generally applicable to the scenario of newly built Huancang data platform or the special requirements of industry customers under the background of special construction.
Logical view of the methodology of even-numbered lakes and warehouses
The methodology of integrated construction of even lakes and warehouses can not only be compatible with the implementation methods of traditional data warehouses, but also avoid some disadvantages in the process of landing data lakes in the past, and consider the reality that many enterprises have built data platforms for many years. it also absorbs the forward-looking trend of the rapid change and evolution of data-related technologies in recent years.
ITPUB: what do you think you need to pay attention to when building real-time lake warehouses?
Chang Lei: before the formal establishment of the project, we suggest that customers consider the project planning in terms of industry implementation experience, project implementation cycle and the overall cost of the platform, and carry out overall design and step-by-step implementation. Generally speaking, it means that the team should find well, the product should be selected, and the project should be implemented well. The even-numbered methodology also gives several suggestions to avoid the pit in the project establishment stage, as well as the important grasp in the project implementation process. You can pay attention to our upcoming book on the integrated construction method of the lake and warehouse.
Prospect: real-time data technology and technology entrepreneurship in the era of AIGC
In the era of ITPUB:AIGC, what is the impact of AI technologies such as large models on data technology?
Chang Lei: I think the rise of big models is a great benefit to us. Because the large model lowers the threshold for people to use data, they can use natural language to use data, while in the past, using data often required learning complex product and query syntax.
AIGC makes the data stack easier to use, for example, it can automatically generate SQL, and in the future, design models and data governance can also be automatically driven by natural language. So the big model has a great impact on the industry, but now this vertical scene has not landed very well.
The large model now belongs to a more cutting-edge exploration stage, basically still do some more general basic scenes, for some vertical scenes, there is still a long way to go, a long way to go.
ITPUB: many people say that this is a good time for technology entrepreneurs. As a technology entrepreneur, how do you meet the challenges and seize the opportunities?
Chang Lei: this is indeed the best time for technology entrepreneurs. You really want to do a thing thoroughly and well. It is very difficult without technological innovation. But technology entrepreneurs also have limitations, and it is also a challenge to know less about business logic and requirements.
Technology has been developed for so many years, and now there are not as many breakthrough technologies as before. In the case of fewer and fewer breakthroughs, if everyone does stock competition, it is actually quite difficult from the commercial level, so it is still necessary to make some technological breakthroughs to break this pattern. Technological entrepreneurship is still very important.
For example, when we talked about the real-time lake warehouse three years ago, people were still hesitant to wait and see, and now there is basically a consensus. We hope that these enterprises can make good use of the real-time lake warehouse and really reduce the cost and increase efficiency of the business.
ITPUB: now that there are so many similar products in the market, what do you think of the competition in the industry?
Chang Lei: this is just like the war of hundred regiments and thousands of regiments at that time. After a new technology comes out, there must be a number of companies to do it, which is very normal, and the market competition must be more and more fierce. But whether we can have the last laugh depends on the strategy, technology and products.
In fact, the development of data technology is very fast, now about every 10-15 years or so, there will be a new generation of platforms, many manufacturers may accidentally fall behind, may be eliminated.
So you should always grasp your own innovation, never put down the innovation, do not think that the product is more stable, feel that you can meet the demand, there is no need for innovation. If you don't innovate, you will be eliminated, but some industries may have higher requirements for innovation and change faster, and some industries may innovate a little more slowly, but they still need to seize opportunities and innovation. The trading pool is a little simpler, and its development is slower, but the development and changes related to big data are particularly obvious. It is really changing with each passing day. I have experienced the past three generations of platforms, and now I have developed to the fourth generation of Yunyuan Survival calculation is separated.
To start a business is not to advance or to fall behind. We have been innovating and evolving. At the beginning, we do cloud native data warehouse, which is an analytical database. Now we have become a real-time lake warehouse, with analytical database as the core, forming a set of product matrix. We have been iterating forward over the past few years.
ITPUB: now we are all talking about convergence, the great convergence of data technology, just like the previous button phone, MP3, camera all merged into one smartphone.
Chang Lei: Oracle has talked about fusion for a long time. Oracle supports all kinds of data scenarios, such as graph data, time series data and so on, so fusion is not a new concept.
Now everywhere is talking about integration, I think some parts of the integration is possible. But there must be a problem with the integration of everything. Let a person do everything, anything can be done, but certainly not everything is the best, there should be focus.
The demand of the enterprise is what problem you have solved and how much value it brings, such as the problem of real-time scene to be solved. For example, the lake and the warehouse, why should they merge together? You have to make it clear about value, and then discuss whether it will be integrated or not. The customer's perspective is to solve the problem and bring value. From the perspective of technical people, it may be that you see that I can do anything. I am very skilled, and this perspective is not desirable.
According to the demand, don't always look for nails with a hammer, make hammers based on nails.
ITPUB: for practitioners, can you give them some advice on how to keep up with the pace of technical iterations?
Chang Lei: for practitioners, I think new technologies should follow closely, and major trends should follow closely. The new domestic trend is a new generation of database products and real-time lake warehouses. Do not rest on your laurels. Nowadays, the iterative rate of knowledge and technology renewal is very fast, so we must pay attention to arming ourselves. For example, we are now offering some courses. I think the traditional DBA should learn the training and sharing of this new technology. When others have mastered it, you will be very dangerous.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.