In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Author: Jankert (Rooney) Preface No. 2019.10.7 / 9, with the smooth conclusion of the 70th anniversary of the National Day, Flink Forward also held its fifth Congress in Berlin, their birthplace as usual. Although the specific data are not yet available, judging from the phenomenon that the training tickets have been sold out before the meeting, the Flink Forward conference continues to maintain a good momentum. This Congress continued to reach a new high in terms of the number of participants, the topics submitted, and the number of companies participating. Of course, this will remove the data from last year's Flink Forward Beijing Railway Station. Alibaba sent a total of 3 lecturers, including the author, to participate in 4 sharing sessions and 2 Q & A sessions. Here, according to the topics I participate in, I will give you an overall introduction to this meeting and my personal feelings and thoughts in the process of attending the meeting. I hope it will be helpful to the students who are interested. Keynote, let's first talk about the Keynote of these two days. The opening Keynote of the first day continues to be given by Stephan Ewen, the first brother in the community. He first summarized some of the current status of the Flink project, including: Flink had more than 10,000 Github star in August. Of all Apache projects, Flink ranks Top 3 in the mailing list, and this number is likely to decrease in August. Version 1.9.0 released in August has the most features released by Flink so far. The most modified version of this picture is a good summary of what Flink has focused on over the past half a year:
As for a possible future direction for Flink, Stephan continues to express his interest in the scenario of Application, which is a partial online service. First, he summarizes what we usually call batch and flow computing as Data Processing, while applications such as message-driven and database are summarized as Applications, and Stream Processing is the bridge between these two seemingly different scenarios. When I first heard this, I was a little confused and confused. After thinking about this issue in the past few days, I have some of my own understanding, and I will explain it at the end of the article. When it comes to Application, we have to mention the popular FaaS (Function as a Service). In this area, Stephan feels that everyone has ignored the importance of State in this area. For example, a typical Application scenario generally has the following characteristics: the whole Application will have one or more entrances, the computing logic will be driven by messages, and the specific business logic will be split into several units with smaller granularity, and each logical unit will use a Function to execute the specific logic Function will call each other. Generally speaking, we will also design these calls as asynchronous patterns. The computing logic of each Function may require some state. For example, we can use the database as the state storage. After the complete computing logic is completed, we will return the processed state through a unified exit in this scenario. We see at least three requirements: computing logic is message-driven and the relationship between computing logic and mutual invocation must be flexible to organize computing logic needs state support, and in some cases, the need to ensure the processing semantics of exactly once is the third most difficult to do. You can imagine that now our Application has to deal with a process such as placing an order in an e-commerce scenario, and we rely on the database as the state storage for this application. We have a special inventory management logic and an order issuing logic. In a complete purchase logic, we need to call the inventory management module to check whether the item is in stock, and then subtract 1 from the database. After this step is successful, our service continues to invoke the order-issuing logic to generate a new order in the database. This logic is relatively simple when everything is normal, but it can be troublesome when something goes wrong. For example, we have reduced the inventory, but there is an error in the process of generating the order, so we have to find a way to roll back the inventory. Once there are more similar business logic units, your application code will become extremely complex. This problem is a typical end-to-end exactly once, and we want a complex computing process that either succeeds or fails as if it never happened. In order to solve this problem, combined with some of the current accumulation of Flink, Stephan launched a new project: statefun.io, or Stateful Functions. By combining Stateful Stream Processing and FaaS, a whole new way of writing Stateful Application is provided.
The specific implementation logic, I will no longer introduce too much, you can go to the official website to check and learn. The first Keynote given by ClouderaStephan is still relatively technical, which is also in line with his personal style. After that, including all the Keynote of the next day, basically all the well-known big companies came to the Flink platform. Starting with Cloudera, they said that they have received more and more customer requests for Flink, so they "comply with public opinion" to add Flink support to their data platform. To be able to occupy a place in this commercial open source software provider is basically a sign that Flink has entered a relatively mature stage. In addition, Cloudera is a big brother who plays with open source, so it's not as simple as providing Flink software. They announced at the meeting that they had formed an engineering team led by two Flink PMC, and planned to invest more resources in the Flink community, which undoubtedly injected a fresh and powerful force into the prosperity of the Flink community.
AWSAWS made its debut the next day, given by their director EMR, Athena, DocumentDB and blockchain boss Rahul. He first reviewed the development of stream computing-related products in AWS:
As can be seen from the picture, they added Flink to their EMR as early as 2016 when Flink came to prominence. Compared with the hindsight of Cloudera, AWS is really much older in this respect. What is impressive is that AWS has always had a clear main line around the development of stream computing products in recent years, that is to launch products and solutions that are more suitable for customers of different volumes. They well summarize the differences in product needs of customers of different sizes (it is believed that this is not only for flow computing, but also for other products):
For example, they found that a large number of customers sometimes use the streaming computing framework to simply solve the problem of data transfer, such as simply transferring data from Kinesis Data Stream (which is actually one of their message queuing services, whose name is a little misleading) to S3, or sending the data to Redshift or Elasticsearch. In response to this scenario, they developed a special Kinesis Data Firehose product that allows users to do this without writing code. In addition, some customers with some development capabilities will write some code or SQL to process and analyze the data. For this scenario, they provide Kinesis Data Analytics services. Another impressive thing is that the collaboration between the various products of AWS is very good (I later participated in a demonstration and sharing of AWS Kinesis products, which involved the coordination and connection between many products, which was impressive). Each product focuses on solving some of the problems, and there is no functional overlap between the product and the product, but it is basically very restrained. Each real user scenario shared in the speech basically involves the collaboration of more than 3-5 products. The accurate grasp of customer needs, and the collaborative position of products to accurately solve user problems, these two points are worth learning. It's going a little too far. Let's go back to Flink. Rahul concluded that Flink is the fastest growing system they have seen for products that consume data in message queues, but it is still small in terms of absolute volume. This is basically in line with the current state of Flink, with high heat and rapid growth, but the absolute volume is still too small, but it also indicates that there is still a lot of room for imagination. GoogleGoogle appeared after AWS, brought by Reven and Sergei (the former, who is also one of the authors of Streaming Systems, finally met the real person). On the whole, this Talk does not have much to do with Flink, but shares the experience gained by Google in the development of stream computing-related systems over the years. Compared with AWS, the characteristics of the two companies are also quite distinct. What AWS shares is a summary of customer needs and products, while Google is basically talking about purely technical experience. After listening to it, it is true that we have gained a lot, but because of the problem of space, we will not start here. They have also prepared a summary so that we can pack it and take it away:
Due to the lack of skills in the main agenda, I only selected some areas that I am interested in or do not know much about to observe and learn. But for the sake of the completeness of the whole report, I will try my best to briefly introduce other topics that I am not involved in but are quite familiar with. In the follow-up, all the videos and PPT will be uploaded to the Internet for everyone to view. Next, I will divide the topics into several different categories according to my personal understanding. If you are particularly interested in the details of some of these topics, you can take a closer look at the video and PPT. Platform practice building a data platform based on Flink can be regarded as one of the hottest topics. In recent years, Alibaba real-time computing team has spared no effort to promote the experience of building a data processing platform based on SQL to the community. At present, everyone seems to basically agree with this direction and have begun to go into production. However, according to the specific scenario, the size of the volume of work and other characteristics, some companies will choose to use lower-level and more flexible DataStream API to build data platforms, or both. This is also in line with our initial judgment that SQL can solve most problems, but not all of them. In some flexible scenarios, DataStream can solve users' problems more conveniently and efficiently. Topic 1: "Writing an interactive SQL engine and interface for executing SQL against running streams using Flink", a startup company called eventador from the United States, is also one of the sponsors of this conference. Most of the whole sharing is an introduction to the architecture and features of their products, which is basically similar to the platform architecture of us and other companies. Interestingly, they also found that in the platform practice, users need both a higher-level API such as SQL and a more flexible and bottom-point DataStream API, and the ratio of the two is on at 8:2.
Another interesting feature is that they provide JavaScript UDF support on SQL and are very popular among their users. On SQL, continuously lowering the barriers to use is indeed a more reliable way, and we want to provide Python UDF support based on the same starting point.
Topic 2: "Building a Self-Service Streaming Platform at Pinterest" Pinterest is a new face of the Flink community, and this is the first time they have shared their experiences at the Flink conference. Their main application scenarios are mainly around advertising, using Flink to give advertisers real-time feedback on the effect of advertising. This can be regarded as a very classic usage scenario for Flink. As for why Flink was used so late, they came up to explain. They spent a lot of time to compare Spark Streaming,Flink and Kafka Stream these three engines, weighed again and again before choosing Flink, can be regarded as more cautious and careful. At the same time, their old business basically uses Spark to run batch operations, and after switching to streaming, they also need to make some real achievements before they can be promoted on a large scale in the company.
Then, they also shared two pits filled in the platform practice. The first is to view the log, especially when all the jobs are running on YARN, how to view the log when the job is finished is a headache. The second is Backfilling. When a new job is online or the job logic needs to be changed, they want to first track down some of the historical data stored on S3, and then switch to a message queue such as Kafka to continue processing when the chase is almost finished. This Backfilling is the most classic Flink streaming scenario, and it does seem to be a common rigid requirement. If I remember correctly, there are three topics in this conference that mentioned this issue, as well as their solutions. Each solution has its own advantages, but if Flink can directly support such scenarios in the engine, I believe the experience will be much better (which is precisely one of the more important directions of Flink).
Other topics recommend "Stream SQL with Flink @ Yelp": Yelp is already a veteran player of Flink, and in this sharing they summarize their current stream computing scenarios and what their platforms are doing. I didn't hear about this sharing because of the time conflict, but the feedback from other sources seems that they are more playful. It is recommended that you watch and learn after the video and PPT are online. "Flink for Everyone: Self-Service Data Analytics with StreamPipes": generally speaking, platform construction is an internal project of the company, and it is rarely open source. This non-profit organization called FZI jumped out to be Lei Feng, providing a completely open source platform engineering implementation: streampipes. There is a complete set of Tora homework construction process, and the interface looks quite good, students in need can refer to it. "Dynamically Generated Flink Jobs at Scale": this is a Flink-based platform practice shared by Goldman Sachs that supports running 120000 jobs a day. IT students in banking and finance can refer to it. Space is limited, and other related topics are not listed one by one. Generally speaking, building a data platform based on Flink has been a quite mature practice, and there are successful cases in various industries for reference. Students who haven't got on the bus yet, what are you waiting for? In addition to the above platform practice, using Flink to solve specific problems in some application scenarios is also a hot direction in this sharing. These users often write a small amount of homework themselves to solve their practical problems. Or simply the user of the platform to share how to use the platform to solve problems that are closer to the end user. This is also where Flink can really create real business value. I wanted to hear more, but time clashed all the time. Topic 1: "Making Sense of Streaming Sensor Data: How Uber Detects On-trip Car Crashes" this is a brainstorming application shared by Uber. They use Flink to determine whether a passenger has been in a car accident in real time. Like Pinterest, in this business scenario, Uber migrates from Spark to Flink for timeliness. They describe how they rely on the two most important data (GPS information and mobile phone acceleration information) and then apply machine learning models to determine whether passengers are involved in a car accident in real time.
It is also mentioned that they want to share the data collected in this business, as well as some of the features generated on the basis of this data, which can be promoted in other teams (why does it feel like the direction is going to change to the platform-_!).
Other topics recommend "Airbus makes more of the sky with Flink": Airbus describes how it uses Azure and Flink to analyze flight data in order to provide a better flight experience. "Intelligent Log Analysis and Real-time Anomaly Detection @ Salesforce": Salesforce describes how they use Flink combined with a machine learning model to solve real-time log analysis and detect anomalies such as performance degradation of critical services in real time. "Large Scale Real Time Ad Invalid Traffic Detection with Flink": Criteo, a French advertising company, introduces real-time abnormal traffic detection in advertising scenarios. "Enabling Machine Learning with Apache Flink": Lyft shared how they built a machine learning platform based on Flink to solve a wide variety of business problems. To sum up, in the direction of partial application scenarios, more and more cases of the combination of Flink and machine learning have been seen. Basically, some slightly more complex problems are difficult to make simple decisions through rule logic, or SQL. In this case, machine learning can be of great use. At present, it seems that people are more likely to use other engines to train the model, and then let Flink load the model for prediction operation. However, in the process, we will also encounter similar problems such as the different processing logic of the two engines to the sample, which will affect the final effect. This is also an opportunity for Flink in the future. if Flink can provide better support for model training that is more batch-oriented, then users can use the same engine to carry out a whole set of processes such as book stitching, model training and real-time prediction. It is believed that the entire development experience, including the actual line effect, will be greatly improved, so let's wait and see what Flink does in this area. This part of production practice is mainly the sharing of experience in production practice. I am sorry that I have not participated in any of the related issues. I will make a brief introduction according to the brief introduction of the topic. Interested students can check the relevant materials by themselves. "Apache Flink Worst Practices": you may have heard of a lot of Best Practices. This sharing, on the contrary, is devoted to the worst positions for using Flink, basically sharing all kinds of places to step on pits or mines, so that the audience can avoid them. "How to configure your streaming jobs like a pro": Cloudera is based on their experience in adjusting parameters on hundreds of stream computing jobs over the years. Which parameters are more critical for different types of jobs. "Running Flink in production: The good, the bad and the in-between": Lyft shares the experience of their operation and maintenance Flink, what Flink has done well, and what Flink has not done well enough. Let everyone have a more comprehensive understanding of the operation and maintenance Flink production operations. "Introspection of the Flink in production": Criteo shared the experience of how to observe whether the Flink operation is normal, and how to locate the root cause as quickly as possible when something goes wrong. "Kubernetes + Operator + PaaSTA = Flink @ Yelp": while most people still run Flink based on Yarn, Yelp, a deep player, is already in front of everyone. This is also the only combination I have seen in this conference using Flink + K8S online. Although I didn't listen to a single topic, I also heard some topics about Flink production from other topics, among which the more prominent one is the combination of Flink and Kubernetes. The popularity of K8S makes everyone have the idea that they will be out of date without a touch of heat. Many companies have the willingness to try and explore in this direction. Among them, Yelp is the fastest, and has already put this set of architecture online. Personally, I think the combination of Flink and K8S is quite reliable, which can unlock more Application and online service-related postures. Of course, Alibaba's real-time computing team is not out of date in this respect, and we have been working with Aliyun K8S for quite a long time, and recently launched a new generation of real-time computing product ververica platform based on K8S containerization. The topics in front of research projects are basically some engineering practices, and there are many research projects in this conference that have attracted my interest. With the prosperity and development of ecology, in addition to the practice of major companies, partial theoretical research projects are also indispensable. I heard that this conference received a lot of research topics, but due to the limited number of topics, only some of them were selected. Topic 1: "Self-managed and automatically reconfigurable stream processing" this is a research project brought by a postdoctoral student at the Federal Institute of Technology in Zurich to automatically configure stream computing assignments. Their research direction is mainly focused on how to make flow computing operations autonomous and automatically adjust to the best state without human intervention. This coincides with the sharing of Google in keynote in the hope that the system itself has a strong enough dynamic adjustment ability. This sharing mainly consists of two parts. The first part puts forward a new theory of performance bottleneck analysis. Generally speaking, when we want to optimize the throughput and delay of a stream computing job, we often use the traditional way of observing CPU hotspots to find the most CPU-consuming part of the job and optimize it. But we often ignore the fact that there are all kinds of waiting operations that affect the system latency or throughput, such as operators waiting for data processing. If we optimize the cpu hotspot alone, it may only make the waiting time for the rest of the system longer, and it will not really lead to a drop in latency and an increase in throughput. So they first put forward a theory of "critical path", which takes the link as the unit to judge and measure the performance bottleneck. Only when the time consumption of the whole critical path is really reduced, can the job delay be effectively reduced.
The second part introduces a new automatic job scaling mechanism and compares it with Microsoft Dhalion. The characteristic of this approach is that other similar systems always make decisions about one operator alone, and they will consider more than one operator at the same time. When expanding and reducing the capacity, multiple operators are operated at the same time to reduce the number of actions needed for convergence.
The autonomy of stream computing tasks is also a direction that I am very interested in. I have also seen many research projects and papers describing the work in this area, but there is no in-depth sharing in industry (AWS's kinesis service has the ability to dynamically scale up and down, but due to the lack of details, it is uncertain whether it is universal enough and whether it can cope with more complex scenarios). Alibaba's real-time computing team launched a similar project as early as a year ago to try and explore in this direction. In the face of a large number of internal business scenarios and requirements, coupled with the current various cutting-edge research, I believe that there can be a breakthrough in the near future. Other topics recommend "Moving on from RocksDB to something FASTER": this is also the Zurich Federal Institute of Technology on state storage-related research, looking for faster solutions than RocksDB. Alibaba's real-time computing team also has a layout on Statebackend, and we are exploring a completely Java-based storage engine. "Scotty: Efficient Window Aggregation with General Stream Slicing": introduces a way to use slicing to improve the performance of window aggregation. In-depth technical analysis this section mainly introduces some of the major feature and refactorings that Flink has done in the past 1-2 versions. Since I am a developer of Flink, I am familiar with these tasks, so I do not choose to listen to these shares. A good summary is basically made by borrowing the two pictures of Stephan in Keynote.
If some students are interested in individual technical points, they will basically be able to find the corresponding topics. I will not introduce them one by one here. Summary and feelings in recent years, with Alibaba's continued strong investment in Flink, the maturity and activity of Flink have made a qualitative leap. Community ecology is also more and more prosperous, including cloudera and AWS have begun to actively embrace Flink, but also achieved good results. The issues of major companies have also changed from trying Flink in the early years to sharing some of the achievements and lessons learned after the large-scale launch of Flink. On this basis, the platform practice based on Flink, the problem solving of specific business combined with machine learning and some relatively novel exploration and research projects have been gradually formed to make the development of the whole ecology more complete and robust. Not only that, Flink is also actively exploring some new hot directions, such as the combination with K8S, the combination of online service scenarios and so on, reflecting the strong vitality of this ecology. But in the final analysis, Flink is still a big data computing engine, and its purpose is to solve the problem of big data computing. At the beginning of the article, I also mentioned that when I saw the direction of Flink's foray into Application and FaaS, a question kept on my mind: what kind of computing engine is Flink, and what kind of problem is it trying to solve? Without a clear main line and long-term understanding, it is easy to go astray in the development of the engine and eventually lead to failure. Most people may still be stuck in the recognition that Flink is a mature real-time computing engine, but Flink has been thinking about batch processing since its first day. Even though Flink has gradually filled the hole in batch processing, it is exploring online service scenarios such as Application. At first glance, Flink seems to want to solve all kinds of problems and intervene in all directions. Is that really the case? After attending the whole conference with such doubts and thinking for a few extra days, I began to have some new ideas and insights. To answer the question of what kind of computing engine Flink is and what kind of problem it wants to solve, we have to start with the data itself. After all, the object that a computing engine deals with is the data itself. The first question is, where does the data we need to deal with come from? For most companies and enterprises, the data may come from a variety of mobile APP,IoT devices, online service logs, user inquiries, and so on. Although the sources and types of data vary, there is one feature that may be common in most cases: data is always generated in real time. We can use the concept of Stream or Log to simulate the data that the abstraction needs to deal with, which is now a more popular way of abstraction. Jay Kreps spared no effort to promote this way in his early years. Interested students can read this blog post: "The Log: What every software engineer should know about real-time data's unifying abstraction".
Here to answer a few common questions, because this seems to be different from the data you usually come into contact with. The common questions are: the data I usually come into contact with are stored in Database. Does this look different? Database can be understood as the product of materializing these Stream, generally so that subsequent frequent visits can be faster. And in the implementation of most Database systems, in fact, Log is also used to store all the additions and deletions. The data I usually come into contact with are put in a few warehouses and partitioned according to the day. In this case, you can think about the source of the data again. When the data is first generated, it does not go directly to your data warehouse. Generally speaking, it also needs to go through an ETL process. The general data warehouse can be understood as transferring a section of finite flow from the past to a more efficient format. When we use this way to abstract data, we can consider what kind of calculations we will do on such data. Let's start with finite flow: do some simple cleaning and processing on a part of the past data. Basically, most classic batch ETL jobs do a little more complex association and analysis of the past part of data, which is a little more complex than ETL batch jobs to mine a part of the past data in depth and produce deeper insight. This is the scene of the machine learning training model. For infinite streams, we need to consume the newly generated data all the time, so the possible computing types are: ETL and analytical data processing scenarios similar to batch processing, but the calculation occurs on the latest real-time data for feature analysis and mining of the newly generated data. This is a scenario in which the machine learning real-time training model samples the newly generated data, and then applies the machine learning model to determine. This is a typical real-time prediction scenario that triggers a series of background business logic based on the newly generated data. This is a typical Application or online service scenario.
In particular, it is worth noting that the calculation of finite flow and infinite flow are not completely independent, and sometimes we need to switch between the two, such as these scenarios: first process all the historical data, and then start consuming the latest data in real time. For example, in the statistical scenario, when the statistical caliber changes, we hope to re-count all the historical data first, and then connect the latest data for real-time statistics. We first generate samples according to historical data and then train the model, then consume the latest data, convert it into samples and begin to make real-time prediction and decision. This is also a typical practice in machine learning. The key point is to ensure that the sample logic of the training model is consistent with that of the real-time decision. In addition, we can also try to roughly classify these various computing modes from the perspective of computational delay:
After enumerating so many examples and scenarios, everyone should be able to understand the truth. When we abstract all the data based on Stream, the computing patterns caused by the data are quite diverse. As Stephan first mentioned in keynote, traditional Data Processing and message-driven Application scenarios are not enough to cover all computing models. The essence of all computing models is Stream Processing, but sometimes we need to deal with limited data, and sometimes we need to deal with the latest real-time data. Flink's vision is to become a general-purpose Stream Processing engine and cover all possible more specific computing scenarios based on this paradigm. In this way, when users have different computing requirements, there is no need to select multiple different systems (such as the classic lambda architecture, we need to choose a dedicated batch engine and a dedicated stream computing engine). At the same time, when we need to switch between different computing modes (such as processing historical data first and then connecting to real-time data), using the same computing engine also helps us to ensure the unity of behavior.
The original link to this article is the original content of Yunqi community and may not be reproduced without permission.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.