In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Since January 2019, Alibaba has gradually returned the internally maintained Blink to the Flink open source community, and has contributed more than 1 million lines of code. Domestic including Tencent, Baidu, byte jump and other companies, foreign including Uber, Lyft, Netflix and other companies are Flink users.
Flink 1.9.0, released in August this year, is the first release of Ali's internal version Blink since it was incorporated into Flink. At today's Flink Forward 2019 conference, Ali released the functional foresight of Flink 1.10, and the official version is expected to be released in January 2020. Function Prospect of Flink version 1.10: all functions of Blink enter Flink it is introduced that version 1.10 of Flink can be regarded as a relatively important milestone version. At this point, all functions of Blink have entered Flink, including key designs and general optimizations in Blink. The following are the main features and technical highlights that will be included in this release: 1. More powerful Blink Query ProcessorDDL enhancements of Blink/Flink merge (1), support for defining computing columns in table statements and Batch support for watermark production level, and full support for TPC-H and TPC-DS test suites, in which the performance of TPC-DS 10T is 7 times that of Hive3.0. (2) complete scheduler reconfiguration, support more flexible batch scheduling strategy. (3) more perfect, finer-grained and more flexible resource management combs the memory model of TaskExecutor. It solves the long-standing problems such as difficult configuration and control of RockDB memory, inconsistent memory calculation before and after TM startup, simplifies memory calculation logic, reduces configuration difficulty to manage resource usage at operator level more finely, solves performance and stability problems caused by operator resource overuse, improves resource utilization efficiency, 2.Hive compatibility, production availability (1) Meta compatibility, supports reading Hive catalog directly. The version covers 1.xPol 2.x to 3.x (2) data format compatibility, supports direct reading of Hive tables, and also supports formats written as Hive tables. (3) UDF is compatible and supports direct calls to Hive's UDF,UDTF,UDAF 3 within Flink SQL. More powerful Python support increases the support for NativePython UDF. Users can develop their own business logic with Python to support the dependency management of Python class libraries. Python users can not only customize Python UDF but also integrate with other existing Python library. Architecturally, BeamPortability Framework,Flink and the Beam community are introduced to create a convenient, high-performance Python UDF support framework and Flink resource management framework. Realize the management and control of Python UDF resources. Support native K8S integration (1) native resource management, can dynamically apply for TaskManager according to the resource requirements of the job, do not need to rely on external systems or components for more convenient task submission, do not need to install tools such as kubectl, and can achieve a similar experience with Yarn 5. The new mainstream machine learning algorithm library (1) includes logical regression, random forest, KMeans and other questions: in version 1.10, all the functions of Blink have entered Flink, just three months after the last release of 1.9.It was also the first time that Blink was incorporated into Flink, and it was only a year after Ali announced that it would open source Blink last year. Why can Blink's Merge progress so fast? What are the problems encountered in the process? How did you solve it? Don't ask: we have invested a lot of resources, including dozens of technicians to do this, and the parallelism is large, so we can contribute as many as 1.5 million lines of code in a relatively short period of time. Question: are there any thorny problems in the whole process? Do not ask: the community is a relatively open and transparent scene, unlike your own project can be changed at will, but to go through a democratic process, including community discussion, everyone's approval, to ensure the quality of the code, and so on. It is still a great challenge for us to make rapid progress, but also to ensure quality and fairness in the community. Question: so how do you balance these two things? Don't ask: the cooperation model of the whole Flink community is relatively efficient. The leaders of different modules in the community have videoconferences every week, possibly community discussions in different countries, which are very efficient and the project management is very good. Under the guarantee of this mechanism, we can let the code enter quickly while ensuring the speed of iteration. In fact, this is also a great challenge to the development of engineering efficiency. To put it bluntly, we put in a lot of technicians to do this, but we don't just look at the quantity. Many of the people we put in are the PMC and Committer of the Apache project, rather than ordinary engineers. These people are familiar with the working mechanism and process of the Apache project, and their efficiency and combat capability cannot be calculated according to one person. That's what the community is like. It's not a matter of too many people, but it needs the right people. Question: you mentioned in your speech this morning that Flink is becoming a real Unified Engine. Interestingly, we have heard similar statements from different computing engines recently. For example, the core idea of Spark is to become a "unified data analysis platform". Could you please talk about the design concept of Flink? What are the similarities and differences between the two? Don't ask: we have emphasized the core idea of Flink many times, and its essential computing idea is the core of stream processing. The core of stream processing is that everything is processed based on Stream, and batches can be seen as a limited stream. Like the online Stateful Function mentioned today is also Event Driven, all Event keeps doing function calculations, doing online stateful calculations, and then giving the results to users, and then iterating over and over again. In fact, online services are unlimited, and they will not stop processing. People visit and deal with them all the time. The core of Flink is stream-based Core to cover Offline and Online, so it is not quite the same as Spark. Spark believes that everything is based on Batch, while streams are countless Batch together, which is not quite the same. However, everyone's macro vision is similar, using a set of computing engine technology or the technology handled by big data to solve as many scenarios as possible. in this way, from the user's point of view, the learning cost is lower, the development efficiency is higher, and the operation and maintenance cost is lower. Therefore, our goals and ideas are the same, but the choices in the ways to achieve this goal are different. * * question: we have asked Databricks engineers the following question before, and we would also like to ask you today: if I want to be a unified platform, and you also want to be a unified platform, will there be the question of who can really unify who in the end? * * Don't ask: I don't think people are saying that if you do something, you will win and it will be good. From my personal attitude, technology still needs to have a certain degree of healthy competition, so that we can learn from each other. At the same time, all roads lead to Rome, which may not be absolutely right. Different scenarios may have different preferences or different regional needs, or adapt to different scenarios. To solve similar problems, there are two or three companies coexisting, this state is relatively healthy, just like the database field has MySQL, PostgreSQL and so on, online services are also similar, at least two big companies have to compete together, is more appropriate. But in the end, which one does better depends on whether you can take your theory to the extreme. Because theory is a theory, your theory and mine sound different, but who wins in the end is the details, including the user experience. Whether you are doing it the right way or not, and whether the details are good enough or not, it doesn't make a difference if everyone sounds the same. Details and the development and promotion process of community ecology are very important. What is the progress of open source Alink:Flink machine learning? The progress of Flink in the field of machine learning has been the focus of many developers. This year, Flink ushered in a small milestone: machine learning algorithm platform Alink open source, which also announced that Flink officially entered the field of AI. Link to Alink open source project: https://github.com/alibaba/Alink Alink is a new generation of machine learning algorithm platform developed by Alibaba machine learning algorithm team based on the real-time computing engine Flink since 2017, providing rich algorithm component libraries and convenient operation framework. Developers can build the whole process of algorithm model development covering data processing, feature engineering, model training and model prediction with one click. As the industry's first machine learning platform that supports both batch and streaming algorithms, Alink provides Python interface, so developers can easily build algorithm models without Flink technical background. The name Alink comes from the public part of the related names (Alibaba, Algorithm, AI, Flink,Blink). It is reported that Alink has been widely used in Alibaba search, recommendation, advertising and other core real-time online business. In Tmall's double 11, which just ended, the daily data processing capacity reached 970PB, and the peak data per second was as high as 2.5 billion. Alink successfully withstood the test of very large-scale real-time data training and helped to increase the CTR (item click conversion rate) by 4%. Question: can you first introduce the general situation of FlinkML and Alink and the relationship between them? Don't ask: FlinkML is an existing machine learning algorithm library in the Flink community, which has been around for a long time and is slow to update. Alink is based on the new generation of Flink, which is completely rewritten and has no code relationship with FlinkML. Alink was developed by the team of Alibaba and big data. After it was developed, it was also used inside Alibaba, and now it is officially open source. In the future, we hope that Alink's algorithm will gradually replace FlinkML's algorithm, and maybe Alink will become a new generation of FlinkML. Of course, it will take a long time to replace it. Alink contains a lot of machine learning algorithms, and large bandwidth is needed to contribute or publish to Flink. We are worried that the whole process will take a long time, so open source Alink alone first, and you can use it if necessary. If the later contributions are progressing smoothly, Alink should be able to fully merge into FlinkML, that is, directly into the backbone of Flink ecology, which is the best destination for Alink. At this time, FlinkML can fully correspond to SparkML. Question: apart from Alink, what is the current progress of Flink in the field of machine learning? Compared with other computing engines, how do you evaluate Flink's current work in the field of machine learning and AI? is it competitive enough? Don't ask: in fact, we still have a lot of work in progress. The core of machine learning is iterative computing. Machine learning training is to train the data iteratively, train a model and then go online. On the basis of core training, Flink is designing a new iterative calculation. Because Flink is based on streaming computing, its iterative calculation can be transformed into mini-batch iterative calculation. According to the number of data entries and the duration of data segments, many fine-grained data segments can be typed on the stream. The advantage of Flink is that there is no problem with the feasibility of fine-grained data segments on the stream, because it is pure-streaming and truncated into segments. The iteration of Spark is to do an iteration of a data set, and then do another iteration, this data set is very difficult to cut particularly fine, cut out a section is a task to run, fine-grained challenge is greater. The advantage of Flink is that it can cut the granularity very fine, so it is feasible to reconstruct the original iterative computation. Flink's earliest iterative calculation is the same as Spark, either a batch of iterations or an iteration, completely two extremes, we want to make it an abstract, we can set the batch size of the iteration according to time and size, which is similar to the concept of Flink window, so we can support nested iterations, incremental iterations, and so on. After we have done the flow-based iterative technology at the engine level, the training of the whole machine learning will be greatly accelerated. Although the effect of the algorithm itself may be the same, the performance and speed of running are not the same. At the same time, it can also solve the problem of online training, for example, the log flow of the Internet and user behavior are constantly generated. Flink streaming iteration can deal with the real-time data generated by users uninterruptedly and can be updated online iteratively. The model can be updated every 5 minutes or every 1 minute. In this way, its model online is a 7 × 24-hour circular update, such an online learning system will bring great changes to users, this change is not a simple 30% improvement or engineering optimization, but in the use of machine learning concept will be optimized. This is what we are currently doing, and it has already been discussed in the community, which may be the focus of the first or second version of Flink next year. You can think of Flink as Unified Engine last year and embracing AI this year. A lot of the work we did in 2019 was partial to SQL optimization, and next year we will cut more into the direction of AI, that is, FlinkML and AI scenarios. * * question: when did Ali decide to open source Alink? * * Don't ask: when Blink was open source last year, we were considering whether to open source Alink together. But later, I felt that the first open source had not yet been done, and I didn't dare to take such a big step at once, so I had to do it step by step, and Blink open source also had to prepare a lot of things. At that time, we had no way to open source two big projects at the same time, so we first did a good job of Blink open source. After Blink is open source, we wonder if we should just push Alink's algorithm to Flink. However, it is found that contributing to the community is indeed a complex process. Blink already takes up a lot of bandwidth when it is pushed, while the community has so much bandwidth that it is impossible to do many things at the same time. The community also needs a period of time to consume, so it is decided to first consume the Blink, contribute it, and the community can eat it, and then gradually contribute the Alink back to the community. This is a process that cannot be crossed. Open source is a very prudent process, you can't open one at will. Children cannot be in charge of life and care, and if you want to send things, you must have a long-term plan. If you are responsible, you must give everyone a very clear signal that this is a long-term plan, and it is not the end of opening up the source. In the future, users will certainly ask you if you will take care of it after you put it up. If we do not think about these problems, it will be counterproductive for users. People think that you have not given everyone a clear signal, and they do not dare to use it. Question: what are the highlights compared to SparkML,Alink? In what ways will it be more attractive to developers? Do not ask: first, Alink relies on Flink computing engine layer; second, there are UDF operators in the Flink framework, and Alink itself does a lot of optimization to the algorithm, including detailed optimization in the implementation of the algorithm, such as communication, data access, iterative data processing process and so on. Based on these optimizations, we can make the algorithm run more efficiently, and we have also made a lot of supporting tools to make it easier to use. At the same time, Alink also has a core technology, that is, it has done a lot of FTRL algorithms, which is naturally aimed at online learning. Online learning requires iterative algorithms with high frequency and fast update, in which case Alink has natural advantages. Information flows such as Jinri Toutiao and Weibo often encounter such online scenes.
In offline learning, the comparison between Alink and SparkML is basically similar, as long as everyone's engineering is good enough, offline learning can not make a generation gap, the real generation gap must be the design concept is not the same. Only when the design, product form and technology form are different, there will be obvious advantages of generation gap. Compared with SparkML, our tone is basically the same batch algorithm, including function and performance, Alink can support all the algorithms commonly used by algorithm engineers, including clustering, classification, regression, data analysis, feature engineering, etc., these types of algorithms are commonly used by algorithm engineers. We also calibrated all SparkML algorithms before open source, achieving 100% alignment. In addition, the biggest highlight of Alink is streaming algorithms and online learning, which is unique in its own characteristics, which has no shortcomings for users and obvious advantages at the same time.
Follow-up planning and future outlook of machine learning algorithms supported by Alink question: how often will Flink update its version next? Can you tell me what new features or functions Flink will have to look forward to? Do not ask: 3-4 months, basically a quarterly update of one version, such as January 2020 will be 1.10 in April will be 1.11. It is not clear when to cut 2.0 focus 2.0 should be a very landmark version. Now the Flink community can see a lot of points, not only AI, machine learning, but also Stateful Function mentioned in today's keynote speech Stephan Ewen, is also very promising. In fact, there are many promising things to explore in the online scene, and Serverless (Faas) is also the direction behind Flink. The Flink community is very good, it has just evolved to version 1.x, there is still a lot of room to rise, the vitality and state of the community are very good, you have a lot of ideas to put into it. Question: what new technological directions or trends are more important in big data's field in the future? Don't ask: the integration of big data and AI may be a good opportunity. Everyone has basically played all kinds of games with big data now, and all kinds of projects emerge one after another. AI is also contending with all kinds of flowers, but in fact, users want more than AI. Where is the data? How can AI play without data? It is necessary to calculate the features and samples in order to train a good model. This model can only get better and better through continuous iterative feedback. Data processing and data analysis are very important in this process. Without a complete feedback system, the link between big data and AI will be impassable. No matter how good the engine is, if there is no closed-loop calculation path, it can not really play the effect of production or business. So to make the whole set of big data + AI processing into a very easy-to-use, easy-to-use solution, this is what we need most. Now you may have done it one by one, and many things can find corresponding open source projects, but there needs to be an overall platform to string all the technologies together. Question: does Flink want to do this to some extent? Don't ask: next year we will open source a new project, AI Flow. At present, there is no Ready. We hope that AI Flow can process and preprocess the data through a workflow, including model training, model management, model launch, dynamic update, and get feedback after updating. After feedback, how to reverse optimize the process and string the whole system together. Each of these links can be implemented using a different engine, using Flink OK, Spark and OK, depending on which one works. For example, you can use Flink to do big data processing, TensorFlow to do in-depth learning training, FlinkML to do streaming training, connect all these to provide users with an end-to-end solution, this is a very promising project. Question: is this similar to Databricks's MLflow? Do not ask: AI Flow is greater than MLflow, because MLflow only defines the data format, AI Flow may be more like Kubeflow, AI Flow favors workflow, MLflow focuses on data format, does not cover a particularly complete workflow, but we do not rule out the possibility that MLflow will become bigger and bigger in the future. Why are we making this thing? Because inside Alibaba, we are very familiar with how to play the core system of search recommendation advertising, and how to process step by step to form a brain to regulate the whole traffic, even search traffic, recommendation flow, advertising flow, battle in business flow and cash flow, etc., this is the core system of the whole commercialization, this system is based on big data + AI scheme, and this scheme can not be separated from workflow. It is a larger concept that we cannot do without the definition of data format and the cooperation of different computing engines. We will invest more resources in this area next year, and we will join hands with other companies to do it.
The original link to this article is the original content of Yunqi community and may not be reproduced without permission.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.