How to talk about the multilingual support of Spark 04/25 Update SLTechnology News&Howtos

How to talk about the multilingual support of Spark

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to talk about the multi-language support of Spark? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

There is no doubt about the excellence of Spark design. He grabbed the C position of Hadoop as soon as he made his debut. In the golden decade of open source big data, he was in the limelight for a while, and it is not surprising that he can still keep pace with the times in the era of artificial intelligence. There are many shortcomings in architecture and design, for example, the scheduling model is too coupled with MapReduce computing paradigm. Spark recently introduced Barrier scheduling mode to support deep learning, which is a new type of computing. Fortunately, changes to the framework will not hurt the nerves. Some defects are not the case. They affect the overall situation and are by no means easy to adjust. Here I mainly want to talk about the framework language implementation and multilingual support of Spark, its gains and losses, but mainly about problems and defects.

The core and framework of Spark are built on Scala and must be called Scala for application developers, showing off the beautiful muscles of minimalist code. In the user interface, from the preferred Java for enterprise applications, to the Python and R of data scientists, and then to the SQL of business intelligence BI, officials support them one by one. It is said to be very comprehensive and complete, and all parties have been served. What else is there to complain about? what are the defects? Let's take a look at the existing problems and hope that they will not turn into picking bones in eggs.

First of all, the support for Python is that PySpark is complex and inefficient, a bit like hack, much like the output of early Spark conquest. In fact, the support for SparkR for R is the same, and when we discuss PySpark clearly, we understand that the support for other languages is actually similar. From the user's point of view, the experience of writing programs in PySpark is very good, similar to Scala, concise and elegant, just like writing a stand-alone program, without considering the complexity of distributed processing. However, under this beautiful coat, the Spark program written in PySpark is slower than the Scala version and is easy to OOM.

Why? Let's find out. In general, Spark programs first give input, that is, the data set to be processed, then express the logic and conversion rules to be processed for each row of records in this data set, and then print out or save the results to a file. The underlying Spark framework performs translation transformation to produce a relationship between each RDD and these RDD, and each RDD represents the corresponding data set and the transformation processing logic to be performed on the data set. These RDD form DAG according to the dependency between them, and translate the DAG into stages and tasks. This process is completed by the framework Scala code on the driver side, including the scheduling execution of stages and tasks; the processing logic of RDD corresponds to the user's Scala or Python code, because distributed concurrent processing is mainly executed on executor. Therefore, for the Python version of the Spark program, there is Python code to be executed on both driver and executor, which must be executed by the corresponding Python interpreter; however, the Spark computing framework is implemented by Scala/Java code, and driver and executor have to run in the JVM process as a whole. So how to execute the Python code representing the user logic and the JVM code of the core engine on both the driver side and the executor, and interact and coordinate between the two? The practice of PySpark is to accompany the necessary JVM process on the driver side and executor, and then launch it into a separate Python to interpret and execute the process, and then interact and collaborate through socket, files and pipeline. This has to be said to be a very inefficient practice, because Spark programs are generally not an ordinary application and have to deal with very large data sets. For the processing of thousands of rows of records, you have to go back and forth on the executor through the cross-process pipeline to the Python process, and finally you may have to write a disk file on the driver in order to transfer the calculation results to the Python process. This involves a large number of recorded data to be serialized and deserialized between Python and Java, and the efficiency can be imagined. It is easy to start a new Python process to execute user's code directly, except for efficiency, which is expensive considering process management, memory control, container environment and data security, but Spark has to do so, mainly because of its lack of long-term consideration and overall design in language support, which is the most important issue I want to talk about.

In terms of language support, from the beginning, Spark focused on "sugar coating" rather than "cannonball". It pursued the concise and powerful expression of data processing logic in the user experience, so it outperformed the various computing frameworks at that time. I remember that when Hadoop was in its heyday, the Intel big data team of the author was playing Hadoop happily. Mr. Matei, the author of Spark, demonstrated to us at a teleconference that Spark was still very immature at that time. It was really frightening to see that the famous MapReduce Helloworld example WordCount program only needed two or three lines of Scala code. Spark can steal the thunder of Hadoop in a short time, in addition to faster performance, there is no doubt that this is also a key factor. However, it is not easy to say how much of this simplicity and power can be attributed to Scala. The most essential feeling lies in the functional language support, RDD's own abstraction, rich operator support and ingenious relationship derivation, this part can be classified as the engine level, generally speaking, it can be realized in other languages, such as Cellular languages, which are better at doing system frameworks. As for whether the user's program is to use Scala or Python, as long as it supports closures, it should work. There will not be much difference in brevity, and more languages can be supported. Each user's specific use depends on personal preference, and there is no reason for everyone to learn a new language. Unfortunately, the implementation of the Spark engine should be affected by the main user interface language Scala, which also uses Scala, and some uses Java, but it is essentially JVM, which has the advantage of being one-pot and easy. Archaeologists can list a lot of benefits of using Scala, such as a small number of lines of code, codegen power, and so on, but these are local. As a whole, as a unified big data processing platform, Spark needs to be considered in the long run. This consideration is to be able to support more user languages faster and better on the user interface, the extreme performance optimization at the hardware level on the computing engine, and the integration of other computing engines in the computing scenario.

To put it simply, this consideration is the lack of support for Cmax at the framework level. Of course, this support is difficult to consider on the first day, after all, the focus is not here, at that time, Mr. Matei probably did not expect Spark to be so successful. However, at the framework level, there is a lack of native support for Spark, or in other words, the core of the framework is not developed in this native language, its disadvantages and the impact on the long-term development of the framework are undoubtedly obvious. As a system language, it can directly deal with hardware and various development languages, so it is not difficult to support a variety of computing processing libraries and engines. Spark itself is positioned as a big data processing platform. If the core common logic and operations are handed over to C _ Python +, then it first supports Python as a user interface language and calls directly using the FFI mechanism, which is simple and without loss of efficiency. The mainstream development languages are basically very good support for CAccord Category recording and Python, not to mention Java,Go,Rust,Julia, are. When Spark adds a new language support, it is reduced to adding a new binding, which is easy to maintain, and there is no need to worry about this feature being castrated. That support is not complete. It is not a problem to optimize new hardware, such as computing upstart GPU to RDMA, and then to persistent large memory AEP, will not be limited by the restrictions of JVM, the major hardware manufacturers can roll up their sleeves, Spark just sit back and enjoy the success. Why can't the support for these new hardware be accepted by the community now? Personally, I think it is mainly lack of support at the framework level, which is like hack, and it is also troublesome to get in. Just look at other people's Tensorflow, CPU,TPU,GPU can play well. The core framework is implemented by Scala/Java, and another important impact is the in-memory representation of the dataset. Early Spark directly used Java objects to represent data records, requiring persistence or network transmission to serialize and deserialize. Of course, Spark quickly realized the problem at this point and solved the problem immediately through offheap and Tunsten projects. Fortunately, there are tricks. If it is implemented by tensor +, the memory layout of the data is bound to adopt a neutral representation similar to that of Arrow, which is convenient for various user interface languages to access and manipulate. Just like the common tensor in the deep learning framework, efficient transmission must be considered. Finally, the Spark framework implemented by Scala/Java is easy to integrate all kinds of data sources that are common in open source big data domain and enterprise applications, but it is not enough to integrate more computing frameworks. Similar to the support for new hardware, take a look at Yahoo's support for Caffe and Tensorflow. You can only play with hack, but it is impossible to integrate into Spark and become popular. Native support for deep learning, like machine learning, is OK, but you have to do it yourself. Spark has spent a lot of effort on mllib,Intel and BigDL, but there is no way, because of the lack of strong support from the core framework, it is impossible to directly integrate ready-made library implementations and computing engines in these areas. How much of a problem would it be to integrate PyTorch and Tensorflow if the core framework was implemented by CCompact +? It's just a matter of hanging a dynamic module or building an extension.

This is the answer to the question on how to talk about the multilingual support of Spark. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.