Spark is a basic open source big data technology. 04/21 Update SLTechnology News&Howtos

Spark is a basic open source big data technology.

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Time before spark (http://www.3if0.com)

To understand the potential of Spark, it helps to review the shape of big data ten years ago. In 2008-2009, big data, the business concept, was often confused with Hadoop technology. Hadoop is an open source framework for managing clusters (networks of multiple computers) running on MapReduce programming tasks. MapReduce is a programming model promoted by Google in 2004, which is used to build the collection and analysis of large data sets. Ten years ago, the paradigm big data project was coded as MapReduce batches applied to specific domain data and then executed on Hadoop-managed clusters. Big data and Hadoop determined that a few years later, departments that were not familiar with big data (such as venture capitalists, public relations firms, human resources departments) were notoriously confused with the two in their advertising and other writing.

Hadoop emphasizes that batch processing is clumsy for iterative and interactive jobs. More importantly, Hadoop's MapReduce interpretation assumes that the dataset resides in the Hadoop distributed file system (HDFS). Many (perhaps most) datasets are uncomfortably suited to this model. For example, high-performance machine learning emphasizes memory processing and does not often use file system mass storage.

Spark, "Unified Analysis engine for large-scale data processing", which began with the Berkeley project in 2009, emphasizes:

Compatible with Hadoop by reusing HDFS as a storage layer.

Interactive query.

Support machine learning.

Pipeline operation (that is, it is easy to connect different execution units so that complex calculations can be implemented as a "bucket brigade" to transfer data through continuous calculation phases).

Spark is also flexible in several ways, including the different programming languages it serves, the cloud it can rent, and the large databases it integrates.

Spark vs. Hadoop

Spark is usually faster than Hadoop, so you can better install Spark's memory model, which can have up to 100 jobs. Spark adjusts for typical ML tasks, such as NaiveBayes and K-Means computing, and can also help save time and ease hardware limitations. However, early Spark projects were known for leaking memories, at least in the hands of beginners. In addition, long-running batch MapReduce jobs seem to make it easier to use Hadoop.

Spark is also a more general programming framework, as described above, which is illustrated in more detail in the following example. Hadoop envisions big data as a Java-coded MapReduce operation, which is very inflexible; by contrast, Spark's learning curve is far less steep. Traditional programs in Python,Java,Scala,R and even SQL can almost immediately start writing familiar programs on traditional desktops, while taking advantage of the power of Spark. There are several memorable examples on Spark's official website. Consider this word counter in Python:

Import pyspark

Source = "file://..."

Result = "file://..."

With pyspark. SparkContext ("local", "WordCount") is sc:

Text_file = sc. TextFile (Source)

Counts = text_file. FlatMap (Ramda line: line. Split ("

"))

. Map (lambda word: (word,1))

. ReduceByKey (lambda a dint bju a + b)

It's important. SaveAsTextFile (result)

Any Python programmer can read this. Although it runs on low-power development hosts, it remains the same in Docker-ized Spark,Sparkon industrial clusters, experimental supercomputers, high-time mainframes, and so on. In addition, it is easy to improve such an example using traditional Python programming; subsequent examples may be:

Re-import

Import pyspark

Source = "file://..."

Result = "file://..."

Def better_word_splitter (line):

Use negative rearview to divide all

Blank, but only once for each space

Sequence.

Return heavy. Split (?

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.