Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of Spark-Sql

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the example analysis of Spark-Sql, which is very detailed and has certain reference value. Friends who are interested must finish it!

SparkSQL operational architecture

When Spark SQL handles the SQL statement, it will first parse the SQL statement (Parse), and then form a Tree. In the subsequent processing processes such as binding and optimization, it is the operation of Tree, and the operation method is to use Rule to use different operations for different types of nodes through pattern matching.

Spark-sql is a module used to deal with structured data and is the primary module for getting started with spark.

The learning of technology is nothing more than to understand its API, but Spark is a bit difficult, because its examples and what can be found on the Internet are basically written by Scala. We use Java here.

Getting started example

The first example of data processing is usually word count, which counts how many times each word appears in a file. Let's give it a try.

> there are many examples on the Internet, even through spark; most of them are written in Scala, which I haven't tried; a few are written through Java

Some of the examples in Java are implemented using RDD, and only a few are done through DataSet. But even with this handful of examples, I can't get away with it.

So I tried to finish this example myself. I saw that someone else wrote three or five lines in Scala, and I tried to make little progress all day. Cobble together Java familiar with Spark on the Internet

Let's change it with our previous example:

String logFile = "words"

SparkSession spark = SparkSession.builder () .appName ("Simple Application") .master ("local") .getOrCreate ()

Dataset logData = spark.read () .textFile (logFile) .cache ()

System.out.println ("number of lines:" + logData.count ()); instead of using the previous README file, I created a words file with a bunch of words at random.

Execute the program, you can print it out normally:

Next we need to divide the sentence into words and count the number of times each word appears.

> some people may say, this is simple. I can handle it by using Java8 stream:

The rowset is processed by flatMap, each row is divided into a separate set of words by split (""), and the result is obtained through its own groupBy to the terminating data structure Map.

Finally, just get the size of map's key and value.

Indeed, this is how it is implemented using Java. But Spark provides a set of tools similar to Java's stream API name and effect, except that Spark is distributed API

Let's deal with it first through Spark's flatMap:

Dataset words = logData.flatMap ((FlatMapFunction) k-> Arrays.asList (k.split ("\ s")). Iterator (), Encoders.STRING (); System.out.println ("number of words:" + words.count ()); words.foreach (k-> {System.out.println ("W:" + k);})

Unlike the flow of Java, the return value of spark, the flatMap, can access the result directly:

> some people may have noticed that there is a big gap between the parameter definition of functional methods in spark and Java. Their parameters are not the same, and they have an extra encoder. At present, I do not know why this definition, but the impression that the encoder is also an important optimization content of spark3.

The way to use Scala in Java is always weird. Lambda expressions always need to be cast in front of them, just to indicate the parameter type, otherwise you need to new an anonymous class.

It also took me a long time to find a web page called org.apache.spark.sql.Dataset.flatMap java code examples | Tabnine

And then I was confused:

KeyValueGroupedDataset group = words.groupByKey ((Function1) k-> k, Encoders.STRING ())

In this way, I have group, but the return is not DataSet, I do not know what the use of this return, how to get the content inside? I went to a lot of trouble to get it done.

For example, I found that the count method returns a DataSet:

It seems to be exactly what I want, but when I want to output it, I actually report an error:

Ount.foreach (t-> {System.out.println (t);})

Not to mention foreach, even if you want to see the quantity in it (just as we looked at how many lines there are in the file at the beginning), you will report an error, just like the error content.

Count.count ()

Check a lot of data, the main idea is that the calculation methods of spark are distributed, each task needs to communicate with each other, and the communication needs serialization to transmit information. So above we can see the number of file lines because the type is String, the serialization flag; now the tuple is generated and cannot be serialized. I tried all kinds of methods, and even created my own new class to simulate the computing process.

Check the information for a long time, such as Job aborted due to stage failure: Task not serializable: | Databricks Spark Knowledge Base (gitbooks.io) is still unsolved. Chance to find an exciting website Spark Groupby Example with DataFrame-SparkByExamples finally solved my problem.

Use DataFrame

Although DataFrame is an important tool provided by spark, there is no corresponding class on Java. It just changes the generic object of DataSet to Row. Note that this Row does not have a generic definition, so which columns are not known

You can convert DataSet to DataFrame from the beginning:

But you can see that it's troublesome to get data from Row. So right now, I'm only going to turn where I need serialization:

The above is all the content of this article "sample Analysis of Spark-Sql". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report