Example Analysis of the New Features of spark2.0 05/05 Update SLTechnology News&Howtos

Example Analysis of the New Features of spark2.0

2025-05-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you a sample analysis of spark2.0 's new features, which is concise and easy to understand. I hope you can gain something through the detailed introduction of this article.

Use:

Offline computing: most data sources come from hdfs (hive), so sql uses so much that hivecontext or sqlcontext is used in almost every offline computing job.

Real-time computing: streaming module

Graphic computing is rarely used in enterprises, and the demand is low.

The data mining package mllib is used a little more than graph calculation, but the algorithm in it is not very easy to use, so there is less demand at present.

What's new:

1. Unify hivecontext and sqlcontext with sparksession

2 sql stage code generation greatly improves computing performance, because the amount of data in the sql processed per second is increased tenfold by turning the physical plan into hard coding, that is, multiple calls to physical execution are converted into code for loops, lace hardcode mode, reducing the number of function calls executed, which is very large when there are too many data records.

3. The dataset that combines dataframe and datasets,1.6 includes the function of dataframe, so there is a lot of redundancy between them, so 2.0 unifies the two, preserving dataset api, and expressing dataframe as dataset [Row], that is, a subset of dataset. Dataframe is an abstract class of sql query result rdd, which is equivalent to resultset in java.

4. Structured flow computing. Sparkstreaming regards flow computing as an offline calculation to complete flow computing, and provides a set of dstream stream api. Compared with other stream computing, sparkstreaming has the advantages of fault tolerance and throughput. In the version before 2.0, if users had stream computing and offline computing, they needed two sets of api to write programs, one is rddapi and the other is dstream api. And dstream api is not nearly as easy to use as sql or dataframe. In order to truly unify streaming computing and offline computing in programming api, and also let streaming jobs enjoy the advantages brought by dataframe/dataset: performance improvement and API ease of use, structed streaming is proposed. Finally, we only need to develop offline computing and streaming computing programs based on dataframe/dataset, which can easily make spark in api and dataflow to unify offline computing and streaming computing. For example, when we do batch aggregation, we can write the following code:

Logs = ctx.read.format ("json") .open ("s3://logs")

Logs.groupBy (logs.user_id) .agg (sum (logs.time)) .write.format ("jdbc") .save ("jdbc:mysql//...")

So for flow calculation, we just call different function codes of dataframe/dataset, as follows:

Logs = ctx.read.format ("json") .stream ("s3://logs")

Logs.groupBy (logs.user_id) .agg (sum (logs.time)) .write.format ("jdbc") .stream ("jdbc:mysql//...")

5 sql optimization, adding a lot of previously unsupported SQL statements

6. Use vectorized Parquet decoder to read data on parquet, which used to be read by one line, and then processed. Now it is changed to read 4096 rows of records at a time. It is not necessary to call the method of getting records by parquet without processing a row of records, but instead to call once in a batch (spark-12854). Coupled with the fact that parquet itself is a column storage, this optimization makes the parquet read speed three times faster.

Radix sort is used to improve the performance of sort (spark-14724). In some cases, sorting performance can be improved by 10 to 20 times.

Using vectorizedhashmap instead of java's hashmap to accelerate the execution of groupby

Implement the window function in hive with native spark window because native spark window has advantages in memory management

Avoid repeated calculations of the same logical parts of complex statements during execution

The compression algorithm uses lz4 by default

The calculation in mllib replaces the previous rdd computing logic with dataframe-based api.

Provide more r language algorithms

Compile and run with scala-2.11

In terms of compatibility with older versions, the parsing of hive statements and syntax are moved to core, and when there is no hive original database and hive dependency packages, we can use hivesql statements like previous versions using standard hivesql.

The above is a sample analysis of the new features of spark2.0. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.