Filtering method of Spark data set 10/14 Update SLTechnology News&Howtos

Filtering method of Spark data set

2025-10-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the filtering method of Spark data set". In the daily operation, I believe that many people have doubts about the filtering method of Spark data set. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "filtering method of Spark data set". Next, please follow the editor to study!

In practice, filtering a Spark dataset based on a field is a very common scenario, for example:

A dataset A that stores company employee information has the following three fields:

Id: Integername: Stringage: Integer

Now to filter out the id of some employees, these id in set B (B may be a hash table or a Spark dataset), the filtering logic is:

C = A.filter (A.id in B)

There are four ways to do this, which are:

Filter

Map

MapPartition

Inner Join

The following is a detailed introduction.

Filter

The Filter transform of Spark can filter the data set according to conditional expression, filter function that returns Boolean value, and conditional string. The method is as follows:

/ / 1. The conditional expression A1 = A.filter (Column condition) / / 2. Custom filter function A1 = A.filter (FilterFunction func) / / 3. Condition string A1 = A.filter (String condition)

Filter transformation is relatively simple, and it is efficient to process records one by one regardless of the size of the dataset, but it needs to be able to broadcast the dataset B used for filtering to all executor.

Map

Map transform, which calls a function for each record in the dataset. The return value can be null, or a new record of the same type or a different type, using the following methods:

The / / encoder parameter is used to specify the output type A2 = A.map (MapFunction func, Encoder encoder)

If filtering is realized by Map transform, only the records that meet the conditions are returned as is in the Map transform, and the records that do not meet the conditions are returned to null.

As you can see, the semantics of the Map transformation is similar to that of the Filter transform, which processes records one by one, but Map needs to provide an additional Encoder, so it is not as simple and elegant as Filter, and because the output filters null values, it is not as efficient as Filter.

MapPartitions

The MapPartitions transform is similar to the Map transform, but the mapping function is not called on each record, but at the partition level, using the following methods:

/ / the input and output of func are both Iterator type A3 = A.map (MapPartitionsFunction func, Encoder encoder)

MapPartitions operates at the partition level, not the record level, and is therefore more efficient than Filter and Map. On the downside, first of all, like Map, you need to provide an additional Encoder, and in addition, when the partition is too large to provide more memory than executor can provide, the task will fail, so it is not as reliable as Map and Filter.

Inner Join

Take the equality of employee id as the condition of Inner Join, and then as long as the fields in the A collection, you can also achieve filtering, using methods:

/ / the join expression may be A ("id") = B ("id") A4 = A.join (Dataset B, Column joinExprs)

Like Filter, Inner Join has guaranteed efficiency and reliability, and has no preference for the type and size of B sets.

At this point, the study of "filtering methods for Spark data sets" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.