Comparison of common data query operations among SQL, Pandas and Spark 07/11 Update SLTechnology News&Howtos

Comparison of common data query operations among SQL, Pandas and Spark

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "comparison of common data query operations of SQL, Pandas and Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

01 SQL standard query

When it comes to data, databases are always mentioned; when it comes to databases, they generally refer to relational databases (RMDB), and the language for operating relational databases is SQL (Structured Query Language). SQL is still a programming language in essence and has a long history, but its syntax features have hardly changed, which in a sense reflects the excellence of the SQL language.

In the latest TIOBE ranking, SQL ranks 10th.

Generally speaking, a standard SQL statement usually contains the following keywords in writing order:

Select: specify query field

Distinct: deduplicates the query result field

From: explicitly queried databases and tables

Join on: specifies that the query data is derived from multi-table joins and conditions

Where: set filter criteria for query results

Group by: sets the field for grouping aggregate statistics

Having: further filter based on the fields after aggregate statistics

Order by: sets the sort of returned results by

Limit: limit the number of returned results

This is the main keyword that can be involved in a SQL query statement, and after the parser and optimizer, the final execution process is very different, in the following order:

From: first find the table to be queried

Join on: if there is more than one target data table, establish a join to multiple tables

Where: filter data records based on query criteria

Group by: group aggregation of filter results

Having: secondary filtering of the result of grouping aggregation

Select: extract target fields from the results of secondary filtering

Distinct: de-reprocessing according to conditions

Order by: sort the deduplicated results

Limit: only the specified number of sorted records is returned

Once upon a time, I wondered why the writing order of SQL statements was not adjusted to the same order as the execution order, so that it was easier to understand some of the technical principles, but I gave up after querying the data without success.

Of course, the purpose of this article is not to introduce the execution principle or optimization skills of SQL query, but only a few keywords of SQL query, focusing on the implementation in Pandas and Spark.

02 Pandas and Spark realize the corresponding operation of SQL

The following explains the implementation of SQL keywords in Pandas and Spark according to the execution order of SQL, in which Pandas is the data analysis toolkit in Python, and Spark is a general distributed computing framework that integrates Java, Scala, Python and R, which is described in Scala language by default.

1) from. Because Python and Scala are both object-oriented design languages, from is not needed in Pandas and Spark, and the process of performing df.xxx operation itself contains the meaning of from.

2) join on. Join on is a very important operation in SQL multi-table query. The commonly used join methods are inner join, left join, right join, outer join and cross join, and there are corresponding keywords in Pandas and Spark.

There are two main API:merge and join for Pandas:Pandas to implement join operations. Merge is the top-level interface of Pandas (you can call the pd.merge method directly), and it is also the API of DataFrame, which supports rich parameter settings. The main descriptions are as follows:

Def merge (left, # left table right, # right table how: str = "inner", # default connection method: inner on=None, # SQL on connection segment Require the common field left_on=None in the left table and right table, # set the left table connection field right_on=None, # set the right table connection field left_index: bool = False, # use the left table index as the join field right_index: bool = False, # use the right table index as the join field sort: bool = False, # join sort suffixes= ("_ x", "_ y") # when an unconnected field has a duplicate name But the suffix copy: bool = True, indicator: bool = False, validate=None,)-> "DataFrame":

Among the above parameters, there are three main ways to set on connection conditions: if the connection field is a field shared by two tables, it can be set directly with on; otherwise, it can be set through left_on and right_on respectively; when the connection field of a table is an index, left_index can be set to True.

Similar to the merge operation, join can be seen as a simplified version of merge, which defaults to the index as the join field and can only be called through DataFrame. It is not the top-level interface of Pandas (that is, there is no pd.join method).

In addition, concat can also realize the horizontal splicing of two tables horizontally by setting the axis=1 parameter, but it is more commonly used for vertical union operation.

Spark: compared with many ways to implement two DataFrame connections in Pandas, the interface in Spark is much more single, with only one keyword join, but it also implements a variety of overloading methods, mainly in the following three ways:

/ / 1. Two DataFrame have public fields, and only one connection condition. Pass the connection column name df1.join (df2, "col") / / 2 directly. There are multiple fields. You can pass multiple fields df1.join (df2, Seq ("col1", "col2") / / 3. The connection fields in the two DataFrame have different names. At this point, you need to input df1.join (df2, df1 ("col1") = df2 ("col2")) / / Note: in the above connection condition, it is equal to = =, not equal to =! =

3) where. Data filtering is an important part of all data processing processes. It is implemented with the keyword where in SQL, and there are corresponding interfaces in Pandas and Spark.

Pandas . There are many ways to implement data filtering in Pandas, and the most commonly used ones are as follows:

Filtering is realized by loc positioning operator + logical judgment condition. Loc is a method for data reading, because it also supports incoming logic judgment conditions, so it can also be used to implement data filtering, which is the most frequently used in daily use.

When it comes to query through the query interface, the first thing that comes to mind is Q in SQL. In fact, query in pandas implements the where syntax in SQL, which is very useful in implementing chained filtering queries. For more information, please refer to Pandas for a year. These three functions are my favorites.

Where statement, Pandas is famous for its rich API, so naturally it won't let go of the where keyword, but unfortunately, where in Pandas, like where in Numpy, is used to perform the same logical judgment on all elements of all columns, and its customizability is poor.

Spark . The interface for data filtering in Spark is more simple, with two keywords where and filter, and the underlying implementation of the two is the same, so in fact there is only one usage. But in specific use, where also supports two grammatical forms, one is to pass in a conditional expression similar to SQL in the form of a string, similar to query; in Pandas, and the other is to perform logical judgment on each column of objects to get a set of Boolean results, similar to the loc operation in Pandas.

4) group by. The group by keyword is used for grouping aggregation, which actually includes two stages: grouping and aggregation. because this operation is a more standardized operation, the keyword of the same name is also provided in Pandas and Spark, except that the operator followed by group by is different.

Groupby operations in Pandas:Pandas can be followed by multiple keywords. The commonly used ones actually include the following four categories:

Connect aggregate functions directly, such as sum, mean, etc.

Follow the agg function and pass in multiple aggregate functions

Then transform, and pass in the aggregate function, but do not aggregate the result, that is, there are N records before aggregation, and there are still N records after aggregation, which is similar to the window function in SQL. Do you know all these uses of groupby in Pandas?

Connect with apply to achieve more customized function functions. Referring to these three functions in Pandas, I did not expect to become the main force of my data processing.

The commonly used groupBy operations in Spark:Spark include the following three types:

Connect aggregate functions directly, such as sum, avg, etc.

Follow the agg function and pass in multiple aggregation operators, similar to those in Pandas

Then the pivot function is used to realize the specific PivotTable function.

5) having. In SQL, having is used to filter the results after aggregate statistics. The core difference from where is whether the condition used for filtering is pre-aggregate field or post-aggregate field. This does not exist in Pandas and Spark, so it is consistent with the where implementation.

6) select. Select a specific query result. For more information, please see Pandas vs Spark: n ways to get the specified column.

7) distinct. Distinct is used to de-duplicate query results in SQL, and in both Pandas and Spark, the function to do this is drop_duplicates/dropDuplicates.

8) order by. Order by is used to sort by specified fields. The implementation in Pandas and Spark is as follows:

Pandas:sort_index and sort_values, where the former is sorted by index and the latter is sorted by the column name field passed in, and the ascending or descending order can be controlled by passing in the ascending parameter.

Spark:orderBy and sort, both of which are the same underlying implementation, are identical in function. It is also sorted through the passed fields, which can be used with asc and desc functions to achieve ascending and descending order respectively.

/ / 1. Specify column + desc df.orderBy (df ("col"). Desc) / / 2, desc function plus specified column df.orderBy (desc ("col"))

9) limit. The limit keyword is used to limit the number of returned results, which is a relatively single operation. The implementations of the two are as follows:

Pandas: you can extract the result of a specified number of entries through the head keyword and the iloc accessor, respectively.

Spark: the limit operator is directly built in, and the usage is closer to the limit keyword in SQL.

10) Union. There is another common query keyword Union in SQL, which is also implemented in Pandas and Spark:

Pandas:concat and append, where concat is the top-level method in Pandas, which can be used for vertical splicing of two DataFrame, requiring column name alignment, while append is equivalent to a compact concat implementation, similar to the append method of the list in Python, which is used to append another DataFrame to the tail of one DataFrame.

Spark:Spark imitates SQL syntax directly, providing two operators union and unionAll respectively to realize the vertical stitching of two DataFrame, and the meaning is completely similar to that in SQL.

Section 03

This paper introduces the keywords commonly used in standard SQL standard query sentences, focusing on the corresponding operations in Pandas and Spark. Generally speaking, the two computing frameworks can achieve all the operations in SQL, but the interfaces implemented by Pandas are richer and the parameters are more flexible, while Spark has a more unified interface, but generally supports various forms of method overloading. In addition, the operator naming in Spark is closer to SQL, and its syntax habits are very similar, which makes it easier for people with a solid SQL foundation to learn Spark quickly.

This is the end of the content of "comparison of common data query operations of SQL, Pandas and Spark". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.