The overall implementation Framework of sparkSQL 07/19 Update SLTechnology News&Howtos

The overall implementation Framework of sparkSQL

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The purpose of this blog is to make those children's shoes who are new to the sparkSQL framework, hope that they can have a general understanding of the overall framework of sparkSQL, lower their barriers to enter the spark world, and prevent them from being at a loss, not knowing what to learn and what to think when they first come into contact with sparkSQL. This is also a summary of my work, so that I can look back in the future. There will be a series of detailed introductions to sparkSQL later. Take your time. Module analysis of ~ ~ 1 and sql statements

When we write a query statement, it generally consists of three parts: the select part, the from data source part, and the where constraint part. The contents of these three parts have special names in sql:

When we write sql, as shown in the figure above, when we do logical parsing, we divide sql into three parts, project,DataSource,Filter module, and when we generate the executive part, we call them: Result module,

DataSource module and Opertion module.

So in a relational database, when we finish writing a query statement for execution, the process that occurs is shown in the following figure:

The whole execution process is: query-> Parse-> Bind-> Optimize-> Execute

1. After writing the sql query statement, sql's query engine first parses our query statement, that is, the Parse process. The parsing process is to divide the query statement we wrote and parse the three parts of project,DataSource and Filter to form a logical parsing tree. In the process of parsing, we will also check whether there are any errors in our sql syntax, such as the lack of indicator fields, the database does not include this data table, and so on. Stop parsing immediately when an error is found and report an error. When the parsing is successfully completed, you will enter the Bind process.

2, Bind process, through the words we can see that this process is a binding process. Why do I need the binding process? This question requires us to think from the perspective of software implementation. If we are asked to implement this sql query engine, what should we do? The strategy they adopt is to first divide the sql query statement into different parts, and then parse it to form a logical parsing tree, and then need to know where the data table we need to take the data, which fields are needed, and what logic to execute, all of which are saved in the database data dictionary, so the bind process is actually the process of binding the logical parsing tree formed after the Parse process with the database data dictionary. After binding, an execution tree is formed to let the program know where the table is, what fields are needed, and so on.

3. After completing the Bind process, the database query engine will provide several query execution plans and give some statistical information about the query execution plan. Since several execution plans are provided, then there are advantages and disadvantages in comparison. The database will choose an optimal execution plan according to the statistical information of these execution plans, so this process is an Optimize (optimization) process.

4. Choose an optimal execution plan, then there is only the last step to execute Execute, and the final execution process is different from our parsing process. When we know the order of execution, it is very helpful for us to write sql and optimize later. After executing the query, he first executes the where part, then finds the data table of the data source, and finally generates the select part, our final result. The order of execution is: operation- > DataSource- > Result

Although the above sections have nothing to do with sparkSQL, knowing these is very helpful for us to understand sparkSQL.

2. The architecture of sparkSQL framework

To have a clear understanding of this framework, we first need to figure out why we need sparkSQL. In general, I suggest that you should not use sparkSQL when writing problems that sql can solve directly, and if you want to use sparkSQL deliberately, you may not be able to speed up the development process. The purpose of using sparkSQL is to solve the complex logic that can not be solved with sql, and to use the advantages of programming language to solve the problem. We use the general process of sparkSQL as shown below:

As shown in the figure above, it is generally divided into two parts: a, read the data into sparkSQL, sparkSQL for data processing or algorithm implementation, and then output the processed data to the corresponding output source.

1. Similarly, we also think about this problem from what we should do and what we need to consider if we are allowed to develop.

The first question is, how many data sources are there, and from which data sources can we read the data? SparkSQL now supports many data sources, such as hive data warehouses, json files, .txt files, and orc files, as well as jdbc to fetch data from relational databases. It's very powerful.

B. another question to think about is how to map data types? We know that when we read data from a database table, what is the mapping relationship between the field types of the table structure and the programming language such as the data type mapping in scala? There is a way to solve this problem in sparkSQL to implement the mapping of field types in data tables to programming language data types. This will be introduced in detail later, just understand that there is this problem first.

C, with the data, how should we organize the data in sparkSQL, what kind of data structure should we need, and what kind of operation can we do with the data? SparkSQL uses the DataFrame data structure to organize the data read into sparkSQL. The DataFrame data structure is actually similar to the table structure of the database. The data is stored according to rows, and there is also a schema, which is equivalent to the table structure of the database, recording which field each row of data belongs to.

D, when the data is processed, we need to put the data where, and cut in what format, this an and b to solve the same problem.

2, sparkSQL for the above problems, the implementation logic is also very clear, from the above figure, has been very clear, mainly divided into two stages, each stage corresponds to a specific class to implement.

For the first phase, there are two classes in sparkSQL to solve these problems: HiveContext,SQLContext, colleague hiveContext inherits all of SQLContext's methods, and colleagues extend them. Because we know that there are some differences between hive and mysql queries. HiveContext is only used to handle reading data from the hive data warehouse, and SQLContext can handle all the remaining data sources that sparkSQL can support. The granularity of these two classes is limited to reading and writing data, and colleagues' operations at the table level, such as reading in data, caching tables, releasing cache tables, registries, deleting registered tables, returning the structure of tables, and so on.

B. SparkSQL processes the read-in data using the method provided in DataFrame. Because when we read the data into sparkSQL, the data is of type DataFrame. At the same time, the data is stored according to Row. Many useful methods are provided in DataFrame. I'll talk about it later.

C. After the spark1.6 version, a data structure DataSet similar to DataFrame has been added. The purpose of adding this data structure is DataFrame has a soft rib. It can only deal with data stored according to Row, and can only use the methods provided in DataFrame. We can only use some of the operations provided by RDD. The purpose of implementing DataSet is to enable us to manipulate data in sparkSQL in the same way as we do with RDD.

D, there are some other classes, but now in sparkSQL the most important are the above three classes, other classes will slowly figure it out when they encounter them.

3. The operation principle of hiveContext and SQLContext of sparkSQL.

HiveContext and SQLContext and I talked about the first part of the sql statement module parsing implementation principle is actually the same, using the same logic process, and there are a lot of online talk about this piece, just paste and copy it!

The overall process of sqlContext is shown in the following figure:

The 1.SQL statement is parsed into UnresolvedLogicalPlan by SqlParse.

two。 Use analyzer to bind with data data Dictionary (catalog) to generate resolvedLogicalPlan

3. Use optimizer to optimize resolvedLogicalPlan to generate optimizedLogicalPlan

4. Use SparkPlan to convert LogicalPlan to PhysicalPlan

5. Use prepareForExecution () to convert PhysicalPlan into an executable physical plan

6. Use execute () to execute an executable physical plan

7. Generate SchemaRDD.

In the whole running process, several SparkSQL components are involved, such as SqlParse, analyzer, optimizer, SparkPlan and so on.

The overall process of hiveContext is shown in the following figure:

The 1.SQL statement is parsed into Unresolved LogicalPlan after HiveQl.parseSql. In this parsing process, use getAst () to get the AST tree for the hiveql statement, and then parse it.

two。 Use analyzer to bind with data hive source data Metastore (new catalog) to generate resolved LogicalPlan

3. Optimizer is used to optimize resolved LogicalPlan to generate optimized LogicalPlan. ExtractPythonUdfs (catalog.PreInsertionCasts (catalog.CreateTables (analyzed) is used for preprocessing before optimization.

4. Use hivePlanner to convert LogicalPlan to PhysicalPlan

5. Use prepareForExecution () to convert PhysicalPlan into an executable physical plan

6. Use execute () to execute an executable physical plan

7. After execution, import the results into SchemaRDD using map (_ .copy).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.