What is extended Spark SQL parsing 07/19 Update SLTechnology News&Howtos

What is extended Spark SQL parsing

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is extended Spark SQL parsing". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what is extended Spark SQL parsing.

Theoretical basis

ANTLR

Antlr4 is an open source parser generation tool that can generate corresponding parsers according to syntax rule files. Now many popular applications and open source projects are used, such as Hadoop, Hive and Spark all use ANTLR for syntax analysis.

ANTLR grammar recognition is generally divided into two stages:

1. Lexical analysis stage (lexical analysis)

The corresponding analysis program is called lexer, which is responsible for dividing symbols (token) into symbol classes (token class or token type).

two。 Analysis stage

According to the morphology, construct an analysis tree (parse tree) or grammar tree (syntax tree).

The syntax file of ANTLR is very much like a circuit diagram. From entrance to exit, each Token is like a resistor, and the connection line is the short circuit point.

Syntax file (* .g4)

The syntax file snippet corresponding to the screenshot above defines two parts of syntax, one is the display expression and assignment, the other is the operation and expression definition.

Next, add the definition lexical part, and you can form a complete grammar file.

Complete syntax file:

Grammar LabeledExpr; / / rename to distinguish from Expr.g4 prog: stat+; stat: expr NEWLINE # printExpr | ID'= 'expr NEWLINE # assign | NEWLINE # blank Expr: expr op= ('*'|'/') expr # MulDiv | expr op= ('+'|'-') expr # AddSub | INT # int | ID # id | ('expr')'# parens; MUL:'*'; / / assigns token name to'* 'used above in grammar DIV:' /' ADD:'+'; SUB:'-'; ID: [a-zA-Z] +; / / match identifiers INT: [0-9] +; / / match integers NEWLINE:'\ rpm?'\ n'; / / return newlines to parser (is end-statement signal) WS: [\ t] +-> skip; / / toss out whitespaceSqlBase.g4

The syntax file of Spark is found in the catalyst module under sql, as shown below:

Extended syntax definition

A normal SQL, such as Select t.idret.name from t, now let's add a JACKY expression to it so that it appears after the Select, forming a statement

Select t.idjournal t.name JACKY (2) from t

Let's first take a look at the normal grammar rules:

Now let's add a jackyExpression

The rule of jackExpression itself is a number wrapped in parentheses by JACKY.

Add JACKY as token

Modify the syntax file as follows:

JackyExpression: JACKY' ('number')'/ / expression; namedExpression: _ expression (AS? (identifier | identifierList)?; namedExpressionSeq: named_Expression (', 'namedExpression | jackyExpression) *; extended logic plan

After the above modification, you can test whether the syntax rules are in line with expectations. Below is a parsing tree, and we can see that jackyExpression can be parsed normally.

Spark execution process

Here is a classic Spark SQL architecture diagram.

The SQL statement we entered is first parsed into Unresolved Logical Pan, corresponding to

Add traversal methods to the logical plan:

Override def visitJacky_Expression (ctx: JackyExpressionContext): String = withOrigin (ctx) {println ("this is astbuilder jacky =" + ctx.number () .getText) this.jacky = ctx.number () .getText.toInt ctx.number () .getText}

When processing namedExpression again, add jackyExpression processing

/ / Expressions. Val expressions = Option (namedExpressionSeq). ToSeq .flatMap (_ .namedExpression.asScala) .map (typedVisit [expression]) / / jackyExpression handles if (namedExpressionSeq (). Jacky_Expression ()! = null & & namedExpressionSeq (). Jacky_Expression (). Size () > 0) {visitJacky_Expression (namedExpressionSeq (). Jacky_Expression (). Get (0))}

All right, here from the logical plan processing is completed, with the logical plan, you can add the corresponding processing logic to the subsequent physical plan (not yet understood. Orz).

test

Test case

Public class Case4 {public static void main (String [] args) {CharStream ca = CharStreams.fromString ("SELECT `b`.`id`, `b`.class` JACKY (2) FROM `b` LIMIT 10"); SqlBaseLexer lexer = new SqlBaseLexer (ca); SqlBaseParser sqlBaseParser = new SqlBaseParser (new CommonTokenStream (lexer)); ParseTree parseTree = sqlBaseParser.singleStatement (); AstBuilder astBuilder = new AstBuilder (); astBuilder.visit (parseTree) System.out.println (parseTree.toStringTree (sqlBaseParser)); System.out.println (astBuilder.jacky ());}}

Execution result

Thank you for your reading, the above is the content of "what is extended Spark SQL parsing". After the study of this article, I believe you have a deeper understanding of what is extended Spark SQL parsing, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.