Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is Antlr4?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what is Antlr4". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is Antlr4?"

1. A brief introduction to Antlr4

Antlr4 (Another Tool for Language Recognition) is an open source parser generation tool based on Java, which can generate corresponding parsers according to syntax rule files. It is widely used in DSL construction, language lexical syntax parsing and other fields. Now used in many popular frameworks, for example, in building a language-specific AST, CheckStyle tool, is based on Antlr to parse Java syntax structure (the current Java Parser is based on JavaCC to parse Java files, it is said that there are plans in the next version to use Antlr to parse), and is widely used in DSL construction, the famous Eclipse Xtext has the use of Antlr.

Antlr can generate AST (https://www.antlr.org/download.html) of different target, including Java, C++, JS, Python, C# and so on, which can meet the development needs of different languages. Currently, the latest stable version of Antlr is 4.9 github Antlr4. There are already dozens of languages of grammer (https://github.com/antlr/grammars-v4) in the official Antlr4 repository, but although the regular grammar definitions of so many languages are in a repository, the license of grammer of each language is different. If you want to use it, you need to refer to the license of each language's own syntax structure.

This article will first introduce the definition of Antlr4 grammer (briefly introduce the syntax structure, and how to debug based on the IDEA Antlr4 plug-in), then introduce how to generate the corresponding AST through Antlr4 grammer, and finally introduce two AST traversal modes of Antlr4: Visitor mode and Listener mode.

2. Antlr4 rule grammar

The following is a brief introduction to the writing of a part of Antlr4's g4 (grammar) file (mainly referring to Antlr4's official wiki: https://github.com/antlr/antlr4/blob/master/doc/index.md). The most effective way to learn the regular grammar of Antlr4 is to refer to the existing regular grammar. In learning, you can refer to the grammar of the existing language. And Antlr4 has implemented dozens of language grammars, if you need to define your own, you can refer to the grammar closest to your own language to develop.

2.1 basic syntax and keywords of Antlr4 rules

First of all, if you have a little C or Java foundation, the grammar of getting started with Antlr4 G4 is very fast. There are mainly the following grammatical structures:

Comments: it is exactly the same as the comments of Java. You can also refer to the comments of C, except that the comments of JavaDoc type are added.

Marker: refer to the naming convention of Java or C markers. For the definition of Token name in Lexer, all uppercase letters are used. For parser rule naming, hump naming with lowercase initials is recommended.

There is no distinction between characters and strings, and they are all enclosed in single quotation marks. At the same time, although Antlr G4 supports Unicode coding (that is, Chinese coding), it is recommended that everyone try to have English.

Action, behaviors, mainly @ header and @ members, are used to define some behaviors that need to be generated into the object code. For example, the package information of the generated code can be set through @ header, and @ members can define additional variables into the Antlr4 syntax file.

In Antlr4 syntax, the supported keywords are: import, fragment, lexer, parser, grammar, returns, locals, throws, catch, finally, mode, options, tokens.

2.2 introduction to Antlr4 Grammar 2.2.1 the overall structure and writing examples of the grammar file

The overall structure of Antlr4 is as follows:

/ * * Optional javadoc style comment * / grammar Name;options {...} import.; tokens {...} channels {...} / / lexer only@actionName {...} rule1 / / parser and lexer rules, possibly intermingled...ruleN

Generally, if the syntax is very complex, it will be written to two different files based on Lexer and Parser (for example, Java, see: https://github.com/antlr/grammars-v4/tree/master/java/java8). If the syntax is relatively simple, you can only write to one file (for example, Lua, see: https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4).

Let's introduce how to use it by combining some of the grammatical structures in Lua.g4. The grammar of writing Antlr4 needs to be decided according to the structure of the source code. When defining, the grammatical structure is constructed from top to bottom according to the writing of the source file. For example, here is a portion of Lua.g4:

Chunk: block EOF; block: stat* retstat?; stat:';'| varlist'= 'explist | functioncall | label |' break' | 'goto' NAME |' do' block 'end' |' while' exp 'do' block' end' | 'repeat' block' until' exp | 'if' exp' then' block ('elseif' exp' then' block) * ('else' block)? 'end' |' for' NAME'= 'exp', 'exp (', 'exp)? 'do' block' end' | 'for' namelist' in' explist 'do' block' end' | 'function' funcname funcbody |' local' 'function' NAME funcbody |' local' attnamelist ('= 'explist)?; attnamelist: NAME attrib (', 'NAME attrib) *

As in the above syntax, the whole file is represented as a chunk,chunk represented as a block and a file Terminator (EOF); block is also represented as a collection of a series of statements, and each statement has a specific syntax structure, including specific expressions, keywords, variables, constants and other information, and then the grammatical composition of recursive expressions, the writing of variables, etc., all boil down to Lexer (Token) at the end of the recursive tree.

As a matter of fact, you can already see the writing of Antlr4 rules above. Here are some of the more important rules.

2.2.2 alternative label

First of all, as shown in the code in section 2.2.1, stat can have many types, such as variable definition, function definition, if, while, etc., all of which are not distinguished, so when parsing the syntax tree, it will be very unclear, and you need to combine a lot of tags to complete the identification of specific statements. In this case, we can make the distinction with the substitute tag, as shown in the following code:

Stat:';'| varlist'= 'explist # varListStat | functioncall # functionCallStat | label # labelStat |' break' # breakStat | 'goto' NAME # gotoStat |' do' block 'end' # doStat |' while' exp 'do' block' end' # whileStat | 'repeat' block' until' exp # repeatStat | 'if' exp' then' block ('elseif' exp' then' block) * ('else' block)? 'end' # ifStat |' for' NAME'= 'exp', 'exp (', 'exp)? 'do' block' end' # forStat | 'for' namelist' in' explist 'do' block' end' # forInStat | 'function' funcname funcbody # functionDefStat |' local' 'function' NAME funcbody # localFunctionDefStat |' local' attnamelist ('= 'explist)? # localVarListStat

You can distinguish between statements by adding # alternative tags at the end of the statement to convert the statement to these alternative tags.

2.2.3 operator precedence processing

By default, ANTLR associates operators from left to right, while some operators such as exponential groups are from right to left. You can use the option assoc to manually specify the correlation on operator tokens. Such as the following:

Expr: expr'^ 'expr

^ represents an exponential operation, and the addition of assoc=right indicates that the operator is right associative.

In fact, Antlr4 has already dealt with the priority of some commonly used operators, such as addition, subtraction, multiplication and division, which no longer require special treatment.

2.2.4 hide the channel

A lot of information, such as comments, spaces, etc., does not need to be processed for the generation of result information, but it is not suitable for us to discard them directly. the way to safely ignore comments and spaces is to put these tokens sent to the parser into a "hidden channel", which only needs to be tuned to a single channel. We can pass anything we want to other channels. In Lua.g4, this information is handled as follows:

COMMENT:'--['NESTED_STR']'- > channel (HIDDEN) LINE_COMMENT:'- -'(/ /-- |'['='* / /-- [= = |'['='* ~ ('='| ['|'\ r' |'\ r') ~ ('\ r' |'\ n) ) * / /-- [= = AA | ~ ('['|'\ r' |'\ n') ~ ('\ r' |'\ n') * / /-- AAA) ('\ r\ n' |'\ r' |'\ n' | EOF)-> channel (HIDDEN) WS: [\ t\ u000C\ r\ n] +-> skip; SHEBANG:'#'!'~ (\ n'|'\ r') *-> channel (HIDDEN)

The Token that is put into channel (HIDDEN) is not processed by the parsing phase, but can be obtained through Token traversal.

2.2.5 Common lexical structures

Antlr4 uses the BNF paradigm, using'|'to indicate branching options,'*'to match the previous match 0 or more times, and'+'to match the previous match at least once. Here are several common lexical examples (all from Lua.g4 files):

1) comment information

COMMENT:'--['NESTED_STR']'- > channel (HIDDEN) LINE_COMMENT:'- -'(/ /-- |'['='* / /-- [= = |'['='* ~ ('='| ['|'\ r' |'\ r') ~ ('\ r' |'\ n) ) * / /-- [= = AA | ~ ('['|'\ r' |'\ n') ~ ('\ r' |'\ n') * / /-- AAA) ('\ r\ n' |'\ r' |'\ n' | EOF)-> channel (HIDDEN)

2) numeric

INT: Digit+; Digit: [0-9]

3) ID (named)

NAME: [a murz Amurz] [a-zA-Z_0-9] *; 3. Debugging Antlr4 grammar rules based on IDEA (grammar visualization)

If you want to install Antlr4, select File-> Settings-> Plugins, and then search the Antlr installation in the search box. You can choose to install the latest version searched. The following figure shows the newly installed ANTLR v4, version v1.15, which supports the latest Antlr 4.9 version.

General steps to debug Antlr4 syntax based on IDEA:

1) create a debugging project and create a G4 file

Here, I test the development with Java myself, so I created a Maven project, and put the G4 file in the src/main/resources directory, named Test.g4.

2) write a simple grammar structure

Here, we refer to the expression for writing an addition, subtraction, multiplication and division operation, and then right-click on the Rule corresponding to the assignment operation to select test:

As shown in the figure above, expr represents a multiplication operation, so we test it as follows:

However, if it is changed to an addition operation, it cannot be recognized and only the first number can be recognized.

In this case, you need to continue to expand the definition of expr to enrich different grammars to continue to support other grammars, as follows:

You can also continue to expand other types of support to complete the syntax of the entire language step by step. Here, we form a complete format as follows (representing the addition, subtraction, multiplication and division of shaping numbers):

Grammar Test;@header {package zmj.test.antlr4.parser;} stmt: expr;expr: expr NUL expr # Mul | expr ADD expr # Add | expr DIV expr # Div | expr MIN expr # Min | INT # Int; NUL:'*'; ADD:'+'; DIV:'/'; MIN:'-'; INT: Digit+;Digit: [0-9] WS: [\ t\ U000C\ r\ n] +-> skip;SHEBANG:'#'!'~ ('\ n' |'\ r') *-> channel (HIDDEN); 4. Antlr4 generates and traverses AST4.1 to generate source code files

This step introduces two methods of generating parsing syntax tree for reference:

Automatic generation of Maven Antlr4 plug-ins (for Java projects, it can also be used for Gradle)

Pom.xml sets up the Antlr4 Maven plug-in, which can automatically generate the required code by executing mvn generate-sources (see link: https://www.antlr.org/api/maven-plugin/latest/antlr4-mojo.html). The main significance is that when the code is stored, there is no need to store the generated syntax files into the library, reducing the code redundancy in the library. It only contains the code developed by yourself, and there will be no automatically generated code. There is no need to do clean code rectification), here is an example:

Org.antlr antlr4-maven-plugin 4.3 antlr antlr4 generate-sources ${basedir} / src/main/resources ${project.build.directory} / generated-sources/antlr4/zmj/test/antlr4/parser true true true

After following the above settings, you only need to execute mvn generate-sources to automatically generate code in the maven project.

Command line mode

The main reference link (https://www.antlr.org/download.html), there is a syntax configuration for each language, we consider downloading the complete Antlr4 jar here:

After downloading (antlr-4.9-complete.jar), you can use the following command to generate the required information:

Java-jar antlr-4.9-complete.jar-Dlanguage=Python3-visitor Test.g4

In this way, the source code of Python3 target can be generated. The supported source code can be viewed from the link above. If you do not want to generate Listener, you can add the parameter-no-listener.

4.2 Visitor pattern traverses the Antlr4 syntax tree

Antlr4 supports two design patterns during AST traversal: the visitor design pattern and the listener pattern.

For the visitor design pattern, we need to define our own access to the AST (https://xie.infoq.cn/article/5f80da3c014fd69f8dbe09b28, this is an introduction to the visitor design pattern, which you can refer to). The following shows the use of the visitor pattern in Antlr4 directly through the code (based on the example in Chapter 3):

Import org.antlr.v4.runtime.CharStream;import org.antlr.v4.runtime.CharStreams;import org.antlr.v4.runtime.CommonTokenStream;import zmj.test.antlr4.parser.TestBaseVisitor;import zmj.test.antlr4.parser.TestLexer;import zmj.test.antlr4.parser.TestParser;public class App {public static void main (String [] args) {CharStream input = CharStreams.fromString ("12, 2, 12"); TestLexer lexer=new TestLexer (input); CommonTokenStream tokens = new CommonTokenStream (lexer) TestParser parser = new TestParser (tokens); TestParser.ExprContext tree = parser.expr (); TestVisitor tv = new TestVisitor (); tv.visit (tree);} static class TestVisitor extends TestBaseVisitor {@ Override public Void visitAdd (TestParser.AddContext ctx) {System.out.println ("= test add"); System.out.println ("first arg:" + ctx.expr (0). GetText ()) System.out.println ("second arg:" + ctx.expr (1). GetText ()); return super.visitAdd (ctx);}

As above, in the main method, the AST structure of the expression is parsed, and a Visitor:TestVisitor is defined in the source code, which accesses AddContext, and prints the two expressions before and after the plus expression. The output of the above example is as follows:

= test addfirst arg: 12*2second arg: 124.2 listener mode (observer mode)

For listener mode, by listening to an object, if a specific event occurs on that object, the listening behavior is triggered to execute. For example, there is a monitor (listener), which monitors the gate (event object), and alarms (triggers the operation behavior) if an intruder occurs (event source).

In Antlr4, if you use the listener mode, you first need to develop a listener that can listen for different behaviors (such as entering and ending the node) of each AST node (such as expressions, statements, and so on). When in use, Antlr4 will ParseTreeWalker the generated AST, and if it traverses to a specific node and performs a specific behavior, the listener's event will be triggered.

Listener methods do not return a value (that is, the return type is void). Therefore, an additional data structure (either through Map or stack) is needed to store the results of the current calculation for the next calculation call.

Generally speaking, when facing the static analysis of the program, we use the visitor mode, seldom use the listener mode (can not actively control the order of traversing the AST, it is not convenient to transfer data between different nodes), and the usage is not friendly to us, so this article does not introduce the listener pattern, if you are interested, you can search and test it yourself.

5. Lexical and grammatical parsing of Antlr4

In fact, this part is the most basic content of Antlr4, but when it comes to the last part, it has a specific purpose, that is, to explore the boundary between lexical parsing and grammatical parsing, as well as the processing of the results of Antlr4.

5.1 Antlr4 execution Pha

For example, the previous syntax definition, which is divided into Lexer and Parser, actually represents two different stages:

Lexical analysis stage: corresponding to the lexical rules defined by Lexer, the parsing result is a Token.

Parsing stage: according to the morphology, construct a parsing tree or syntax tree.

As shown in the following figure:

5.2 Harmony of lexical and grammatical parsing

First of all, we should have a general understanding that grammatical parsing will generate more overhead than lexical parsing, so we should try our best to complete some possible processing in the lexical parsing stage so as to reduce the overhead in the syntactic parsing stage. here are the main examples:

Merge tags that languages do not care about, for example, some languages (such as js) do not distinguish between int and double, but only number, so in the lexical parsing stage, it is not necessary to distinguish int from double and merge them into a single number.

Blanks, comments and other information are not very helpful to grammar parsing and can be removed in the lexical analysis stage.

Common tokens such as markers, keywords, strings, and numbers should be done during lexical parsing, not at the syntax parsing stage.

However, in addition to saving the cost of parsing, this operation also has some impact on us:

Although the language is not type-sensitive, for example, only number, no int and double, etc., but for static code analysis, we may need to know the exact type to help analyze specific defects

Although comments are not very helpful to the code, we sometimes need to parse the contents of the comments for analysis. If we can't get them during syntax parsing, we need to traverse the Token, which leads to more overhead in static code analysis.

...

How to deal with such problems?

5.3 parsing tree vs syntax tree

Most of the data, the tree structure generated by Antlr4 is called parsing tree or syntax tree, but, if we study carefully, it may be said that parsing tree is more accurate, because the result of Antlr4 is only a simple grammar parsing, can not be called syntax tree (grammar tree should be able to reflect the grammatical characteristics of the information), such as the above questions, it is difficult to get in the parsing tree generated by Antlr4.

So now many tools are encapsulated based on Antlr4 and then processed further to get richer syntax trees, such as CheckStyle. Therefore, if you are simple to use through the Antlr4 parsing language, you can develop directly based on the results of Antlr4, but if you want to do more in-depth processing, you need to further deal with the results of Antlr4 to be more in line with our usage habits (for example, Java in Java Parser format, Java in AST,Clang format and AST in AST,Clang format) before you can better develop on it.

At this point, I believe you have a deeper understanding of "what is Antlr4", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report