How to implement a small Compiler based on JS 07/06 Update SLTechnology News&Howtos

How to implement a small Compiler based on JS

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to achieve a small compiler based on JS". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to achieve a small compiler based on JS.

Preface

The-super-tiny-compiler is a super small compiler with less than 200 lines of code, but through this compiler you can learn the most basic compile principles, including babel is also developed on this principle.

The example of the repository itself is that a set of lisp-style function syntax is compiled into C-style function syntax, for example:

LISP style C style 2 (add 2 2) add (2 subtract 2) 4-2 (subtract 42) substract (4, 2) 2 + (4-2) (add 2 (substract 42)) add (2, (substract (4, 2)

This is probably what compiler needs to do this time. The syntax may not seem complete, but it can also demonstrate the main ideas of modern compilers.

Most Compilers divides the compilation process into three main processes: parse, transform, and code generate:

Parse is mainly about transforming the source code into a more abstract code expression.

Transform is to do whatever compiler wants to do with the abstract expression above.

Code generate generates a new kind of code from the code processed by transform

Parse

Parse is mainly divided into two steps: lexical analysis and grammatical analysis.

Lexical analysis divides the source code into pieces according to expression. Tokens,tokens is a group of objects used to describe individual grammars, such as numbers, labels, punctuation, operators, etc.

Syntax analysis is to rearrange the tokens generated by lexical analysis to get a structure called Abstract Syntax Tree (AST). AST is a nested tree structure that is easy to use and can display complete information.

For example, the tokens generated after the previously mentioned add 2 (subtract 4 2) expression is processed by lexical analysis is about:

[{type: 'paren', value:' ('}, {type: 'name', value:' add'}, {type: 'number', value:' 2'}, {type: 'paren', value:' ('}, {type: 'name', value:' subtract'}, {type: 'number', value:' 4'} {type: 'number', value:' 2'}, {type: 'paren', value:')}, {type: 'paren', value:')'}]

The AST structure processed by parsing is as follows:

{type: 'Program', body: [{type:' CallExpression', name: 'add', params: [{type:' NumberLiteral', value: '2clients,}, {type:' CallExpression', name: 'subtract' Params: [{type: 'NumberLiteral', value:' 4percent,}, {type: 'NumberLiteral', value:' 2percent,}]}]} Transform

Transform mainly takes the abstract syntax tree obtained by parse and makes some changes based on it. The tranform process can either modify ast based on the style of the current language or use a new language style.

The following shows the workflow of the transform process based on the previous ast structure.

As you can see, the elements in the ast look very similar. These elements make up the child nodes of the ast, whose data structure types describe a separate part of the code (such as variables, declaration statements, expressions, and so on).

For example, the node mentioned above is of type NumberLiteral:

{type: 'NumberLiteral', value:' 2clients,}

Or a more complex type of child node:

{type: 'CallExpression', name:' substract', params: [... nested nodes here...],}

In the transfrom process, we can add / delete / modify the abstract syntax tree nodes, or we can directly create a new one based on the current abstract syntax tree.

Since our goal here is to convert the input code into a new language-style code (Lisp-> C), we will create a new AST for the new language.

So here we need to clarify the operation of modifying AST:

Traversal (ergodic)

In order to handle all nodes, we can traverse them with a depth-first search:

{type: 'Program', body: [{type:' CallExpression', name: 'add', params: [{type:' NumberLiteral', value:'2'}, {type: 'CallExpression', name:' subtract', params: [{type: 'NumberLiteral', value:' 4'}, {type: 'NumberLiteral' Value:'2'}]}]]}

The traversal process goes like this:

Program-start from the top node of AST

Call_Expression (add)-the first child element of Program

NumberLiteral (2)-the first child element of Call_Expression (add)

Call_Expression (subtract)-the second child element of Call_Expression (add)

NumberLiteral (4)-the first child element of Call_Expression (subtract)

NumberLiteral (2)-the second child element of Call_Expression (subtract)

If you operate directly within the ast instead of generating a new ast, you may need to introduce all kinds of abstractions. For now, however, the method of accessing all nodes is sufficient.

The word visiting represents a pattern for manipulating elements within the object structure.

Visitors (access)

Here we can create a visitor object that includes methods for receiving different nodes.

For example:

Var visitor = {NumberLiteral () {}, Call_Expression () {}}

So when we iterate through the ast, if we match the corresponding type node, we can call the method in the visitor to handle it.

Code generate

The last phase of Compiler is generate, and what you do at this stage may overlap with transformation, but the most important part of code generation is to output code according to AST.

Generate works in several different ways. Some Compilers reuse the previously generated token, while others create independent code representations to make it easier to output code linearly, but then we focus on using the previously generated AST.

Our generator needs to know how to print all types of nodes in AST, and then recursively call itself, knowing that all the code is printed into a long string.

Code implementation

These are all parts of Compiler, but not all Compiler are like this. Different compiler has different purposes, so different steps may be required.

Let's start writing the code:

Lexical analyzer (tokenizer)

According to the previous theoretical analysis, we first carry out the lexical analyzer (tokenizer) in this stage of parser.

This function takes a string and splits it into an array of token:

Ex:

(add 2 (substract 4 2)) = > [{type: 'paren', value:' ('},...]

So you can write a function like this:

Const tokenizer = (input) = > {const tokens = []; let current = 0; while (current)

< input.length) { let char = input[current]; if (char === '(') { tokens.push({ type: 'paren', value: '(', }) current++; continue; } if (char === ')') { tokens.push({ type: 'paren', value: ')', }) current ++; continue; } // 空格目前不需要处理 if (/\s/.test(char)) { current++; continue; } // 处理数字 if (/[0-9]/.test(char)) { let value = ''; // 一直遍历直到遇到非数字的字符 while (/[0-9]/.test(char)) { value += char; char = input[++current]; } tokens.push({ type: 'number', value, }) continue; } // 处理字符串 if(/[a-z]/i.test(char)) { let value = ''; while(/[a-z]/i.test(char)) { value += char; char = input[++current]; } tokens.push({ type: 'name', value, }); continue; } // 如果存在匹配不上的 token 就抛错 throw new Error(`Unknown token: ${char}`); } return tokens;}语法分析器(parser) 词法分析器接收语法分析得到的 token 数组，然后将其转换成 AST 结构。例如: [{ type: 'paren', value: '(' }, ...] =>

{type: 'Program', body: [...]}

Const parser = (tokens) = > {let current = 0; const walk = () = > {const token = tokens [current]; / / if it is a node of type number, return a new ast node to if (token.type = = 'number') {current++ Return {type: 'NumberLiteral', value: token.value,}} / / next check the CallExpression type, starting with the left parenthesis if (token.type =' paren' & & token.value = ='(') {/ / skip the left parenthesis token = tokens [+ + current] / / first a root node of CallExpression is created, and then name is set to the current token.value / / because the token after the left parenthesis must be the function name const node = {type: 'CallExpression', name: token.value, params: [],}; / / jump again here the function name token = tokens [+ + current] / / iterate through each of the next token Until you encounter the right parentheses / / these token will become `params` / / in lisp-style code: (add 2 (substract 4 2)) / * there will be many parentheses in token * [{type: 'paren', value:' ('}, {type: 'name', value:' add'} {type: 'number', value:' 2'}, {type: 'paren', value:' ('}, {type: 'name', value:' subtract'}, {type: 'number', value:' 4'}, {type: 'number', value:' 2'} {type: 'paren', value:')'}, {const method = visitor [node.type] If (method) {method (node, parent);} / / A pair of different nodes separately processes switch (node.type) {case 'Program': traverseArray (node.body, node); break; case' CallExpression': traverseArray (node.params, node); break / / in this case, there are no child nodes. Break directly case 'NumberLiteral': break; default: throw new Error (`Unknown node type: ${node.type} `);}} traverseNode (ast, null);} Converter (transformer)

The converter is used in conjunction with the above traversal, which receives the previously built ast and passes it into the traversal along with the visitor to get a brand new AST.

The original AST structure is (add 2 (subtract 4 2)):

The AST structure generated after the transformation is (add (2, subtract (4, 2)):

{type: 'Program', body: [{type:' ExpressionStatement', expression: {type: 'CallExpression', callee: {type:' Identifier', name: 'add',}, arguments: [{type:' NumberLiteral', value: '2percent,}, {type:' CallExpression' Callee: {type: 'Identifier', name:' subtract',}, arguments: [{type: 'NumberLiteral', value:' 4percent,}, {type: 'NumberLiteral', value:' 2percent,}]}

Next, we can write the corresponding converter code as follows:

Const transformer = (ast) = > {const newAst = {type: 'Program', body: [],} ast._context = newAst.body; traverser (ast, {/ / processing NumberLiteral type NumberLiteral: (node, parent) = > {parent._context.push ({type:' NumberLiteral', value: node.value,})) }, / / handle CallExpression type CallExpression: (node, parent) = > {/ / initialize a new node of CallExpression / / put a nested Identifier node const expression = {type: 'CallExpression', callee: {type:' Identifier', name: node.name}, arguments: [],} Node._context = expression.arguments / / see if the parent node is CallExpression if (parent.type! = = 'CallExpression') {/ / if not, wrap the CallExpression node in a statement node called `ExpressionStatement` / / do so because top level's CallExpression can also be treated as a declaration statement expression = {type:' ExpressionStatement', expression,} in JavaScript } / / finally we put CallExpression into the context of the parent node parent._context.push (expression);}}); return newAst;} code generator (code generator)

The code generator is also a recursive function that finally prints each node in the AST to a large string:

Const codeGenerator = (node) = > {switch (node.type) {/ / if it is Program, it traverses each node in the 'body' attribute / / and recursively codeGenerator those nodes, and prints the result into a new line case' Program': return node.body.map (codeGenerator) .join ('\ n') / / Recursive call to expression attribute for ExpressionStatement with a semicolon case 'ExpressionStatement': return `${codeGenerator (node.expression)};` / / for CallExpression to make a recursive call to the callee attribute, and then add (parentheses / / and then make a recursive call to the arguments attribute, plus) parentheses case 'CallExpression': return `${codeGenerator (node.callee)} (${node.arguments.map (codeGenerator) .join (',')})`; / / for Identifier, return name case 'Identifier': return node.name directly / / for NumberLiteral, directly return value case 'NumberLiteral': return node.value; default: throw new Error (`Unknown node type: ${node.type} `);}} compiler (Compiler)

At this point, basically all the processes are complete, and we can create a compiler function, and the entire compiler can be done by calling the above function:

Input = > tokenizer = > tokens

Tokens = > parser = > ast

Ast = > transformer = > newAst

NewAst = > generator = > output

The code only needs the following simple steps:

Const compiler = (input) = > {const tokens = tokenizer (input); const ast = parser (tokens); const newAst = transformer (ast); return codeGenerator (newAst);}

We can enter the previous sets of test examples to ensure that the results are correct.

Thank you for your reading, the above is the content of "how to achieve a small compiler based on JS". After the study of this article, I believe you have a deeper understanding of how to achieve a small compiler based on JS, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.