What are the morphology of JavaScript? 07/11 Update SLTechnology News&Howtos

What are the morphology of JavaScript?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail what the JavaScript words are, and the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

InputElement input element

Input elements are the most basic elements obtained by the JS lexical scanner, that is, "words" that express a specific meaning in the source code of the JS program.

There are four types of input elements:

InputElement:: WhiteSpace Comment Token LineTerminator

It is worth noting that two kinds of InputElement are actually defined in the JS specification, as follows

InputElementDiv:: WhiteSpace Comment Token LineTerminator DivPunctuatorInputElementRegExp:: WhiteSpace Comment Token LineTerminator RegularExpressionLiteral

This is done because JS's division operator and regular expression literals both use the / character, and it is impossible to distinguish between the two during the lexical analysis phase. Therefore, there are two states in the lexical analysis of JavaScript, one is scanning InputElementDiv and the other is scanning InputElementRegExp. Therefore, the lexical analyzer of JS should have two states, which should be set by the parser, and the lexical analysis and syntax analysis of JavaScript must be interlaced.

In the process of learning web front end, you will inevitably encounter a lot of problems, which may bother you for a long time, so I have a web development learning exchange group (545667817).

They are all little friends of ITPUB, and have sorted out the most comprehensive front-end learning materials, from the most basic HTML+CSS+JS to mobile HTML5 projects. You can apply to join if you want to learn. We can learn from each other, communicate with each other, make common progress, and share different learning materials every day!

The following example illustrates the conflict between division and regular expression writing:

If (ASCB) / ACPG; (ASCB) / ASCG

You can see the exact same / abind g (and the previous character is the same), which may be understood as division or regular expression. Because the grammatical environment must be distinguished, lexical analysis alone cannot determine whether to use division or regular expressions to understand.

Because there is basically no editing environment for parsing text, this problem also causes many syntax coloring systems to fail to handle JS regular expressions well.

From the point of view of the non-verbal implementer, the lexical form of JS should be understood according to the production at the top.

WhiteSpace white space character

I believe that without going into details, all JS programmers are familiar with this word. JavaScript accepts five ASCII characters as space characters, and all characters in the BOM and Unicode categories that belong to the whitespace category can also be used as space characters:

WhiteSpace::

Where is Utt0009, which is the indented TAB character, that is, the'\ t 'written in the string.

It's the Utre000B, the vertical TAB character'\ vindicator, which is hard to type on the keyboard, so it's rarely used.

This character is rarely used in JS source code because it is rarely used to print source programs in modern times. It is the Uzz000C force form Feed, page breaks, and string literals written in direct quantities.

It's Ubun0020, which is the most common space.

Is U+00A0, non-line-breaking spaces, it is a variant of SP, in text typesetting, you can avoid line breaks because of spaces here, other aspects are exactly the same as ordinary spaces. Most JS editing environments treat it as a normal space (because the general source code editing environment does not automatically wrap lines at all. )

Is U+FEFF, this is the newly added blank character of ES5, is the zero width non-broken line space in Unicode, in the document encoded in UTF format, often insert an additional U+FEFF at the beginning of the file, the program that parses the UTF document can guess which UTF encoding method the file uses according to the U+FEFF representation. This character is also called "bit order mark".

Represents all characters in the "separator, space (Zs)" category in Unicode, including:

Character name display U+0020SPACE in your browser

U+00A0NO-BREAK SPACE

U+1680OGHAM SPACE MARK

U+180EMONGOLIAN VOWEL SEPARATOR U+2000EN QUAD

U+2001EM QUAD

U+2002EN SPACE

U+2003EM SPACE

U+2004THREE-PER-EM SPACE

U+2005FOUR-PER-EM SPACE

U+2006SIX-PER-EM SPACE

U+2007FIGURE SPACE

U+2008PUNCTUATION SPACE

U+2009THIN SPACE

U+200AHAIR SPACE

U+202FNARROW NO-BREAK SPACE

U+205FMEDIUM MATHEMATICAL SPACE

U+3000IDEOGRAPHIC SPACE

Note that although the JS specification acknowledges that these characters can be used as white space characters, they should be used as far as possible unless there are special requirements for printing and typesetting of the source code, especially considering that a considerable number of fonts cannot support all the characters in.

According to some team coding specifications, it is often used for indentation. The debate in programming languages about whether to indent with or without four never stops, so we won't discuss it here.

In JS, most of the uses of WhiteSpace are to separate the token and keep the code neat and beautiful. Basically, the WhiteSpace generated by the lexical parser is directly discarded by the parser.

So some WhiteSpace can be removed without affecting the execution of the program at all. But there are some situations where WhiteSpace must exist, consider the following code:

1.toString (); 1.toString (); / / error report

In the above code, the white space character separates 1 from., so they are understood as two token.

1. ["toString"] (); 1. ["toString"] (); / / error report

The opposite is true.

LineTerminator line Terminator

This is also a very common concept, and only four characters are provided as newline characters in JS:

LineTerminator::

Where is Uzz000A, which is the most normal newline character, the'\ n'in the string.

It's Uzz000D, which really means "enter" in the string, and in some Windows-style text editors, the newline is two characters\ r\ n.

Is Utility 2028, the line delimiter in Unicode.

It is the paragraph delimiter in Unicode.

Most LineTerminator are discarded by the parser after they are scanned by the lexical analyzer, but line breaks affect two important grammatical features of JS: the automatic insertion of semicolons and the "no line terminator" rule.

Consider the following three pieces of code:

Var a = 1, b = 1

According to the automatic semicolon insertion rule of JS syntax, the code interpretation may be ambiguous.

But because the post-increment operator is limited by no line terminator, the actual result is equivalent to:

Var a = 1, b = 1

Consider the following two pieces of code:

Return 123x return 123

Because return has restrictions on no line terminator, the first piece of code is actually equivalent to

Return;123;Comment comment

JS comments are divided into single-line comments and multi-line comments:

Comment:: MultiLineComment SingleLineComment

Multiline comments are defined as follows:

MultiLineComment:: / * MultiLineCommentCharsopt * / MultiLineCommentChars:: MultiLineNotAsteriskChar MultiLineCommentCharsopt * PostAsteriskCommentCharsopt PostAsteriskCommentChars:: MultiLineNotForwardSlashOrAsteriskChar MultiLineCommentCharsopt * PostAsteriskCommentCharsopt MultiLineNotAsteriskChar:: SourceCharacter but not asterisk * MultiLineNotForwardSlashOrAsteriskChar:: SourceCharacter but not forward-slash / orasterisk *

This definition is a little more complicated, but in fact it is a strict description of the JS multiline comment syntax as we know it.

MultiLineNotAsteriskChar is allowed to appear freely in multiline comments, that is, all characters except *. And after each *, there can be no positive slash /

Single-line comments are relatively simple:

SingleLineComment:: / / SingleLineCommentCharsoptSingleLineCommentChars:: SingleLineCommentChar SingleLineCommentCharsoptSingleLineCommentChar:: SourceCharacter but not LineTerminator

Except for the four LineTerminator, all characters can be used as single-line comments.

In general, either single-line or multiline comments do not affect the meaning of the program, but multiline comments that contain LineTerminator affect the automatic semicolon insertion rule:

Return/* * / 123 return / * * / 123

The two will have different effects.

Token

Token is the smallest semantic unit in JS that can be understood by the engine.

There are four kinds of Token in JS:

Token:: IdentifierName Punctuator NumericLiteral StringLiteral

If you don't consider the conflict between division and regularity, Token should also include RegularExpressionLiteral, and Punctuator should also add / and / = symbols.

IdentifierName

IdentifierName is defined as:

IdentifierName:: IdentifierStart IdentifierName IdentifierPartIdentifierStart:: UnicodeLetter $_\ UnicodeEscapeSequenceIdentifierPart:: IdentifierStart UnicodeCombiningMark UnicodeDigit UnicodeConnectorPunctuation

IdentifierName can start with the dollar character $underscore _ or the Unicode letter, and in addition to the start character, you can also use connection marks, numbers, and connection symbols in Unicode in IdentifierName.

Any character of IdentifierName can be written using JS's Unicode escape, and there is no character limit when using Unicode escape.

IdentifierName can be Identifier, NullLiteral, BooleanLiteral, or keyword, and in ObjectLiteral, IdentifierName can also be used directly as an attribute name.

IdentifierName is resolved to Identifier only if it is not a reserved word.

UnicodeLetter, UnicodeCombiningMark, UnicodeDigit and UnicodeConnectorPunctuation correspond to the classification of several Unicode respectively.

JS lexical name Unicode classification name code number of characters UnicodeLetterUppercase letterLu1441Lowercase letterLl1751Titlecase letterLt31Modifier letterLm237Other letterLo11788Letter numberNl224UnicodeCombiningMarkNon-spacing markMn1280Combining spacing markMc353UnicodeDigitDecimal numberNd460UnicodeConnectorPunctuationConnector punctuationPc10

Note that and are two new format control characters added to ES5, but so far there is no browser support.

The keywords in JS are:

Keyword:: one of break do instanceof typeof case else new var catch finally return void continue for switch while debugger function this with default if throw delete in try

There are seven keywords reserved for future use:

FutureReservedKeyword:: one of class enum extends super const export import

In strict mode, there are some additional keywords reserved for future use:

Implements let private public interface package protected static yield

In addition to these, NullLiteral:

NullLiteral:: null

And BooleanLiteral:

BooleanLiteral:: true false

It is also a reserved word and cannot be used in Identifier.

Punctuator

JavaScript uses 48 operators, and the / and / = operators are split into DivPunctuator because of the division and regularity problems mentioned earlier. The remaining operators are:

Punctuator:: one of {} () []. ;

< >

= = +-*% +-- > & | ^! ~ & & |?: = + =-=% = > > = & = = ^ =

All operators appear as different symbol in the parser.

NumericLiteral

The direct amount of numbers specified in the JS specification can be written in two ways: decimal and hexadecimal integers, and although it is not mentioned in the standard, most JS implementations also support octal integer writing that begins with 0.

So in fact, JS's NumericLiteral production should look like this:

NumericLiteral:: DecimalLiteral HexIntegerLiteral OctalIntegerLiteralnot-standard

Only decimal can represent floating point numbers, and DecimalLiteral is defined as follows:

DecimalLiteral:: DecimalIntegerLiteral. DecimalDigitsopt ExponentPartopt. DecimalDigits ExponentPartoptDecimalIntegerLiteral ExponentPartoptDecimalIntegerLiteral:: 0 NonZeroDigit DecimalDigitsoptDecimalDigits:: DecimalDigit DecimalDigits DecimalDigitDecimalDigit:: one of 0 1 2 3 4 5 6 7 7 8 9 NonZeroDigitPartition: one of 1 2 3 4 5 6 7 7 8 9 ExponentPartVera: ExponentIndicator SignedIntegerExponentIndicator:: one of e ESignedInteger:: DecimalDigits + DecimalDigits-DecimalDigits

StringLiteral in JS supports both single and double quotation marks.

Decimal numbers can be omitted before and after the decimal point, so 1. And. 1 are both legal numeric direct quantities, and in particular, decimal numbers cannot start with 0 except 0 (this is actually reserved for octal integers).

. At the same time, it is also a Punctuator. In the lexical analysis stage, .123 should first be understood as NumericLiteral rather than Punctuator NumericLiteral.

The hexadecimal integer production is as follows:

HexIntegerLiteral:: 0x HexDigit 0X HexDigit HexIntegerLiteral HexDigitHexDigit:: one of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The case of the 0x tag is supported in JS, and the case in hexadecimal numbers can be used at will.

Octal integers are non-standard, but most engines support:

OctalIntegerLiteral:: 0 OctalDigit OctalIntegerLiteral OctalDigit OctalDigit:: one of 0 1 2 3 4 5 6 7StringLiteral

StringLiteral in JS supports both single and double quotation marks.

StringLiteral:: "DoubleStringCharactersopt" 'SingleStringCharactersopt'

The difference between single quotation marks and double quotation marks lies only in the way of writing. In the direct quantity of double quotation mark string, double quotation marks must be escaped, and in the direct quantity of single quotation mark string, single quotation marks must be escaped.

DoubleStringCharacters:: DoubleStringCharacter DoubleStringCharactersoptSingleStringCharacters:: SingleStringCharacter SingleStringCharactersoptDoubleStringCharacter:: SourceCharacter but not double-quote "orbackslash\ or LineTerminator\ EscapeSequence LineContinuationSingleStringCharacter:: SourceCharacter but not single-quote 'orbackslash\ or LineTerminator\ EscapeSequence LineContinuation

The other characters that must be escaped in the string are\ and all newline characters.

Four forms of escape are supported in JS, and there is an octal escape that is supported by most implementations, although the standard is not defined.

EscapeSequence:: CharacterEscapeSequence 0 [lookahead no DecimalDigit] HexEscapeSequence UnicodeEscapeSequence OctalEscapeSequencenot-standard

The first is single character escape. This is the form of a backslash followed by a character.

CharacterEscapeSequence:: SingleEscapeCharacter NonEscapeCharacterSingleEscapeCharacter:: one of'"\ b f n r t vNonEscapeCharacter:: SourceCharacter but notEscapeCharacter or LineTerminator

Characters with special meaning include 9 types of characters defined by SingleEscapeCharacter, as shown in the following table:

Escape characters are escaped as a result of the display in your browser of 'Ubun0022', 'Ubun0027'\ Ubun005C\ bU+0008fU+000C

NU+000A

RU+000D

TU+0009

VU+000B

Except for these nine characters, numbers, x and u, and all newline characters, the other characters are escaped by\.

Hexadecimal escape supports only two digits, that is, this writing only supports ASCII characters:

HexEscapeSequence:: x HexDigit HexDigit

Unicode escape can support all characters in BMP:

UnicodeEscapeSequence:: u HexDigit HexDigit HexDigit HexDigit

LineContinuation can be understood as a special kind of escape. Flexible use of LineContinuation when writing string direct quantities can increase readability.

LineContinuation::\ LineTerminatorSequenceLineTerminatorSequence:: [lookahead no]

To accommodate Windows-style text, JS uses "\ r\ n" as a newline character.

Note that because CR cannot be displayed in some windows-style editors, misuse can have a strange effect.

RegularExpressionLiteral

Regular expressions consist of two parts, Body and Flags:

RegularExpressionLiteral:: / RegularExpressionBody / RegularExpressionFlags

The Body part has at least one character, and the first character cannot be * (because / * conflicts lexically with multiline comments. )

RegularExpressionBody:: RegularExpressionFirstChar RegularExpressionCharsRegularExpressionChars:: [empty] RegularExpressionChars RegularExpressionCharRegularExpressionFirstChar:: RegularExpressionNonTerminator but not * or\ or / or [RegularExpressionBackslashSequence RegularExpressionClassRegularExpressionChar:: RegularExpressionNonTerminator but not\ or / or [RegularExpressionBackslashSequence RegularExpressionClass

Except for\ / and [three characters, the characters in JS regular expressions are ordinary characters.

RegularExpressionBackslashSequence::\ RegularExpressionNonTerminatorRegularExpressionNonTerminator:: SourceCharacter but not LineTerminator

You can use\ and a non-newline character to form a RegularExpressionBackslashSequence, which can be used to represent special characters in regular expressions.

RegularExpressionClass:: [RegularExpressionClassChars]

In a regular expression, class is represented by a square bracket. The special characters in class are only] and\.

Class allowed is empty.

Escape is also supported in class.

RegularExpressionClassChars:: [empty] RegularExpressionClassChars RegularExpressionClassCharRegularExpressionClassChar:: RegularExpressionNonTerminator but not] or\ RegularExpressionBackslashSequence

Flag in regular expressions does not restrict characters at the lexical stage, although only a few ig are valid, but any IdentifierPart sequence is considered legal at the lexical stage.

RegularExpressionFlags:: [empty] RegularExpressionFlags IdentifierPart

Some examples of lexical analysis that are considered legal but actually do not conform to regular grammar:

A brief description of the English name of the lexical summary of the attached JS A brief description of the sample InputElement input elements all legal "words" in JS

┣ Comments comments text to help you read

┃┣ SingleLineComments single-line comments starting with / / single-line comments / / Isimm comments ┃┗ MultiLineComments multiline comments begin with / * comments that end with * / / white space / * Isimm comments,too.*/ ┣ WhiteSpace white space that separates or maintains aesthetic white space

┣ Token lexical markers all meaningful lexical markers in JS

A word whose ┃┣ IdentifierName identity name begins with a letter or _ or $and can be used for an attribute name

┃┃┣ Identifier identifier IdentifierName for unreserved words Can be used for variable names or attribute names abc ┃┃┣ Keyword keywords have special grammatical meaning IdentifierNamewhile ┃┃┣ NullLiteralNull direct quantity represents a value of type Null null ┃┃┗ BooleanLiteral Boolean direct quantity represents a value of type Boolean true ┃┣ Punctuator punctuation marks indicate special meaning * ┃┣ NumericLiteral numeral directly represents a value of type Number. 12e-10 ┃┣ StringLiteral string direct quantity represents a String class The value of type "Hello world!" ┃┗ RegularExpressionLiteral regular expression directly quantifies a newline character of an object of class RegularExpression / [a murz] + $/ g ┗ LineTerminator line Terminator that separates or maintains beauty May affect the automatic insertion of semicolons

An overview of invisible characters in all JS lexicology in the schedule U+0009tab characters, for blank Uplan000B vertical tabs, for blank Utt000C feed, for blank Utt0020 spaces, for blank U+00A0 non-line-breaking spaces, for blank UEFF zero-width non-breaking spaces, byte order marks, for blank U200C zero-width non-connectors, for identifiers U + 200D zero-width connectors, for identifier Upole 000A line feed, for line Terminator Utre000D carriage return For the line Terminator Utility 2028 line separator, for the line Terminator Utility 2029 page separator, for the line Terminator on the JavaScript lexical has been shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.