In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
This article honors Ken Thompson and Dennis Ritchie, the great masters of computer science, and pays tribute to all others who have contributed to the development of Unix. The dust of history
As a world-famous operating system, Unix has a history of more than 40 years. Around the development of this ancient operating system, a series of peripheral software ecological groups have been derived, one of which is shell. It is the outermost interface of the operating system, which is responsible for direct user interaction and provides kernel services, including command line interface (CLI) or graphical interface interface (GUI). Taking CLI as an example, it provides a set of command specifications and is an interpretive language, which converts user input into real system calls through the output of the interpreter (interpreter), and realizes the function of human-computer interaction.
Like the operating system, shell has experienced a long history of evolution. Most of the data today tell that the oldest shell is from Bourne Shell in 1977. It was originally transplanted to Unix V7 and was recognized as the ancestor of the entire shell family, from which later populations branched out.
A lot of materials about the history before 1977 are mostly mentioned or omitted. In fact, the first shell transplanted to Unix was not written by Steve Bourne. As early as May 1975, Bell Labs released the first widely spread version of Unix-Unix V6 (the previously developed version is for internal research only). The / bin/sh in its root directory is the first shell that comes with Unix, written by Ken Thompson, so it is also known as Thompson Shell. Even earlier, dating back to 1971, Thompson Shell was implemented as a kernel-independent application, but from its official launch in 1975 to its replacement in 1977, its short two-year life made it rarely known to most people.
The reasons for the replacement of Thompson Shell will be explained later, and some technical details of the shell itself will be highlighted here. Frankly, information about Thompson Shell is a little scarce, but at least the source code and online documentation can be found online. Thompson Shell itself is composed of an interpreter with less than 900 lines of code and some external command tool components (utilities), written with Knowledr C. the relevant source code and documentation links for each component are given below.
Interpreter sh: parses a variety of shell commands, including built-in and external commands; source sh.c; installation path / bin/sh; manual sh (1). The built-in command manual includes chdir (1), login (1), newgrp (1), shift (1) and wait (1).
Here are the external commands:
Exit command: exit a file; source exit.c; installation path / bin/exit; manual exit (1). Goto command: jump to shell control flow in a file; source code goto.c; installation path / bin/goto; manual goto (1). If command: conditional judgment expression, is the predecessor of the test command; source code if.c; installation path / bin/if), manual if (1). Glob command: extended command parameter wildcard; source code glob.c; installation path / etc/glob; manual glob (8). Command structure and specification
Although it was later "eliminated", Thompson Shell still has an undeniable historical status, and its greatest value lies in that it lays the foundation for the structure and specification of shell command language, and its interpreter is portable across platforms, and affects the design and implementation of various scripting languages, including Bourne Shell. Here are five of these features to review some of the already familiar command specifications, you can also check the raw material through the sh (1) manual.
Filter / Pipe Line (filter/pipeline). This is definitely an invention to go down in the annals of Unix, and it was Douglas McIlroy,Thompson Shell who introduced and implemented the great concept of one or more commands forming a chain of filters separated by'|'or'^ 'symbols. Except for the last command, the standard output of each command is used as standard input for the next command. In this way, each command runs as a separate process and is connected to neighboring processes through pipes. The sequence of commands in parentheses can be implemented as a filter instead of a single command as a whole, for example, the user can type "(Ascape B) | C". Command sequence and background process. The semicolon'; 'indicates that multiple commands are serialized to execute.' The & 'symbol indicates that the command executes asynchronously in the background, so that the preceding pipeline line does not have to wait for it to terminate, but only reports a process id so that users can communicate with it later through the kill command. It is beneficial to process management. Ipaw O redirect. It takes advantage of an important feature of Unix design-everything is a file, represented by three symbols: "redirect the output, create it if the file does not exist, truncate it if the file does not exist, redirect the output in'> 'append mode, create it if the file does not exist, and append the output to the end if the file exists. Wildcard extension (globbing). The concept of wildcards is derived from regular expressions, which enables the interpreter to intelligently handle incomplete user input, such as not remembering file names, entering multiple files at once, and so on.' ,' Matches any single character;'* 'matches any string (including empty strings); pairs of' ['and'] 'define a class of character sets, which can match any member in brackets, and a series of consecutive character matching ranges can be specified at both ends of' -'. Parameters are passed. The concepts of position parameter and option parameter are mainly introduced here:'$n 'indicates the nth parameter substitution of the shell call. Two option parameters,'- t 'and'-caches, are also defined for interaction, causing shell to read a line from standard input as a system command executed by the user, while the latter instructs shell to execute the next argument attached as a command (which handles newline characters correctly), which complements'- t', especially if the caller has read some of the characters in the command. The principle and implementation of directly reading the file name interpreter without option parameters
In order to understand the principle of the shell interpreter, we need to describe its entire workflow (here is an annotated sh.c source code analysis). Students who have read the principles of compilation know that the implementation of the interpreter is similar to that of the compiler, except that the step of generating the object code is omitted and the user input (shell command) is directly converted into output (system call). The front end of the software is consistent, including preprocessing, lexical scanning, syntax analysis, semantic analysis, and finally an additional process management. Of course, compared with modern compilers, the Thompson Shell interpreter is much simpler in algorithm and scale, but it is similar in principle, not to mention that it is older than Lex & Yacc. Although the sparrow is small and has all the internal organs, it may be a good choice for beginners to start with the compilation principle from Thompson Shell.
Preprocessing (preprocessor)
Just as the C preprocessor needs to expand the macros and header files contained in the source code in advance, Thompson Shell first needs to deal with the option parameters and position parameters in the command. There are two types of option parameters,'- t 'and'-cached, which determine whether shell reads characters from the standard input or parameter cache (see sh (1)). In addition, the backslash'\ 'should also be processed in the character sequence to determine whether it is an escape character or a line continuation. The former sets a reference identification to the next character, indicating that ordinary character processing is done, while the latter will filter out the newline immediately after it.
The position parameter starts with the dollar sign'$', followed by a number, such as'$'. The preprocessor counts the shell command parameter from scratch and returns the parameter position specified by the number n. If you encounter double'$$', it means the current process ID. Call getpid () to get it.
Notice that the preprocessor needs to read more than one character at a time, so that it will read an unnecessary character. The interpreter provides a peek for this, that is, each time a character is read from the input stream, it is put into a read-ahead cache (only a stack the size of int), which is also called push back. After that, it is read from the read-ahead cache, and if the cache is finished, it is read from the input stream.
Lexical scanning (lexical scanning)
The preprocessed character sequence will be cut into a series of lexical tokens (token) and placed in the token list, and the scanner will do the following processing for the following types of characters.
Whitespace and tab: simple filtering. Quotation marks: need to appear in pairs, the characters themselves are filtered, and all characters between a pair of quotation marks are set as a token. Metacharacters: such as'&','|', etc., the character itself acts as a separate token. Other characters: all token will be filled in until the above characters are separated.
For example, when we enter the command "(ls; cat tail) > junk", the token list image will look like this:
Parsing (syntax parser)
Parsing is to take the elements in the token list as expressions (expression) and build a syntax tree in nodes. A simple command is an expression, while a compound command and a command sequence are a combination of multiple expressions. In Thompson Shell, simple arrays are used as containers for syntax trees, which is actually a variant of the structure, except that each member field is the same size (all sizeof int). A syntax tree node has up to six fields (the size varies according to the type), which are
DTYP (Node Type): each node has a unique type, which is divided into four types-TCOM (simple command), TPAR (compound command), TFIL (filter / pipe line), and TLST (command sequence). DLEF (left subtree node): equivalent to a linked list pointer, which varies according to the definition of DTYP. For example, the left subtree node of the filter type is the output redirection file of the previous command, and the right subtree node is the input redirection file of the latter command. DRIG (right subtree node): same as above. DFLG (Node attribute): this is a flag that determines which attributes the node contains for the command and in what state to execute. DSPR (subcommand): double meaning: for simple commands, this field is empty; for compound commands, this field refers to the child syntax tree node. DCOM (command character): refers to a sequence of command characters.
The order of syntax tree node generation depends on the priority (priority) of each element in the token list. First, traverse the entire list, find the highest priority token as the root node, and then generate the left and right subtrees respectively, which is the simplest top-down (top-down) solution. Each token priority depends on the DTYP field
Priority
Token
DTYP
The first level
'&';'\ n'
TLST
The second level
'|' ^'
TFIL
The third level
'(')'
TPAR
The fourth level
Other characters
TCOM
A dynamic programming algorithm based on finite state machine (finite-state machine) is also used in the construction of syntax tree, which divides the whole logic flow into four states: syntax, syn1, syn2 and syn3. Corresponding to the above token priority, the program generates a corresponding type of node in each state and four strategies at the same time. To decide what state to move to next (search for the corresponding token based on priority). These four strategies are
Generate the left subtree: the token list on the left progresses to the lower state. Generate the right subtree: the token list on the right and go back to the upper state or recursive call. Cannot find the corresponding token: keep the original token list progressive to the lower state. Generate node: return directly to the node.
When we have traversed the entire token list, the program can always return to the original call point, that is, the root node, thus generating a complete syntax tree. The advantage of this algorithm is that programmers do not have to pay attention to every detail of the specific implementation, just pay attention to the corresponding state and formulate the corresponding transition strategy. It is also worth mentioning that each transfer strategy occurs on an assignment statement or a return statement, and uses function arguments to save temporary variables, thus avoiding stack overflows caused by too many calls.
Still give two examples, such as the syntax tree corresponding to the command "A &; B | C"
Syntax tree corresponding to the command "(A; B) | C":
Semantic analysis (Semantic Analyzer)
Syntax analysis only stays at the level of legitimacy of token expressions, it does not know whether the expression is meaningful, such as which commands are to be run in the background, which commands are redirected to the pipeline, how to extend wildcards, and so on, which depends on semantic analysis. The "semantics" here are reflected in the dynamic handling of special characters and the field settings of syntax tree nodes, depending on the context (context). For example, for metacharacters'>', we need to determine which file the output is redirected to, truncated or appended. For the wildcard characters'?','*', and'[...]', we have to decide which characters to extend, which are handled specifically in / etc/glob. For syntax tree nodes, in addition to their own inherent attributes, you also need to inherit the attributes of the upper node, and push the attributes down to the lower subtree node, which is described in a table below.
DTYP
DLEF/DRIG
DFLG
DSPR
TLST
It can be empty or other node, and the type can be TLST/TFIL/TCOM itself attribute 0. If you take'&', push down the attribute FINT | FAND | FPRS to the left and right subtree (ignore signal, background asynchronous, print pid) empty TFIL
Must exist at the same time, the type can only be TCOM or TPAR its own attributes inherit from the upper TLST; push FPIN down to the left subtree node; push FPOU to the right subtree node. Empty TPAR
Null inheritance upper TLST and TFIL; if it is an append mode redirect output, plus FCAT; if it is the last subcommand in the compound command, plus FPAR, it will not fork the child process. Subcommand TCOM
The left subtree node is the input redirection file, and the right subtree outputs the redirection file for the node. Null execute command (Executor)
After the previous series of steps, if the error count is 0, the interpreter starts from the root node of the syntax tree, traverses all nodes in depth first, and executes the included commands one by one according to the types and attributes obtained from the previous syntax and semantic analysis. to generate the final system call.
For command sequence (TLST) nodes, the subtree node commands are executed sequentially from left to right.
For the filter (TFIL) node, create a pipe file handle as a redirect file for the left and right subtrees.
For simple command (TCOM) and compound command (TPAR) nodes, first filter out the system built-in command (built-in), and for the remaining external commands, fork a child process to execute it. If it is the last subcommand in the compound command, it is still executed on the original process without having to create a new process. The executable path is searched in order: ① local path; ② / bin; ③ / usr/bin.
In a multi-process environment, special attention should be paid to file handle management. In addition to sharing standard input and output devices between commands, the pipeline will also be redirected, and the parent process will get a copy of the file handle after the fork, so the parent process must close the idle pipe line handle (if any) immediately after the fork to avoid resource leakage, and the child process will also close the pipe line handle after the redirection.
For background commands, the pid needs to be printed, but there is no need to respond to the interrupt signal, and the parent process does not have to wait for the child process to terminate. The interrupt signal can be captured during the execution of other process commands and transferred to the corresponding processing function.
The interpreter saves the process termination status with the built-in errno global variable and generates a termination report (termination report). The system calls wait () to return the pid of the terminated process and output the report message index.
Which is better or worse?
Although Thompson Shell is an excellent command interpreter and has produced a number of historical innovations, it is a pity that it is still not favored by the goddess of fate, which is due to its own defects-single function, decentralized commands, simple control, and can not be used to write scripts (script). As Unix grows, it can no longer cope with increasingly complex programming projects. At that time, there was a PWB Shell (also known as Mashey Shell) written by a man named John Mashey, which made some improvements based on Thompson Shell, expanded the command set, added shell variables, and added control logic such as if-then-else-endif,for,while. Unfortunately, it is shorter-lived than Thompson Shell because it met a strong rival in 1977.
Yes, that's Bourne Shell, and its main advantage is that it really implements structured scripting, which is better than the previous shell, and what's more, it's not compatible with the first two shell, so a standardization debate begins. In the article "ksh-An Extensible High Level Language" written by David G. Korn (author of ksh), Steve Bourne and John Mashey argued over their respective reasons at three consecutive Unix user group gatherings. Between these gatherings, each enhances their shell to have each other's functions. A committee was also set up to select the standard shell, and eventually Bourne shell was chosen as the standard.
So the "Bourne Shell Family" mentioned above has been available since Unix V7. However, there is no perfect technology in history, and with the rapid development of the operating system in the 1980s and 1990s, there are more and more criticisms against Bourne Shell. In the implementation of the interpreter itself, I saw that one of the comments on the Internet is "universally considered to be one of the most horrible C code ever written". As for the reason, you can see the reason by taking a look at mac.h. A large number of macro definitions, including basic operators and keywords, make the whole code look like it is not written in C. Maybe Bourne wants to make the interpreter its own unique style. It is no wonder that the later bash named "born again" was a playful tease of its ancestors. In addition, some defects in memory management bring about the problem of platform portability, and the technical details are a little advanced, which are beyond the scope of this article.
Thompson Again Shell?
Although history did not give Thompson Shell a chance, it did not become an ancient "fossil" in open source museums like Unix V6. As a work from the top, as a product of popular computers along with a great operating system like Unix, it is still remembered by programmers at home and abroad, or rewrite it, or make notes for it. For example, a foreign site v6shell.org has implemented a free and open source portable shell, which is compatible with and extends the original Thompson Shell and can be used for scripting. For another example, Chinese programmer Han Cicada Tui Shi released an annotated version on his personal blog and made some changes to the original version, mainly to convert Kappa R C into ANSI C, and comply with the POSIX specification, making the original obscure source code clear and easy to read. It was precisely because I came into contact with his version that aroused my interest in the archaeology of the old Unix that I had this Archaeological Note. I wonder if there will be a tash like bash in the future.
Some feelings.
This should be the end of the full text, but I can't help but want to say a few more words at this moment. This note did not mean to do it at the beginning, and it gradually became a chapter when the feeling accumulated in the process of hacking source code. It took more than four weeks to read the code, make notes, check materials, and write this article, which was pieced together by the scattered pieces of time squeezed out by the "milking ditch" in the complicated work. Although the text is not long, it takes a lot of painstaking effort. I realize the difficulty of diving into the heart to do research after stepping into society. Now faced with such a less than 900 lines of writing, not a single extra line of code, concise (clarity), clean (clean), fast (fast), this is the charm of Pure C, I am deeply impressed by this heavy programming skills, the so-called "the road to simplicity". Although it takes a lot of time to fully understand it, I believe the price is worth it.
Finally, when Dennis Ritchie died in 2011, someone asked him, "how long does it take to learn C to become a skilled developer and write important product code?" Ritchie replied, "I don't know. I've never been to C." (I don't know. I never had to learn C.) In fact, the answer has been given here-there is no better choice than to read the Unix source code, and in a sense the C language is made for Unix.
references
The Unix Heritage Society:Unix community legacy with source code for V6 and V7 and other derivative versions of the operating system.
A brief history of The Traditional Bourne Shell Family:Bourne Shell family.
V6shell:osh, an open source portable old shell based on Thompson Shell.
Han Cicada's blog: an annotated version of Thompson Shell.
Evolution of shells in Linux: a brief introduction to the history of Linux Shell evolution.
Appendix A Chinese annotated shell source code
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.