How to understand the hardcore operating system 07/11 Update SLTechnology News&Howtos

How to understand the hardcore operating system

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to understand the hard-core operating system". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. let's study and learn "how to understand the hard-core operating system"!

1 von Neumann system 1.1 A brief introduction to von Neumann system

Von Neumann, the father of modern computer, first put forward the idea of program storage, and successfully applied it to the design of computer. This idea agreed to use binary system for calculation and storage, and defined the basic structure of computer as five parts, namely central processing unit (CPU), memory, input device, output device and bus.

Memory: code and data are stored linearly in RAM and ROM, and the unit of data storage is a binary bit. The smallest unit of storage is bytes.

Bus: bus is used for communication between CPU, memory and other devices. There are three main types of bus:

Address bus: used to specify the memory address that CPU will operate on.

Data bus: data used to read and write memory.

Control bus: it is used to send and receive signals, such as interrupt, device reset and so on. When CPU receives the signal, it also needs the control bus.

Input / output device: the input device inputs data to the computer, which, after calculation, outputs the data to the output device. For example, when you need to interact with CPU when you press a keyboard, you need to use the control bus.

CPU: central processing unit, analogous to the human brain, as the operation and control core of the computer system, is the final execution unit of information processing and program running. CPU uses registers to store the data needed for calculation. There are generally three types of registers:

General register: used to store data that needs to be operated, such as two data that need to be added.

Program counter: used to store the memory address where CPU will execute the next instruction.

Instruction register: used to store the instruction itself pointed to by the program counter.

The process of computer instruction execution under the von Neumann system:

The CPU reads the program counter to obtain the instruction memory address, the CPU control unit operates the address bus to get the data from the memory address, and the data reaches the CPU through the data bus and is stored in the instruction register.

CPU analyzes the instructions in the instruction register, if the instructions of the calculation type are handed over to the logic operation unit, and if the instructions of the storage type are given to the control unit for execution.

After the CPU executes the instruction, the value of the program counter points to the next instruction by self-increment, for example, 32-bit CPU will increase by 4.

After the increment, the next instruction is executed sequentially and is executed in a loop until the end of the program.

CPU bit width: 32-bit CPU can calculate 4 bytes at a time, 64-bit CPU can calculate 8 bytes at a time, this is hardware level. Generally speaking, the 32-bit or 64-bit operating system refers to the software level, which refers to how many bits of instructions in the program.

Line bit width: CPU operation instruction data through high and low voltage changes for data transmission, transmission can be serial transmission, can also be parallel transmission, how many parallel equal to how many bit width.

1.2 introduction to CPU

Central Processing Unit CPU, as the core of operation and control of computer system, is the final execution unit of information processing and program running.

CPU

CPU core: generally, a CPU will have multiple CPU cores. Multi-core usually refers to the integration of two or more complete computing engines in a processor. The relationship between nuclear and CPU is that nuclear is a part of CPU.

Register: closest to the CPU pair memory unit, 32-bit CPU registers can store 4 bytes and 64-bit registers can store 8 bytes. Register access speed is generally half a CPU clock cycle, belonging to nanosecond level

L1 cache: each CPU core has it, which is used to cache data and instructions. The access space is generally in 32~256KB, and the access speed is generally 2 to 4 CPU clock cycles.

Cat / sys/devices/system/cpu/cpu0/cache/index0/size # L1 data cache cat / sys/devices/system/cpu/cpu0/cache/index1/size # L1 instruction cache

L2 cache: each CPU core has access space in 128KB~2MB, and the access speed is generally 10 to 20 CPU clock cycles.

Cat / sys/devices/system/cpu/cpu0/cache/index2/size # L2 cache size

L3 cache: multiple CPU cores are shared, the access space is in 2MB~64MB, and the access speed is generally 20 to 60 CPU clock cycles.

Cat / sys/devices/system/cpu/cpu0/cache/index3/size # L3 cache size

Memory: multiple CPU shares, now generally 4G~512G, access speed is generally 20000300 CPU clock cycles.

Solid hard disk SSD: now the mainstream desktop computers will be equipped with, the above registers, cache, memory are immediately lost, but the SSD will not be lost, the size is generally 128G~1T, 10 to 1000 times slower than memory.

Mechanical disk HDD: a popular hard disk a long time ago, the capacity varies in 512G~8T, and the access speed is 10W times slower than memory.

Access data order: when CPU takes the data processing, it almost follows the above process, and only if the upper layer cannot be found, it will find the next layer.

CacheLine: when CPU reads data, it loads the data into the cache in the way of CacheLine, each Cacheline = 64KB, because L1 and L2 are unique to each core so that pseudo-sharing may be triggered, so data may be divided into different CacheLine to avoid pseudo-sharing. For example, the newly added LongAdder in JDK8 involves this knowledge point.

Pseudo sharing: the cache system is stored in cache lines (cache line). When multiple threads modify independent variables, if these variables share the same cache line, they will inadvertently affect each other's performance, which is called pseudo sharing.

JMM: after all kinds of layering, the access speed of data increases continuously, and it also brings various problems. Multiple CPU operating the same data at the same time may result in a variety of BU, which need to be locked. Here, JUC concurrency has been discussed in detail.

1.3 CPU access method

CPU access mode

When mapping memory data to CPU Cache, the formula Block N% CacheLineMax determines which CPU CacheLine the memory Block data is placed in. CPU Cache is mainly composed of four parts.

CacheLine Index: the CPU cache does not read data in bytes, but stores and reads data in CacheLine mode.

Valid Bit: significant bit marker. A value of 0 indicates that CPU will directly access memory and reload data regardless of whether there is data in CPU Line or not.

Tag: group tags that mark the mapping of different BLock in memory to the same CacheLine, and use Tag to distinguish different memory Block.

Data: real to memory data information.

When CPU actually accesses in-memory data, it only needs to specify three parts.

Cache Line Index: to access the Cache Line location.

Tag: means to use that data block.

When Offset:CPU reads data from CPU Cache, it does not read the entire block of data directly from Cache Line, but reads the pieces of data required by CPU, called Word. How to find Word requires an offset Offset.

1.4 CPU access Speed

Access time comparison

As shown in the figure above, CPU access speed is gradually slowing down, so when CPU accesses data, try to access it in the cache close to CPU. According to Moore's Law, CPU access speed doubles every 18 months, while memory access increases by about 10% every 18 months. As a result, the gap between CPU and memory access performance is gradually widening, so how to improve the CPU cache hit rate as much as possible?

1. Data cache: access data in memory layout order when traversing data. Because CPU Cache operates data in batches according to Cache Line, it will speed up if you read data in order, just like writing in disk order.

Instruction cache: provide regular conditional branch statements as much as possible, let CPU branch predictor play a role, and further improve the efficiency of execution, because CPU has its own branch predictor, which automatically puts the instructions that may be needed into the instruction cache in advance.

Thread binding to CPU: a task A runs on the previous time slice with CPU core 1 and the latter on CPU core 2, so caching L1 and L2 is wasted. So the operating system provides the ability to bind a process or thread to run on a CPU. For example, the sched_setaffinity method is provided on Linux to implement this function, and API with similar functionality is available in other operating systems. When multiple threads perform intensive computing at the same time, and the CPU cache hit ratio is very high, if each thread is bound to a different CPU core, the performance will be significantly improved.

1.5 operating system

Computer structure

With the von Neumann computer system, the computer needs to install an operating system Operation System if it wants to provide convenient services for users. the operating system is a special layer of software covered on the hardware, which manages the computer's hardware and software resources and provides a large number of services for other applications. It can be understood that the operating system is the interface between daily applications and hardware. You often use the Windows/Linux system every day, the operating system provides us with super convenience, but do you know anything about the operating system? How does the operating system manage memory, process, file, input and output?

2 memory management

Your computer is a 32-bit operating system, and the maximum memory you can support is 4G. Have you ever wondered why you can run more than two programs with 2G memory at the same time? The application does not directly use the physical address, the operating system assigns a set of virtual addresses to each running process, each process has its own virtual memory address, the process can not directly access the physical memory address. As for the mapping between virtual addresses and physical addresses, the process is imperceptible! The operating system itself provides a mechanism to map the virtual addresses of different processes and the physical addresses of different memory.

Virtual memory

2.1 MMU

The Memory Management Unit memory management unit is a kind of computer hardware responsible for processing CPU memory access requests. Its functions include the translation of virtual address to physical address, memory protection and CPU cache control. Modern CPU basically chooses to use MMU.

When a process holds a virtual memory address, CPU manipulates virtual memory when it executes the process, while MMU automatically maps virtual memory operations to physical memory.

MMU

To mention here, the address you see during the Java operation is the JVM address, not the real physical address.

2.2 memory management mode

The operating system mainly uses memory segmentation and memory paging to manage the relationship between virtual addresses and physical addresses, in which segmentation was used a long time ago, but now most of them use paging, but paging is not complete paging. It's paging on the basis of segmentation.

2.2.1 memory segmentation

JVM memory model

As an example of the JVM memory model in the figure above, programmers will think that our code is made up of code segments, data segments, stack segments, and heap segments. Different segments have different attributes, and users do not care about the location of the memory where these elements are located, and segmentation is a memory management scheme that supports this user view. The logical address space is made up of a set of segments. Each segment has a name and length. The address specifies the segment name and the intra-segment offset. Therefore, the user segment number and segment offset specify the addresses of different attributes. The virtual memory address and the physical memory address are mapped through the segment table, there is no evidence, look at the picture.

Memory segmentation management

As shown above, the virtual address has five segments, and each segment is stored as shown in the figure. Each segment has an entry in the segment table, which includes the base address of the beginning of the segment in physical memory and the boundary length of the segment. For example, segment 2 is 400 bytes long and starts at position 4300. Therefore, the reference to segment 2 byte 53 is mapped to position 4300 + 53 = 4353. The reference to segment 3 bytes 852 is mapped to the position 3200 + 852 = 4052.

Segmented mapping is simple, but it can lead to inefficient interaction between memory fragments and memory. First of all, there are internal memory fragments and external memory fragments in memory management.

Internal fragmentation: the memory space that has been allocated is not used frequently, and the memory space allocated is larger than the memory space required for the request.

External fragmentation: a block of free memory space that has not been allocated but is too small to be allocated to the new process requesting space.

As an example in the above figure, the system is now idle at 1400 + 800,600 = 2800. If there is a program that wants to use 2000 continuously, it cannot be provided in memory segmentation mode. The above three are external memory fragments. Of course, you can use the Swap space of the system, first write segment 0 to disk, and then reassign segment 0. This makes it possible to achieve ultimate availability, but when it comes to disk reading and writing, memory interaction is inefficient.

Swap space utilization

2.2.2 memory paging

Memory paging, the entire virtual memory and physical memory are cut into segments of fixed size. Each fixed size is called a page Page, and in the Linux system Page = 4KB. Then the mapping between virtual memory and physical memory is realized by page table.

When using memory paging, the release and use of memory are on a page-by-page basis, so there is no memory fragmentation. When there is not enough space, according to the operating system scheduling algorithm, the least used memory page swap-out may be swapped out to disk, and then swap-in in, so as to reduce disk flushing as much as possible and improve the efficiency of memory interaction.

The virtual address in paging mode is mainly composed of two parts: the page number and the intra-page offset. The physical memory address is found by querying the page table through the page number, and then combined with the intra-page offset to find the real physical memory address.

Paged memory addressing

The virtual address that can be operated by a process in a 32-bit operating system is 4GB. Suppose a virtual page size is 4KB, which requires 4GB/4KB = 2 ^ 20 page information. A row of page tables is recorded as 4 bytes, and 2 ^ 20 is equivalent to storing information in 4MB page tables. This is just what a process needs. What if there are 10, 100, 1000? Only the page table storage takes up a lot of memory.

In order to solve this problem, we need to use multi-level page tables, and the core idea is local allocation. In a 32-bit operating system, the 4G space is divided into 1024 row page catalog items (4KB), and each page directory item corresponds to 1024 row page table items. As shown in the following figure:

32-bit system two-level paging

The physical address of the page directory is stored in the control register cr3, and the page directory can be found through the cr3 register, while the Directory part of the 32-bit linear address determines the directory entry in the page directory, and the page catalog entry stores the physical base address of the page table to be found. Combined with the middle 10-bit page table entry in the linear address, the page table entry of the page frame can be found. The Offset part of the linear address occupies 12 bits, so the physical address of the page frame + the Offset part of the linear address = any byte in the page frame.

One-level page after paging is equivalent to 4G virtual address space, and if those addresses are not in the first-level page table, there is no need to create a second-level page table! The core idea is to create on demand. When the system allocates 4G space to each process, it is impossible for the process to occupy all the memory. If only 10% of the first-level directory pages are used, the page table space = first-level page table 4KB + 0.1 * 4MB. This is much easier to use than a separate process occupying 4m!

The drawback of multi-layer paging is the increase in access time.

When using a page table to read a page in memory requires two accesses to memory, access to the page table item + and read a page of data.

Using a secondary page table requires three visits, access to page catalog entries + access to page table entries + one page of data accessed and read. The increase in the number of access to memory also means an increase in the total time spent accessing the data.

For 64-bit systems, two-level paging can not be satisfied. Linux has adopted a four-level paging model since 2.6.11.

Page Global Directory Global Page Catalog entry

Page Upper Directory upper page directory entry

Page Middle Directory middle page directory entry

Page Table Entry page table item

Offset offset.

64-bit addressing

2.2.2 TLB

Translation Lookaside Buffer can be translated into address translation backup buffer, abbreviated as fast table, which belongs to a module within CPU. TLB is a part of MMU, which is essentially cache, which caches page table entries of recently used data (virtual address to physical address mapping). He appeared to speed up access to data (memory) and reduce repetitive page table lookups. Of course, it is not necessary, but with it, the speed is even faster.

TLB

TLB is small, so there is not much to cache. Mainly caches the data mapping of recently used data. The structure of TLB is shown below:

TLB query

If one needs to access a data in memory, given the virtual address of the data, query TLB, find that there is a hit, directly get the physical address, and get the data in memory according to the physical address. If TLB does not have this virtual address miss, it will have to struggle to find it through the page table. The daily CPU process for reading a data is as follows:

CPU read data flow chart

When the context of the process address space is switched, for example, process 1 is running, and the address of the relevant data of process 1 is put in the TLB, and suddenly switch to process 2 TLB, the original data in the TLB is not related to process 2. There are two ways for TLB to refresh the data.

Full refresh: very simple, but expensive, a lot of data that does not need to be refreshed is also refreshed, increasing the fearless cost.

Partial refresh: refresh the data that needs to be refreshed according to the flag bits, and retain the data that does not need to be refreshed.

2.2.3 paragraph page management

Memory segmentation is not opposed to memory paging, and the two can be combined to be used in the same system, which is often called segment-page memory management. How to implement segment-page memory management:

First of all, the data is divided into different segments, that is, the segmentation mechanism mentioned above.

Then each paragraph is paged, that is, the paging mechanism mentioned earlier.

At this time, the address structure = segment number + intra-segment page number + intra-page displacement.

Each process has a segment table, and each segment establishes a page table. The address in the segment table is the starting address of the page table, and the address in the page table is the physical page number of a page.

Segment-page management

At the same time, we often see two professional words logical address and linear address.

Logical address: refers to an address that is not mapped by segmented memory management.

Linear address: the address before translation by segment memory management mapping and page memory management, commonly known as virtual address.

At present, Intel X86 CPU adopts the management mode of memory segmentation + memory paging, in which paging means adding a layer of address mapping to the address mapped by segment memory management.

X86 memory management mode

2.2.4 Linux memory Management

First of all, the conclusion: the Linux system is based on the operating system of X86 CPU, so it also uses the segment-page memory management mode.

We know that the addressable range of the 32-bit operating system is 4G, and the operating system divides the 4G accessible memory space into user space and kernel space.

Kernel space: the area accessed by the operating system kernel, independent of ordinary applications, and is protected memory space. In kernel mode, CPU can execute any instruction and freely access any valid address.

User space: an area of memory accessible to ordinary applications. The executed code is subject to many restrictions of CPU, and the process can only access the virtual address specified in the page table entry that maps its address space to the page that can be accessed in the user mode.

Then why do you need two spaces? Now there is no distinction between the embedded environment and the previous WIN98 system, it is necessary to know that the two spaces are divided by CPU, and the operating system runs on it, and the operating system with a single user and a single task service is not necessary to distinguish between the so-called user mode and kernel mode. User mode and kernel mode are due to the requirements of multi-user and multi-task, and then after the cooperation of CPU hardware manufacturers, an operating system is produced to solve the needs of multi-user and multi-task. The solution is to restrict, through hardware means (and can only be done by hardware means), to restrict some code so that it cannot control the entire physical hardware, so that the codes of different users and different tasks have no right to modify the entire physical hardware, and then protect the core underlying code of the operating system and the data of other users from being destroyed and stolen unintentionally or intentionally.

Later, the researchers divided the CPU into four levels of Ring0~Ring3 according to the level of operation. Ring0 is the highest level, Ring1 is the second, Rng2 is the second. For example, the code of the operating system kernel runs on the highest running level Ring0, and you can use privileged instructions to control interrupts, modify page tables, access devices, and so on. The code of the application runs at the lowest running level, Ring3, and cannot do controlled operations, but can only access the space assigned to the user. If you want to do operations such as accessing disk and writing files, you need to execute the system call function. When you execute the system call, the running level of CPU will switch from Ring3 to Ring0, and jump to the corresponding kernel code location for execution, so that the kernel completes the device access for you, and then returns Ring3 from Ring0. This process is also known as the switching between user mode and kernel state.

If you want to use a computer device or IO in the user state, you need to complete the sys call through the system call, which allows the kernel to do these operations. While the system call affects the current process context, CPU provides a soft interrupt to protect the thread, obtains the system call number and parameters, and gives it to the kernel corresponding system call function to execute.

Linux system structure

You can see that each application has its own independent virtual memory address, but the corresponding kernel address in each virtual memory is actually the same large block, so that the kernel space memory can be easily accessed when the process switches to the kernel state. For example, the Java code creation thread new Thread calls the start method to track the JVM source code. You will find that you call pthread_create to create the thread, which involves switching from user mode to kernel mode.

3 process management 3.1 process basics

A process is an execution of a program, an activity that occurs when a program and its data are executed sequentially on the machine, and a running process of a program with independent functions on a data set. It is a basic unit for resource allocation and scheduling of the system. The scheduling status of the process is as follows:

State change diagram

Let's focus on suspending and blocking:

Blocking is usually when the system performs an IO operation, when the process enters the blocking state, waiting for the return of an event.

Suspending means that the process has no physical memory and is written to disk. At this point, the process state is pending.

Blocking suspend: the process is written to the hard disk and waits for an event to occur.

Ready to suspend: the process is written to the hard disk and enters the memory to enter the ready state directly.

3.2 PCB

In order to describe and control the operation of the process, the system defines a data structure for each process-the process control block Process Control Block, which is a part of the process entity and the most important recording data structure in the operating system.

The function of PCB is to make a program that cannot be run independently in a multiprogramming environment become a basic unit that can run independently, a process that can execute concurrently with other processes:

As the symbol of the basic unit of independent operation

To achieve intermittent mode of operation

Provide the information needed for process management

Provide the information needed for process scheduling

Achieve synchronization and communication with other processes

3.2.1 PCB Information

In order to achieve the above functions, PCB contains a lot of information:

Process identifier: used to uniquely identify a process, which usually has two identifiers:

Internal process identifier: identifies each process, each process has a unique identifier, the internal identifier is set mainly to facilitate the use of the system.

External process identifier: provided by the creator, the user identity can be set to indicate the user who owns the process. It is often used by a user process when accessing the process. In general, in order to describe the family relationship of the process, the parent process identity and child process identity should also be set.

Processor status: consists of various registers. Contains a lot of information is put in the register, convenient program restart.

General register, instruction counter, program status word PSW, user stack pointer and other information.

Process scheduling information

Process status: indicates the current state of the process as the basis for process scheduling and exchange.

Process priority: an integer used to describe the priority of a process using a processor, and a high-priority process should be given priority to the processor.

Other information needed for process scheduling: related to the process scheduling algorithm used, such as the sum of the time that the process has waited for CPU, the sum of time that the process has executed, and so on.

Event: refers to the event waiting for the process to change from the execution state to the blocking state, that is, the blocking reason.

Resource list

Information about the memory address space or virtual address space, the list of files opened, and the information about the Imax O device used.

3.2.2 PCB organization mode

There are too many PCB in the operating system, how to manage it is a problem, generally in the following ways.

Offline array

Linear mode:

Organize all the PCB of the system into a linear table and store the header address in a dedicated area of memory.

It has the advantages of simple implementation and low overhead, but the whole table needs to be scanned every time, which is suitable for systems with a small number of processes.

Index mode

Indexing method:

The processes in the same state are organized in an index table, the index table entries point to the corresponding PCB, and different states correspond to different index tables.

Linked list mode

Link method:

Link the PCB in the same state into a queue to form a ready queue, blocking queue, blank queue, and so on. The ready queues are often arranged according to the priority of the process, and the priority is high in front of the queue.

This mode is generally adopted because processes are created, destroyed, and scheduled frequently.

3.3 process control

Process control is the most basic function of process management, which mainly includes creating new processes, terminating completed processes, blocking abnormal processes, waking up processes and so on.

3.3.1 process creation

The parent process can create a child process, and the child process will be terminated after the parent process terminates. The child process can inherit all the resources of the parent process, and the termination of the child process needs to return the inherited resources to the parent process. Next, let's take a look at the general process of creation.

Assign a unique input identification number to the new process, and then create a blank PCB. Note that the number of PCB is limited, so the creation may fail.

Try to allocate the required resources for the new process, and if the resources are insufficient, the process will enter a waiting state.

To initialize PCB, you can do the following.

Identification information: fill the system-assigned identifier and the parent process identifier into the new PCB

Processor status information: make the program counter point to the program entry address and the stack pointer to the top of the stack

Processor control information: sets the process to the ready / static state, usually to the lowest priority

If the process scheduling queue can accept the new process, the process is inserted into the ready queue and waits to be scheduled to run.

3.3.2 process termination

Process termination is generally divided into three types: normal termination, abnormal termination and external intervention.

Normal end

Abnormal end

Out-of-bounds error: the accessed store is out of the process's area

Protection error: attempt to access resources that are not allowed, or access in an inappropriate manner (write-only)

Illegal instruction: an attempt to execute an instruction that does not exist (maybe the program mistakenly transferred to the data area, and the data is treated as an instruction)

Privileged instruction error: the user process attempted to execute an instruction that is only allowed to be executed by OS

Run timeout: execution time exceeds the specified maximum

Wait timeout: the process waits for something to exceed the specified maximum

Arithmetic error: an attempt to perform a prohibited operation (divided by 0)

Ipaw O fault

Outside intervention

Operator or OS intervention (deadlock)

When requested by the parent process, when the child process completes the task specified by the parent process

The parent process terminates and all child processes should end

Termination process:

According to the identifier of the terminated process, the PCB is retrieved from the PCB collection and the process status is read.

If it is in the state of execution, it immediately terminates the execution and allocates CPU resources to other processes.

If the process has a descendant process, all its descendant processes will be terminated.

All resources are returned to the parent process or operating system.

The PCB of the process is removed from the queue / linked list.

3.3.3 process blocking

This means that the process is blocked halfway through execution and must be awakened by an event process. What is common is IO read operations. Common blocking times / events are as follows:

Failed to request to share resources, the system does not have enough resources to allocate

Wait for an operation to complete

New data has not yet arrived (process of cooperation)

Waiting for a new task

Blocking process:

Find the PCB that corresponds to the process identification number to be blocked.

Change the process from a running state to a blocking state.

Insert the process PCB into the blocking queue.

3.3.4 process Wake up

The wake-up primitive wake up is generally used in pairs with blocking. The wake-up process is as follows:

Find the desired PCB from the blocking queue.

PCB overflows from the blocking queue and then becomes ready.

The PCB is overflowed from the blocking queue and then inserted into the ready state queue waiting for CPU resources to be allocated.

3.4 process scheduling

The number of processes is generally greater than the number of CPU, and the process state switching is mainly scheduled by the scheduler. Generally speaking, CPU scheduling is mainly divided into preemptive scheduling and non-preemptive scheduling.

Non-preemptive: scheduling that allows a process to run until it is finished or blocked, easy to implement and suitable for dedicated systems.

Preemptive: each process can be scheduled and run by CPU only when it gets a time slice, which can prevent a single process from monopolizing the CPU system for a long time.

3.4.1 process scheduling principles

CPU utilization

CPU utilization = busy time / total time.

The scheduler should try to keep CPU busy all the time, which can improve the utilization of CPU. For example, when an IO read occurs, don't wait to execute another process.

System throughput

System throughput = total number of jobs completed / total time spent.

Processes with long jobs consume longer CPU resources, resulting in reduced throughput, while processes with short jobs increase system throughput.

Turnaround time

Turnaround time = job completion time-job submission time.

Average turnaround time = each job turnaround time and / number of jobs

Weighted turnaround time = job turnaround time / actual running time of the job

Average weighted turnaround time = sum of weighted turnaround time / number of jobs

Minimize turnaround time as much as possible.

Waiting time

Refers to the time that a process waits in a waiting queue and generally needs to be as short as possible.

Response time

Response time = system first response time-user submission time. In interactive system, response time is the main criterion to measure the quality of scheduling algorithm.

3.4.2 scheduling algorithm

FCFS algorithm

The First Come First Severd first-come-first-served algorithm follows the first-come-first-served principle. It takes the longest waiting time from the ready queue each time, and then takes the next one after running.

This mode is good for long jobs, suitable for systems with CPU busy jobs, and not suitable for Imax O jobs, because it will result in low CPU utilization of the process.

SJF algorithm

Shortest Job First shortest Job first algorithm, which gives priority to the process with the shortest running time to execute, which can improve the throughput.

Contrary to FCFS, it is disadvantageous to long homework.

SRTN algorithm

The Shortest Remaining Time Next minimum remaining time first algorithm can be regarded as a preemptive version of SJF. When a newly ready process has a shorter completion time than the current running process, the system preempts the current process and selects the newly ready process to execute.

There is the shortest average turnaround time, but it is unfair that a steady stream of short tasks may prevent long tasks from running for a long time.

HRRN algorithm

Highest Response Ratio Next highest response ratio priority algorithm, in order to balance the first two, is executed according to the response priority from high to low. It belongs to the compromise between the first two.

Priority = (waiting time + required service time) / required service time

RR algorithm

Round Robin time slice rotation algorithm, the operating system sets a time slice Quantum, time slice causes each process to run only in that time slice, this way causes each process to get the right of execution evenly.

Time slices are generally 20ms~50ms, if too small will lead to frequent context switching, too large may lead to poor response to short interaction requests.

HPF algorithm

Highest Priority First highest priority scheduling algorithm, which selects the highest priority process from the ready queue to execute first.

The priority setting is either initialized or dynamically adjusted according to the waiting time or performance while the code is running.

The disadvantage is that it may cause low-priority ones to be executed all the time.

MFQ algorithm

Multilevel Feedback Queue multi-level feedback queue scheduling algorithm can be considered as a combination of RR algorithm and HPF algorithm.

There are multiple ready queues in the system at the same time, each queue priority is arranged from high to low, and the higher the priority, the shorter the time slice.

The new process will first join the highest priority queue, and if the new process has a higher priority than the currently executing process, it will stop the current process and execute the new process. If the new process is not finished in the time slice, it needs to be moved down to the secondary priority queue.

Multi-level feedback queue scheduling algorithm

3.5 Thread

3.5.1 Thread definition

In the early operating system, there was no concept of threads, and threads were added later. Why is there a thread? That is because in the past, in the multi-process phase, it often involved how to communicate between processes and how to share data. And the process is associated with the life cycle of the PCB, which is expensive to manage. Threads are introduced to solve this problem.

A thread is an execution process in a process. Multiple threads in the same process can share resources such as code segments, data segments, open files, and so on. At the same time, each thread has an independent set of registers and stacks to ensure that the control flow of the thread is independent.

The process is managed by a PCB, just as the operating system manages threads through the Thread Control Block thread control block.

3.5.2 advantages and disadvantages of threads

Advantages

There can be 1 million threads in a process at the same time, and these threads can execute concurrently.

Resources such as address space and files can be shared between threads.

Shortcoming

When a thread in a process collapses, it causes all threads of the process to collapse.

Multithreaded programming, something that makes people's heads big.

Thread execution overhead is small, but it is not conducive to the isolated management and protection of resources, while the process is just the opposite.

3.5.3 processes are associated with threads

Process:

It is an independent unit for resource allocation and scheduling.

It is an execution of a program, and each process has its own address space, memory, data stack and other auxiliary data to record the running track.

Thread:

Is an entity of a process and the basic unit of CPU scheduling and dispatching. It is a smaller basic unit that can run independently than a process.

All threads run in the same process and share the same running resources and environment

Threads are generally executed concurrently, which enables multi-task parallelism and data sharing.

Differences between process threads:

A thread can only belong to one process, and a process can have multiple threads, but at least one thread.

The division scale of threads is smaller than that of processes (resources are less than processes), which makes multithreaded programs have high concurrency.

The process has an independent memory unit in the process of execution, and multiple threads share memory, which greatly improves the running efficiency of the program.

Resources are assigned to a process, and all threads of the same process share all resources of that process.

CPU allocates resources to processes, but it is the threads that are really running on CPU.

Threads cannot be executed independently and must be stored in the process.

Where is the thread fast?

When the thread is created, some resources do not need to be managed by themselves, but can be taken directly from the process, and the thread can manage the life cycle of the register and stack.

Share data with multiple threads in the process, so process data transmission can use zero copy technology, do not need to go through the kernel.

The process uses a virtual memory and a page table, and then multiple threads share the virtual memory, which is much faster if the context switch between two threads within the process is much faster than the process.

3.5.4 Thread implementation

Kernel mode and user mode are mentioned in the previous memory management. The creation of corresponding threads is also divided into user-mode threads and kernel-mode threads.

3.5.4.1 user mode thread

The thread implemented in the user space is managed by the user-mode thread library. The operating system is scheduled according to the process dimension, and when the thread is created in the user state, the application needs to realize the thread creation, maintenance and scheduling in the user space. The operating system knows nothing about the existence of threads! The operating system can only see the process and not the thread. All threads are implemented in user space. From the operating system's point of view, each process has only one thread.

User mode thread

Benefits:

Even if the operating system does not support thread mode, it can also support thread mode through user-level library functions, and TCB is maintained by user-level thread library functions.

Using library function mode to implement threads can avoid switching from user mode to kernel mode.

Disadvantages:

CPU does not know that threads exist, and the time slice switching of CPU is based on the dimension of the process. A thread is blocked because of operations such as IO, and the operating system will block the whole process, even if other threads in the process are still working.

A user-mode thread cannot break a running thread unless the thread voluntarily relinquishes the right to use the CPU.

3.5.4.2 Kernel state thread

The thread implemented in the kernel is managed by the kernel, and the corresponding TCB of the thread is in the operating system, so the creation, termination and management of the thread are all responsible for by the operating system. In inthreading mode, one user thread corresponds to one kernel thread.

Kernel mode thread

Note: JVM in Linux has been implemented based on pthread since version 1.2, so now threads in Java are essentially threads in the operating system.

Advantages:

Blocking one thread in a process does not affect the running of other kernel threads.

A time slice in user mode is allocated to multiple threads, and the time slice allocated directly to threads in kernel mode is increased.

Disadvantages:

Kernel-level thread scheduling is expensive. Scheduling kernel threads can be as expensive as scheduling processes, and much more expensive than user-level threads. A thread default stack = 1m, more threads will lead to a lot of memory consumption.

Thread tables are stored in a fixed table space or stack space of the operating system, so the number of kernel-level threads is limited.

3.4.4.3 lightweight process

The initial process definition includes three parts: the program, the resource and its execution, in which the program usually refers to the code. At the operating system level, the resource usually includes memory resources, IO resources, signal processing and so on. The execution of the program is usually understood as the execution context, including the occupation of CPU, and later developed into threads. Before the emergence of the thread concept, in order to reduce the overhead of process switching, operating system designers gradually modified the concept of process, gradually allowing the resources occupied by the process to be separated from its main body, and allowing some processes to share some resources. for example, files, signals, data memory, and even code, this developed the concept of lightweight processes.

Light-weight process lightweight process is a user thread supported by kernel, which is based on the high-level abstraction of kernel thread. The system can have LWP only if kernel thread is supported first. A process can have 1 LWP, and each LWP is mapped to the kernel thread one-to-one, that is, the LWP is supported by a kernel thread.

LWP mode

Lightweight processes are still processes in nature, but compared with ordinary processes, LWP shares most of the logical address space and system resources with other processes. LWP lightweight is reflected in that it only has a minimum execution context and statistical information needed by the scheduler. It is the executive part of the process, with only execution-related information.

Linux features:

There are no real threads in Linux because Linux does not prepare specific data structures for threads. From the point of view of the kernel, there are only processes and no threads, and they are also scheduled as processes. What Linux calls a thread is actually a process that shares resources with other processes. But there are threads in windows.

Threads that are not in Linux are simulated by the process.

So in Linux, from a CPU point of view, a process is called a lightweight process LWP.

3.5.5 Synergy

3.5.5.1 Collaborative definition

Most web services and Internet services are essentially IO-intensive services. The bottleneck of IO-intensive services is not CPU processing speed, but to read and write data with high concurrency and multiple connections as quickly as possible. There used to be two solutions:

Multi-process: there is the problem of frequent scheduling switching, and there is also the problem that each process resource is not shared, which needs to be solved by introducing additional inter-process communication mechanism.

Multithreading: a large number of IO waits in high concurrency scenarios will cause multithreading to be suspended and switched frequently, which consumes system resources very much. At the same time, multithreading access to shared resources has a competitive problem.

At this point, the co-program appears, and the co-program Coroutines is a kind of tasklet that is more lightweight than threads. The analogy is that a process can have multiple threads, and a thread can have multiple collaborators. The cooperative program can be simply understood as a subroutine call, and each subroutine can be executed in a separate cooperative program.

Cooperative process

The co-program runs on the thread, and when the execution of one co-program is completed, you can choose to give up actively and let another co-program run on the current thread. The cooperative program does not increase the number of threads, but runs multiple coprograms by time-sharing reuse on the basis of the thread, and the switching of the cooperative program is completed in the user mode, and the cost of switching is much lower than that of the thread from the user mode to the kernel state. Generally, the knowledge of the cooperative program is involved in Python and Go, especially the high-performance script Go.

3.5.5.2 points for attention in collaboration

The co-program runs on the thread, and the co-program calls a blocking IO operation, at this time, the operating system does not know the existence of the co-program, it only knows the thread, so when the co-program call blocks the IO operation, the operating system will let the thread enter the blocking state, and the current co-program and other co-programs bound to the thread will be blocked and can not be scheduled.

Therefore, operations that cause thread blocking, such as printing, reading files, Socket interfaces, and so on, cannot be called in the co-program. Only when combined with asynchronous IO can collaborative process exert the greatest power. And collaborative processes only work in IO-intensive tasks.

3.6 process communication

The user address space of processes is independent of each other and cannot access each other, but the kernel space is shared by all processes, so the communication between processes must go through the kernel. Inter-process communication is mainly programmed through pipes, message queues, shared memory, semaphores, signals and Socket.

3.6.1 Pipeline

Pipes are mainly divided into anonymous pipes and named pipes, which can realize the one-way flow of data. It is easy to use, but the communication mode of pipeline is inefficient and is not suitable for frequent data exchange between processes.

Anonymous pipes:

The | in the daily Linux system is the anonymous pipeline. The previous input of the instruction is the output of the latter instruction.

Name the pipe:

Pipes are typically created through mkfifo SoWhatPipe. Follow cat through echo "sw" > SoWhatPipe

< SoWhatPipe 实现输入跟输出。匿名管道的实现依赖int pipe(int fd[2])函数，其中fd[0]是读取断描述符，fd[1]是管道写入端描述符。它的本质就是在内核中创建个属于内存的缓存，从一端输入无格式数据一端输出无格式数据，需注意管道传输大小是有限的。管道通信底层匿名管道的通信范围是存在父子关系的进程。由于管道没有实体，也就是没有管道文件，不会涉及到文件系统。只能通过 fork子进程来复制父进程 fd 文件描述符，父子进程通过共用特殊的管道文件实现跨进程通信，并且因为管道只能一端写入，另一端读出，所以通常父子进程遵从如下要求：父进程关闭读取的 fd[0]，只保留写入的 fd[1]。子进程关闭写入的 fd[1]，只保留读取的 fd[0]。 shell管道通信需注意Shell执行匿名管道 a | b其实是通过Shell父进程fork出了两个子进程来实现通信的，而ab之间是不存在父子进程关系的。而命名管道是可以直接在不想关进程间通信的，因为有管道文件。 3.6.2 消息队列消息队列消息队列是保存在内核中的消息链表，会涉及到用户态跟内核态到来回切换，双方约定好消息体到数据结构，然后发送数据时将数据分成一个个独立的数据单元消息体，需注意消息队列及单个消息都有上限，日常我们到 RabbitMQ、 Redis 都涉及到消息队列。 3.6.3 共享内存共享空间现代操作系统对内存管理采用的是虚拟内存技术，也就是每个进程都有自己独立的虚拟内存空间，不同进程的虚拟内存映射到不同的物理内存中。所以，即使进程A和进程B虚拟地址是一样的，真正访问的也是不同的物理内存地址，该模式不涉及到用户态跟内核态来回切换，JVM 就是用的共享内存模式。并且并发编程也是个难点。 3.6.4 信号量既然共享内存容易造成数据紊乱，那为了简单的实现共享数据在任意时刻只能被一个进程访问，此时需要信号量。信号量其实是一个整型的计数器，主要用于实现进程间的互斥与同步，而不是用于缓存进程间通信的数据。信号量表示资源的数量，核心点在于原子性的控制一个数据的值，控制信号量的方式有PV两种原子操作： P 操作会把信号量减去 -1，相减后如果信号量 < 0，则表明资源已被占用，进程需阻塞等待。相减后如果信号量 >

= 0, it indicates that resources are still available and the process can continue to execute normally.

The V operation adds 1 to the semaphore, and if the semaphore is 0, it indicates that there is no currently blocking process.

3.6.5 signal

For the process working mode in the abnormal state, the signal working mode is needed to inform the process. For example, the Linux system provides a lot of abnormal signals kill-l in response to various events. The signal is the only asynchronous communication mechanism in the inter-process communication mechanism, which can send a signal to a process at any time. For example:

Kill-9 1412, which means to send a SIGKILL signal to a process with a PID of 1412 to end the process immediately.

The keyboard Ctrl+C generates a SIGINT signal that terminates the process.

The keyboard Ctrl+Z generates a SIGTSTP signal to stop the process, but it is not finished yet.

When a signal occurs, the process generally responds to the signal in three ways:

Perform default actions: the Linux operating system is equipped with special processing operations for many signals.

Capture signal: equip the captured signal with a special signal processing function.

Ignore signals: specifically designed to ignore certain signals, but SIGKILL and SEGSTOP cannot be ignored in order to be able to end or stop a process at any time.

3.6.6 Socket programming

The previously mentioned pipes, message queues, shared memory, semaphores, and signals all communicate between processes on the same host, so Socket communication is required to communicate across the network and between processes on different hosts.

Int socket (int domain, int type, int protocal)

The above are the core functions of socket programming, you can specify IPV4 or IPV6 type, TCP or UDP type. For example, the socket programming model of TCP protocol communication is as follows:

Socket programming

The server and client initialize the socket to get the file descriptor.

The server calls bind, which binds to the IP address and port.

The server calls listen to listen.

The server calls accept and waits for the client to connect.

The client calls connect to initiate a connection request to the address and port of the server.

The server accept returns the file descriptor of the socket used for the transfer.

The client calls write to write data, and the server calls read to read the data.

When the client disconnects, close is called, so when the server read reads the data, the EOF is read. After the data is processed, the server calls close to indicate that the connection is closed.

When the server calls accept, the successful connection returns a socket that has completed the connection, which is later used to transfer data. There are two socket on the server, one is called snooping socket and the other is called completed connection socket.

After a successful connection is established, both parties begin to read and write data through read and write functions.

UDP transmission

UDP is relatively simple, is a broadcast-like transmission, and does not need to maintain a connection. However, bind is also required, and each call to sendto and recvfrom is passed into the IP address and port of the target host.

3.7 Multithreaded programming

Since multiple processes are too expensive, what we often use is multithreaded programming. Memory models, JMM, Volatile, critical sections, and so on, may be involved during this period. This is covered in the Java concurrent programming column.

4 File Management 4.1 VFS Virtual File system

In the operating system, the file system is mainly responsible for storing the file data information to the disk and plays the role of persisting the file. The basic unit of the file system is the file, and different ways of file composition will form different file systems.

There are many kinds of file systems, but different file systems need to provide a unified external interface after being applied to the operating system. At this time, we use a design concept that nothing can not be solved by adding a layer. Add a virtual file system layer Virtual File System between the user layer and different file systems.

The virtual file system layer defines a set of data structures and standard interfaces supported by all file systems, so that programmers do not need to understand how the file system works, only the unified interface provided by VFS.

Virtual file system

There are generally three kinds of daily file systems:

Disk file system: this is our common EXT 2-3-4 series.

Memory file system: data is not stored on disk and takes up memory data, such as / sys, / proc. Some of the data in the process is mapped to / proc.

Network file system: common network disk mount NFS, etc., by accessing the data of other hosts.

4.2 document composition

Take the Linux system as an example, in the Linux system, everything is a file, and the Linux file system assigns an index node inode and a directory entry directory entry to each file to record the file content and directory hierarchy.

4.2.1 inode

To understand inode, start with file storage. Files are stored on the hard disk, and the smallest storage unit of the hard disk is called the sector. Each sector stores 512 bytes. When the operating system reads the hard disk, it will not read each sector one by one, so the efficiency is too low. Generally, 8 sectors (4KB) are read continuously at one time as a block. This block composed of multiple sectors is the smallest unit of file access.

File data is stored in blocks, and we must also find a place to store meta-information about the file, such as inode number, file size, creation time, modification time, disk location, access rights, and so on. Almost all the file metadata information except for the file name is stored in a place called the index node inode. You can view inode information by stat file name

Each inode has a number, and the operating system uses inode numbers to identify different files. The file name is not used inside the Unix/Linux system, but the inode number is used to identify the file, and the user can view the corresponding number of each file through ls-I. For the system, the file name is just a nickname or nickname for the inode number to be easily identified. When it is difficult to delete a file with a special name, you can try to delete it with the inode number. Moving and renaming will not cause the file inode to change. When the user tries to open the file based on the file name, the system actually divides the process into three steps:

The system finds the inode number corresponding to this file name.

Through the inode number, obtain inode information, carry out permission verification and other operations.

According to the inode information, find the block where the file data is located, and read the data.

Note that inode also consumes hard disk space. After formatting, the hard disk is divided into three areas: Super block, index node area and data block area:

Super chunk area: used to store file system details, such as block size, number of blocks, and so on. Generally, after the file system is mounted, the data information will be synchronized to memory.

Inode area: used to store the Inode inode table. Each inode is generally 128byte or 256byte, and generally one inode is required for every 1KB or 2KB data. Index data is usually cached in memory in order to speed up the query.

Block area: the place where disk data is actually stored.

Df-I # View the total number of inode of each hard disk partition and the number of sudo dumpe2fs-h / dev/hda already used | grep "Inode size" # View the size of each inode node

4.2.2 Directory

In the Unix/Linux system, the directory directory is also a kind of file, opening the directory is actually opening the directory file. The contents of a directory file are columns of a series of directory items that contain the file's name, file type, Inode pointer, and hierarchical relationships with other directory items.

In order to avoid frequently reading directory files on disk, the kernel caches the directory files that have been read in memory with the data structure of directory entries to facilitate users to read directory information next time. Directory entries can contain directories or files. Don't be surprised that you can save directories. Directory entries contain item by item of file information in the directory.

4.2.3 soft links and hard links

Soft connection and hard link

Hard links: the old file An is created after several hard links B, C. Files A, B, and C have the same inode, so they cannot cross file systems. At the same time, the source file will not be deleted until all ABC is deleted.

Soft link: it is equivalent to creating a new file B based on the old file A, which has a new inode, but the content of file B is the path of the old file A. So soft links can span file systems. When the old file An is deleted, file B still exists, but the specified file cannot be found.

[sowhat@localhost ~] $ln [option] Source file destination file option:-s: create a soft link file. If you do not add the "- s" option, a hard-linked file is established;-f: mandatory. If the target file already exists, delete the target file and then establish a linked file; 4.3 File Storage

It is said that before file storage, we need to understand that the basic unit of file system operation is data blocks, while the usual user operation bytes need to be converted to data blocks, of course, these file systems help us to connect well. Next, let's take a look at how the file system is divided into continuous space storage and discontiguous space storage according to data blocks.

4.3.1 continuous space storage

Implementation: continuous space storage means the same as array storage, find a continuous space to store the data at one time, and the starting position of the file header storage and the data length can be.

Advantages: high reading and writing efficiency, disk addressing can be done once.

Disadvantages: easy to produce space debris, and file expansion is not convenient.

Continuous storage

4.3.2 linked list of discontinuous space storage

Implicit linked list

Implementation: the file header contains StartBlock and EndBlock. Each BLock has a hidden next pointer, just like an one-way linked list.

Disadvantages: can only be chained down to find data, do not support fast direct access.

Implicit linked list

Explicit linked list

Implementation: store the next pointer in each Block into the memory file allocation table and get all the data by traversing the array.

Disadvantages: 1KB has an inode pointer. If the disk data is large, you need a large file allocation table to store the mapping relationship.

Show linked list

4.3.3 Index of discontiguous spatial storage

Implementation: the entire file type is a Xinhua dictionary, and the real data blocks are stored in the actual location of the dictionary, but the index locations of the data blocks required by the files will be aggregated to form a directory index and placed in front of the dictionary.

Advantages: no fragmentation, dynamic expansion of files, and support for sequential machine reading and writing.

Disadvantages: a small file may occupy a directory index, the file is too large to accommodate an index pointer, may also need to have a multi-level index or index + linked list mode.

Index storage

These storage methods have their own advantages and disadvantages, so the operating system generally changes the storage mode dynamically according to the file size, which is the same as the fast row bottom layer in STL = fast sort + insert sort + heap.

4.3.4 Free space management

In order to prevent users from traversing all the disk space to find available data blocks when storing data, there are generally the following recording methods.

Free table: dynamically maintains a list of free blocks, with each row storing the starting position and length of free blocks. Suitable for a small amount of free data blocks.

Free table

Free linked list: concatenates idle databases with next pointers. The disadvantage is that they cannot be accessed randomly.

Idle linked list

Bitmap method: using the 01 of Bit to show that the data block is available and unavailable, simple and convenient, this method is used by both inode and idle database.

Bitmap method

5 input and output management 5.1 device controller and driver

5.1.1 device Controller

Device controller

In order to manage a large number of devices and shield the differences between devices, the operating system installs a small CPU called device controller for each device. Each device controller knows its own function and usage of the corresponding peripherals, and each device controller has a unique register to communicate with CPU.

Read the device register value to know the status of the device and whether you can receive new instructions.

The operating system can write some instructions to the device register to send data, receive data and so on.

The controller is generally divided into data register, command register and status register. CPU controls the device conveniently by reading and writing the registers in the device controller.

Data register: CPU writes the data that needs to be transferred to the Istroke O device. For example, printing what,CPU first sends a w character to the corresponding Iram O device.

Command register: CPU sends a command to tell the I / O device that the input / output operation is to be performed, so it gives it the job. When the task is completed, it marks the status in the status register as complete.

Status register: used to tell CPU that work is now in progress or that work has been completed, and that CPU can send the next character and command only if the status register is marked as completed.

At the same time, input and output devices can be divided into block devices and character devices.

Block devices: used to store data in fixed-size blocks, each block has its own address, hard disk, U disk and so on are common block devices. Block devices generally have large data transmission to avoid frequent IO, there is a read-write data buffer in the controller. The Linux operating system introduces a common block layer to shield the differences caused by different block devices. The general block layer is a block device abstraction layer between the file system and the disk driver. It mainly provides the following two functions:

Upward provides a standard interface for file systems and applications to access block devices, abstracts a variety of different disk devices into unified block devices, and provides a framework at the kernel level to manage the drivers of these devices.

The general purpose of the general layer is to improve the efficiency of reading and writing to the disk, which is also scheduled by the file system and the application.

Character device: sends or receives a character stream in units of characters. The character device is not addressable and there is no seek operation. The mouse is a common character device.

CPU generally communicates with the device's control registers and data buffers through IO ports and memory-mapped IO.

IO port: each control register is assigned an Imax O port, and these registers can be manipulated through special assembly instructions, such as in/out-like instructions.

Memory-mapped IO: maps all control registers to memory space so that data buffers can be read and written just like memory.

5.1.2 driver interface

Driver program

The device controller shields the device details, but the registers, buffers and other usage patterns of the controller of each device are different, which belongs to the hardware. In the category of operating system diagram, in order to shield the differences of device controllers, device drivers are introduced, and different device-to-driver programs will provide a unified interface for the operating system to call. In this way, the operating system kernel will use the device driver interface just like calling native code.

When a device makes an IO request, it responds to it in the device driver, and it responds to the interrupt handler according to the interrupt type.

Interrupt request process

5.2 IO control

CPU sends instructions to the device controller to read and write data, and how to notify CPU when it is finished?

5.2.1 polling mode

There is a status register in the controller, and CPU constantly polls to see the status of the register, and this mode will foolishly occupy CPU all the time.

Polling mode

5.2.2 IO interrupt request

Interrupt mode

The controller has an interrupt controller. When the device completes the task and triggers the interrupt to the interrupt controller, the interrupt controller notifies the CPU to process the interrupt request. There are two kinds of interrupts, one is soft interrupts, such as code calls INT instruction trigger. One is the hardware interrupt, which is triggered by the interrupt controller. But the interrupt mode is not very friendly to the operation of reading and writing disk data frequently, and the CPU will be interrupted frequently.

Here we talk about disk cache PageCache, which is used to cache the data recently accessed by CPU into memory, and it also has a pre-read function. It is possible that you have cached the post-16KB data before you read the 16KB data.

Pagecache: page cache. When a process needs to read a disk file, linux allocates some memory, transfers the data from the disk read area to memory, and then passes the data to the process. When the process needs to write data to disk, linux first allocates memory to receive user data, and then writes the data from memory to disk. At the same time, due to the limited size, pagecache generally only caches the recently accessed data, and needs to access the disk when the data is insufficient.

5.2.3 DMA mode

Direct Memory Access direct memory access, with the support of the hardware DMA controller, when the data transmission between the I-DMA device and memory is carried out, all the work of data handling is handed over to the DMA controller, while CPU no longer participates in anything related to data handling, leaving CPU to deal with other things.

DMA mode

It can be found that in the whole process of data transmission, CPU will not directly participate in the data handling work, and DMA will be directly responsible for the data reading work. Now every IO device generally has its own DMA controller. CPU intervention is required when reading data only at the beginning and end of the transmission.

5.2.4 Zero Copy

Zero Copy does not carry data through CPU. All data is transmitted through DMA. It only takes 2 context switches and 2 copies of DMA data, which is at least double the speed of the most original read and write mode. In fact, Zero Copy has already been mentioned in Kafka.

5.2.4.1 read and write the old version

The old version of simple read and write operations do not do anything to the data. During this period, there will be 4 switches between user mode and kernel state. 2 copies of DMA data, 2 copies of CPU data.

Old-fashioned reading and writing

The speed-up method is to reduce the number of context switching and memory copy between user mode and kernel state. The process of copying data from the kernel's read buffer to the user's buffer and then from the user buffer to the socket buffer is unnecessary. Next

Next, we will talk about the history of Zero Copy according to three versions.

5.2.4.2 mmap and write

Mmap + write

The idea is to use mmap instead of read function. When mmap is called, it will directly map the data in the kernel buffer to user space, which reduces one data copy, but still needs to copy the data from the kernel buffer to the socket buffer through CPU, and still needs 4 context switches, because the system call is still 2 times.

Buf = mmap (file, len); write (sockfd, buf, len)

5.2.4.3 sendfile

The function sendfile () is provided in version 2.1 of the Linux kernel.

Ssize_t sendfile (int out_fd, int in_fd, off_t * offset, size_t count); out_fd: destination file descriptor in _ fd: source file descriptor offset: offset within source file count: intend to copy data length ssize_t: actually copy data length

You can find a sendfile = read + write, which avoids switching back and forth between user mode and kernel mode, and can directly copy the data from the kernel buffer to the socket buffer, so that there are only 2 context switches and 3 data copies.

Sendfile mode

5.2.4.4 True zero copy

Linux kernel 2.4 if the network card supports SG-DMA technology, it can reduce the process of copying data from the kernel buffer to the socket buffer through CPU.

$ethtool-k eth0 | grep scatter-gatherscatter-gather: on

SG-DMA technology can directly copy the data from the kernel cache to the buffer of the network card, and this process does not need to copy the data from the operating system kernel buffer to the socket buffer, thus reducing a data copy.

ZeroCopy

5.2.4.5 File transfer rules

Don't think that after you know Zero Copy, you will use Zero Copy for both large and small files. In practice, small files generally use Zero Copy technology, while large files use asynchronous IO. As for why, let's take a look at the following analysis:

The data mentioned above is read from the disk to the kernel buffer to PageCache. PageCache has the functions of caching and pre-reading. However, PageCache will not fail when transferring very large files, because large files will quickly fill the PageCache area, but these files will only be accessed once again, which will cause other hot and small files to be unable to use PageCache, so instead of PageCache, use asynchronous IO. What is asynchronous IO? We'll talk about it later.

5.3 IO layering

IO layering

The Linux storage system can be divided into file system layer, general block layer and device layer from top to bottom.

The file system layer provides a standard file access interface for applications up and down through the common block layer to store and manage disk data.

The common block layer includes the Imax O queue and the Imax O scheduler of the block device, which processes the IO request through the IO scheduler.

The device layer includes hardware devices, device controllers and drivers, which are responsible for the Icano operation of the final physical device.

IO read acceleration in Linux system:

In order to improve the efficiency of file access, a variety of caching mechanisms such as page cache, index node cache and directory item cache are used to reduce the direct calls to block devices.

In order to improve the access efficiency of the block device, a buffer is used to cache the data of the block device.

6 End

The small 3W word is finally over. Hope that reading can give you a general impression of the operating system, you are using Window, but do not know after 30 years of basic precipitation, Windows complete source code tree size of more than 0.5 TB, involving more than 560000 folders, more than 4 million files, a total size of more than a billion lines. Coupled with the need for compatibility, maintainability, manageability and so on between the various functions, as the code becomes more and more deliberable, it finally produces such a work of art!

Thank you for your reading. the above is the content of "how to understand the hard-core operating system". After the study of this article, I believe you have a deeper understanding of how to understand the hard-core operating system. The specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.