In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what is the method of interaction between JVM and operating system". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the method of interaction between JVM and operating system".
To the naked eye, computers are made up of hardware devices such as CPU, memory and monitors, but most people are engaged in software development. The underlying principle of the computer is the bridge between hardware and software. Only by understanding the underlying principle of the computer can we go faster and easier on the road of programming. If you understand the execution process of high-level programming language from the operating system level, you will find that many software designs are the same routine, and many language features depend on the underlying mechanism, which will be revealed for you today.
Understand how a line of Java code is executed in conjunction with CPU
According to von Neumann's idea, computers are based on binary production and must include calculators, controllers, storage devices, and input and output devices, as shown in the following figure.
Enter image description here
Let's first analyze the working principle of CPU. Most of the modern CPU chips are integrated with control unit, operation unit and memory unit. The control unit is the control center of CPU. CPU needs it to know what to do next, that is, what instructions to execute. The control unit also includes instruction register (IR), instruction decoder (ID) and operation controller (OC).
When the program is loaded into the memory, the instruction is in the memory. At this time, the memory is independent of the main memory device outside the CPU, that is, the memory bar in the PC machine. The instruction pointer register IP points to the address of the next instruction to be executed in the memory, and the control unit loads the instructions in the main memory into the instruction register according to the direction of the IP register.
This instruction register is also a storage device, but it is integrated in CPU. After the instruction arrives at CPU from the main memory, it is only a string of 010101 binary strings, and it also needs to be decoded by a decoder to analyze what the opcode is and where the Operand is, followed by the specific operation unit for arithmetic operation (addition, subtraction, multiplication and division) and logical operation (comparison, displacement). The execution process of CPU instruction is roughly as follows: address (de-main memory acquisition instruction to register), decoding (get Operand from main memory into cache L1), execution (operation).
Enter image description here
This explains that the memory cell SRAM integrated within the CPU in the figure above corresponds to the DRAM in the main memory. RAM randomly accesses memory, that is, data can be accessed at an address, while disk as a storage medium must be accessed sequentially, while RAM is divided into dynamic and static. Static RAM has low integration, small capacity and high speed, while dynamic RAM is highly integrated, which is mainly achieved by charging and discharging capacitors. The speed is not as fast as static RAM, so dynamic RAM is generally used as main memory, while static RAM, as the cache (cache) between CPU and main memory, is used to shield the difference in speed between CPU and main memory, that is, L1, L2 cache that we often see. Each level of cache speed becomes slower and the capacity becomes larger.
The following figure shows the hierarchical architecture of memory and the process of CPU accessing main memory. There are two knowledge points here. One is the cache consistency protocol introduced between multi-level caches to ensure data consistency. You can refer to this article. Another knowledge point is the mapping between cache and main memory. First of all, it should be clear that the unit of cahce cache is the cache line, corresponding to a block of memory in main memory, not a variable. This is mainly due to the spatial limitations of CPU access: a memory cell being accessed is likely to be accessed again in a relatively short period of time, and space limitations: a memory cell being accessed will also be accessed in a relatively short time.
There are many mapping methods, similar to cache line number = main memory block number mod cache total line number, so that each time you get a main memory address, you can calculate the line number in cache by calculating the block number in main memory according to this address.
Let's move on to CPU's instruction execution. Addressing, decoding, execution, this is an instruction execution process, all instructions will be strictly executed in this order. But multiple instructions can actually run in parallel. For single-core CPU, only one instruction can occupy the execution unit at a time. The execution here is the third of the three steps of CPU instruction processing (fetch, decoding, execution), that is, the computing task of the computing unit.
Therefore, in order to improve the instruction processing speed of the CPU, it is necessary to ensure that the preparatory work of the operation unit is completed before execution, so that the operation unit can always be in the operation. In the serial flow just now, the operation unit is idle during decoding, and fetching and decoding need to be accessed from the master if the cache is not hit, and the speed of the main memory is not at the same level as the CPU. So the instruction pipeline can greatly improve the processing speed of the CPU. The following figure is an example of a level 3 pipeline, while the current Pentium CPU is a level 32 pipeline. The specific approach is to split the above three processes into more details.
Enter image description here
In addition to the instruction pipeline, CPU also has some means to optimize the speed, such as branch prediction, out-of-order execution and so on. All right, let's get back to the point, how does a line of Java code execute?
If a line of code can be executed, there must be a context in which it can be executed, including instruction registers, data registers, stack space and other memory resources. Then this line of code must be recognized by the operating system's task scheduler as an execution stream and be assigned CPU resources. Of course, the instructions represented by this line of code must be decoded and recognized by CPU. So a line of Java code must be interpreted as the corresponding CPU instruction before it can be executed. Let's take a look at the translation process of the System.out.println ("Hello world") line of code.
Java is a high-level language, which cannot be run directly on hardware, but must be run on a virtual machine that can recognize the characteristics of Java. Java code must be converted into a sequence of instructions, also known as Java bytecode, recognized by the virtual machine through a Java compiler. It is called bytecode because the operation instruction (OpCode) of Java bytecode is fixed to one byte. The following is the bytecode compiled by System.out.println ("Hello world"):
0x00: B200 02 getstatic Java .lang.System.out0x03: 12 03 ldc "Hello, World!" 0x05: b6 00 04 invokevirtual Java .io.PrintStream.println0x08: b1 return
The leftmost column is offset; the middle column is the bytecode read to the virtual machine; the rightmost column is the code of the high-level language, the following is the machine instruction converted by assembly language, the middle is the machine code, the third column is the corresponding machine instruction, and the last column is the corresponding assembly code:
0x00: 55 push rbp0x01: 48 89 e5 mov rbp,rsp0x04: 48 83 ec 10 sub rsp,0x100x08: 48 8d 3d 3b 00 00 00 lea rdi, [rip+0x3b] Load "Hello, World!\ n" 0x0f: c7 45 fc 00 00 00 mov DWORD PTR [rbp-0x4], 0x00x16: b0 00 mov al,0x00x18: E8 0d 00 00 call 0x12 Call the printf method 0x1d: 31 c9 xor ecx,ecx0x1f: 89 45 f8 mov DWORD PTR [rbp-0x8], eax0x22: 89 c8 mov eax,ecx0x24: 48 83 c4 10 add rsp,0x100x28: 5d pop rbp0x29: c3 retJVM after loading the bytecode in the class file through the class loader It will be interpreted into assembly instructions through the interpreter, and then translated into machine instructions that can be recognized by CPU. The interpreter is implemented by software, mainly to realize that the same Java bytecode can be run on different hardware platforms, while converting assembly instructions into machine instructions is directly implemented by hardware. This step is very fast. Of course, in order to improve the running efficiency, JVM can also compile some hot code (the code in a method) into machine instructions at once and then execute them, that is, the just-in-time compilation (JIT) corresponding to interpretation and execution. When JVM starts, you can use-Xint and-Xcomp to control the execution mode.
From the software level, after the class file is loaded into the virtual machine, the class information is stored in the method area, and the code in the method area is executed when it is actually run. All threads in JVM share heap memory and method area, and each thread has its own independent Java method stack, local method stack (for native methods), PC register (where the thread executes), and when a method is called The Java virtual machine presses a stack frame in the method stack corresponding to the current thread, which is used to store Java bytecode operands and local variables. After the execution of this method, the stack frame will pop up, and a thread will execute multiple methods successively. Corresponding to the push-in and pop-up of different stack frames, the process of JVM interpretation and execution is after pressing into the stack frame.
Enter image description here interrupt just said that as long as CPU is powered on, it is like a perpetual motion machine, constantly fetching instructions, operations, and repeating over and over again, and interrupt is the soul of the operating system. Therefore, interrupt is to interrupt the execution process of CPU and do something else.
For example, a fatal error occurs during system execution and needs to be terminated. For example, if a user program calls a system call method, such as mmp, it will interrupt CPU to switch context to kernel space. For example, a program waiting for user input is blocking, and when the user completes input through the keyboard and the kernel data is ready, an interrupt signal will be sent. Wake up the user program to take the data away from the kernel, otherwise the kernel may overflow the data. When the disk reports a fatal exception, it will also notify CPU through the interrupt, and the timer will also notify CPU of the clock interrupt when the timer completes the clock tick.
We will not break down the types of interrupts here. Interrupts are somewhat similar to what we often call event-driven programming, but how is this event notification mechanism implemented? the implementation of hardware interrupts transmits interrupt signals through a wire connected to CPU, and there are specific instructions on the software, such as executing instructions for system calls to create threads, and CPU executes each instruction. It will check if there is an interrupt in the interrupt register, and if so, take it out and execute the handler corresponding to the interrupt.
Trapped in the kernel: when we design software, we will consider the frequency of program context switching, too high frequency will definitely affect the program execution performance, and trapped in the kernel is for CPU, the implementation of CPU from the user mode to the kernel state, the user program used to use CPU, now the kernel program is using CPU, this switching is generated by system calls.
The system call executes the underlying program of the operating system. The designer of Linux, in order to protect the operating system, separates the execution state of the process from the user state. In the same process, the kernel and the user share the same address space, generally 4G virtual address, of which 1G is in kernel state and 3G is in user state. When programming, we should minimize the switching from user mode to kernel mode, for example, creating a thread is a system call, so we have the implementation of thread pool. Understanding JVM memory Model from the Perspective of Linux memory Management
Process context
We can think of a program as a set of executable instructions, and after the program starts, the operating system will allocate resources such as CPU, memory, and so on, and this running program is what we call a process, which is an abstraction from the operating system to the programs running in the processor.
The memory allocated for the process and the CPU resources are the context of the process, and the currently executed instructions and variable values are saved. After JVM starts, it is also an ordinary process on the linux. The physical entity of the process and the environment that supports the process running are collectively called the context. Context switching is to change the currently running process and change a new process to the processor to run, so as to allow multiple processes to execute concurrently. Context switching may come from operating system scheduling or from within the program, for example, when reading IO, it allows switching between user code and operating system code.
Enter image description here virtual storage when we start multiple JVM execution at the same time: System.out.println (new Object ()); will print the hashcode of this object, hashcode defaults to the memory address, and it turns out that they all print Java .lang.Object @ 4fca772d, that is, multiple processes return the same memory address.
From the example above, we can prove that each process in linux has a separate address space. Before we do that, let's take a look at how CPU accesses memory.
Assuming that we do not have a virtual address, only a physical address, and the compiler needs to convert the high-level language into machine instructions when compiling the program, then CPU must specify an address when accessing memory. If the address is an absolute physical address, then the program must be placed in a fixed place in memory, and this address needs to be confirmed at compile time. You should know how cheating this is.
If I want to run two office word programs at the same time, then they will operate the same piece of memory. The great computer predecessors designed to let CPU access memory by the way of segment base address + intra-segment offset address. The segment base address is confirmed when the program starts. Although this segment base address is still an absolute physical address, it can eventually run multiple programs at the same time. CPU accesses memory in this way. The segment base address register and the intra-segment offset address register are needed to store the address, and finally the two addresses are added to the address bus.
And memory segmentation, equivalent to each process will allocate a memory segment, and this memory segment needs to be a continuous space, there are multiple memory segments maintained in the main memory, when a process needs more memory and exceeds the physical memory, it needs to change some uncommonly used memory segment to the hard disk, and so on when there is sufficient memory, it is loaded from the hard disk, that is, swap. Each exchange requires the operation of the entire segment of data.
First of all, continuous address space is very valuable, such as a 50m memory, in the case of gaps between memory segments, will not be able to support five programs that need 10m memory to run, how to make the address within the segment discontiguous? The answer is memory paging.
In protected mode, each process has its own independent address space, so the base address of the segment is fixed, only the offset address in the segment is given, and the offset address is called linear address, and the linear address is contiguous. Memory paging associates consecutive linear addresses with paged physical addresses, so that logically continuous linear addresses can correspond to discontiguous physical addresses.
The physical address space can be shared by multiple processes, and this mapping will be maintained through the page table. The standard page size is generally 4KB, after paging, the physical memory is divided into several 4KB data pages. When the process applies for memory, it can be mapped to multiple 4KB-sized physical memory, and the application will take the page as the minimum unit when reading the data, and when it needs to swap with the hard disk, it is also based on the page as the unit.
Modern computers often use virtual storage technology, which makes each process think that it owns the whole memory space. In fact, this virtual space is an abstraction of main memory and disk. The advantage is that each process has a consistent virtual address space, which simplifies memory management and eliminates the need for processes to compete with other processes for memory space.
Because it is exclusive, but also protects their respective processes from being destroyed by other processes, in addition, he regards the main memory as a cache of the disk, and only the active program segments and data segments are saved in the main memory. When there is no data in the main memory, a page break occurs, and then it is loaded from the disk, and swap to the disk occurs when the physical memory is insufficient. The page table stores the mapping between the virtual address and the physical address. The page table is an array, and each element is the mapping relationship of a page. This mapping relationship may be with the main memory address, or with the disk, the page table is stored in the main memory. We call the page table stored in the high-speed buffer cache TLAB.
Enter image description here
The load bit represents whether the page is in main memory, and if the address page is represented by each page, the data is still on disk.
The storage location establishes the mapping between the virtual page and the physical page for address translation. If the null represents an unallocated page modification bit to store whether the data has been modified or not, it is used to control whether there is read and write permission to prohibit the cache bit, which is mainly used to ensure the data consistency of the cache main memory disk. Under normal circumstances, our process of reading the file is to first read the data from the disk through the system call. Stored in the kernel buffer of the operating system, and then copied from the kernel buffer to user space, while memory mapping maps disk files directly to the user's virtual storage space, maintains virtual address-to-disk mapping through page tables, and has the advantage of reading files through memory mapping, because it reduces the copy from kernel buffer to user space and reads data directly from disk to memory. Reduce the overhead of system calls, for users, as if directly manipulating files on disk, and because of the use of virtual storage, there is no need for continuous main memory space to store data.
Enter image description here
In Java, we use MappedByteBuffer to implement memory mapping, which is an out-of-heap memory. After mapping, we do not occupy physical memory immediately, but when accessing the data page, we first check the page table, find that it has not been loaded, initiate a page fault exception, and then load the data from disk into memory, so some middleware with high real-time requirements, such as rocketmq, the message is stored in a file with a size of 1G. In order to speed up the reading and writing speed, after the file is mapped to memory, one bit of data is written on each page, so that the entire 1G file can be loaded into memory, and the page fault will not occur when actually reading and writing, which is called file preheating in rocketmq.
Next, we post a piece of code for the rocketmq message storage module, which is located in the MappedFile class. This class is the core class of rocketMq message storage. You can study it on your own. One of the following two methods is to create a file map, and the other is to preheat the file. Every 1000 data pages are preheated, the CPU permission is relinquished.
Private void init (final String fileName, final int fileSize) throws IOException {this.fileName = fileName; this.fileSize = fileSize; this.file = new File (fileName); this.fileFromOffset = Long.parseLong (this.file.getName ()); boolean ok = false; ensureDirOK (this.file.getParent ()); try {this.fileChannel = new RandomAccessFile (this.file, "rw"). GetChannel () This.mappedByteBuffer = this.fileChannel.map (MapMode.READ_WRITE, 0, fileSize); TOTAL_MAPPED_VIRTUAL_MEMORY.addAndGet (fileSize); TOTAL_MAPPED_FILES.incrementAndGet (); ok = true;} catch (FileNotFoundException e) {log.error ("create file channel" + this.fileName + "Failed. ", e) throw e;} catch (IOException e) {log.error (" map file "+ this.fileName +" Failed. ", e); throw e;} finally {if (! ok & & this.fileChannel! = null) {this.fileChannel.close () } / / File preheating, OS_PAGE_SIZE = 4kb is equivalent to writing a byte 0 per 4kb and loading all pages into memory, so that when you really use it, there will be no page fault exception public void warmMappedFile (FlushDiskType type, int pages) {long beginTime = System.currentTimeMillis (); ByteBuffer byteBuffer = this.mappedByteBuffer.slice (); int flush = 0; long time = System.currentTimeMillis () For (int I = 0, j = 0; I
< this.fileSize; i += MappedFile.OS_PAGE_SIZE, j++) { byteBuffer.put(i, (byte) 0); // force flush when flush disk type is sync if (type == FlushDiskType.SYNC_FLUSH) { if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {flush = I; mappedByteBuffer.force ();}} / / prevent gc if (j% 1000 = = 0) {log.info ("j = {}, costTime= {}", j, System.currentTimeMillis ()-time); time = System.currentTimeMillis () Try {/ / here sleep (0), which asks the thread to relinquish CPU permissions for execution by other higher priority threads, which are converted from running to ready Thread.sleep (0);} catch (InterruptedException e) {log.error ("Interrupted", e) } / force flush when prepare load finished if (type = = FlushDiskType.SYNC_FLUSH) {log.info ("mapped file warm-up done, force to disk, mappedFile= {}, costTime= {}", this.getFileName (), System.currentTimeMillis ()-beginTime); mappedByteBuffer.force () } log.info ("mapped file warm-up done. MappedFile= {}, costTime= {} ", this.getFileName (), System.currentTimeMillis ()-beginTime); this.mlock ();} memory layout of objects in JVM
In linux, as long as you know the starting address of a variable, you can read the value of this variable, because the first 8 bits from this starting address record the size of the variable, that is, you can locate the end address. In Java, we can get the value of the variable by Field.get (object), that is, reflection, which is finally achieved through the UnSafe class. We can analyze the specific code.
The getInt method of the Field object is checked first, and then FieldAccessor @ CallerSensitive public int getInt (Object obj) throws IllegalArgumentException is called, IllegalAccessException {if (! override) {if (! Reflection.quickCheckMemberAccess (clazz, modifiers)) {Class caller = Reflection.getCallerClass (); checkAccess (caller, clazz, obj, modifiers) }} return getFieldAccessor (obj) .getInt (obj);} get the offset of the address of field in the object fieldoffset UnsafeFieldAccessorImpl (Field var1) {this.field = var1; if (Modifier.isStatic (var1.getModifiers () {this.fieldOffset = unsafe.staticFieldOffset (var1) } else {this.fieldOffset = unsafe.objectFieldOffset (var1);} this.isFinal = Modifier.isFinal (var1.getModifiers ());} UnsafeStaticIntegerFieldAccessorImpl calls method public int getInt (Object var1) throws IllegalArgumentException {return unsafe.getInt (this.base, this.fieldOffset) in unsafe } through the above code, we can read and write the value of the attribute through the offset of the attribute from the starting address of the object, which is also the principle of Java reflection, which is useful in many scenarios in jdk, such as setting blocking objects in LockSupport.park. So what are the rules for determining the offset of an attribute? Let's take this opportunity to analyze the memory layout of the Java object.
In the Java virtual machine, each Java object has an object header (object header), which consists of a tag field and a type pointer. The tag field is used to store the object's hash code, GC information, and held lock information, and the type pointer points to the object's class Class. In 64-bit operating systems, the tag field occupies 64 bits, and the type pointer also occupies 64 bits. In other words, a Java object occupies 16 bytes of space without any attributes. Currently, compressed pointers are enabled by default in JVM, so the type pointer can only occupy 32 bits, so the object header occupies 12 bytes, and the compressed pointer can act on the object header, as well as fields of reference type.
JVM reorders fields for memory alignment. Alignment here mainly means that the starting address of an object in the Java virtual machine heap is a multiple of 8. If an object uses less than 8N bytes, then the rest will be populated. In addition, the offset of the property inherited by the subclass is the same as that of the parent class. In the case of Long, he has only one non-static attribute value, although the object header only has 12 bytes, while the offset of the attribute value can only be 16. Of these, 4 bytes can only be wasted, so field rearrangement is to avoid memory waste, so it is difficult to analyze the actual space occupied by this Java object before the Java bytecode is loaded. We can only estimate the size of the object by recursive all the properties of the parent class, and the actual size can be obtained from the Instrumentation in Java agent.
Of course, another reason for memory alignment is to make the field appear only in the cache row of the same CPU. If the field is not aligned, it is possible to have a part of the field in cache line 1 and the other half in cache line 2, so that the reading of the field needs to replace two cache lines, and the writing of the field will invalidate other cached data on both cache lines, which will affect program performance.
Memory alignment can avoid the situation that a field exists in two cache lines at the same time, but it still cannot completely avoid the problem of cache pseudo-sharing, that is, multiple variables are stored in a cache line, and when these variables are in parallel with multi-core CPU, it will lead to competition for write permissions of the cache line. When one of the CPU writes data, the cache line corresponding to this field will become invalid. Causes other fields in this cache line to become invalid as well.
Enter image description here
In Disruptor, by filling in several meaningless fields, the size of the object is exactly 64 bytes, and the size of a cache line is 64 bytes, so that the cache line will only be used by this variable, thus avoiding cache row pseudo-sharing. But in jdk7, due to invalid fields being cleared, the method is invalid, and only by inheriting parent fields to avoid populated fields being optimized. Jdk8 provides an annotation @ Contended to indicate that this variable or object will have a cache line. To use this annotation, you must add the-XX:-RestrictContended parameter when JVM starts. In fact, it is also a trade-space for time. Jdk6-public final static class VolatileLong {public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; / / populated fields} jdk7 inherits public class VolatileLongPadding {public volatile long p1, p2, p3, p4, p5, p6; / / populated fields} public class VolatileLong extends VolatileLongPadding {public volatile long value = 0L } jdk8 annotated @ Contended public class VolatileLong {public volatile long value = 0L;} NPTL and Java's threading model
According to the textbook definition, the process is the smallest unit of resource management, while the thread is the smallest unit of CPU scheduling and execution. Threads appear in order to reduce the context switch of the process (the context switch of the thread is much smaller than that of the process), and to better adapt to the multi-core CPU environment. For example, multiple threads under a process can be executed on different CPU, while the support of multi-threads can be implemented in the Linux kernel. Can also be implemented outside the core, if placed outside the core, only need to complete the switching of the running stack, scheduling overhead is small, but this method can not adapt to multi-CPU environment, the underlying process is still running on a CPU, in addition, because of the high requirements for user programming, so the mainstream operating systems support threads in the kernel, while in Linux, threads are a lightweight process, which only optimizes the cost of thread scheduling.
The thread in JVM corresponds to the kernel thread one by one, and the scheduling of the thread is completely handed over to the kernel. When Thread.run is called, a kernel thread will be created through the system call fork (). This method will switch between the user mode and the kernel mode, and the performance is not as high as that in the user mode. Of course, due to the direct use of kernel threads, the maximum number of threads that can be created is also controlled by the kernel. The current threading model on Linux is NPTL (Native POSIX Thread Library), which uses one-to-one mode, is compatible with POSIX standards, and does not use management threads, so it can run better on multicore CPU. The state of a thread is three states for a process, ready, running, blocking, and in JVM, there are four types of blocking. We can view the state of the thread through jstack to generate a dump file.
When BLOCKED (on object monitor) acquires the lock through the synchronized (obj) synchronization block and waits for other threads to release the object lock, the dump file will show that waiting to lock TIMED WAITING (on object monitor) and WAITING (on object monitor) call object.wait () to wait for other threads to call object.notify () after acquiring the lock. The difference between the two is whether the TIMED WAITING (sleeping) program calls thread.sleep () with a timeout. Here, if sleep (0) does not enter the blocking state, Unsafe.park () is called directly from the running to ready TIMED WAITING (parking) and WAITING (parking) programs, and the thread is suspended, waiting for a condition to occur, waiting on condition. In the POSIX standard, thread_block accepts a parameter stat, which also has three types, TASK_BLOCKED, TASK_WAITING, and TASK_HANGING, while the scheduler will only schedule threads with thread state READY. Another point is that the blocking of the thread is operated by the thread itself, which is equivalent to the thread taking the initiative to give up the CPU time slice, so when the thread is awakened, its remaining time slice will not change, and the thread can only run in the remaining time slice. If the thread has not finished after the time slice expires, the thread state will be changed from RUNNING to READY, waiting for the next scheduling of the scheduler.
All right, that's all for threading. For Java concurrent packages, the core is in AQS, and the bottom layer is implemented through the cas method of the UnSafe class and the park method. Later, we are looking for time to analyze separately, and now we are looking at the process synchronization scheme of Linux.
POSIX stands for portable operating system interface (Portable Operating System Interface of UNIX, abbreviated as POSIX), and the POSIX standard defines the interface standard that the operating system should provide for applications.
The CAS operation needs CPU support, and the comparison and exchange are performed as an instruction. CAS generally has three parameters, memory location, expected original value and new value, so the compareAndSwap in the UnSafe class uses the offset of the attribute relative to the initial address of the object to locate the memory location.
Synchronization of threads
The fundamental reason for thread synchronization is that access to common resources requires multiple operations, and the execution process of these operations is not atomic and is separated by the task scheduler, while other threads destroy shared resources. So we need to do thread synchronization in the critical area. Here we first clarify a concept, that is, the critical area, which refers to the instructions when multiple tasks access shared resources such as memory or files. He is a command, not an accessed resource.
POSIX defines five kinds of synchronization objects, mutex, condition variable, spin lock, read-write lock, semaphore, these objects also have corresponding implementation in JVM, and do not all use api defined by POSIX. The implementation through Java is more flexible and avoids the performance overhead of calling native method. Of course, the bottom layer ultimately depends on pthread mutex mutex, which is a system call and expensive. So JVM automatically upgrade and downgrade the lock, and the implementation based on AQS will be analyzed later. Here we mainly talk about the keyword synchronized.
When declaring a block of code for synchronized, the compiled bytecode contains a monitorenter and multiple monitorexit (multiple exit paths, normal and abnormal). When executing monitorenter, it checks whether the counter of the target lock object is 0, if it is 0, sets the holding thread of the lock object to itself, and then adds 1 to the counter to obtain the lock. If it is not 0, check whether the holding thread of the lock object is self. If it is to add 1 to acquire the lock, if not, block waiting, exit when the counter minus 1, when reduced to 0 clear lock object holding thread flag, you can see that synchronized supports reentrant.
I just mentioned that thread blocking is a system call with high overhead, so JVM designed an adaptive spin lock, that is, when the lock is not acquired, CPU returns to the spin state and waits for other threads to release the lock. The spin time mainly depends on how long the lock was acquired last time. For example, the lock was not acquired in 5 milliseconds last time, and this time it is 6 milliseconds. Spin will cause CPU to run empty. Another side effect is an unfair locking mechanism because the thread spins to acquire the lock while other blocking threads are still waiting. In addition to spin locks, JVM also implements lightweight locks and biased locks through CAS to address situations where multiple threads access locks at different times and locks are used by only one thread. The latter two locks are equivalent to not calling the underlying semaphore implementation (using semaphores to control thread A releases the lock, for example, calling wait (), and thread B can acquire the lock, which can only be achieved by the kernel, and the latter two types do not need to be controlled by the underlying semaphore because there is no competition in the scenario), but maintain the lock holding relationship in user space, so it is more efficient.
Enter image description here
As shown in the figure above, if a thread enters monitorenter, it will put itself in the entryset queue of the objectmonitor, and then block. If the current holding thread calls the wait method, it will release the lock, and then encapsulate itself as objectwaiter into the waitset queue of objectmonitor. At this time, a thread in the entryset queue will compete for the lock and enter the active state, if the thread calls the notify method. The first objectwaiter of the waitset will be taken out and put into the entryset (which may spin first according to the policy). When the thread calling notify executes moniterexit to release the lock, the thread in the entryset starts competing for the lock and enters the active state.
In order to protect the application from the interference of data competition, happen-before is defined in the Java memory model to describe the memory visibility of the two operations, that is, the X operation happen-before operation Y, then the X operation result is visible to Y.
In JVM, there are happen-before rules for the implementation of volatile and locks. The bottom layer of JVM restricts the compiler's reordering by inserting a memory barrier. Taking volatile as an example, the memory barrier will not allow statements before the volatile field write operation to be reordered after the write operation, nor will it allow statements after reading the volatile field to be reordered before the read statement. Instructions that insert the memory barrier will have different effects depending on the instruction type. For example, the cache will be forced to be refreshed after the lock is released by monitorexit, while the memory barrier corresponding to volatile will be forced to flush to main memory after each write, and because of the characteristics of the volatile field, the compiler cannot allocate it to the register, so each time it is read from main memory, so volatile is suitable for scenarios where there are more reads than writes. It is best for only one thread to write multiple threads to read. Performance can be affected if frequent writes cause the cache to be constantly flushed.
On the question of how many threads to set in the application, our general practice is to set the maximum number of CPU cores * 2. When we code, we may not be sure what kind of hardware environment we are running in. We can get the CPU core through Runtime.getRuntime (). AvailableProcessors ().
However, the specific setting of the number of threads is mainly related to the blocking time in the tasks running within the thread. If all tasks are computationally intensive, then the thread that only needs to set the number of CPU cores can achieve the highest CPU utilization. If the setting is too large, it will affect the performance because of thread context switching. If there is a blocking operation in the task, you can let CPU perform tasks in other threads during the blocking time. We can calculate the most appropriate number of threads by the formula of number of threads = number of kernels / (1-blocking rate), which can be obtained by calculating the total execution time and blocking time of the task.
At present, there are a large number of RPC calls under the micro-service architecture, so the execution efficiency can be greatly improved by using multithreading. We can count the time consumed by RPC calls with distributed link monitoring, and this part of time is the blocking time in the task. Of course, in order to maximize the efficiency, we need to set different values and then test.
How to realize timing tasks in Java
Timer is already an indispensable part of modern software, such as checking status every 5 seconds, whether there is a new email, implementing an alarm clock, etc., there is already a ready-made api in Java, but if you want to design more efficient and accurate timer tasks, you need to understand the underlying hardware knowledge, such as implementing a distributed task scheduling middleware. You may want to consider the problem of clock synchronization between applications.
There are two ways to implement timed tasks in Java, one is through the timer class, and the other is ScheduledExecutorService in JUC. I don't know if you are curious about how JVM implements timed tasks. Have you been polling time to see if it is time to call the corresponding processing task? but it is definitely not advisable to keep polling without releasing CPU, or it is thread blocking to wake up the thread when the time is up. So how does JVM know the time is up and how to wake up?
First of all, let's take a look at the JDK and find that there are about three time-related API, and these three places also distinguish the accuracy of time:
The object.wait (long millisecond) parameter is millisecond, which must be greater than or equal to 0. If it is equal to 0, it blocks until other threads wake up. The timer class is implemented through the wait () method. Let's take a look at another method of wait: public final void wait (long timeout, int nanos) throws InterruptedException {if (timeout)
< 0) { throw new IllegalArgumentException("timeout value is negative"); } if (nanos < 0 || nanos >999999) {throw new IllegalArgumentException ("nanosecond timeout value out of range");} if (nanos > 0) {timeout++;} wait (timeout);} this method is intended to provide a timeout that supports nanoseconds, but only rudely adds 1 millisecond.
Thread.sleep (long millisecond) generally releases CPU in this way. If the parameter is 0, it releases CPU to a higher priority thread and transitions itself from running state to runnable state to wait for CPU scheduling. It also provides a method that can support nanosecond implementation, which differs from wait amount in that it separates whether to add 1 millisecond by 500000. Public static void sleep (long millis, int nanos) throws InterruptedException {if (millis)
< 0) { throw new IllegalArgumentException("timeout value is negative"); } if (nanos < 0 || nanos >999999) {throw new IllegalArgumentException ("nanosecond timeout value out of range");} if (nanos > = 500000 | | (nanos! = 0 & & millis = = 0)) {millis++;} sleep (millis) } LockSupport.park (long nans) Condition.await () calls this method, ScheduledExecutorService uses condition.await () to block a certain timeout, and other methods with timeout parameters are also implemented by him. At present, most timers are implemented through this method, and this method also provides a Boolean value to determine the accuracy of time.
Both System.currentTimeMillis () and System.nanoTime () depend on the underlying operating system. The former is millisecond, and the frequency of the tested windows platform may exceed that of 10ms, while the frequency of the latter is nanosecond, and the frequency is about 100ns. So if you want to get a more accurate time, it is recommended to use the latter. Api is done. Let's take a look at the underlying implementation of the timer. There are three kinds of hardware clock implementation in modern PCs. They all complete the clock signal synchronization through the input of the square wave signal generated by the crystal vibration. The real-time clock RTC is used for devices that store the system for a long time, even if the system is turned off, it can continue to count on the battery in the motherboard. When Linux starts, it reads the time and date from RTC as initial values, and then uses other timers to maintain system time during operation. Programmable interval timer PIT, which has an initial value, which is reduced by 1 after each clock cycle. When the initial value is reduced to 0, a clock interrupt is sent to CPU through the wire, and CPU can execute the corresponding interrupt program, that is, the task timestamp counter TSC corresponding to callback. All Intel8086 CPU contains a register corresponding to the timestamp counter. The value of this register is incremented by 1 each time the CPU receives an interrupt signal for a clock cycle. It is more accurate than PIT, but it can't be programmed and can only be read. Clock cycle: how long does the hardware timer produce clock pulses, and the clock cycle frequency is the number of clock pulses produced in 1 second. It is usually 1193180 at present.
Clock tick: when the initial value in PIT is reduced to 0, a clock interrupt occurs, which is specified by programming.
When Linux starts, it first obtains the initial time through RTC, then the kernel maintains the date through the clock tick of the timer in PIT, and writes the date to RTC regularly, while the timer of the application is mainly set by setting the initial value of PIT. When the initial value is reduced to 0, it means that the callback function is to be executed. Will you have any doubts here, so there can only be one timer program at the same time? We must have a lot of timer tasks in the application and between multiple applications. In fact, we can refer to the implementation of ScheduledExecutorService.
We only need to sort these scheduled tasks according to time. The more the tasks to be executed are put first, the first task is to set the value of the second task relative to the current time. After all, CPU can only run one task at the same time. With regard to the accuracy of time, we can not be completely accurate at the software level. After all, the scheduling of CPU is not completely controlled by user programs. Of course, the greater dependence is on the clock cycle frequency of the hardware. At present, TSC can improve higher accuracy.
Now we know that the timeout in Java is achieved by setting an initial value and then waiting for the interrupt signal through a programmable interval timer. The accuracy is generally millisecond due to the hardware clock cycle. After all, the light speed of 1 nanosecond is only 3 meters, so the implementation of nanosecond parameters in JDK is rude, reserved for the emergence of a timer with higher precision. While getting the current time System.currentTimeMillis () will be more efficient, but it is millisecond precision, he reads the date maintained by the Linux kernel, while System.nanoTime () will give priority to using TSC, the performance is slightly lower, but it is nanosecond, and the Random class uses nanoTime to generate seeds to prevent conflicts. How Java communicates with external Devices
The external devices of the computer include mouse, keyboard, printer, network card and so on. Usually, we call the information transmission between the external device and the main memory as "I am O" operation. According to the operation characteristics, it can be divided into output device, input device and storage device. Modern devices use channel mode to interact with main memory. The channel is a device specially used to handle IO tasks. CPU encounters an IUnio request when dealing with the main program and starts the device selected on the specified channel. Once started successfully, the channel starts to control the device to operate, while CPU can continue to perform other tasks. After the iUnip O operation is completed, the channel sends out an interruption at the end of the IUnip O operation. The processor instead handles events after the end of the IO. Other ways of handling IO, such as polling, interrupts, and DMA, do not have a channel in terms of performance, which is not covered here. Of course, the communication between Java programs and external devices is also done through system calls, and we will not go any further here.
Thank you for your reading, the above is the content of "what is the method of interaction between JVM and operating system". After the study of this article, I believe you have a deeper understanding of what is the method of interaction between JVM and operating system. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.