Linux operating system tuning 09/23 Update SLTechnology News&Howtos

Linux operating system tuning

2025-09-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Overview of 1 CPU tuning 1 system Topology 1

In modern computer technology, most systems have multiple processors, and the connections between multiple processors and other resources need to be adjusted through the system topology to have an impact.

2 two topology types of modern computer technology 1 SMP (symmetrical multiprocessor)

SMP (symmetric multiprocessor) topology allows all processors to access memory at the same time. However, due to the sharing and equality of memory access rights, although it will force all CPU and SMP systems to serialize memory access rights to limit performance, this situation is not accepted. Therefore, the server system is now a NUMA (inconsistent memory access) mechanism.

2 NUMA topology (inconsistent memory access mechanism)

Compared to the SMP topology, the NUMA (inconsistent memory access) topology is recently developed. In NUMA systems, multiple processors are physically grouped into a socket, each socket is made up of a dedicated memory area, and the server system that accesses that memory locally is called a node.

The server on the same node can access the bank of that node at high speed, but the speed of accessing the bank of other nodes is relatively slow, so accessing non-local storage will cause performance loss.

3 considerations for NUMA topology adjustment

Given the performance penalty, when the server executes applications, performance-sensitive applications in the NUMA topology system should access the memory of the same node and should avoid accessing any remote memory as much as possible.

Therefore, when adjusting the performance of an application in a NUMA topology system, it is important to consider the execution point of the application and the bank closest to this execution point.

In the NUMA topology system, the / sys file system contains connection information for processors, memory, and peripherals.

The / sys/devices/system/cpu directory contains details about how processors are connected to each other in the system. The / sys/devices/system/node directory contains node information about the NUMA in the system and the relative distance between nodes.

4 determine the topology of the system

1 use the numactl-hardware instruction to describe the topology of the system

[root@python ~] # numactl-- hardwareavailable: 1 nodes (0) # memory has only one node 0 cpus: 0 12 3node 0 size: 4095 MBnode 0 free: 177 MBnode distances:node 0: 10

2 use the lscpu instruction to query

Its instructions are provided by util-linux packets, including CPU architecture information, such as the number of CPU, the number of threads, the number of kernels, the number of socket and the number of NUMA nodes.

[root@python ~] # lscpuArchitecture: x86_64CPU op-mode (s): 32-bit 64-bitByte Order: Little EndianCPU (s): 4On-line CPU (s) list: 0-3Thread (s) per core: 1Core (s) per socket: 2: 2NUMA Node: 1 manufacturer ID: GenuineIntelCPU Series: 6 Model: 158Model name: Intel (R) Core ( TM) i5-7500 CPU @ 3.40GHz step: 9CPU MHz: 3408.003BogoMIPS: 6816.00 Virtualization: VT-x Super Manager Vendor: VMware Virtualization Type: full L1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 6144KNUMA node 0 CPU: 0- 3Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec arat spec_ctrl intel_stibp arch_capabilities2 scheduling

The system scheduler determines the processor that runs the thread and the time it takes to run. However, because the scheduler is primarily concerned with keeping the system busy, threads may not be optimally scheduled for the performance of the application.

For example, in a NUMA system, a processor is available on node B and an application runs on node A. to keep the processor at node B busy, the scheduler transfers a thread of the application to node B. However, the application on the thread still needs to access the memory at node A. Because the thread is currently running on node B, and node A's memory is no longer local memory for this thread, it takes longer to access. It may be more time-consuming for this thread to end running at node B than waiting for an available processor on node An and executing a thread on a source node that has local memory access.

1 kernel tick signal

Tick signal: in the early Linux version, the Linux kernel periodically terminates each CPU to check the tasks that need to be completed, and the results are used to determine process scheduling and load balancing. This is the tick signal.

Disadvantages: this flag does not consider whether the kernel has tasks to execute, which forces even idle kernels to enter a high-energy state on a regular basis, which prevents the system from making effective use of the deep sleep state of x86 processors

Interruption on demand:

The default redhat 6 and 7 kernels no longer interrupt idle CPU that tends to be low-power. When one or more tasks are running, on-demand interrupts replace scheduled interrupts, so that CPU can be idle or low-power for longer to reduce power consumption.

Dynamic clock-less setting:

Linux redhat 7 provides a dynamic clock-less setting (nohz_full) that further improves its certainty by reducing kernel interference through user-space tasks. This setting can be enabled through the nohz_full kernel parameter in the specified kernel. When this setting is enabled in a kernel, all timing activity will be moved to the non-latency-sensitive kernel. This is useful for both high-performance computing and real-time computing workloads because their user-space tasks are particularly sensitive to microsecond delays caused by kernel timer tick signals.

2 interrupt request Management (Interrupt ReQuest)

An interrupt request, or IRQ, is a signal requesting timely attention and is sent from the hardware to the processor. Each device in the system is assigned one or more IRQ numbers so that a unique interrupt signal can be sent. When an interrupt is enabled, the processor that receives the interrupt request immediately pauses the execution of the current application thread in order to process the interrupt request. The high school outage rate can seriously degrade system performance because normal operation is interrupted, but it is possible to reduce interrupt events, which can be handled by setting interrupt correlation or sending a batch of low-priority interrupts ("combined interrupts").

3 Monitoring and diagnosing performance issues 1 turbostat

Turbostat gives the results of the timer at specified intervals to help administrators identify server anomalies, such as excessive power consumption, inability to go to deep sleep, or creating unnecessary system management interruptions (SMIs).

The turbostat tool is part of the kernel tools package. Support for use in systems with AMD 64 and Intel ®64 processors. Root privileges are required to run, and the processor supports timestamp timers and specific registers for APERF and MPERF models.

2 numastat

The numastat tool enumerates the memory data of each NUMA node to all processes and operating systems, and tells the administrator whether the process memory is scattered across the system or concentrated on a node.

Cross-reference the numastat output through the processor's top output to confirm that the process thread is running on the same node, which is the process memory allocation node.

[root@centos8 ~] # numastat node0numa_hit 4372424numa_miss 0numa_foreign 0interleave_hit 19604local_node 4372424other_node 0

Numa_hit successfully allocates attempts for this node

The number that numa_miss attempts to allocate to another node because of its low memory in the destination node, with each numa_miss event having a corresponding numa_foregin event in the other node.

Numa_foreign initially assigns this node but finally allocates the number of assignments for another node, and each numa_foregin event has a corresponding numa_miss event in the other node

3 / proc/ interrupt

The / proc/interrupts file enumerates the number of interrupts sent to each processor from a particular Icano device, showing the number of interrupt requests (IRQ), the number of interrupt requests processed by each processor in the system, the types of interrupts sent, and a comma-separated list of devices that respond to the listed interrupt requests.

If a particular application or device generates a large number of interrupt requests to the remote processor, its performance will be affected. In this case, when an application or device is processing an interrupt request, a processor can be set on the same node to alleviate the poor performance. The method of assigning interrupt processing to a specific processor.

4 configuration recommendations

By default, Redhat 7 uses a clock-less kernel that does not interrupt idle CPU to reduce power consumption and allows newer processors to take advantage of deep sleep.

Red Hat Enterprise Linux 7 also provides a dynamic clock-less setting (disabled by default), which is helpful for latency-sensitive workloads, such as high-performance computing or real-time computing.

5 configure kernel tick signal time

To enable dynamic clock-less performance in a particular kernel, set it with the nohz_full parameter on the kernel command line. On a 16-core system, setting nohz_full=1-15 enables dynamic clock-less kernel performance in kernels 1 to 15 and moves all timings to the only unset kernel (0 kernel). This performance can be enabled temporarily at startup or permanently in the / etc/default/grub file. To continue this performance, run the grub2-mkconfig-o / boot/grub2/grub.cfg directive to save the configuration.

Starting dynamic clock-less performance requires some manual management

When the system boots, you must manually move the rcu thread to the kernel that is not sensitive to latency, in this case kernel 0.

For i in `pgrep rcu`; do taskset-pc 0$ I; done

Use the isolcpus parameter on the kernel command line to separate a specific kernel from user-space tasks. You can optionally set the CPU association of kernel writeback bdi-flush threads for the auxiliary kernel:

Echo 1 > / sys/bus/workqueue/devices/writeback/cpumask

Verify that the dynamic clock-less configuration is working properly and execute the following command, where stress is a program that runs for 1 second in CPU.

Perf stat-C1-e irq_vectors:local_timer_entry taskset-c 1 stress-t 1-c 1

The default kernel timer configuration displays 1000 ticks in a busy CPU:

# perf stat-C 1-e irq_vectors:local_timer_entry taskset-c 1 stress-t 1-c 1 1000 irq_vectors:local_timer_entry

Under the dynamic clock-less kernel configuration, the user will only see the tick mark once:

# perf stat-C 1-e irq_vectors:local_timer_entry taskset-c 1 stress-t 1-c 11 irq_vectors:local_timer_entry6 set hardware performance policy

The x86_energy_perf_policy tool allows administrators to define the relative importance of performance and energy efficiency. This information can be used to change the processors that support this feature when the processor makes a trade-off between performance and efficiency.

By default, it applies to all processors in performance mode, which requires processor support, shown by CPUID.06H.ECX.bit3, and must run with root privileges. X86_energy_perf_policy is provided by kernel-tools packets. How to use x86_energy_perf_policy

7 use taskset to set processor association

The taskset tool is provided by the util-linux packet. Taskset allows administrators to restore and set processor associations in a process, or to start a process through a specific processor association.

1 use taskset to set the affinity of CPU

Taskset searches for and sets the CPU affinity of the running process (depending on the process ID). It can also be used to start a process that is affinity to CPU, which bundles the specified process with a specified CPU or set of CPU, but taskset does not guarantee local memory allocation.

If you want to allocate local memory for additional performance benefits, it is recommended to use numacl

The prototypes of CPU are represented by bit masks, with the lowest bit corresponding to the first logical CPU and the highest bit corresponding to the last logical CPU. These masks are usually hexadecimal, so 0x00000001 represents processor 0 as 0001 0x00000009 and processors 3 and 1 as 1001.

Set the affinity between CPU and process

To set the CPU affinity of the running process, execute taskset-p mask pid for processing

To start a process with a given affinity, execute

Taskset mask-program

-c specifies which cpu to bind to

Taskset-c 0meme 1je 2jue 3-- myprogram

See which CPU the process is running on

Ps axo pid,psr

Which processor does PSR run on?

Ps axo pid,psr | grep pid

The above binding method can only realize that one process can run on this CPU, but there is no guarantee that other processes will not run on the corresponding CPU. If necessary, some processes must be executed before the operating system log is started, which can not be used at present.

[root@master ~] # cat / etc/rc.local taskset-c 1 httpd [root@master ~] # ps axo pid,psr | grep-E "1296 | 1298" 1296 2 1298 2

This method ensures that the process runs on the CPU, but does not guarantee that the in-memory data in the CPU runs locally.

8 use numactl to manage NUMA associations

Administrators can use Numactl to run processes through specific scheduling or memory installation policies, and numactl can also set persistence policies for shared memory fragments or files, and set processor associations and memory associations for processes

In NUMA systems, the greater the distance between the processor and the memory bar, the slower the access speed of the memory stick. It should be said that the performance-sensitive program is configured to allocate memory from the nearest memory stick, preferably using the memory and CPU on the same NUMA node.

Matters needing attention

Performance-sensitive multithreaded applications that are configured to run on specific NUMA nodes have more benefits than running on specific processors. Whether this is appropriate or not depends on the needs of the user's system and application. If multiple application threads access the same cached data, it may be appropriate to configure those threads to run on the same processor. However, if multiple threads running on the same processor access and cache different data, each thread may reclaim the cached data previously accessed by the thread. This means that each thread is "missing" the cache, wasting running time to get data from disk and replace it in the cache.

The / sys file system contains CPU, memory and peripherals, and its connections are interconnected through .NUMA, especially the / sys/devices/system/cpu directory contains CPU diagrams about the system and handles connection information.

The / sys/devices/system/node directory contains information about the NUMA nodes in the system and the relative distances between them

/ proc kernel-level information

/ sys hardware, memory related information

Detailed explanation of numactl parameters

-- show

Displays the NUMA policy settings for the current process

-- hardware

Displays a list of available nodes in the system

-- membind memory affinity

Only allocate memory from the specified node. When using this parameter, the allocation will fail if there is insufficient memory in these nodes. The use of this parameter is

Numactl-- membind=nodes program, where nodes is the list of nodes from which you want to allocate memory, program is the program from which to allocate memory, and node numbers can be provided in a comma-separated list, range, or a combination of the two

-- cpunodebind binds CPU and memory

Only execute commands (and their child processes) in the CPU that belongs to the specified node. The use of this parameter is numactl-- cpunodebind=nodes progrpm, where nodes is a list of nodes to which the CPU belongs to the specified program. Node numbers can be provided in a comma-separated list, range, or a combination of the two.

-- physcpubind

Execute only the commands in the specified CPU. The use of this parameter is numactl-- physcpubind=cpu program, where CPU is a comma-separated list of physical CPU numbers, and this data is shown in the processor field of / proc/cpuinfo that program is a program executed only in those CPU, and that CPU is specified as associated with the current cpuset

-- localalloc

Specify the memory that will always be allocated in the current node, schedule the process to that CPU, and the process runs and stores data on the corresponding memory node

-- preferred

If possible, allocate memory to the memory in the specified node, and return to other nodes if the memory cannot be allocated to the specified node.

Numactl-preferred=nodes

Must be specified when starting the program

Must be executed on the startup script

[root@python ~] # numactl-- showpolicy: defaultpreferred node: current (current Node) physcpubind: 0 12 3cpubind: 0nodebind: 0membind: 09 Automated NUMA Association Management using numad 1 Overview

Numad is an automated NUMA association management daemon. It monitors the NUMA topology and resource usage in the system in order to dynamically improve the configuration and management of NUMA resources.

Numad also provides pre-placement consulting services, which can be queried through different job management systems and help with the initial binding of processor CPU and memory resources. This pre-placement consultation is available regardless of whether numad is running as an executable program or service.

According to the system load, numad can improve basic performance by 50%. To achieve this performance advantage, numad periodically accesses the information in the / proc file system to empty the available system resources in each node. The daemon then attempts to place a large number of processes on NUMA nodes with sufficient memory and CPU resources to optimize NUMA performance. The current process management threshold is at least 50% of a CPU and at least 300MB memory. Numad attempts to maintain the level of resource conversion and, if needed, through a balanced allocation of total processes among NUMA nodes.

Numad also provides a preconfiguration recommendation service that can be queried through various task management systems to provide CPU initiation bundling and support for process memory resources, which can be used regardless of whether or not numad is running on the system.

2 there are two ways to use numad 1 as a service

When running numad as a service, it attempts to dynamically adjust the system based on the workload of the current system. Its activity is recorded on / var/log/numad.log. To start the service, run: # systemctl start numad.service if the service is persistent after restart, run:

# chkconfig numad on 2 is used as an executable

Use numad on the command line

To use numad as an execution table, simply run:

Numad

When numad runs, its activity is recorded in / var/log/numad.log.

It continues to run until terminated by the following command:

Numad-I 0

Terminating numad does not remove changes it has made to improve NUMA association. If there are significant changes in system usage, running numad again can adjust the correlation to improve performance under the new conditions.

If you need to restrict numad management to specific processes

Start it with the following options:

Numad-S 0-p pid-p

Pid this option adds the specified pid to the explicit inclusion list. When the specified process reaches the significant threshold of the numad process, the specified process is managed.

-S 0

It sets the type of process scan to 0, which restricts numad management to explicit package processes.

4 Overview of scheduling strategy 1

The Linux scheduler implements a large number of scheduling principles to determine where and how long threads run. There are two main types of scheduling principles: general principle and real-time principle. Ordinary threads are used for ordinary priority tasks, and real-time principles are used for tasks that are timely and must be completed without interruption.

Real-time threads are not controlled by intervals, which means they will run until they block, exit, concede, or be pre-positioned by higher-priority threads. The lowest priority real-time thread is scheduled before other common threads.

2 scheduling principle

The scheduler replication ensures that the CPU in the system is busy. The Linux scheduler adopts the scheduling policy, and it can decide when and what thread runs in the specific CPU.

1 Real-time policy (priority scan) 1 SCHED_FIFO

Static priority scheduling policy, in which each thread has a fixed priority permission. The scheduler scans according to the priority permission order, sched_ Fife thread list, and schedules the highest priority thread ready to run, which runs until it blocks, exits, or is preempted by a better thread to run. This policy recommends that you cannot run longer and time-sensitive tasks.

Even the lowest priority real-time thread is scheduled earlier than the non-real-time policy thread. If there is only one real-time thread, then the sched_fifo priority value does not matter. Only the priority of those kernel threads is the real-time priority. The priority level of a SCHED_FIFO thread can be any integer between 1 and 99, and 99 is the highest priority. Red Hat recommends using a smaller number at first and then increasing the priority after the delay problem has been identified.

Because real-time threads are not controlled by time intervals, Red Hat does not recommend setting a 99 priority. This causes processes of the same priority to become migration threads or monitoring threads, and if threads enter a computer loop and these threads are blocked, they will not be able to run. In this case, the single processor system will eventually be suspended.

Administrators can limit the bandwidth of SCHED_FIFO to prevent programmers of real-time applications from enabling real-time tasks for exclusive processors.

/ proc/sys/kernel/sched_rt_period_us

This parameter defines the time in microseconds and is 100% processor bandwidth. The default value is 1000000 μ s, or 1 second.

/ proc/sys/kernel/sched_rt_runtime_us

This parameter defines the time in microseconds and is used to run real-time threads. The default value is 950000 μ s, or 0.95s

2 SCHED_RR

Rotation scheduling also provides a fixed priority between 1 and 99 for sched_rr threads, but threads with the same priority are scheduled using specific or time slice rotation training methods. Sched_rr_get_interval system calls the values returned by all time slices, but the user cannot set the duration of the time piece. This strategy is helpful for multi-threads running with the same priority.

Modify the real-time priority of a process

[root@python ~] # chrt-hShow or change the real-time scheduling attributes of a process.Set policy: chrt [options] [...] Chrt [options]-- pid Get policy: chrt [options]-p Policy options:-b,-- batch set policy to SCHED_BATCH-d,-- deadline set policy to SCHED_DEADLINE-f,-- fifo set policy to SCHED_FIFO-I,-- idle set policy to SCHED_IDLE-o,-- other set policy to SCHED_OTHER-r -- rr set policy to SCHED_RR (default) Scheduling options:-R,-- reset-on-fork set SCHED_RESET_ON_FORK for FIFO or RR-T,-- sched-runtime runtime parameter for DEADLINE-P,-- sched-period period parameter for DEADLINE-D,-- sched-deadline deadline parameter for DEADLINEOther options:-- a,-- all-tasks operate on all the tasks (threads) for a given pid-m -- max show min and max valid priorities-p,-- pid operate on existing given pid-v,-- verbose display status information-h,-- help displays this help and exits-V,-- version outputs version information and exits

Chrt specify priority-p specify pid

If not specified, the RR scheduling policy is used by default

2 General policies (user processes)

Daemons and batch processes

Interactive process

SCHED_OTHER or SCHED_NORMAL

SCHED_OTHER redhat 7 default scheduling policy, which uses the full Fair Scheduler (CFS) to provide fair access to all threads using this policy. CFS establishes a dynamic priority list, partly based on the niceness value of each process thread, which provides users with some indirect control of process priority, but this dynamic priority list can only be modified directly by CFS. This scheduling algorithm is most useful when there are a large number of threads or data throughput priority, because it can schedule threads more efficiently over time.

For low priority tasks

SCHED_BATCH

SCHED_IDLE

3 commonly used scheduling strategies

Sched rr

Chrt-r

The higher the number from 1 to 99, the higher the priority

Sched_other

The automatic kernel automatically provides dynamic priority

Manual nice renice

The smaller the number of 100-139, the higher the priority.

4 Strategy selection

Choosing the correct scheduler policy for program threads is not always straightforward, and the policy should usually be implemented at a critical time or in important tasks that need to schedule quickly and expect no longer running time. General policies usually produce better data flow results than real-time policies because they allow scheduling processes to run more efficiently (and that is, they do not need to reschedule preemptive processes frequently)

If you need to manage a large number of processes and worry about data traffic (network packets per second, writing to disk, etc.), use SCHED_OTHER and let the system manage CPU for you

If you are worried about the delay in event response time, use SCHED_FIFO, and if there are only a small number of threads, consider isolating the CPU slot and moving the thread to the core of that slot so that no other thread competes with it.

3 isolated CPU1 isolcpus boot isolated CPU

Users can use the isolcpus boot parameter to isolate one or more CPU from the scheduler, thereby preventing threads from scheduling any user space on this CPU.

Once the CPU is quarantined, the user needs to manually assign processes to the quarantined CPU, or use CPU-related system calls or numactl commands

Isolate the third and sixth CPU in the system to the eighth CPU and add the following to the kernel command line:

Isolcpus=2,5-7 2 Tuna tool isolates CPU

Users can also use the Tuna tool to isolate CPU. Tuna can isolate CPU at any time, not just at startup. However, this isolation method is slightly different from the isolcpus parameters, and the performance benefits associated with isolcpus have not yet been realized.

In systems with more than 32 processors, the user must limit the value of the smp_affinity to a scattered 32-bit group. For example, if you initially want to use only a 32-bit processor in a 64-bit processor system to handle an interrupt request, you can run:

Echo 0xffffffffcome00000000 > / proc/irq/IRQ_NUMBER/smp_affinity3 uses tuna to configure CPU, thread and mid-range association

Tuna can control CPU, thread and interrupt associations, and can provide a large number of operations for each type of entity it can control.

To remove all threads from one or more specific CPU, run the following command to replace the CPUs with the number of CPU you want to isolate.

Tuna-cpus CPUs-isolate

To add a CPU to the list of CPU that can run a specific thread, run the following command to replace the CPUs with the number of CPU you want to join.

Tuna-cpus CPUs-include

To move an interrupt request to a specific CPU, run the following command to replace the CPU with the number of CPU and the IRQs with the list of interrupt requests that you want to move and separated by commas.

Tuna-irqs IRQs-cpus CPU-move

In addition, the user can find all interrupt requests for sfc1* mode using the following command.

Tuna-Q sfc1*-c7-m-x

To change the policy and priority of a thread, run the following command to replace thread with the thread you want to change, replace policy with the desired thread running strategy name, and replace level with an integer from 0 (lowest priority) to 99 (highest priority).

Tuna-threads thread-priority policy:level4 interrupt and interrupt request (IRQ) mediation

RIQ interrupt requests are requests for services that are sent from the hardware level and can be interrupted using dedicated hardware lines or information packets across hardware buses.

After the interrupt is enabled, the IRQ will be prompted to switch to the interrupt context, and the kernel interrupt scheduling code will search the ISR list of registered interrupt service routes associated with the IRQ number machine, and call ISR,ISR sequentially to acknowledge the interrupt and ignore the extra interrupts from the same IRQ, then queue up to complete the interrupt processing in the delayed handle, and ignore subsequent interrupts to end the ISR

[root@master ~] # cat / proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 148000000 0 IO-APIC-edge timer 1: 1300 0 0 0 IO-APIC-edge i8042 8: 1 0 0 0 IO-APIC-edge rtc0

The / proc/interrupts file lists the number of interrupts per CPU in each IO device, the number of interrupts handled by each CPU core, and the type of interrupts. And a comma-separated list of drivers registered to accept interrupts

IRQ has an associated "similar" property smp_affinity, which defines the CPU that runs the IRQ execution ISR and, in the file, can improve the performance of the program by assigning interrupt similarity and program thread similarity to one or more specific CPU cores, which allows caching to be shared between specified interrupts and program threads

The interrupt approximation value of the specific IRQ number is saved in the relevant / proc.irq.IRE_NUMBER/smp_affinity file, which can be viewed and modified as a root user. The value saved in this file is a hexadecimal byte mask representing all CPU cores in the system.

[root@master ~] # grep eth0 / proc/interrupts 18: 24909 0000000 IO-APIC-fasteoi eth0 [root@master ~] # cat / proc/irq/18/smp_affinity f represents all CPU 000000000000000000000000000000000000000000ff

For certain hardware, it will be saved on some CPU when it is registered.

For some common hardware, such as network cards and hard drives, it may be constantly scheduled by the kernel.

Most of the hardware is processed on CPU0

The default value of Smp_affinity is f, and you can provide IRQ for any CPU in the system. Set this value to 1, as follows

[root@master ~] # echo 1 > / proc/irq/18/smp_ affinity[ root @ master ~] # cat / proc/irq/18/smp_affinity00000000,00000000,00000000,00000001

Isolated interrupt

Adjust some interrupts on the specified CPU to another cpu before binding

[root@master ~] # echo 1 > / proc/irq/18/smp_ affinity[ root @ master ~] # cat / proc/irq/18/smp_affinity00000000,00000000,00000000,00000001

A specific process is bound to a specific CPU, and the CPU does not handle any other requests or interrupts to avoid switching between interrupt context and process context

2 memory tuning 1 considerations

Red Hat Enterprise Linux 7 is optimized by default for moderate workloads. If the user's application or use case requires a large amount of memory, changing the system to handle virtual memory can improve the performance of the application.

2 page and memory management 1 page management

The physical memory management area is called a page, and the physical location of each page is mapped to a virtual location so that the processor can access memory. This mapping is stored in a data structure called a page table. The physical memory is organized into Ye Kuang, while the linear address is organized into pages, the storage of data is defined in the format of the page, and the data of the page may be mapped to discontinuous physical Ye Kuang.

By default, a page is about 4kb per page with 1m memory equal to 256000 pages, and 1GB memory is equal to 256000 pages. CPU has embedded memory units that contain these lists, and each page uses also table entry references. Because the default size of the page is very small, users need a lot of pages to manage a large amount of memory, and their page table can only store limited address mappings, so it is expensive and difficult to increase the number of stored address mappings. therefore, to take into account the performance level to keep within the memory requirements, 32-bit systems generally support 4k pages and 4m, 64-bit 4k and 2m.

2 Super large escape backup buffer (huge TLB)

Escaping physical memory addresses to virtual memory addresses is part of memory management. the mapping between physical addresses and virtual addresses is stored in a data structure called page table, because each address mapping is time-consuming and resource-consuming to read page tables. so the most recently used addresses are cached, which is the transfer backup buffer.

But TLB can only cache a large number of address mappings. If the required address mapping is not in TLB, you must read the page table to determine the physical to virtual address mapping. This is TLB missing. Because of the relationship between memory requirements and address mappings used to cache TLB, programs with large memory requirements are more vulnerable to TLB missing than programs that use less memory, because each missing involves page table reading. Therefore, it is important to avoid these deficiencies as much as possible.

The large transfer backup buffer (huge TLB) can manage large areas of memory so that more addresses can be cached at a time, which reduces the possibility of missing TLB, which in turn improves the performance of programs with large memory requirements.

Red Hat Enterprise Edition Linux provides a large conversion back buffer (large TLB), which can be divided into large segments for memory management. This allows a large number of address mappings to be cached at the same time, reducing the possibility of missing TLB and improving the performance of applications that require large memory.

3 large memory and transparent pages

When accessing data, the data accessed for a long period of time is loaded instead of all the data. If you need to access other data that is not loaded, a data exception will occur. At this time, query processing needs to be done through IO.

For processes that eat memory very much, the function of opening their large memory pages is a way of tuning.

Two ways for a system to manage a large amount of memory

1 increase the number of page tables in the hardware memory management unit

2 add large pages

The first method is very expensive. the hardware memory management unit in the current processor only supports hundreds or thousands of page entries, and the hardware suitable for managing thousands of pages and memory management algorithms may not be able to manage millions or even hundreds of millions of pages very well. this can cause performance problems, but the program sequence number uses more pages than the memory management unit, and the system falls back to slow software-based memory management. As a result, the whole system runs slowly.

Large pages are 2MB and 1GB-sized memory blocks. 2MB uses page tables to manage multiple GB, while 1GB pages are the best choice for TB memory.

Oversized pages must be allocated at boot time, they are difficult to manage manually and often need to change the code to be used effectively, so Red Hat Enterprise Linux also deploys transparent oversized pages (THP).

THP is an extraction layer that can be created automatically. Manage and use most aspects of oversized pages

Large pages must be configured when booting, while large transparent pages can be configured at any time

THP system administrators and developers have reduced a lot of the complexity of using large pages because THP is designed to improve performance, so its developers have tested and optimized THP in a variety of systems, configurations, and loads, which allows the default settings of THP to improve the performance of most system configurations

Varish is not compatible with transparent large pages, which may cause memory leaks

The performance of Varish is mediocre on 2m pages.

Transparent large page management function for the management of asynchronous memory segments

Red Hat Enterprise Linux provides the ability to manage large memory for each page through static large paging, which can be allocated to 1GB size, but it is difficult to manage and must be allocated at startup.

Transparent large pages are largely an automatic selection of static large pages. The transparent large page size is 2 MB and starts by default. They sometimes interfere with latency-sensitive applications, so they are often disabled when latency is severe.

2 Monitoring and diagnosing performance problems

Red Hat Enterprise Linux 7 provides a number of useful tools to monitor system performance and performance issues related to system memory

1 vmstat monitors memory usage

Vmstat is provided by procps-ng packets and outputs user system processes, memory, web pages, block input / output, interrupts, and CPU activity reports.

For more information, please see:

Https://blog.51cto.com/11233559/2152153

2 using valgrind to analyze the memory usage of the application 1 installation

Valgrind is a framework that provides users with a method of measuring spatial binaries. It contains a large number of tools to summarize and analyze program performance. The valgrind tools listed in this chapter can help users detect memory errors, such as uninitialized memory usage and inappropriate memory allocation and de-allocation.

To use valgrind or its tools, install the valgrind package:

Yum install valgrind2 memcheck

Memcheck is the default valgrind tool that detects and reports a large number of memory errors that are difficult to detect and diagnose, such as

Memory access that should not occur

Use undefined or uninitialized values

Incorrect release heap address

Pointer overlap

Memory leak

Memcheck can only report these errors, but cannot prevent them from happening. If the program accesses memory in a way that normally causes segment errors, segment errors will still occur, but memcheck will record a message immediately before the error occurs.

Because memcheck uses measurement tools, applications executed through memchek run 10-30 times slower than usual.

To run memcheck on the application, execute the following instructions:

Valgrind-tool=memcheck application

Users can also use the following options to focus memcheck output on specific problem types.

3 detailed explanation of parameters

Leak-check--

After the application finishes running, memcheck searches for memory leaks. The default value is-- leakcheck=summary, and the number of memory leaks is displayed when they are found. The user can specify-- leak-check=yes or-- leak-check=full to output the details of each leak problem. If disabled, please set-- leak-check=no.

Undef-value-errors--

The default value is-- undef-value-errors=yes, and an error will be reported when using undefined values. The user can also set-- undef-value-errors=no, which disables this report and slightly increases the speed of Memcheck.

Ignore-ranges--

Specify a range that one or more memcheck should ignore when viewing addressable memory, for example,-- ignoreranges=0xPP-0xQQ,0xRR-0xSS.

3 cache grind

Use cache grind to analyze cache usage

Cachegrind simulates the interaction between the application and the system cache layer structure and branch predictor, tracking simulated and instruction and data cache usage to detect bad interactions between that level of cache and code. It also tracks the last level of cache (the second or third pole) to track memory access. In this case, applications that use cachegrind will run 20-100 times slower than usual.

Cachegrind collects statistics during the execution of the application and outputs the summary to the console. To run in the application

Cachegrind, execute the following instructions:

Valgrind-tool=cachegrind application

Users can also use the following options to focus cachegrind output on a specific problem.

Detailed explanation of parameters

I-- I1

The method to specify the size, relevance, and first-level instruction cache line size is as follows:-- I1 instruction sizeweight associativityline size.

Hou-D1

The following methods are used to specify the size, associativity, and first-level data cache row size:--

D1roomsizegravity associativitytheconnection size.

LL--

The following methods are used to specify the size, associativity, and last-level cache row size:--

LL=size,associativity,line_size .

Cache-sim--

Enabling or disabling the collection of cache access and missing numbers is enabled by default (--cache-sim=yes). Disable this collection and-- branch-sim to prevent cachegrind from collecting information.

Branch-sim--

Enabling or disabling the collection of branch instructions and error predictions is enabled by default (--branch-sim=yes). Disable this collection and-- cache-sim to prevent cachegrind from collecting information.

Cachegrind writes detailed analysis information to each process cachegrind.out.pid file, where pid is the process identifier. This detail can be further processed using the cg_annotate tool as follows:

Cg_annotate cachegrind.out.pid

Cachegrind also provides cg_diff tools that make it easier to record program performance before and after code changes. To compare the output text

, please execute the following command: first replace with the original configuration output file, and then replace with the subsequent configuration output file.

Cg_diff first second4 uses massif to analyze stack space

Massif measures the heap space of a particular application. It measures available space and additional space for recording and alignment. Massif helps users understand ways to reduce application memory usage in order to run faster and reduce the possibility that applications run out of system swap space. Applications executed with massif run about 20 times slower than usual.

To run massif in an application, execute the following command:

Valgrind-tool=massif application

Users can also use the following options to focus the output of massif on a specific problem.

Heap--

Sets whether massif parses the heap. The default is-- heap=yes. To disable heap analysis, set it to-- heap=no.

Heap-admin--

When heap analysis is enabled, set the number of bytes of each block used for management. The default value is 8 bytes.

Stacks--

Sets whether massif parses the heap. The default value is-stack=no because heap analysis greatly slows down massif. Set this option to-- stack=yes to enable heap analysis. Note that massif assumes that the main heap starts at zero in order to better show changes in heap size related to the application being analyzed.

Time-unit--

Sets the interval at which massif collects analysis data. The default value is I (execute instruction). Users can also specify ms (milliseconds or real-time) and B (the number of stack bytes allocated or recalled). Checking the number of bytes allocated is good for short-running applications and tests, because it is the most repetitive for different hardware.

Massif outputs the analysis data to a massif.out.pid file, where the pid is the process identifier of the specified application. The ms_print tool charts this analysis data to show the memory consumption of the executing application, as well as the details of the sites that the peak memory allocation point is responsible for allocating. To draw the data in the massif.out.pid file, execute the following instructions:

Introduction to ms_print massif.out.pid3 configuration tool 1

Memory usage is often configured by setting parameter values for one or more kernels, which can be set temporarily by changing the file contents in the / proc file system, or permanently by setting the system core parameters tool, which is provided by the procps-ng packet

For example, to temporarily set the overcommit_memory parameter to 1, run the following directive:

Echo 1 > / proc/sys/vm/overcommit_memory

To set this value permanently, run the following directive:

Sysctl vm.overcommit_memory=1

Temporarily setting a parameter is helpful to determine the impact of this parameter on the system. The user can determine that the parameter value has the desired effect before setting it to a permanent value.

2 configure large pages

Large pages depend on contiguous memory areas, so it is best to define large pages at startup, that is, before memory becomes a judgment. To do this, add the following parameters to the kernel startup command line:

Hugepages

Define the number of 2MB-rated large pages in the kernel at startup. The default value is 0. Large pages can be allocated (or reclaimed) only if the system has enough physically persistent free pages. Pages reserved by this parameter cannot be used for other purposes.

This value can be adjusted by changing the value of the / proc/sys/vm/nr_hugepages file after startup. For more details, see the relevant kernel documentation, which is installed by default in / usr/share/doc/kernel-doc- kernel_version/Documentation/vm/hugetlbpage.txt.

/ proc/sys/vm/nr_overcommit_hugepages

Define the maximum number of extra large pages that the system can create and use by overusing memory. Write any non-zero value in this file to indicate that the system contains this number of large pages that come from the kernel's regular page pool when the constant page pool is exhausted. Because these extra large pages are unused, they are released and returned to the kernel's regular page pool.

3 configure system memory capacity

Virtual memory parameters

The parameters here are all early in / proc/sys/vm, unless otherwise indicated

Dirty_ratio: a percentage value that, when the 100% of the memory of the entire system is modified, is written to disk by running pdflush, which defaults to 20%.

Dirty_background_ratio: a percentage value that, when the 100% of the memory of the entire system is modified, is written to disk in the background. The default is 10%.

Overcommit_memory: defines considerations for deciding whether to accept or reject a large memory request

The default value is 0, and by default, the kernel performs exploration to overuse memory, which is handled by estimating the size of available memory and requests that fail because it is too large, but because memory allocation uses discovery algorithms rather than precise algorithms, this setting makes it possible to overload memory.

When this parameter is set to 1, the kernel does not perform memory overutilization processing, which increases the possibility of memory overload, but also improves the performance of memory-intensive tasks.

When this parameter is set to 2, the kernel rejects requests, and the requested memory is greater than or equal to the total available swap space, and the percentage of physical RAM specified in overcommit_ratio, which reduces the risk of overusing memory, but this setting is recommended only if the system swap space is larger than physical memory.

Overcommit_ratio

When overcommit_memory is set to 2, the percentage of physical RAM considered is set, with a default value of 50. 0.

Max_map_count

Defines the maximum number of memory-mapped areas that a process can use, and the default (65530) applies in most cases, which can be increased if the application needs to map more than this number of files.

Min_free_kbytes

Specifying a minimum number of kilobytes to keep it idle throughout the system is used to determine an appropriate value for each low memory area, each of which allocates a large number of reserved free pages in proportion to its size.

Warning:

Extreme values can damage the user's system. Set min_free_kbytes to a minimal value to prevent the system from reclaiming memory, which causes the system to lock up, and the OOM-killing process. However, setting the min_free_kbytes too high (for example, 5-10% of the entire system memory) immediately puts the system into a state of insufficient memory, causing the system to spend too much time reclaiming memory.

Oom_adj

When the system is out of memory and the panic_on_oom parameter is set to 0, the oom_killer function ends the process until the system can recover, starting with the improved oom_score process.

The oom_adj parameter helps determine the oom_score of a process, which is set on a per-process identifier basis. A value of-17:00 disables a process's oom_killer, and other valid values from-16 to 15 a process that results from an adjusted process continues its oom_score.

"swappiness

A value from 0 to 100 controls how much the system swaps. High values give priority to system efficiency and actively swap processes that run out of physical memory when processes are not active. Low values give priority to responsiveness and avoid swapping processes that run out of physical memory for as long as possible, with a default value of 60.

4 File system parameters

The parameters here are all in / proc/sys/fs unless otherwise indicated.

Aio-max-hr

Defines the maximum number of events allowed in an asynchronous input / output environment, with a default value of 65536, and modifying this value does not pre-allocate or change the size of any kernel data structures.

File-max

Define the maximum number of file handles allocated by the kernel. The default value matches the files_stat.max_files value in the kernel. Set this value to the maximum value NR_FILE (8192, in Red Hat Enterprise Linux) or the following result:

(mempages * (PAGE_SIZE / 1024)) / 10

Increase this value to resolve errors caused by the lack of available file handles

5 Kernel parameters

The parameters here are all in / proc/sys/kernel unless otherwise indicated.

Msgmax

Msgmax defines the maximum possible value of any information in the information queue in bytes, which cannot exceed the size of the queue (msgmnb). The default value is 65536Z.

Msgmnb

Defines the maximum value of each information queue in bytes. The default is 65536.

Msgmni

Defines the maximum number of information queue identifiers (and the maximum number of queues). In 64-bit systems, the default value is 1985.

Shmall

Defines the total amount of shared memory on the page that the system can use at the same time

Shmmni

Defines the maximum number of system-wide shared memory fragments, with a default value of 4096 on all systems

Threads-max

Define the maximum number of threads that can be used by the kernel at the same time within the system. The default value is the same as the kernel parameter max_threads, or as a result:

Mempages / (8 * THREAD_SIZE / PAGE SIZE)

The minimum is 20

III Storage and file system 1 considerations

The reasonable setting of storage and file system performance depends largely on the storage purpose, and Icano and file system performance are affected by the following factors:

1 data write or read mode

2 data rearrangement and underlying architecture

Size of 3 pieces

4 File system size

5 Log size and location

6 record the number of visits

7 to ensure the reliability of data

8 prefetch data

9 pre-allocate disk space

10 File fragments

11 Resource contention

2 solid state hard disk

SSD (solid state drives) use flash memory chips instead of rotating disks to store permanent data. They provide constant access time for all data in logical block addresses, and do not incur measurable search costs like their rotating counterparts. Storage space per gigabyte is more expensive and denser, but it has shorter latency and higher throughput than HDD.

"when blocks are used on SSD close to disk capacity, performance usually degrades, to an extent that varies from vendor to vendor, in which case the performance of all devices degrades, and enabling discard helps mitigate the performance degradation."

The default Icano scheduler and virtual memory options apply to SSD

3 Overview of the Icano Scheduler 1

The schedule determines when and for how long the operation runs on the storage device. It is also known as the elevator O (iUnix O elevator).

2 deadline

Except that the SATA disk is the default Imax O scheduler for all block devices, deadline attempts to provide a guaranteed delay for requests that arrive at the Imax O scheduler, which is suitable for most use cases, especially for requests where read operations are more frequent than write operations.

The queued iMago requests are classified as read or write batches and executed according to the LBA increment order. By default, read batches take precedence over write batches, because applications are more likely to prevent reading iCandle O. After batch processing, deadline checks how long the write operation is "hungry" due to waiting time for processing, and appropriately schedules the next read batch or write batch to solve the number of batch requests. The number of read batches that issue write batches and the amount of time before the request expires can be configured.

3 cfg

The default scheduling is only applicable to devices marked as SATA hard drives. Full Fair Scheduler, cfg, divides processes into three separate categories, real-time, as much as they can and space. Real-time category processes always execute before the idle category processes. This means that real-time classes can "starve" as much as they can and idle processes wait for processor time, and by default, assign processes to as many categories as they can

Cfg uses historical data, so whether the application will issue more Icano requests later, and if there will be more Icano requests, cfq will be idle waiting for a new Icano, even if there are Icano requests from other processes waiting for processing.

Because of the tendency to be idle, the cfg scheduler is not applied to hardware where connections do not cause a large number of searches, unless it is adjusted for secondary purposes, and the cfg scheduler is not used to connect to other intermittent work schedulers.

Cfg behavior is highly configurable

4 noop

The noop I Dot O scheduler performs a simple FIFO (first-in, first-out) scheduling algorithm that requests merging at the general block layer through simple last-selected cache data, which is the best scheduler for CPU-constrained systems that use the fastest storage

4 File system 1 XFS

XFS is a reliable and highly scaled 64-bit file system, which is the default file system in Linux redhat7. XFS uses partition-based allocation and has some allocation schemes, including pre-allocation and delayed allocation, both of which reduce fragmentation and auxiliary performance. it also supports metadata logs that facilitate failure recovery, and XFS can be fragmented and magnified when mounted and activated XFS supports file systems with maximum capacity up to 500TB. And the file offset with a maximum capacity of 8EB.

2 Ext4

Ext4 is a scalable extension of the Ext3 file system, and its default behavior is optimal for most workloads. However, it only supports a file system with a maximum capacity of 50TB that has a set of files with a maximum capacity of 16TB.

3 Btrfs (Technical Preview)

Btrfs is a copy-on-write (replication on write) file system that provides scalability, fault tolerance and easy management. It includes built-in snapshot and RAID support to provide data integrity through data and metadata verification. It is also through data compression to improve performance and space efficiency. Btrfs as a technical preview supports the maximum capacity of 50TB file systems. Btrfs is best suited for desktop storage and cloud storage

4 GFS2

GFS2 is part of a highly available add-on that provides cluster file system support for redhat 7 Enterprise Edition, and GFS2 clusters provide consistent file system images for all servers, allowing servers to read and write in separate shared file systems.

GFS2 supports file systems with maximum capacity up to 150TB

5 Overview of considerations 1 for formatting file system

After the device is formatted, some of the file system configuration decisions cannot be changed, and the formatted results must be accurately estimated here.

2 size

Create file systems of reasonable size according to the workload, and smaller file systems have fewer backups in proportion, and file system checks require less time and memory. However, if your file system is too small, performance will be degraded by a large number of fragmentation

Size of 3 pieces

A block is the working unit in a file system. The block size determines how much data a single block can store, and thus determines the minimum amount of data that can be read and written at the same time.

The default block size is suitable for most use cases, however, if the block size is as large as the amount of data usually read and written at the same time, or slightly larger, the file system will perform better and store data more efficiently. Small files will still use a complete block, and files will be distributed among multiple blocks, resulting in additional running time overhead, while other file systems will be limited to a certain number of blocks. Turn to limit the maximum size of the file system

When formatting a device using the mkfs directive, the block size is specified as part of the file system option, and the parameters that specify the block size vary with the file system.

4 geometry

The geometry of the file system is related to the distribution of data in the file system. If the system uses ribbon memory, such as RAID, when the device can be formatted, the performance can be improved by rearranging the data and the underlying storage geometry metadata.

The recommended geometry for many data exports is automatically set when the device is formatted with a specific file system. If the device does not export these recommended geometry, or if you want to change the recommended settings, you need to specify the geometry manually when using mkfs to format the device.

5 external diary

The log file system records the changes that occur during the write operation to the log file before performing the write operation, which reduces the possibility of damage to the log storage device in the event of a system failure or power failure, and speeds up the recovery process.

With frequent updates of metadata-intensive workload design logs, large logs use more memory but less frequent operations, and by placing device logs on dedicated storage as fast or faster as primary storage, increase seek time for devices with metadata-intensive workloads.

Ensure the reliability of the external log, the loss of the external log, the device will lead to file system corruption

External logs must be created during formatting and log devices must be established during mount

6 Mount event Note 1 barrier (masking)

The barrier of the file system ensures that the file system metadata is correctly written to permanent storage and sorted, and that the data transferred using fsync is preserved in the event of a power outage. In previous versions of redhat, enabling file system barrier significantly slowed down applications that were heavily dependent on fsync, or created and deleted many small files

In Red Hat Enterprise Linux 7, the improvement in file system barrier performance is to minimize the performance impact of disabled file system barrier (less than 3%)

2 access time

Each time a file is read, its metadata is updated with the access time atime, which involves an additional atime O, which is small in most cases because, by default, the previous access time is earlier than the last modification time (mtime) or state change (ctime), and redhat 7 only updates atime.

However, if updating metadata is time-consuming and does not require exact access time, it can mount the file system using the noatime mount option, which disables the update of metadata when the file is read, and it also enables the nodiratime behavior, which disables the update of metadata when the directory is read

3 pre-reading

The pre-read behavior accelerates file system access by prefetching data that may be immediately needed and loading it into a page cache that can be retrieved faster than on disk. The higher the pre-read value, the earlier the system prefetches the data.

Redhat Enterprise Linux tries to set an appropriate pre-read value based on a check of the file system, and then the check can not always be accurate. Workloads involving large amounts of data streams in order I get O often benefit from high read-ahead values. Redhat Enterprise Edition Linux 7 provides related storage adjustment profiles to improve read-ahead values, just like using LVM striping However, these adjustments are not always sufficient for all workloads.

7 maintenance

Periodically discarding unavailable blocks in the file system is recommended for solid state drives and thin devices, and there are two ways to discard unused blocks:

Batch discard (batch and discard)

This discarding method is part of the fstrim instruction, which discards all inapplicable blocks in the file system that match the criteria specified by the administrator.

Redhat Linux 7 supports batch discard on XFS and ext4 formatting devices, which support actual discard operations

Make (HDD devices with / sys/block/devname/queue/discard_max_bytes values not equal to 0 and SSD devices with / sys/block/sda/queue/discard_granularity values not equal to 0).

Online discard (Network discard)

The discarding operation in this way is configured with the discard option when mounted. The actual operation is free from user interference, however, online discard only discards blocks that are converted from usage to idle, and Redhat 7 supports online discard on XFS and Ext4 formatting devices.

Red Hat recommends batch discard unless online discard is required to maintain performance, or batch discard is not available for system workloads.

Pre-allocation

Pre-allocation marks hard disk space as already allocated to a file, and data is not written to that space, which can be used to limit data fragmentation and poor read performance. Redhat Linux 7 supports pre-allocation of space on XFS,Ext4 and GFS2 devices within mount events.

6 use system tap to monitor storage

The following SystemTap sample scripts relate to storage performance and may be helpful in diagnosing storage or file system performance problems. Under the default setting

Install them to the / usr/share/doc/systemtap-client/examples/io directory.

"disktop.stp

Check the status of the read / write hard drive every 5 seconds and output the top ten items during this period.

"iotime.stp

Displays the amount of time spent in read and write operations, as well as the number of bytes read and written.

"traceio.stp

The first ten executables per second are displayed based on the observed cumulative Ithumb O stream.

"traceio2.stp

Displays executable names and process identifiers when reading and writing to a specific device.

"inodewatch.stp

Whenever a read or write operation is performed on a specific inode on a specific primary / secondary device, the executable name and process identifier are displayed.

"inodewatch3.stp

Whenever a property changes on a specific inode on a specific primary / secondary device, the name, process identifier, and attribute that will be executable are displayed.

6 configuration tool 1 configure storage performance tuning profile

Tuned and tuned-adm provide some configuration advice designed to improve performance for specific use cases, and the following configuration files are particularly important to improve storage performance

Its functions are as follows

1 delay performance

2 Throughput performance

To configure a configuration file in the file system, run the following command to replace name with the name of the profile you want.

$tuned-adm profile name

The tuned-adm recommend command recommends an appropriate profile for the system, and it also sets the default profile for the system during installation, so it can be used to return the default profile

2 set the default Imax O scheduler

If the device's mount option does not specify a scheduler, you can use the default Icano scheduler

To set the default elevator O scheduler, specify the scheduler you want to use by appending the scheduler parameter to the kernel command line during restart, or by editing the / etc/grub2.conf file

Elevator=scheduler_name3 configures the Istroke O scheduler for the device

To set the scheduler or scheduler priority order for a specific storage device, edit the / sys/block/devname/queue/scheduler file, and devname is the name of your preconfigured device.

[root@python ~] # cat / sys/block/sda/queue/scheduler noop [deadline] cfq [root@python ~] # echo cfq > / sys/block/sda/queue/scheduler [root@python ~] # cat / sys/block/sda/queue/scheduler noop deadline [cfq] [root@python ~] # 4 adjust the deadline scheduler

When using deadline, queued iAccord O requests will be divided into read batches and write batches, and then scheduled according to the order of incremental execution of LBA. By default, read batches take precedence over write batches, because referencing programs on read Istroke O are easily blocked. After the batch is processed, deadline will check how long the write operation is "hungry" due to waiting for the processor time, and reasonably schedule the next read or write batch.

The following parameter image deadline scheduler behavior

Fifo_batch

The number of read or write operations issued in a single batch is 16 by default. The higher the value, the more throughput, but also increases latency.

Front_merges

If your workload never produces a positive merge, the adjustable parameter is set to 0, however, unless you have tested the cost of the check, 1 is recommended as the default

Read_expire

The number of milliseconds in the read request should be scheduled for the service, with a default value of 500 (0.5 seconds)

Write_expire

The number of milliseconds in write requests should be scheduled for the server, with a default value of 5000 (5 seconds)

Writes_started

The number of read batches processed before write batches, and the higher the value, the more priority is given to read batches

5 adjust the cfg scheduler

When using cfg, processes are divided into three categories, real-time, do their best and space, and try their best to schedule all real-time processes between processes, while space schedules do their best before scheduling. By default, processes are classified as doing their best, and the process classification can be manually adjusted using the ionice command.

Further adjust the behavior of the cfg scheduler by changing the specified file in the / sys/blog/devname/queue/iosched directory by using the following parameters, based on the

[root@python iosched] # ll Total usage 0 root-1 root root 4096 December 10 23:01 back_seek_max-rw-r--r-- 1 root root 4096 December 10 23:01 back_seek_penalty-rw-r--r-- 1 root root 4096 December 10 23:01 fifo_expire_async-rw-r--r-- 1 root root 4096 December 10 23:01 fifo_expire_sync-rw-r--r-- 1 root root December 10 23:01 group_idle-rw-r--r-- 1 root root 4096 December 10 23:01 low_latency-rw-r--r-- 1 root root 4096 December 10 23:01 quantum-rw-r--r-- 1 root root 4096 December 10 23:01 slice_async-rw-r--r-- 1 root root 4096 December 10 23:01 slice_async_rq-rw-r--r-- 1 root root 4096 December 10 23:01 slice_idle -rw-r--r-- 1 root root 4096 December 10 23:01 slice_sync-rw-r--r-- 1 root root 4096 December 10 23:01 target_ target_ [root @ python iosched] # pwd/sys/block/sda/queue/iosched

Back_seek_max

Cfq will perform a backward search for the maximum distance of kilobyte calculations. The default is 16kb. Searching backwards usually harms performance, so large values are not recommended.

Back_seek_penalty

When deciding whether to move the head forward or backward, the multiplier is applied to the backward search. The default value is 2. If the head position is 1024kb and there is an isometric request in the system, back_seek_penalty

Applied to backward search distance and disk moving forward.

Fifo_expire_async

The asynchronous (buffered write) request is calculated in milliseconds for the length of time that may remain unserved, after which a separate hungry asynchronous request is moved to the shipping list, with a default of 250 milliseconds.

Fifo_expire_sync

Synchronization (read or O_DIRECT write) request the length of time in milliseconds that may remain unserved, after which a separate "hungry" synchronization request is moved to the shipping list, with a default value of 125ms.

Group_idle

By default, this parameter is set to 0, which means that it is disabled, and when set to 1 (enabled), the cfg scheduler is idle in the last process in the control group that sends out I _ picpico, such as using a proportional weight I _ peg O control group, or if slice_idle is set to 0 (on fast storage).

"group_isolation

By default, this parameter is set to 0 (disabled). When set to 1 (enabled), it provides greater isolation between groups, but throughput decreases because fairness is used for random and sequential workloads. When group_isolation is disabled (set to 0), fairness is provided only to sequential workloads.

Low_latency

By default, the setting parameter is 1 (enabled), when enabled. By providing a maximum wait time for 300ms for each process that issues IAccord O on the device, CFG pays more attention to fairness rather than throughput. When the parameter is set to 0 (disabled), the target delay is ignored and each process accepts the completed time slice.

Quantum

This parameter defines the number of cfg requests sent to a device at the same time, which is actually a limit on the queue depth. The default value is 8 requests. The device used may support a larger queue depth, but increasing the queue depth can lead to delays, especially large sequential write workloads.

Slice_async

This parameter defines the length of time slices assigned to each process that makes an asynchronous Iripple O request, which defaults to 40 milliseconds

Slice_idle

This parameter specifies the length of cfg space time in milliseconds while waiting for the next request. The default is 0 (no space in the queue or service tree level). The default value is ideal for the push volume of external raid memory. Due to the increase in the overall number of search operations, the throughput of non-RAID memory is reduced.

Slice_sync

This parameter defines the length of time slices assigned to each process that makes an asynchronous Icano request. The default value is 100ms.

Adjust cfg for fast storage

Unlike hardware that cannot withstand a large search penalty (penalty), hardware recommends cfq schedulers, such as fast external storage sequences or solid state drives. If

You need to use cfq on this storage and edit the following configuration files:

Set / sys/block/devname/queue/ionice/slice_idle to 0

Set / sys/block/devname/queue/ionice/quantum to 64

Set / sys/block/devname/queue/ionice/group_ idle to 1

6 adjust the noop scheduler

The noop I sys/block/sdx/queue/ O scheduler is mainly useful for CPU restrictions that use fast storage and requests merging at the block layer, so modify the noop behavior by editing the block layer parameters in the file in the / sys/block/sdx/queue/ directory

[root@python ~] # echo noop > / sys/block/sda/queue/scheduler [root@python ~] # cat / sys/block/sda/queue/scheduler [noop] deadline cfq [root@python ~] # cd / sys/block/sda/queue/iosched/ [root@python iosched] # ll Total usage 0 [root@python iosched] # cd.. [root@python queue] # pwd/sys/block/sda/queue [root@python queue] # ll Total usage 0-rw-r--r -- 1 root root 4096 December 10 23:01 add_random-r--r--r-- 1 root root 4096 December 10 23:01 discard_granularity-r--r--r-- 1 root root 4096 December 10 23:01 discard_max_bytes-r--r--r-- 1 root root 4096 December 10 23:01 discard_zeroes_data-r--r--r-- 1 root root 4096 December 10 23:01 hw_sector_sizedrwxr-xr-x 2 root root 0 December 10 23:27 iosched-rw-r--r-- 1 root root 4096 December 10 23:01 iostats-r--r--r-- 1 root root 4096 December 10 23:01 logical_block_size-r--r--r-- 1 root root 4096 December 10 23:01 max_hw_sectors_kb-r--r--r-- 1 root root 4096 December 10 23:01 max_integrity_segments-rw-r--r-- 1 root root 4096 12 December 10 23:01 max_sectors_kb-r--r--r-- 1 root root 4096 December 10 23:01 max_segments-r--r--r-- 1 root root 4096 December 10 23:01 max_segment_size-r--r--r-- 1 root root 4096 December 10 23:01 minimum_io_size-rw-r--r-- 1 root root 4096 December 10 23:01 nomerges-rw-r--r-- 1 root root 4096 December 10 23 01 nr_requests-r--r--r-- 1 root root 4096 December 10 23:01 optimal_io_size-r--r--r-- 1 root root 4096 December 10 23:01 physical_block_size-rw-r--r-- 1 root root 4096 December 3 21:14 read_ahead_kb-rw-r--r-- 1 root root 4096 December 10 23:01 rotational-rw-r--r-- 1 root root 4096 December 10 23:01 rq _ affinity-rw-r--r-- 1 root root 4096 December 10 23:27 scheduler-rw-r--r-- 1 root root 4096 December 10 23:01 unpriv_sgio-r--r--r-- 1 root root 4096 December 10 23:01 write_same_max_bytes

Add_random

The entropy pool of / dev/random will be affected by some I-O times, and this parameter can be set to 0 if the load of these effects becomes measurable

Max_sectors_kb

Specifies the maximum size (in kilobytes) of the I _ 512kb O request, with a default value of 512kb, the minimum value of this parameter is determined by the logical block size of the storage device, and the maximum value of this parameter is determined by the max_hw_sectors_ KB value.

When the max_hw_sectors_k O request is larger than the internal storage block size, some solid state hard disks will perform poorly. In this case, redhat recommends reducing the max_hw_sectors_k to the internal storage block size.

Nomerges

Most workloads benefit from request merging, however, disabling merging is helpful for debugging purposes, with the parameter set to 0 to disable merging and enabled by default (set to 1).

"nr_requests

Limit the maximum number of read and write requests queued at the same time. The default value is 128, which means that 128 read requests and 128 write requests are queued before the next process requesting a read or write operation enters sleep mode. For latency-sensitive applications, lower the parameter value and limit the depth of the command queue on the storage so that the write-back Icano cannot populate the device queue with write requests. When the device queue is filled, other processes that try to perform the Icano operation go into sleep mode until queue space is available. The request is then allocated as a round-robin fashion (circular) to prevent a process from continuously using all the points in the queue.

Optimal_io_siz e

This parameter is used by some storage devices to report the best Ibig O size. If this value is reported, Red Hat recommends that you try to align the application's emitted Imax O with the optimal Imax O size, which is a multiple of the optimal Imax O size.

"read_ahead_kb

Defines the number of kilobytes that the operating system will read in advance during the sequential read operation phase to store information that may be immediately needed in the page cache. Device mappers often benefit from a high read_ahead_kb value of 128 KB; this is a good starting point for accessing the devices to be mapped.

Rotation

Some solid-state drives do not correctly publish their solid-state drive status and will be loaded on traditional rotating disks. If your solid state drive cannot automatically set it to 0, please set it manually to disable unnecessary search reduction logic on the scheduler.

"rq_affinity

By default, Iamp O fulfillment can be done on different processors, rather than limited to the processor that issued the Iripple O request. Set rq_affinity to 1 to disable this capability and execute it only on the processor that issued the Icano request. This can improve the effectiveness of processor data caching.

7 for performance configuration file system

If file fragmentation or resource contention causes performance loss, performance can usually be improved by reconfiguring the file system. However, in some use cases, you may need to change the application.

1 adjust XFS

The XFS default formatting and mount settings apply to most workloads. Red Hat recommends that you change specific configurations only if they are beneficial to your workload.

1 format option parameters

Directory block size

The directory block size affects the amount of directory information that can be retrieved or modified by each Iram O operation. The minimum directory block size is the file system block size (4 KB by default). The maximum directory block size is 64 KB.

For the specified directory block size, a large directory requires more Ihand O than a smaller directory. Because systems with large directory blocks use more processing power per Imax O operation than systems with small directory blocks. Therefore, depending on your workload, it is recommended that you use the smallest possible directory and directory block size.

If the file system has fewer listed items than heavy write and read workloads, Red Hat recommends using the following directory block size, as shown below

< Table 5.1 "maximum recommended directory entries for directory block size" >

Directory block size maximum entry (mass read operation) maximum entry (mass write operation) 4 KB 100000-200000 1000000-200000016 KB 100000-1000000 1000000-1000000064 KB > 1000000 > 10000000

See the XFS file for the effect of directory block size on read and write workloads in file systems of different sizes.

Allocation group

The allocation group is a separate structure that indicates free space and allocates inodes in the file system section. As long as the simultaneous operation affects different allocation groups, each allocation group can be modified independently, so that the XFS performs both allocation and de-allocation operations. Therefore, the number of simultaneous operations performed in the file system is equal to the number of assigned groups. However, because the ability to perform simultaneous operations is limited by the number of processors that can perform operations, Red Hat recommends that the number of allocation groups should be more than or equal to the number of processors in the system.

Multiple allocation groups cannot modify individual directories at the same time. Therefore, Red Hat recommends that applications that create and remove a large number of files do not store all files in a single directory.

Growth constraint

If you need to increase the size of the file system after formatting, because the allocation group size cannot be changed after formatting, you need to consider that the layout must adjust the allocation group size according to the final capacity of the file system, not according to the initialization capacity. The number of allocation groups in the file system that takes up all usage space should not exceed several hundred, unless the allocation group is at the maximum size (1TB), so Red Hat wants to recommend maximum growth for most file systems, allowing the file system to be ten times its original size.

It is important to consider additional care when growing file systems with RAID arrays, because the device size must be aligned with the fixed multiple allocation group sizes so that the new allocation group headers are correctly aligned in the newly added storage, and because the geometry cannot be modified after formatting, the new storage must also be geometrically consistent with existing storage, so storage of different geometry cannot be optimized on the same device.

INode size and inline attributes

If iNode has enough free space, XFS can write attribute names and values directly to inode, and since no additional Icano is needed, these inline attributes can be obtained and modified to a faster order of magnitude than individual attribute blocks.

The default inode size is 256 bytes. Of these, only about 100 bytes is available for attribute storage, depending on the number of data range pointers stored on the inode. When formatting a file system, increasing the inode size increases the amount of space available for storage attributes.

Both the attribute name and the attribute value are limited by the maximum size of 254 bytes. If the name or value exceeds the length of 254 bytes, the property is pushed to a separate property fast rather than stored in inline.

RAID

If you use the software RAID, mkfs.xfs automatically configures the underlying hardware with appropriate ribbon units and widths. However, if you use hardware RAID, the ribbon unit and width may need to be manually configured because not all hardware RAID devices output this information. Use the mkfs.xfs-d option to configure the ribbon unit and width. See the mkfs.xfs man page for more information.

Log size

Pending changes are accumulated in memory until synchronization events are triggered, at which point they are written to the log. Log size determines the number of modifications in progress at the same time. It also determines the maximum number of changes that can be accumulated in memory, thus determining how often the recorded data is written to disk. Compared with large logs, small logs cause data to be written back to disk more frequently. However, large logs use more memory to record pending changes, so systems with limited memory will not benefit from large logs.

Logs perform better when they are aligned with the underlying ribbon unit; in other words, they start and end at the boundary of the ribbon unit. Use the mkfs.xfs-d option to align logs to ribbon units, see the mkfs.xfs man page for more information.

Use the following mkfs.xfs options to configure the log size and replace logsize with the log size:

# mkfs.xfs-l size=logsize

See the mkfs.xfs man page for more details:

$man mkfs.xfs

Log ribbon unit

Log writes on storage devices that use RAID5 or RAID6 layouts may perform better when they start and end at the ribbon boundary (aligned with the underlying ribbon unit). Mkfs.xfs attempts to automatically set up the appropriate log ribbon unit, but this depends on the RAID device that outputs this information. If your workload triggers synchronization events too frequently, setting up large log ribbon units can degrade performance. This is because small writes need to be populated to the log ribbon unit, which increases latency. If your workload is constrained by delays in log writes, Red Hat recommends that you set the log ribbon unit to 1 block to trigger unaligned log writes as much as possible. The maximum log ribbon unit supported is the size of the maximum log cache (256 KB). Therefore, the underlying memory may have a larger ribbon unit, and the ribbon unit can be configured in the log. In this case, mkfs.xfs issues a warning and sets up a log strip unit with a size of 32 KB. Configure the log ribbon unit using one of the following options, where N is the number of blocks used for the ribbon unit and size is the size of the ribbon unit in KB.

Mkfs.xfs-l sunit=Nbmkfs.xfs-l su=size

See the mkfs.xfs man page for more details:

$man mkfs.xfs2 mount option

INode allocation

It is recommended that the file system parameters greater than 1TB iNode iNode64 should be configured to allocate iNode and data in the file system, which ensures that the iNode will not be allocated to the starting position of the file system and the data will not be allocated to the end position of the file system, thus improving the performance of the large file system.

Log cache and quantity

The larger the log cache, the fewer Istroke O operations that write all changes to the log, and the large log cache can improve the performance of a system with a large Imax O-intensive total load, which does not have a non-volatile write cache.

Configure the log cache size through the logbsize mount option and determine the maximum amount of information stored in the log cache. If the log stripe unit is not set, the cache write can be less than the maximum, so there is no need to reduce the log cache size in a large number of synchronization workloads.

The default log cache size is 32 KB. The maximum value is 256 KB and also supports 64 KB, 128 KB, or log ribbon units between 32 KB and 256 KB with a multiple of 2.

The number of log caches is determined by the logbufs mount option. The default value for log cache is 8 (maximum), but the minimum log cache that can be configured is 2. There is usually no need to reduce the number of log caches unless memory-constrained systems cannot allocate memory for additional log caches. Reducing the number of log caches can degrade log performance, especially if the workload is sensitive to Icano latency.

Defer deferred change log

XFS's options integrate memory changes before they are written to the log. The delaylog parameter allows frequently changed metadata to be written to the log periodically, rather than being logged for each change. This option increases the potential number of operations lost in a failure and the amount of memory used to track metadata. However, it increases the speed and scalability of metadata by sorting by order of magnitude, and this option does not reduce data or metadata integrity when using fsync, fdatasync, or sync to ensure that data and metadata are written to the hard disk.

2 adjust ext41 formatting options

INode table initialization

It takes a long time to initialize all iNode in the file system on a large file system. The default setting delays initialization next time (enable slow iNode table initialization). However, if there is no ext4 driver, slow iNode table initialization is disabled by default, which can be enabled by setting lazy_itable_init to 1. After mounting, the kernel process continues to initialize the file system.

2 Mount option

Initialization rate of inode table

When slow inode initialization is enabled, you can control the rate at which initialization occurs by specifying the init_itable parameter value. The time to perform background initialization is approximately equal to 1 divided by the parameter value. The default value is 10.

Custom automatic file synchronization

Some applications fail to perform fsync correctly after renaming, truncating, or rewriting existing files. By default, ext4 automatically synchronizes files after performing these operations. But it will be time-consuming.

If this level of synchronization is not required, you can disable this behavior by specifying the noauto_da_alloc option when mounting. If noauto_da_alloc is set, the application must explicitly use fsync to ensure data persistence.

Log Icano priority

By default, the log Iamp O priority is 3, which is slightly higher than that of the regular Iamp O. You can use it when mounting

The journal_ioprio parameter controls the priority of the log Imax O. Valid values for journal_ioprio range from 0 to 7

Where 0 represents the Iripple O with the highest priority.

3 adjust btrfs

Since Red Hat Enterprise Linux 7. 0, btrfs has been available as a technology preview for users. If btrfs is fully supported, this chapter will be updated in the future.

Adjust GFS2

This section covers the tuning parameters available to the GFS2 file system when formatting and mounting.

Directory spacing all directories created in the top-level directory of the GFS2 mount point are automatically spaced to reduce fragmentation in the directory and improve write speed. To interval other directories like the top-level directory, label the directory with the T attribute, as shown, instead of dirname with the path you want to interval the directory.

# chattr + T dirname

Chattr is provided to users as part of the e2fsprogs software package.

Reduce contention and reduce contention GFS2 uses global locking mechanism, which requires communication between nodes in the cluster. File and directory contention between multiple nodes can degrade performance. By minimizing the area of the file system shared between multiple nodes, you can minimize the risk of cache invalidation.

4 Network

The network subsystem consists of different parts of many sensitive connections. The Red Hat Enterprise Linux 7 network is therefore designed to provide the best performance for most workloads and automatically optimize their performance. Therefore, it is not necessary to adjust the network performance manually from time to time. This chapter discusses the further optimization of the functional network system.

1 points for attention

If tuning is needed, users need to be fully aware of the acceptance of Linux network packets for redhat Enterprise Edition.

2 basic principles 1 basic principles of network packet receiving and processing

The packet sent to the redhatLinux system is accepted by NIC (network interface card), and the packet is placed in the kernel hardware buffer or circular buffer. NIC will then send a hardware terminal request, prompting a software interrupt operation to be generated to handle the interrupt request. As part of the software interrupt operation, the packet will be transferred from the buffer to the network stack, according to the packet and the user's network configuration. The packet is then forwarded, deleted, or transferred to an application's socket accept queue and removed from the network stack, a process that continues until there are no packets in the NIC hardware buffer or a certain number of packets (specified in / proc/sys/net/core/dev_weight) are transferred.

2 factors affecting network performance

The most common network performance problems are caused by hardware failures or infrastructure layer failures.

Packet acceptance bottleneck

Although the network stack is basically self-optimizing, there are many stagnant bottlenecks and performance problems in the process of network stack processing.

NIC hardware buffer or circular buffer

If a large number of packets are discarded, the hardware buffer will become a bottleneck, and ethool is needed to monitor the packets transmitted by the system.

Hardware or software interrupt queue

Interrupts increase latency and compete for processors.

Socket receive queue for the application

The bottleneck of the application receiving queue is that a large number of packets are not copied into the requested application, or the UDP input error increases, which is in / proc/net/snmp.

3 Monitoring and diagnosing performance issues 1 ss

Ss is a command-line utility that displays data information about socket and allows administrators to evaluate device performance at any time. The ss list opens an unnoticed but connected TCP socket by default, but provides a number of useful options for administrators to filter specific socket data.

Red Hat recommends using ss instead of netstat in Red Hat Enterprise Linux 7.

2 ip

The ip utility allows administrators to manage and monitor lines, devices, routing policies, and channels. Ip monitor instructions continuously monitor the status of devices, addresses, and lines.

3 dropwatch

Dropwatch is an interactive tool for monitoring and recording kernel-discarded packets.

4 ethtool

The ethtool utility allows administrators to view and edit the settings of the network interface card. It helps to observe data about a specific device, such as the device being discarded

The number of packets set.

Users can use ethtool-S to view the counting status of a specific device and the name of the device they want to monitor.

Ethtool-S devname5 / proc/net/snmp

The data displayed in the / proc/net/snmp file is used by snmp to monitor and manage IP,ICMP,TCP and UDP. Checking this file periodically can help administrators identify outliers, thus identifying potential performance problems, such as an increase in UDP input errors (InErrors), and errors in / proc/net/snmp, which means that there is a bottleneck in the socket receive queue.

[root@centos8] # cat / proc/net/snmpIp: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreatesIp: 1 64 837779 0 837749 711025 02 0 60 000 0 0 0Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskRepsIcmp: 28091 13028 0 28088 000 30 000 0 28125 0 28122 000 000 000 0 0IcmpMsg: InType3 InType8 OutType0 OutType3IcmpMsg: 28088 3 3 28122Tcp : RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrorsTcp: 1200 120000-1 15481 26 14418 25 4 793658 977256 12557 1 94 0Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMultiUdp: 724534 0884,000 8577UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMultiUdpLite: 000000004 configuration tool 1 Network performance tuned-adm profile

Tuned-adm provides different profiles for many specific use cases to improve performance

Delay performance

Network delay

Network throughput

2 configure hardware buffer

If a large number of packets are discarded in the hardware buffer, there are many possible solutions.

Slow down input traffic

Filter incoming traffic, reduce the number of multicast groups joined, or reduce broadcast traffic to reduce queue filling rates.

Adjust the hardware buffer queue

By increasing the size of the queue to reduce the number of transmitted packets, making it not easy to overload, the user can use the ethtool instruction to change the rx/tx parameters of the network device.

Ethtool-set-ring devname value

Change the exclusion rate of the queue

Device weight refers to the number of packets that the device can receive at a time (accessed by a single predetermined processor). Users can increase the discharge ratio of the queue by increasing the weight of the device, which is controlled by the dev_weight parameter. This parameter can be changed temporarily by changing the contents of the / proc/sys/net/core/dev_weight file, or permanently by using sysctl, which is provided by the procps-ng packet.

Changing the queue discharge rate is usually the easiest way to alleviate poor network performance. However, increasing the number of packets a device can receive at a time consumes extra time for the processor, during which time other processes cannot be scheduled, which can lead to other performance problems.

3 configure interrupt queue

If the analysis shows high latency, the system may benefit from polling-based packet reception rather than interrupt-based packet reception.

1 configure busy rotation training

Busy polling helps to reduce the delay in the network receiving path, allowing the socket layer code to query the receiving queue of network devices and disable network interruptions, which can eliminate delays caused by interruptions and the resulting environment switching. However, it increases the usage of CPU. Busy polling prevents CPU from going to sleep, which causes additional power consumption.

Busy polling is disabled by default. To enable busy polling in a specific socket, follow these instructions:

Set sysctl.net.core.busy_poll to a value other than 0. This parameter controls the number of microseconds that socket polls and selects packets in the waiting device queue. Red Hat recommended value of 50.

Add the SO_BUSY_POLL socket option to socket.

To enable busy polling globally, sysctl.net.core.busy_read must be set to a value other than 0. This parameter controls the number of microseconds that socket reads packets in the waiting device queue and sets the default value for the SO_BUSY_POLL option. Red Hat recommends setting the value to 50 when the number of socket is low and 100 when the number of socket is large. Use epoll when the number of socket is extremely large (more than a few hundred).

Busy polling is supported by the following drivers. Red Hat Enterprise Linux 7 supports these drivers.

Bnx2x

Be2net

Ixgbe

Mlx4

Myri10ge

4 configure socket receive queue

If the analysis data shows that the packet is discarded because the socket queue exit rate is too slow, there are several ways to solve the resulting performance problems.

"reduce the speed of incoming traffic

The queue filling rate can be reduced by filtering or discarding packets before entering the queue, or by reducing the weight of the device.

Device weight refers to the number of packets that a device can receive at a time. The device weight is controlled by the dev_weight parameter, which can be changed temporarily by changing the contents of the / proc/sys/net/core/dev_weight file, or permanently by using sysctl

[root@centos8 ~] # cat / proc/sys/net/core/dev_weight64

Increase queue depth

Increasing the socket queue depth of an application is often the easiest way to increase the socket queue exhaust rate, but it cannot be a long-term solution.

To increase the queue depth, you need to increase the size of the socket receive buffer. You can make the following changes:

Increase / proc/sys/net/core/rmem_default value

This parameter controls the default size of the receive buffer used by socket, which must be less than

Value of / proc/sys/net/core/rmem_max.

Use setsockopt to configure a large SO_RCVBUF value

This parameter controls the maximum value of the socket receive buffer in bytes. Use getsockopt system calls to make sure

Sets the value of the current buffer.

5 configure RSS

RSS (receiver adjustment), also known as multi-queue acceptance, allocates network acceptance processes through some hardware-based acceptance queues, so that inbound network traffic can be handled by multiple CPU. RSS can be used to alleviate the bottleneck caused by a single CPU overload in the process of accepting interruptions and reduce network latency.

To determine if your network interface card supports RSS, check to see if multiple interrupt request queues have related interfaces in / proc/interrupts. For example, if the user is interested in the p1p1 interface:

# egrep 'CPU | p1p1' / proc/interrupts

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5

89: 40187 0 0 0 IR-PCI-MSI-edge

P1p1-0

90: 0 790 0 0 0 IR-PCI-MSI-edge

P1p1-1

91: 0 0 959 0 0 0 IR-PCI-MSI-edge

P1p1-2

92: 0 0 0 3310 0 0 IR-PCI-MSI-edge

P1p1-3

93: 0 0 0 622 0 IR-PCI-MSI-edge

P1p1-4

94: 0 0 0 2475 IR-PCI-MSI-edge

P1p1-5

The previous output shows that the NIC driver created six receive queues (p1p1-0 to p1p1-5) for the p1p1 interface. It also shows the number of interrupts handled by each queue and the CPU that handles interrupts. In this case, since there are six default queues, this special NIC driver creates a queue for each CPU, which has a total of six CPU. This is a common pattern in NIC drivers. Or the user can view the output of ls-1/sys/devices///device_pci_address/msi_irqs after the network driver is loaded.

For example, if a user is interested in a device with a PCI address of 0000VO1GRO 00.0, the interrupt request queue for that device can be listed with the following instruction:

Ls-1 / sys/devices///0000:01:00.0/msi_irqs

one hundred and one

one hundred and two

one hundred and three

one hundred and four

one hundred and five

one hundred and six

one hundred and seven

one hundred and eight

one hundred and nine

RSS is enabled by default. The number of queues for RSS (or CPU that needs to run network activity) will be configured by the appropriate network driver. The bnx2x driver is configured in num_queues. The sfc driver is configured with the rss_cpus parameter. Typically, this is configured in / sys/class/net/device/queues/rx-queue/, where device is the name of the network device (such as eth2) and rx-queue is the appropriate receive queue name.

When configuring RSS, Red Hat recommends limiting the number of queues per physical CPU kernel. Hyperthreads usually represent independent kernels in analysis tools, but the configuration queues of all kernels, including logical kernels such as hyperthreading, have not been proven to be beneficial to network performance.

When enabled, RSS distributes network processes equally among CPU based on the number of CPU processes per queue. However, users can use the

The ethtool-- show-rxfh-indir and-- set-rxfh-indir parameters change the way network activity is allocated and weigh which type of network activity is more important.

The irqbalance daemon can be combined with RSS to reduce the possibility of cross-node memory and cache line bouncing. This reduces the delay in processing network packets.

6 configure RPS

RPS (receiver packet Control), similar to RSS, is used to assign packets to a specific CPU for processing. However, RPS is performed at the software level, which helps prevent the software queue of a single network interface card from becoming a bottleneck in network traffic.

RPS has several advantages over hardware-based RSS:

RPS can be used with any network interface card.

It is easy to add software filters to RPS to handle new protocols.

RPS does not increase the hardware outage rate of network devices. But it can cause interrupts between internal processors.

RPS is configured for each network device and receive queue. In the / sys/class/net/device/queues/rxqueue/rps_cpus file, device is the name of the network device (such as eth0) and rx-queue is the appropriate receive queue name (for example, rx-0).

The default value for rps_cpus files is 0. This disables RPS so that CPU that handles network outages can also process packets.

To enable RPS, configure the appropriate rps_cpus file and the CPU that must process packets in specific network devices and receive queues.

Rps_cpus files use CPU bitmaps separated by commas. Therefore, to have CPU handle interrupts for the receive queue on an interface, set their location value in the bitmap to 1. For example, use CPU 0, 1, 2, and 3 to handle interrupts and set the value of rps_cpus to 00001111

(1, 2, 4, 8), or f (the hexadecimal value is 15).

For network devices with a single transmission queue, best performance can be achieved by configuring RPS to use CPU in the same memory area. On non-NUMA systems, this means that all free CPU can be used. If the network outage rate is extremely high, excluding CPU that handles network outages can also improve performance.

For multi-queue network devices, configuring RPS and RSS is usually not beneficial because the RSS configuration maps CPU to each receive queue by default. However, if there are fewer hardware queues than CPU, RPS is still useful, and RPS is configured to use CPU in the same memory area.

7 configure RFS

RFS (receiving end flow Control) reduces network latency by extending the performance of RPS to increase the CPU cache hit ratio. RPS uses the RPS back end to predict the most appropriate CPU based only on queue length RFS, and then forwards the packet based on where the application processes the data. This increases the caching efficiency of CPU.

RFS is disabled by default. To enable RFS, the user must edit two files:

Customers / proc/sys/net/core/rps_sock_flow_entries

Sets the maximum expected number of active connections from this file to the same time. For medium server load, the recommended value is 32768. All entered values are rounded to the power of the nearest 2.

Customers / sys/class/net/device/queues/rx-queue/rps_flow_cnt

Change device to the name of the network device you want to configure (for example, eth0) and rx-queue to the name of the receive queue you want to configure (for example, rx-0).

Set the value of this file to rps_sock_flow_entries divided by N, where N

Is the number of receive queues in the device. For example, if rps_flow_entries is set to 32768 and there are 16 configured receive queues, then rps_flow_cnt should be set to 2048. For devices with a single queue, the value of rps_flow_cnt is the same as that of rps_sock_flow_entries.

Data received from a single sender is not sent to multiple CPU. If you receive more data from a single sender than a single CPU can handle, you must configure a larger number of frames to reduce the number of interrupts, thereby reducing the processing workload of the CPU. Or consider the NIC uninstall option to get faster CPU.

Consider using numactl or taskset in combination with RFS to pin your application to a specific kernel, socket, or NUMA node. This can help prevent data processing disorders.

8 configure acceleration RFS

Accelerated RFS is accelerated by adding hardware assistance. Like RFS, data forwarding is based on where the application processes packets. But unlike a traditional RFS, the data is sent directly to the local CPU that processes the data thread: that is, the CPU running the application, or a local CPU for the CPU in the cache hierarchy.

Accelerated RFS can be used only if the following conditions are met:

The network interface card must support accelerated RFS. Accelerated RFS is supported by interface cards that output ndo_rx_flow_steer () netdevice functions.

Ntuple filtering must be enabled.

Once these conditions are met, the queue map CPU is automatically exported based on the traditional RFS configuration. That is, the queue map CPU is automatically exported based on the IRQ association configured by the driver of each receiving queue. For information on configuring traditional RFS from < Section 6.3.7 "configuring RFS", Red Hat recommends using accelerated RFS when RFS can be used and when hardware acceleration is supported by the network interface card.

5 other network tuning parameters

Net.ipv4.tcp_max_tw_buckets

The number of Timewait. Default is 8192.

Net.ipv4.ip_local_port_range= 1024 65000

The range of ports that the system is allowed to open, with the front offline and the back online. The default is 32768 61000

Note: this range of availability determines the number of links in the final timewait state: the following two options can effectively reduce the number of links in tw status

Net.ipv4.tcp_tw_recycle= {0 | 1}

Whether to enable timeout fast recycling: note: enabling this feature may cause serious problems in the nat environment, because TCP has a behavior that buffers the latest timestamp of each link. Subsequent requests are considered invalid and the corresponding request message is discarded if the time is less than the timestamp in the buffer. Whether Linux enables this behavior depends on tcp_timestamp and tcp_tw_recycle, and the former parameter is enabled by default. So enabling the following parameters will activate this feature.

If it is NAT, tcp_tw_recycle should be disabled for security reasons. Another solution: setting tcp_timestamps to 0.tcp_tw_recycle to 1 will not take effect.

Net.ipv4.tcp_tw_reuse= {0 | 1}

Whether to enable tw reuse, that is, whether to allow time_wait sockets to be used for new TCP links

Net.ipv4.tcp_sycncookies= {0 | 1}

Whether to enable SYN cookies, and when SYN waits for queue overflow, it is sufficient to enable cookies (whether it can accommodate as many requests as possible)

Net.ipv4.tcp_timestapms=0

The time of the tcp message is wrong. Closing can avoid the winding of the serial number (to prevent the problem of coming back in a circle)

Net.ipv4.tcp_max_syn_backlog= 262144

The maximum value of saved link requests for which the client acknowledgement has not been received. The default is 128, which can be increased.

Net.ipv4.tcp_synack_retries= # indicates the number of times data is sent on the server side if the client side does not receive the corresponding data

To open the peer link, the kernel needs to send a SYN with an ACK that responds to the previous SYN. This is the second of all three-way handshakes. This setting determines the number of SYN+ACK packets sent before the kernel abandons the link. It is recommended to set it to 0 or 1 on busy servers.

Net.ipv4.tcp_syn_Retries=2622144

Indicates the number of times the server actively sent TCP links

The number of SYN packets sent before giving up establishing a connection in the kernel, which is recommended to be set to 0 or 1 on busy servers

Net.ipv4.tcp_max_orphans=2622144

The maximum number of TCP sockets in the system is not associated with any user file handle, if more than this number, the connection will be reset fault pronunciation warning message, this restriction is only to prevent simple Dos, can not rely too much on humanity to reduce this value, if you need to modify, in order to ensure that there is enough memory possible, you should increase this value.

Net.ipv4.tcp_fin_timeout=5

If the socket is closed by the local request, this parameter determines how long I / he stays in the FIN_WAIT-2 state. The default is 60 seconds.

However, the peer may make an error or an unexpected downtime will never close the link, even if it is a lightweight web, or there may be a large number of four sockets and the risk of memory overflow, fin-wait-2 is less dangerous than fin-wait-1, because each link can only consume up to 1.5K of memory, but their lifetime is longer.

Net.ipv4.tcp_keepalive_time=30

The frequency at which TCP sends keepalive messages when keepalive starts. The default is 2 hours.

Buffers that begin with core are mainly used to define non-TCP related buffers in bytes.

Net.core.rmem_max=8388608

Defines the maximum accepted buffer size that the kernel uses for all types of links

Net.core.rmem_Default=65536

Define the default accept buffer size for all types of links in the kernel

Net.core.wmem_max=8338608

Defines the maximum send buffer size that the kernel uses for all types of links

Net.core.wmem_default=65536

Defines the default send buffer size that the kernel uses for all types of links

Net.ipv4.tcp_mem='8388608 8388608 8388608'

Defines the memory space used by the TCP protocol stack, which is minimum, default, and maximum, respectively

Net.ipv4.tcp_rmem='4096 37380 8388608

Define the memory space used for acceptance buffering in the TCP protocol stack. The first value is the minimum. Even if the current host memory space is tight, it is necessary to ensure that the TCO protocol stack has at least secondary free energy space, and the second value is the default value. It will overwrite the size of the acceptance buffer defined for all protocols in net.core.rmem_Default, and the third value is the maximum value, which can be used for the maximum memory space of the TCP acceptance buffer.

Net.ipv4.tcp_wmem='4096 65535 8388608'

Defines the memory space used by the TCP protocol stack for the send buffer

5 detailed explanation of related tools 1 irqbalance

Is a command-line tool that allocates hardware interrupts in the processor to improve system performance and runs the background program by default, but only once through the-- oneshot option

The following parameters can be used to improve performance

-- powerthresh

Before CPU enters energy saving mode, set the number of CPU that can be idle. If a CPU greater than the threshold is greater than a standard deviation, the difference is lower than the average soft interrupt workload, and no CPU is greater than a standard deviation, and the deviation is higher than the average, and more than one irq is assigned to them, a CPU will be in energy saving mode. In energy saving mode, CPU is not part of irqbalance, so it is only awakened when necessary.

-- hintpolicy

Decide how to resolve irq kernel association hints. Valid values are exact (the irq association hint is always applied), subset (the irq is balanced, but the assigned object is a subset of the association hint), or ignore (irq is completely ignored).

-- policyscript

The script location is defined to execute each interrupt request through the device path, the irq number as a parameter, and the zero exit code expected by irqbalance. The defined script can specify zero or multi-key value pairs to guide the management of the irqbalance in the passed irq.

The following are effective key-value pairs:

Ban

Valid values are true (the passed irq is excluded from the balance) or false (the irq represents balance).

Balance_level allows the user to override the passed irq balance. By default, the balance is based on the type of irq device that owns the PCI device. Valid values are none, package, cache, or core. Numa_node allows users to rewrite NUMA nodes that are considered to transmit irq locally. If the information of the local node is not limited to the ACPI, the device is considered to be equidistant from all nodes. Valid values are integers (starting at 0) and-1 that identify a specific NUMA node, which states that irq should be considered equidistant from all nodes.

-- banirq

Adds an interrupt with the specified interrupt request number to the list of prohibited interrupts.

You can also use the IRQBALANCE_BANNED_CPUS environment variable to specify an CPU mask that is ignored by irqbalance.

2 tuna

Tuna enables you to control processor and scheduling associations. This section contains a command line interface, but you can also use a graphical interface with the same scope of functionality. Run the command line tuna to start the graphics tool.

Tuna accepts a variety of command-line arguments that are processed sequentially. The following command distributes the load among the four socket.

Tuna-- socket 0-- isolate\ n-- thread my_real_time_app-- move\ n-- irq serial-- socket 1-- move\ n-irq eth*-- socket 2-- spread\ n-- show_threads-- show_irqs

-- gui

Open the graphical user interface.

-- cpu

Fetch the CPU comma-separated list controlled by Tuna. This list is valid until a new list is specified.

-- config_file_apply

Apply the profile name to the system.

-- config_file_list

Lists preloaded configuration files.

-- cgroup

Used to connect-- show_threads. If a control group is enabled, the control group type is displayed, and the control group processing displays the control group type to which the-- show_threads belongs.

-- affect_children

When specified, Tuna affects child threads as well as parent threads.

-- filter

Filter the display, showing only the affected entities.

-- isolate

Take the comma-separated list of CPU. Tuna migrates threads from the specified CPU.

-- include

Take the comma-separated list of CPU, and Tuna allows all threads to run on the specified CPU.

-- no_kthreads

When this parameter is specified, Tuna does not affect kernel threads.

-- move

Moves the selected entity to the specified CPU.

-- priority

Specify the thread scheduler policy and priority. Valid scheduler policies are OTHER, FIFO, RR, BATCH, or IDLE.

When the policy is FIFO or RR, the valid priority value is an integer from 1 (lowest) to 99 (highest). The default value is 1. For example, tuna--threads 7861-priority=RR:40 sets the RR (round robin) policy and priority of 40 for thread 7861.

When the policy is OTHER, BATCH, or IDLE, the only valid priority value is 0, which is also the default value.

-- show_threads

Displays a list of threads.

-- show_irqs

Displays a list of irq.

-- irqs

Fetch a comma-separated list of IRQ affected by Tuna. This list is valid until a new list is specified. Use + to add IRQ to the list, and use-to remove from the list.

-- save

Saves the kernel thread schedule to the specified file.

-- sockets

Take a comma-separated list of CPU socket controlled by Tuna. This option takes into account the topology of the system, such as sharing a single processor cache and a core on the same physical chip.

-- threads

Take a comma-delimited list of threads controlled by Tuna. This list is valid until a new list is specified. Use + to add threads to the list-can be removed from the list.

-- no_uthreads

Operations that affect user threads are prohibited.

-- what_is

For more help, see selected entities.

-- spread

Distribute evenly-- the thread specified by threads to the CPU specified by-- cpus.

3 ethtool

The ethtool tool allows administrators to view and edit network interface card settings. This helps to observe the statistics of some devices, such as the number of packets dropped by the device.

4 ss

Ss is a command-line tool that displays socket statistics and allows administrators to access device performance over time. By default, ss lists open, non-monitored and contacted TCP socket, but provides administrators with some useful options for filtering out specific socket data.

Ss-tmpie is a common command that displays all TCP socket (t, internal TCP information (I), socket memory usage (m),

Process using socket (p), and detailed socket information (I).

5 tuned

Tuned is a daemon that adjusts the performance of the operating system by setting and adjusting the configuration file under a certain workload.

Configure it to respond to changes in CPU and network usage, adjust settings to improve the performance of activated devices, and reduce energy consumption in inactive devices.

Edit the dynamic_tuning parameter in the / etc/tuned/tuned-main.conf file to configure the dynamic adjustment behavior. You can also use the update_interval parameter to configure the time in seconds between adjusting the check usage and updating the adjustment details.

6 tuned-adm

Tuned-adm is a command-line tool that provides different configuration files to improve the performance of specific use cases. It also provides a subcommand (tuned-adm recommend) to evaluate the system and output the recommended adjustment profile. It can also set the default configuration text when your system is installed

So that it can be used to return the default profile.

Since Red Hat Enterprise Linux 7, tuned-adm has the ability to run all commands that are part of enabling and disabling tuning profiles. This allows you to add environment-specific detections that are not available in tuned-adm. For example, before selecting which adjustment profile to apply, detect whether the system is the primary database node.

Red Hat Enterprise Linux 7 provides the include parameter in the configuration definition file, which allows you to build your own tuned-adm profile based on an existing one.

The following adjustment profile is provided with tuned-adm and is supported by Red Hat Enterprise Linux 7.

Throughput performance

The focus of the server profile is to improve throughput. This is the default profile and is recommended for most systems.

By setting up intel_pstate and max_perf_pct=100, this profile focuses more on performance than on energy savings. It enables large transparent pages, uses cpupower to set up the performance CPU frequency manager, and sets the input / output scheduler to deadline. It also sets kernel.sched_min_granularity_ns to 10 μ s, kernel.sched_wakeup_granularity_ns to 15 μ s, and sets the

The vm.dirty_ratio setting is 40%.

Delay performance

The focus of the server profile is to reduce latency. This profile is recommended for latency-sensitive workloads, where workloads benefit from c-state adjustments and the increased TLB efficiency of transparent large pages.

Network delay

The focus of the server profile is to reduce network latency.

By setting up intel_pstate and max_perf_pct=100, this profile focuses more on performance than on energy savings. It disables transparent large pages and automatic NUMA balancing. It uses cpupower to set up the performance CPU frequency manager and requests a cpu_dma_latency with a value of 1. It also sets the time of busy_read and busy_poll to 50 μ s and tcp_fastopen to 3.

Network throughput

The focus of the server profile is to improve network throughput.

By setting up intel_pstate and max_perf_pct=100, and saving energy, the profile pays more attention to performance. It can enable transparent large pages and use cpupower to set up the performance CPU frequency manager. It also sets kernel.sched_min_granularity_ns to 10 μ.

The vm.dirty_ratio kernel is set to 15 μ s, and the kernel is set to 40%.

Virtual guest

Virtual Guest is a profile that focuses on optimizing the performance of the Red Hat Enterprise Linux 7 virtual machine.

By setting up intel_pstate and max_perf_pct=100, this profile focuses more on performance than on energy savings. It reduces the swap of virtual memory. Enable transparent large pages and use cpupower to set up the performance CPU frequency manager. It can also set kernel.sched_min_granularity_ns to 10 μ.

And vm.dirty_ratio is set to 15 μ s and 40%.

Virtual-host

Virtual hosting is a profile that focuses on optimizing the performance of Red Hat Enterprise Linux 7 virtual hosts.

By setting up intel_pstate and max_perf_pct=100, this profile focuses more on performance than energy savings. It reduces the swap of virtual memory. It enables transparent large pages and rewrites dirty pages to disk more frequently. Use cpupower to set up the performance CPU frequency manager, which sets kernel.sched_min_granularity_ns to 10 μ seconds, kernel.sched_wakeup_granularity_ns to 15 μ seconds, kernel.sched_migration_cost to 5 μ seconds, and vm.dirty_ratio to

40%

7 perf

Perf provides some useful instructions, some of which are listed in this section. For more information on perf, see "Red Hat Enterprise Edition 7 developer Note"

"South", you can find http://access.redhat.com/site/documentation/Red_Hat_Enterprise_Linux/ at the following websites or see the man page.

Perf stat

This command provides overall data for common performance events, including the steps and the time period consumed. You can use the option flag to collect event data instead of the default measurement event. Starting with Red Hat Enterprise Linux 6.4, perf stat filtering monitoring can be used based on one or more specific control groups (group c).

See the man page for more information:

$man perf-stat

Perf record

This command records the performance data to a file that can then be analyzed using perf report. See the man page for more information.

$man perf-record

Perf report

This command reads performance data from a file and analyzes the record data. For more information, see the man page.

$man perf-report

Perf list

This command lists valid events on a specific machine. These events vary depending on the system performance monitoring hardware and software configuration. See the man page for more information.

$man perf-list

Perf top

This command performs functions similar to those of the top tool. It generates and displays the performance counter configuration file in real time. See the man page for more information.

$man perf-top

Perf trace

This command performs functions similar to those of the strace tool. It monitors the system calls used by a particular thread or process and all signals received by that application. Other tracking targets can be obtained. See the man page for a complete list:

$man perf-trace

8 x86_energy_perf_policy

X86 energyperfpolicy tools allow administrators to define the relative importance of performance and energy efficiency. Provided by the kernel-tools software package.

View the current policy and run the following command:

X86_energy_perf_policy-r

To set the new policy, run the following command:

X86_energy_perf_policy profile_name

Replace the profile name with one of the following profiles.

Performance

The processor does not degrade performance in order to save energy. This is the default value.

Normal

Processors can tolerate small performance degradation caused by potentially significant energy savings. This is a reasonable savings for most servers and desktops.

Powersave

Processors accept potentially significant performance degradation to take full advantage of energy efficiency.

9 turbostat

The turbostat tool provides details of the time it takes for the system to be in different states. Turbostat is provided by the kernel-tools software package.

By default, turbostat displays a summary of counter results for the entire system, followed by counter results every five seconds with the following headers:

Pkg

Processor package number.

Core

Processor core number.

CPU

LinuxCPU (logical processor) number.

% c0

The percentage of instruction intervals completed by the cpu.

GHz

When CPU is in c0 state, the average clock speed.

TSC

The average clock speed throughout the interval process.

C1, c3, and c6

The processor is in C1, c3, or c6 state at an interval percentage, respectively.

% pc3 or% pc6

The percentage of intervals between processors in the pc3 or pc6 state, respectively.

Use the-I option to specify the different periods between counter results, for example, run turbostat-I 10 to display the results every 10 seconds.

10 numastat1 Overview

Numastat is provided by the numactl package and displays memory statistics (such as intermittent allocations) for processors and operating systems on a per NUMA node basis.

2 numastat default tracking category

The default trace categories for the numastat command are as follows:

Numa_hit

The number of pages successfully assigned to this node.

Numa_miss

The number of pages allocated to this node due to insufficient memory in the expected section. Each numa_miss event has a corresponding on the other node

Numa_foreign event.

Numa_foreign

The number of pages assigned to other nodes instead of expected to be assigned to this node. The numa_foreign event is on another node, and there is a corresponding numa_miss event.

Interleave_hit

The number of cross-access policy pages successfully assigned to this node.

Local_node

The number of pages successfully assigned to that node by the process on the node.

Other_node

The number of pages assigned to that node by processes of another node.

Providing any of the following options will change the display unit in megabytes of memory (about two decimal places), and other specified numastat behaviors will also change, as described below:

-c

Display table of horizontally condensed information. This helps systems with a large number of NUMA nodes, where column widths and in-column spaces are unpredictable to some extent. When using this option, the amount of memory is rounded to the nearest megabyte.

-m

Displays system-wide memory usage information based on the unit node, similar to the information in / proc/meminfo.

-n

Using the updated format, megabytes as the unit of measurement, displays the same information as the original numastat command: (numa_hit, numa_miss, numa_foreign, interleave_hit, local_node, and other_node).

-p pattern

Displays unit node memory information for the specified mode. If the value of the pattern consists of numbers, numastat assumes that it is a numeric process identifier. Otherwise, numastat looks for the specified pattern from the process command line.

Assume that the command line argument entered after the value of the-p option is an additional mode in order to filter it. Attach pattern extensions instead of shrinking filters.

Sort the displayed data in descending order so that the largest memory consumers (based on all columns) are listed first.

You can also specify nodes so that the table is classified according to node columns. When using this option, the node value must immediately adopt the-s option, as follows:

Numastat-S2

Do not use spaces between options and their values.

-v

Display more lengthy information. That is, the process information of multiple processes will display the details of each process.

-V

Displays numastat version information.

-z

Rows and columns with only 0 values in the table are omitted from the displayed information. Note that for ease of display, some rounded values close to 0 are not omitted from the display output.

11 Overview of numactl1

Numactl allows administrators to run processes using specified scheduling or memory placement policies. Numactl can also set permanent policies for shared memory segments or files, as well as processor associations and memory associations for processes.

2 parameters

Numactl provides a number of practical options. This appendix outlines some of the options and provides some suggestions for users, but it is not detailed.

-- hardware

Displays the available nodes in the system and contains a detailed list of the relative distances between the nodes.

-- membind

Ensure that memory is allocated only by the specified node. If there is not enough memory available at the specified location, the allocation will fail.

-- cpunodebind

Ensure that the specified command and its child processes are executed only on the specified node.

-- phycpubind

Ensures that the specified command and its child processes are executed only on the specified processor.

-- localalloc

Indicates that memory should always be allocated from the local node.

-- preferred

Specifies the preferred node to allocate memory. If memory cannot be allocated from the specified node, other nodes will be used for fallback.

12 numad1 Overview

Numad is an automatic NUMA association management daemon. In order to dynamically improve the allocation and management of NUMA resources, it monitors within the system

The topology and resource usage of NUMA.

Note that when numad is enabled, its behavior replaces the default behavior of automatic NUMA balancing.

2 use numad on the command line

To use numad as an execution table, simply run:

Numad

When numad runs, its activity is recorded in / var/log/numad.log. It continues to run until terminated by the following command:

Numad-I 0

To restrict numad management to a specific process, start it with the following options:

Numad-S 0-p pid

-p pid

This option adds the specified pid to the explicit inclusion list. When the specified process reaches the significant threshold of the numad process, the specified process is managed.

-S 0

It sets the type of process scan to 0, which limits numad management to processes that are explicitly included

3 run numad as a service

It attempts to dynamically adjust the system based on the workload of the current system. Its activity is recorded on / var/log/numad.log.

To start the service, run:

Systemctl start numad.service

If the service lasts after restart, run:

Chkconfig numad on

13 taskset

The taskset tool is provided by the util-linux software package. It allows administrators to retrieve and set the processor association of a running process, or through the

The specified processor is associated with the running process.

Important taskset does not guarantee local memory allocation. If you need additional performance benefits from local memory allocation, Red Hat recommends using numactl instead of taskset.

To set the CPU association for the running process, run the following command:

Taskset-c processors pid

Replace processors with a comma-separated list of processors or a range of processors (for example, 1, 3, 5 Mel 7). Replace pid with the process identity of the process you want to reconfigure.

Run the process with the specified association and run the following command:

Taskset-c processors-- application

Replace processors with a comma-separated list of processors or a range of processors. Replace application with the commands, options, and parameters of the application you want to run.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.