Common faults Neutron 07/17 Update SLTechnology News&Howtos

Common faults Neutron

2025-07-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

The problem with 1.Neutron: openvswitch jammed and caused all the networks of the host to be interrupted.

Problem: L3 agent down, all the networks cannot be connected, and the public network IP address of the physical machine where L3 is located cannot be accessed.

The dhcpagent server is down, so that all CVMs cannot get addresses and cannot be accessed by the public network IP of all CVMs.

Phenomenon:

The network is down and all public network IP addresses cannot be accessed.

Troubleshooting process:

1. View the running status of openvswitch

2. View the flow direction of data traffic:

View all ovs Bridges

Ovs-vsctl show

View ovs data flow

The ovs-ofctl dump-flows br-int is stuck and there is no response.

Because ovs-vsctl show is data fetched from ovsdb, its normal display indicates that the ovsdb-server process is running normally. Ovs-ofctl is communicating with the ovs-vswitchd process, and the command jam should be caused by the failure of the ovs-vswitchd process to respond to the client request.

3. View the log

Vim / var/log/openvswitch/ovsdb-server.log log

WARN | unix: send error: Broken pipe

2015-09-07T19:43:59.058Z | 00010 | reconnect | WARN | unix:connection dropped (Broken pipe) / / warning of tunnel damage

View / var/log/openvswitch/ovsdb-server.log

A line similar to the following appears

| | WARN | Unreasonably long 16518ms pollinterval |

Indicates that ovs-vswitchd may be deadlocked or unresponsive because of a thread.

4. Check where the process is stuck:

Strace-p pid

Pid refers to the ID of the process, that is, the PID of ovs-vswitchd.

Solution:

Restart the openvswitch service

Use cron task detection

Cat / etc/cron.d/monitor_vswitchd

* root timeout-s SIGKILL 2sovs-ofctl show br-mgmt | | (date > > / var/log/mon_openvswitch.log;serviceopenvswitch restart > > / var/log/mon_openvswitch.log 2 > & 1)

Upgrade the kernel

In the long run, it's better not to do it with cron, but to upgrade the kernel. The problem is resolved after upgrading to 2.6.32-504.16.2.el6.x8664.

When Nentron DHCP Agent restarts and drifts, some virtual machines are disconnected.

Phenomenon:

When DHCP Agent restarts or drifts, some virtual machines are disconnected.

The cause of the problem:

When there are more virtual switches, qdhcp also has more netns. After drifting or restarting Neutron DHCP agent, these resources need to be rebuilt for a long time, sometimes as long as 3-5 minutes. If a virtual machine needs to be renewed during this cycle, the request sent to the DHCP server does not respond, and the overtime renewal fails. Even if the DHCP service replies, it will not try to obtain the IP address again. At this point, enter the virtual machine command line, ifup and eth0.

For CentOS, we recommend that you modify the configuration file of dhclient to increase the timeout for retry when renewal fails to wait for the recovery of the DHCP server.

Solution:

Modify configuration file / etc/dhcp/dhclient.conf

Timeout 300

In this way, the request for renewal of the CentOS virtual machine will continue to retry for 5 minutes to wait for the DHCP service to resume.

Adjust the RX ring buffer length of the network card to solve the problem of packet loss

Problem: public cloud platform: the storage network of compute1 and compute4 computing nodes cannot communicate with each other.

Resolution process:

1.compute1 node ping compute4 node, using tcpdump to grab packets on compute1 and compute4 nodes, ICMP request and ICMP reply are found on compute4. However, the compute1 node does not receive the ICMP reply message and is prompted by xxxpackets dropped by interface.

two。 Log in to the pica8 switch and check the physical and link layer connections of the two machines.

3. Check the physical network card of compute1 and find that there are a large number of packet losses on RX:

[root@compute1 ~] # ifconfig bond2

Bond2 Link encap:Ethernet HWaddr00:0A:F7:5D:4A:E2

Inet addr:172.16.3.51 Bcast:172.16.3.255 Mask:255.255.255.0

Inet6 addr: fe80::20a:f7ff:fe5d:4ae2/64 Scope:Link

UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

RX packets:5974542045 errors:8394 dropped:1892018 overruns:8394frame:0

TX packets:30430136566 errors:0 dropped:0 overruns:0 carrier:0

Collisions:0 txqueuelen:0

RX bytes:5387974623010 (4.9TiB) TX bytes:28489033161925 (25.9TiB)

4. Use the ethtool-- show-ring or ethtool-g command to view the RX/TX ringbuffer of the real physical Nic on bond2:

[root@compute1] # ethtool-- show-ring p6p2

Ring parameters for p6p2:

Pre-set maximums:

RX: 4078

RX Mini: 0

RX Jumbo: 0

TX: 4078

Current hardware settings:

RX: 453

RX Mini: 0

RX Jumbo: 0

TX: 4078

5. It is suspected that the ring buffer parameter setting on the network card is too small to handle Ethernet data frames received from the network card.

6. Resize RX ring buffer via ethtool--set-ring or ethtoo-G

Root@compute1 ~] # ethtool-- set-ring p6p2rx 4078

Cannot set device ring parameters:Input/output error

[root@compute1] # ethtool-- show-ring p6p2

Ring parameters for p6p2:

Pre-set maximums:

RX: 4078

RX Mini: 0

RX Jumbo: 0

TX: 4078

Current hardware settings:

RX: 4078

RX Mini: 0

RX Jumbo: 0

TX: 4078

7. For such a modification, the machine reboot will return to the original configuration. It is recommended to write it to / etc/rc.local.

Ethtool-G p6p2rx 4078

Ethtool-G p7p2rx 4078

Problems caused by network card driver defects

Phenomenon: network card driver defects lead to normal ping after offload, but TCP connection is slow or broken. Diagnosis and solution.

Common reasons are:

1.MTU problem

Confirm whether there is a problem with the physical server network card and the uplink switch MTU. Generally speaking, the MTU of the hardware manufacturer defaults to 1500. Of course, there are exceptions, such as the SDN switch of Pica8. If the MTU value is less than 1512, packets will be lost.

two。 Physical network card offload

When Fuel is deployed, the physical ENI offload attribute is enabled by default. Because the offload attribute is enabled, packet loss or retransmission caused by TCP or UDP verification and inconsistencies may occur.

Solution:

The TCP checksum ensures that the entire message will not change during transmission. If the checksum is inconsistent, TCP will discard the message or trigger a timeout retransmission. The checksum of TCP is necessary, while the checksum of UDP is optional. At this point, it is recommended that rx and tx be turned off.

RX Checksum:

When this function is enabled, when the physical Nic receives a packet, the Nic will calculate the transport layer checksum instead of the kernel protocol stack, and the packet will only be processed by the kernel if the checksum is correct, so as to save system CPU resources.

Close this feature:ethtool-KDEVNAME rx on | off

TX Checksum:

Before the packet is sent, the checksum is calculated by the network card. If this option is enabled, the kernel will randomly fill the check and fields of TCP or UDP, and the correct filling will be done by the physical network card.

Close this feature:ethtool-K DEVNAME tx on | off

Persist offload settings

You can edit / etc/rc.local to join the ethtool command. Or use CentOS's ifcfg- script. For example, to close eth0's tx and rx's checksum offload, you can edit the following file / etc/sysconfig/ network-scripts/ ifcfg-eth0 to add a line ETHTOOL_OPTS= "- Keth0 rx off;-K eth0 tx off" and then ifup eth0, and the settings will take effect.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.