• Arun Nukula

Changing MTU value to causes VMNIC to flap

Updated: May 22


This morning, I was involved in an escalation where NIC flaps were seen on 6 out of 7 hosts on a brand new vxRail Cluster

Looks like it all started once administrator started changing MTU values

hosted logs didn't help a great deal as it was reporting that vmnic has gone down.

When we did check vmkernel logs, MTU was constantly flipping between 1500 and 9000


grep -i "changing MTU" vmkernel.log

2017-12-18T02:54:18.954Z cpu19:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 9000 to 1500
2017-12-18T02:54:19.594Z cpu19:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500
2017-12-18T03:02:28.343Z cpu21:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000
2017-12-18T03:02:28.985Z cpu26:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000
2017-12-18T03:02:59.633Z cpu25:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 9000 to 1500
2017-12-18T03:03:00.269Z cpu30:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500
2017-12-18T03:08:48.374Z cpu25:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000
2017-12-18T03:08:49.006Z cpu25:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000


What we see in vmkernel logs is MTU flap. The server was using ixgbe driver.

Whenever an MTU change is made, it would cause the driver to bring down the NIC, make necessary changes to hardware, and then bring it back up. The link status will be reported to vmkernel and vobd would take down these changes.


vmkernel.log snippet

2017-12-18T03:45:09.454Z cpu23:69310 opID=4758f26a)NetOverlay: 1107: class:vxlan is already instantiated one on depth 0
2017-12-18T03:45:09.470Z cpu20:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000
2017-12-18T03:45:10.101Z cpu23:66107)vxlan: VDL2PortOutputUplinkChangeCB:649: Output Uplink change event with priority :5 was ignored for portID: 400000e.
2017-12-18T03:45:10.101Z cpu23:66107)netschedHClk: NetSchedHClkNotify:2892: vmnic1: link down notification
2017-12-18T03:45:10.101Z cpu22:65646)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x400000a, linkstate 0
2017-12-18T03:45:10.101Z cpu28:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000
2017-12-18T03:45:10.738Z cpu2:66105)vxlan: VDL2GetlEndpointAndSetUplink:387: Now, no active uplinks in tunnel group:67108878.
2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkNotify:2892: vmnic0: link down notification
2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic0: dropping 6 packets from queue netsched.pools.persist.default
2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic0: dropping 3 packets from queue netsched.pools.persist.mgmt
2017-12-18T03:45:10.738Z cpu8:65647)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x4000008, linkstate 0
2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)Team.etherswitch: TeamESLACPLAGEventCB:6277: Received a LAG DESTROY event version :0, lagId :0, lagLinkStatus :NOT USED,lagName :, uplinkName :, portLinkStatus :NOT USED, portID :0x0
2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)netioc: NetIOCSetRespoolVersion:245: Set netioc version for portset: DvsPortset-0 to 3,old threshold: 3
2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)netioc: NetIOCSetupUplinkReservationThreshold:135: Set threshold for portset: DvsPortset-0 to 75, old threshold: 75
2017-12-18T03:45:10.741Z cpu28:69310 opID=4758f26a)netioc: NetIOCPortsetNetSchedStatusSet:1207: Set sched status for portset: DvsPortset-0 to Active, old:Active
2017-12-18T03:45:10.741Z cpu28:69310 opID=4758f26a)VLANMTUCheck: NMVCDeployClear:871: can't not find psReq for ps DvsPortset-0
2017-12-18T03:45:10.784Z cpu15:69324 opID=1731b730)World: 12230: VC opID de31f2ec maps to vmkernel opID 1731b730
2017-12-18T03:45:10.784Z cpu15:69324 opID=1731b730)Tcpip_Vmk: 263: Lookup route failed
2017-12-18T03:45:13.357Z cpu19:10575834)CMMDS: AgentSendHeartbeatRequest:211: Agent requesting a reliable heartbeat from node 5a1e3e8b-2e6e-82a4-bfac-a0369fdec1c4
2017-12-18T03:45:41.383Z cpu22:66079)netschedHClk: NetSchedHClkNotify:2892: vmnic1: link down notification
2017-12-18T03:45:41.383Z cpu22:66079)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic1: dropping 10 packets from queue netsched.pools.persist.mgmt
2017-12-18T03:45:41.383Z cpu1:65647)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x400000a, linkstate 0
2017-12-18T03:45:41.383Z cpu21:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500

Now, after a detailed investigation of logs noticed that all DVS operations are coming from vCenter

Moreover, we do see the following error message stating

"The operation reconfigureDistributedVirtualSwitch on the host <<hostname>> disconnected the host and was rolled back"

From the above message we could clearly interpret that there is a connectivity issue between vCenter and the hosts.

Since we do not see any driver/hardware related errors in the logs, we wanted to attempt increasing timeout value for network rollback under vpxd advanced settings.

Procedure Use the vSphere Web Client to increase the timeout for a rollback on vCenter Server. If you encounter the same problem again, increase the rollback timeout with 60 seconds incrementally until the operation has enough time to succeed.

On the Manage tab of a vCenter Server instance, click Settings. Select Advanced Settings and click Edit. If the property is not present, add the config.vpxd.network.rollbackTimeout parameter to the settings. Type a new value, in seconds, for the config.vpxd.network.rollbackTimeout parameter Click OK. Restart the vCenter Server system to apply the changes.

The value was changed to 600 seocnds

Once done, All hosts in the cluster were in a stable state and ready for NSX configuration.

It looks like the changes were not being saved properly by vCenter Server in time. With this new timeout value, it had enough time to commit the transaction eventually stopping nic flaps.


#vSphere

0 views

Subscribe Now

  • Twitter
  • Facebook Social Icon

Copyright © 2019 nukescloud