Subscribe Now

  • Twitter
  • Facebook Social Icon

Copyright © 2019 nukescloud

  • Arun Nukula

Changing MTU value to causes VMNIC to flap


This morning , I was involved on a escalation where NIC flaps were seen on 6 out of 7 hosts on a brand new vxRail Cluster

Looks like it all started once administrator started changing MTU values

hostd logs didn't help a great deal as it was reporting that vmnic has gone down.

When we did check vmkernel logs, MTU was constantly flapping between 1500 and 9000

grep -i "changing MTU" vmkernel.log 2017-12-18T02:54:18.954Z cpu19:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 9000 to 1500 2017-12-18T02:54:19.594Z cpu19:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500 2017-12-18T03:02:28.343Z cpu21:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000 2017-12-18T03:02:28.985Z cpu26:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000 2017-12-18T03:02:59.633Z cpu25:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 9000 to 1500 2017-12-18T03:03:00.269Z cpu30:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500 2017-12-18T03:08:48.374Z cpu25:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000 2017-12-18T03:08:49.006Z cpu25:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000

What se see in vmkernel logs is MTU flap. The server was using ixgbe driver.

Whenever a MTU change is made , it would cause driver to bring down the NIC , make necessary changes to hardware and then bring it back up. The link status will be reported to vmkernel and vobd would take down these changes.

vmkernel log snippet

2017-12-18T03:45:09.454Z cpu23:69310 opID=4758f26a)NetOverlay: 1107: class:vxlan is already instantiated one on depth 0 2017-12-18T03:45:09.470Z cpu20:65645)<6>ixgbe 0000:01:00.1: vmnic1: changing MTU from 1500 to 9000 2017-12-18T03:45:10.101Z cpu23:66107)vxlan: VDL2PortOutputUplinkChangeCB:649: Output Uplink change event with priority :5 was ignored f or portID: 400000e. 2017-12-18T03:45:10.101Z cpu23:66107)netschedHClk: NetSchedHClkNotify:2892: vmnic1: link down notification 2017-12-18T03:45:10.101Z cpu22:65646)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x400000a, linkstate 0 2017-12-18T03:45:10.101Z cpu28:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 1500 to 9000 2017-12-18T03:45:10.738Z cpu2:66105)vxlan: VDL2GetlEndpointAndSetUplink:387: Now, no active uplinks in tunnel group:67108878. 2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkNotify:2892: vmnic0: link down notification 2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic0: dropping 6 packets from queue netsched.pools.persist.default 2017-12-18T03:45:10.738Z cpu2:66105)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic0: dropping 3 packets from queue netsched.pools.persist.mgmt 2017-12-18T03:45:10.738Z cpu8:65647)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x4000008, linkstate 0 2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)VMKAPIMOD: 86: Failed to check if port is Uplink : Failure 2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)Team.etherswitch: TeamESLACPLAGEventCB:6277: Received a LAG DESTROY event version :0, lagId :0, lagLinkStatus :NOT USED,lagName :, uplinkName :, portLinkStatus :NOT USED, portID :0x0 2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)netioc: NetIOCSetRespoolVersion:245: Set netioc version for portset: DvsPortset-0 to 3,old threshold: 3 2017-12-18T03:45:10.739Z cpu28:69310 opID=4758f26a)netioc: NetIOCSetupUplinkReservationThreshold:135: Set threshold for portset: DvsPortset-0 to 75, old threshold: 75 2017-12-18T03:45:10.741Z cpu28:69310 opID=4758f26a)netioc: NetIOCPortsetNetSchedStatusSet:1207: Set sched status for portset: DvsPortset-0 to Active, old:Active 2017-12-18T03:45:10.741Z cpu28:69310 opID=4758f26a)VLANMTUCheck: NMVCDeployClear:871: can't not find psReq for ps DvsPortset-0 2017-12-18T03:45:10.784Z cpu15:69324 opID=1731b730)World: 12230: VC opID de31f2ec maps to vmkernel opID 1731b730 2017-12-18T03:45:10.784Z cpu15:69324 opID=1731b730)Tcpip_Vmk: 263: Lookup route failed 2017-12-18T03:45:13.357Z cpu19:10575834)CMMDS: AgentSendHeartbeatRequest:211: Agent requesting a reliable heartbeat from node 5a1e3e8b-2e6e-82a4-bfac-a0369fdec1c4

2017-12-18T03:45:41.383Z cpu22:66079)netschedHClk: NetSchedHClkNotify:2892: vmnic1: link down notification 2017-12-18T03:45:41.383Z cpu22:66079)netschedHClk: NetSchedHClkDoFlushQueue:3818: vmnic1: dropping 10 packets from queue netsched.pools .persist.mgmt 2017-12-18T03:45:41.383Z cpu1:65647)vdrb: VdrHandleUplinkEvent:1605: SYS:DvsPortset-0: Uplink event 2 for port 0x400000a, linkstate 0 2017-12-18T03:45:41.383Z cpu21:65645)<6>ixgbe 0000:01:00.0: vmnic0: changing MTU from 9000 to 1500

Now, after detailed investigation of logs noticed that all DVS operations are coming from vCenter

Moreover, we do see following error message stating

"The operation reconfigureDistributedVirtualSwitch on the host <<hostname>> disconnected the host and was rolled back"

From above message we could clearly interpret that there is a connectivity issue between vCenter and the hosts.

Since we do not see any driver / hardware related errors in the logs , wanted to attempt increasing timeout value for network rollback under vpxd advance settings.

Procedure Use the vSphere Web Client to increase the timeout for rollback on vCenter Server. If you encounter the same problem again, increase the rollback timeout with 60 seconds incrementally until the operation has enough time to succeed.

On the Manage tab of a vCenter Server instance, click Settings. Select Advanced Settings and click Edit. If the property is not present, add the config.vpxd.network.rollbackTimeout parameter to the settings. Type a new value, in seconds, for the config.vpxd.network.rollbackTimeout parameter Click OK. Restart the vCenter Server system to apply the changes.

Value was changed to 600 seocnds

Once done , All hosts in the cluster were in stable state and ready for NSX configuration.

Looks like the changes were not being saved properly by vCenter Server in time. With this new timeout value, it had enough time to commit the transaction eventually stopping nic flaps.

#vSphere

834 views