Tuesday, 11 October 2011

vSwitch Network Failover Detection Testing: Beacon Probing and Link Status

In ESXi 5.0, under vSwitch Network NIC Teaming, there is a Network Failover Detection field.  It has two setting, Beacon Probing and Link Status Only.

Under the help Section, it state the following,

Link Status only
Relies solely on the link status that the network adapter provides. This option detects failures, such as cable pulls and physical switch power failures, but not configuration errors, such as a physical switch port being blocked by spanning tree or mis-configured to the wrong VLAN or cable pulls on the other side of a physical switch.
Beacon Probing
Sends out and listens for beacon probes on all NICs in the team and uses this information, in addition to link status, to determine link failure. This option detects many of the failures mentioned above that are not detected by link status alone.
Note: Do not use beacon probing with IP-hash load balancing.
 

After three days of testing over weekends, I found Beacon Probing is very confusing, but it works.

Link Status is what stated above, nothing much to test.  But Beacon Probing is Beacon + Link State.



What is Beacon Probing
After searching the entire google sites and digging into many on line notes on Beacon Probing, I came to the following conclusion.

Beacon Probing is about checking of the health and connectivity between each vmnic (physical NIC) in the same vSwitch. 

ESXi will send a small packet out of it's physical network card, and see if this packet is received by the other physical network card within the same vSwitch.  If the vmnic receive the packet, it means that the connectivity between these two physical network is healthy. 

Why? if you have a single vSwitch with 2 pNIC and each pNIC is connected to different physical Switch, we must have some ways to tell if this two physical switch are connected to each other.

For example,
  • vSwitch 0 has 2 pNIC.  vmnic1 and vmnic 2. 
  • vmnic 1 is connected to Access Switch 1.
  • vmnic 2 is connected to Access Switch 2. 
  • Access Switch 1 is connected to Distribution Switch 1.
  • Access Switch 2 is connected to Distribution Switch 2.
  • Distribution Switch 1 is directly connected to Distribution Switch 2.
  • Note: in this example, Access Switch 1 & 2 only have one up link and not 2 uplinks to 2 Distribution Switch.
If the link between Distribution Switch 1 and Distribution Switch 2 breaks, we will have a split LAN problem. In this case, if the LAN is split into 2, depending which side of the LAN are you in, you can only connect to half of your virtual machines even if the ESX host is running fine.  This is like a split brain, but in Network perspective. 

In order not to have a split brain, Beacon Probing comes into play to detect such network issues or misconfiguration.

But! after testing Beacon Probing for the pass few days, I think it is best not to use Beacon Probing unless you know what you are doing.  Else, you will have more network issues than preventing network issues.

Below are some of my findings on Beacon Probing

1. 3 Physical Network Port
You must have 3 Physical Network Port in the same vSwitch before you turn on Beacon Probing. 
The reason is simple, if you have 2 Physical Network Port in the same vSwitch, and Beacon packet cannot reach each other, which Network Port is to be disabled? vmnic 1 or vmnic 2?

Selecting either one of the vmnic to disable, will cause problem.

If you have 3 vmnic, the 2 vmnic that receive the Beacon will be alive and the other one will be disabled.

2.  Best if you have have odd number of Network Port.
If you have 2 vmnic to Access Switch 1 and 2 vmnic to Access Switch 2, and if the Distribution 1 to Distribution Switch 2 connectivity failed, you will end up with the same 1 to 1 vmnic problem.  Which one to disable?

3. Do not enable IP Hashing for Load Balancing
When you use IP Hashing, you will have to use 2 uplinks to the same physical switch.  Running EtherChannel or 802.1ad without LACP (I will explain next time).  To the physical switch perspective, the two physical uplinks is equal to one virutal uplink port where it will have one Mac Address Table allocated to the EtherChannel port. 
When vmnic 1 send a beacon to check on vmnic 2, the packet will never reach vmnic 2.  Because in a physical switch, it will never forward the packet back to the sender port. Hence, if you run EtherChannel on the physical Switch (IP Hashing on ESXi vSwitch), beacon probing will never work!


Now, after some testing, there are some things that I want to highlight.  I can't find the Beacon Packet!  If you can see the Beacon, please let me know how you do it!

Based on my lab, I can't see the Beacon transmit across the network.  I don't know how VMware implement Beacon packet and send across the network.  I have sniffed all ports and uplinks that my ESXi connected to, and I can't find any packet that looks like a Beacon in my physical Switch.

After two days of searching for the Beacons in my physical Switch, I decided to look into my vSwitch.  And yes I saw the Beacon. 




I have added 5 pNIC into my vSwitch.  Looking into the Packet sniffed, I can see that there is a 2 seconds gap between each beacon transmitted.  Each pNIC transmitted 4 packets to the rest of the 4 pNIC, less itself, using Protocol number 0x8922. 


Based on some docs (ESX 2.x & 3.x) from VMware, it said that the physical NIC will transmit the Beacon Packet and it will use the physical NIC Mac Address.  I have never test the Beacon in 3.x and 4.x.  But in ESXi 5.0 it uses "half" of the phsical Mac Address.

Below is my actual physical NIC Address.



Taking vmnic0 as an example, below is what I captured in Wireshark.



ESXi changed the first 4 bytes of the MAC Address from 00:1b:21:84 (Intel Vendor Code) to 00:50:56:54 (VMware Vendor Code) follow by retaining the lasts 2 bytes, d4:c8. 

Looking into the packet data, you can also see the actual Physical MAC Address and the vmnic number.

This logical apply to the rest of the Beacon Packet.

At the physical Switch Level, this is what I see.  ESXi, did register the modified MAC Address in the physical Switch.



To confirm if this is the Beacon Packets, I change the Network Failover Detection to link state, Protocol 0x8922 did not appear any more in the vSwitch.  Looking back into my physical switch, the modified MAC Address is gone!

And yes, this is the Beacon Packet!  But, this packet is not transmitted out of the vSwitch to the physical switch!  As I can't find such packet in my physical switches.  Trust me, many hours spend in searching this beacon from my switches and I can't find any.

So, did my physical switch drop the packet? and if my physical switch did drop the packet, that mean the beacon theory will not work in my environment, right?  So I did some test to verify if the Beacon probing theory work. 

I connected 3 pNIC (vmnic0, vmnic1 and vmnic4) to one physical switch (1) and connect the other 2 pNIC (vmnic2 and vmnic 3) to another physical Switch (2).  I then break the link between my two physical switch.  And yes! ESXi selected my 3 pNIC as the primary LAN and block all traffic to come out from vmnic 2 and vmnic 3.   I transferred vmnic 4 to Switch 2 and the other side of the pNIC start to transmit packet and vmnic0 and vnmic1 stop transmitting packet.

So, this means the Beacon works!


Further testing, I realised that the Beacon only transmit in untag vlan.  Which means if you have 30 vlans trunk to your vmnic, the Beacon will only works if all vmnic is connected to the same untag vlan. (Additional information, if you have 30 vlans, in your physical switch, you actually have 29 tagged vlans, and 1 untag vlan.  Most likely the untag vlan is the one running Spanning Tree or BPDU).

Next problem, you cannot tell if there are any problems with the Beacon probing.  When there is a swing, there are no alerts.  Looking into the Networking vSwitch, all vmnics are in green. So, if you use Beacon probing, and you have a split network problem, you can't tell!  

There are no troubleshooting guide or explanation on the proper use of Beacon probing from vmware.  There are also very little command to verify what when wrong with the Beacon probing result.  Yes, Beacon probing is to test if the network has problem and react to it, but beacon probing does not tell you what problem it face and react to it without your knowledge.  So, for me, I will never want to use this method.

Link state status is most straight forward.  When there is a problem, you see that the physical network card is crossed with a red cross with and alert.  And you will know that there is a problem.  To further enhance link state, you can configure your physical switch to monitor your upstream links.  Should any important links fail, it will trigger a port shutdown on the physical switch, in return, your ESXi vmnic will be disconnected.





















 








4 comments:

  1. thank you, very good article!

    ReplyDelete
  2. Hi!

    Nice article!
    Little amendment: we had tested beacon probing in vSphere 5.0 U1 build: 721882. We found, that beacon probing works per VLAN basis.

    Bye:
    Fast Driver

    ReplyDelete
  3. Excellent article.

    ReplyDelete
  4. Hi Thomas,
    Just came across this article and it excellent. I know you wrote a while ago but still very relevant. Good job

    ReplyDelete