ACI Troubleshooting - The case of the Blackhole Leaf!
The other day I was working with some folks on their ACI fabric and we ran into some, what I thought at the time, weird stuff. Turns out the troubleshooting revealed the fabric was acting exactly the way it was supposed to (which may not be directly obvious if you’re not familiar with ACI) and the whole thing was a good reminder to not make assumptions about stuff :) Anyway, figured I’d try to explain the perceived problem, a bit of the troubleshooting steps, and the ultimate resolution so hopefully this doesn’t stump anyone else!
Lets start with a quick overview of the topology we’ll be discussing:
Nothing earth shattering here – basic ACI fabric (not totally drawn, only the two relevant leaf nodes pictured to keep the clutter away), a L2 connection to another switch (over a “cloud” in this case), and a pair of out-of-band switches. We ended up needing to temporarily connect the out-of-band switches to the fabric to aid in some file transfers and the like (obviously not so out-of-band at that point, but it is just temporary!).
The setup here is pretty straightforward – both of the out-of-band switches have simple /30 links to the fabric – “OOB1” connects to Leaf_1, and “OOB2” connects to Leaf_2. OSPF peering is up and exchanging routes between the fabric and the out-of-band switches.
The other non-ACI switch (top of the diagram) is trunking some VLANs into ACI, including VLAN 999. VLAN 999 has IP interfaces on both the ACI fabric and the top switch. We’re basically using VLAN 999 to route between this otherwise simple L2 trunk. The fabric has 10.9.9.4/24 (remember, the whole fabric runs any cast, so every leaf has this IP), and the top switch has 10.9.9.1, 10.9.9.2, and 10.9.9.3 (there is as second switch not pictured, in an HSRP pair, hence the three IPs). The top switch has a static route to 10.10.10.0/24 which is our out of out-of-band subnet living on the OOB switches, the static route points to the 10.9.9.4 address living on VLAN 999 in the Fabric. The last important bit about the top switch is that it also has VLAN 1000 10.0.0.1/24 (more on this in a bit). The Fabric has a default route pointing back to the top switch on VLAN 999. Finally, the OOB switches are learning Fabric local routes via OSPF, but have simple static routes for all RFC1918 subnets pointing at the Fabric to cover everything else (they need to have their default route go another way so that other OSPF peer has a better cost, and thus the default route on the OOB is in OSPF, but from a different neighbor).
Phew… still with me? Lots of setup here for this post I guess!
So if you’re tracking this far, basically we have OSPF from the OOB to ACI, but really are just relying on static RFC1918 in this direction. Then ACI has a default route via OSPF to VLAN 999 on the other switch. In the reverse direction, the “top switch” has a static route to the OOB subnet via VLAN 999 in ACI, then ACI knows about the OOB subnet via OSPF from the OOB switches.
Okay, so on to the issue: top switch (really should have named that something… too late now!) can ping 10.10.10.1 and 10.10.10.2 sourced from VLAN 999, and it can ping 10.10.10.1 and 10.10.10.2 sourced from VLAN 1000 as well. So far so good right? Heres where it gets problematic – again from VLAN 999, top switch can NOT ping 10.10.10.3. Same story from VLAN 1000. So what gives?
Lets start looking at basic routing to confirm that what I’ve outlined here isn’t a lie :)
OOB1 has the following routes in the FIB, note that 10.255.0.0/30 is the routed link between OOB1 and Leaf1:
OOB1# show ip route 10.0.0.0/8, ubest/mbest: 1/0 *via 10.255.0.2, [1/0], 2d07h, static 172.16.0.0/12, ubest/mbest: 1/0 *via 10.255.0.2, [1/0], 2d07h, static 192.168.0.0/16, ubest/mbest: 1/0 *via 10.255.0.2, [1/0], 2d07h, static
OOB 2 has the same statics in the FIB, just going to 10.255.0.6 (Leaf2). Okay that’s great. What about ACI – and maybe a better question if you haven’t played with ACI – how the hell do I see what routes ACI has?
For now the easiest way to do this kind of troubleshooting in ACI is still the CLI (I know I know I know – don’t be the elevator operator, but that’s just where we are at for now). So SSH into your APIC (or you can go directly to your Leaf nodes if you have in band or out-of-band setup for that). then you can connect to the Leaf node of your choosing with the following command:
Of course «LEAF_NAME»; is what you named whatever Leaf you want to get into. In our case, we’ll start at Leaf_1. When you get into the Leaf, you are initially dropped into iBash. This is a Linux shell that has some NX-OS-like commands you can do (stuff like “show vrf”). From here we want to get into vShell which is still a Linux thing, but is like a normal NX-OS (for most show commands at least). You can do that by simply running the command “vsh”.
Once in the vShell you have a very NX-OS like place to poke about. For now we want to simply look at the routing table for the appropriate VRF – remember that everything in ACI is in a VRF under the covers. In our case we just want to peek at the routing table for VRF “Tenant.”
Leaf_1# show ip route vrf Tenant IP Route Table for VRF "Tenant" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%' in via output denotes VRF 0.0.0.0/0, ubest/mbest: 1/0 *via 10.9.9.1, vlan999, [110/1], 1d06h, ospf-default, type-2, tag 1 10.10.10.0/24, ubest/mbest: 2/0 *via 10.255.0.1, vlan2, [110/15], 1d06h, ospf-default, intra
Some things to note about all this… the VLANs in real life will be whatever ACI decides to assign. This is because this is all internal to the Fabric. I just changed things to make it a bit more readable. Anyway, you can see this Leaf has the default route as we expected and the more specific OSPF routes via the OOB switches. The OOB switches are single attached, so this Leaf only has the route from OOB1, but if that link failed the route from OOB2 would be populated in the FIB. So basically, so far so good right? Lets look at Leaf_2 just for kicks:
Leaf_2# show ip route vrf Tenant IP Route Table for VRF "Tenant" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%' in via output denotes VRF 0.0.0.0/0, ubest/mbest: 1/0 *via 10.9.9.1, vlan999, [110/1], 1d06h, ospf-default, type-2, tag 1 10.10.10.0/24, ubest/mbest: 2/0 *via 10.255.0.5, vlan2, [110/15], 1d06h, ospf-default, intra
Same deal here, just learning the OOB subnet via the locally attached OOB switch. Okay, great!
Finally, lets take a look at the “top switch” routing table:
Top_Switch#sh ip route S* 10.10.0.0/24 [1/0] via 10.9.9.4
That looks good - of course there are more routes, but that’s the relevant one for us.
So bottom line is that things are looking how you would expect them to look. Lets test things out from top switch:
Top_Switch#traceroute 10.10.10.2</pre> Type escape sequence to abort. Tracing the route to 10.10.10.2 1 10.9.9.4 0 msec 0 msec 4 msec 2 10.255.0.1 0 msec 4 msec 0 msec Top_Switch#traceroute 10.10.10.3 Type escape sequence to abort. Tracing the route to 10.10.10.3 1 10.9.9.4 0 msec 0 msec 4 msec 2 10.255.0.1 0 msec 4 msec 0 msec 3 * * * 4 * * * 5 * * *
Okay weird right? So basically top switch can hit OOB1, but dies at OOB1 while trying to get to OOB2… If it can get to OOB1, and the destination is in the same subnet you would think that we would be able to get to OOB2 unless OOB2 doesn’t have a route back to top switch… okay so lets look at that:
OOB2# trace route 10.0.0.1 traceroute to 10.0.0.1 (10.0.0.1), 30 hops max, 40 byte packets 1 10.255.0.6 (10.255.0.6) 1.332 ms 1.048 ms 1.057 ms 2 * * * 3 * * *
Well, we get to Leaf_2… I guess thats a start, but why the hell is it dying there? We know that Leaf_2 has a route to top switch as well as the OOB subnet so why wouldn’t that work?
Lets look back at this one key bit:
0.0.0.0/0, ubest/mbest: 1/0 *via 10.9.9.1, vlan999, [110/1], 1d06h, ospf-default, type-2, tag 1
VLAN 999 shows up here on both leafs as the location we are learning the default route from. If, however, we look and see whats going on in VLAN land on both leafs we see something a bit confusing:
Leaf_1# show vlan VLAN Name Status Ports ---- -------------------------------- --------- ------------------------------- 999 -- active Eth1/48 Leaf_2# show vlan VLAN Name Status Ports ---- -------------------------------- --------- -------------------------------
Obviously these switches would have some VLAN action happening, but the point I’m trying to make here is that VLAN 999 has not been created on Leaf_2. Of course in ACI there is no “vlan 999” config to enter, so what gives? Simply put, ACI will never instantiate a VLAN on a leaf node that doesn’t have an EPG configured with a static path binding on it. Okay what the does that mean in non-ACI wordy words? Basically if there is no port that needs to be configured for a VLAN on a leaf, that VLAN will never get created. This is a thing for a great reason – why the hell would we build a ton of VLANs that we don’t need? That’s the point, lets not waste TCAM space and have config clutter etc. when we don’t need to. So why is this a problem for us? Well… the leaf is trying to route to a next hop in a VLAN that it doesn’t have – probably won’t work out too well eh? It turns out it doesn’t. Remember WAY long ago in the beginning of this post where I said don’t make assumptions? Yeah… well I made the assumption that “top switch” was dual attached to the fabric. If that was the case, then the EPG for VLAN 999 would have had a static path binding (basically hooking an EPG – and in turn a VLAN – to a port) for both leaf switches. Given that was NOT the case, VLAN 999 never existed on Leaf_2 and it basically become a giant black hole.
So the end “fix” is one of two things – either dual attach “top switch,” or attach OOB2 directly to Leaf_1. Obviously the better thing to do would be to dual attach “top switch,” in our case since this was a temp thing, we just attached OOB2 to Leaf_1.
Now I’m fairly certain there was something else wrong with this fabric unfortunately, but I didn’t have time with the customer to dig any deeper and since it was a temporary thing it didn’t end up warranting too much effort to investigate. In any case, the point of this was just to walk through some troubleshooting that may seem a bit abnormal at first, but is really a lot of the same stuff we’ve been doing on “normal” Nexus gear for a while. If I’m being honest this was a while ago so some of the output and stuff has been made up a bit to help fill in the blanks in my memory, but all in all should be relatively close to real outputs :)