[c-nsp] IOS Initial setup function & DHCP
Justin Shore
justin at justinshore.com
Thu Jan 8 14:24:48 EST 2009
Gert Doering wrote:
>> set off a major multi-hour outage for us.
>
> Ouch.
Definitely. To be clear, it wasn't the fault of the 7201. It just did
something at a point where I didn't think it would do anything.
The problem as it turns out is partly due to what I eluded to earlier:
VLAN 1. I connected the 7201 to an interface on each of our 7600. I
rigged up the console port so I could get into it remotely and work on
it from home. One of the ports was shut. The other port was unshut and
had previously been used to stage a CMTS. I had removed the access vlan
from that port so all it had was switchport and mode access. That put
the int in VLAN 1. I never use VLAN 1. On devices that create a VLAN 1
SVI I also shut it and label it with "DO NOT USE". I never gave it a
second thought when I plugged in that router though. I expected all its
interfaces to be down; I didn't anticipate some autoinstall process to
run DHCP. Even if it did do that and I expected it I wouldn't have
thought it would be a problem.
The 7600s are connected via a trunk with no VLAN restrictions. In each
7600 is a SSC-400 with a 2G IPSec SPA. A VPN SPA essentially. It's
running in VRF mode. I talked about this one the list once before. A
Cisco AS SME came out to help with the initial config of the SPA and
other special SMs in the 7600s. He configured a list of VLANs on both
virtual interfaces of the SPAs. One int is the encrypted outside and
the other int is the unencrypted inside. The 2 ints configure like 1Q
trunks with allowed VLANs. He configured them manually but as it turns
out in VRF Mode they are self-configuring. His major goof was that he
configured the same VLANs on both virtual interfaces. Packets were
recirculated by the IPSec SPA as fast as it could process them. That
didn't hurt the chassis per say. It created a lot of packet loss on
client VPN sessions but the RP didn't get hit so I never noticed it.
Then one day I turned up a new SVI with HSRP configured that happened to
be in that list of VLANs. As soon as I did that the CPU went to 100%
and the RP got bogged down. It ultimately crashed the RP. I did a
packet capture of the IPSec SPA ints during that time and was getting
close to 1m pps of the same HSRP hello packet. The RP has to process
those which is what killed the RP. We removed all the cyrpto config,
rebooted and put it all back in there WO/ modifying the 2 virtual
interfaces and that fixed the problem.
Last night's problem stemmed from that initial fix. When the IOS
configures the 2 IPSec SPA virtual ints in VRF Mode it also includes
some default VLANs, namely 1002-1005 and 1. It includes those VLANs on
both virtual interfaces. Why, I'm not sure. You shouldn't ever have
the same VLANs on both ints at the same time. 1002-1005 doesn't matter;
few people will run into those being used today. VLAN 1 is a problem
though. While I didn't intend to use VLAN 1 it got used nonetheless.
The DHCP DISCOVER from the new 7201 I connected is the packet that was
being recirculated indefinitely by the VPN SPAs. Each 7600 thought that
the source MAC on the DHCP packet belonged to other 7600, punted the
packet across the trunk which was flooded out all ports associated with
VLAN 1 on that 7600 which included both sides of the SPA. Rinse and
repeat. The port-channel between the 7600s was overwhelmed as a result
and had massive output drops on both sides.
The default config presents a fairly easy way to cause this problem. My
questions for the Cisco.com people lurking on c-nsp are:
1) is there any technical reason why VLAN 1 should be allowed on the
IPSec SPA at all or at least on both virtual interfaces?
2) is there any way to remove VLAN 1 from the virtual interfaces without
pissing off the IOS process that auto-configures those 2 interfaces?
#2 worries be. Our TAC engineer told us that if we alter the virtual
ints' config at all that the auto-config process would break. I can't
think of any reason why VLAN 1 should be allowed on the IPSec SPA at
all. I definitely can't think of any reason why it should be allowed on
both virtual interfaces. That's just setting the system up for failure.
So that's what happened last night. We were down for 2 full hours. The
packet loss caused the firewalls in front of our class5 phone switch to
freak out and fight over who was the master (dropping keepalive packets
and each thought the other was dead). It did the same thing to the
FWSMs in the 7600s. The outage took out almost all voice for the entire
telco.
Any suggestions on how to fix this? I won't leave a switchport at the
default of VLAN 1 again but that's a minor thing that set off a config
problem. How do I address the misconfiguration that the auto-config
does on the IPSec SPA ports?
Thanks to Gert and Tony for replying earlier.
Justin
More information about the cisco-nsp
mailing list