[c-nsp] IOS Initial setup function & DHCP

Justin Shore justin at justinshore.com
Thu Jan 8 14:24:48 EST 2009


Gert Doering wrote:
>> set off a major multi-hour outage for us.
> 
> Ouch.

Definitely.  To be clear, it wasn't the fault of the 7201.  It just did 
something at a point where I didn't think it would do anything.

The problem as it turns out is partly due to what I eluded to earlier: 
VLAN 1.  I connected the 7201 to an interface on each of our 7600.  I 
rigged up the console port so I could get into it remotely and work on 
it from home.  One of the ports was shut.  The other port was unshut and 
had previously been used to stage a CMTS.  I had removed the access vlan 
from that port so all it had was switchport and mode access.  That put 
the int in VLAN 1.  I never use VLAN 1.  On devices that create a VLAN 1 
SVI I also shut it and label it with "DO NOT USE".  I never gave it a 
second thought when I plugged in that router though.  I expected all its 
interfaces to be down; I didn't anticipate some autoinstall process to 
run DHCP.  Even if it did do that and I expected it I wouldn't have 
thought it would be a problem.

The 7600s are connected via a trunk with no VLAN restrictions.  In each 
7600 is a SSC-400 with a 2G IPSec SPA.  A VPN SPA essentially.  It's 
running in VRF mode.  I talked about this one the list once before.  A 
Cisco AS SME came out to help with the initial config of the SPA and 
other special SMs in the 7600s.  He configured a list of VLANs on both 
virtual interfaces of the SPAs.  One int is the encrypted outside and 
the other int is the unencrypted inside.  The 2 ints configure like 1Q 
trunks with allowed VLANs.  He configured them manually but as it turns 
out in VRF Mode they are self-configuring.  His major goof was that he 
configured the same VLANs on both virtual interfaces.  Packets were 
recirculated by the IPSec SPA as fast as it could process them.  That 
didn't hurt the chassis per say.  It created a lot of packet loss on 
client VPN sessions but the RP didn't get hit so I never noticed it. 
Then one day I turned up a new SVI with HSRP configured that happened to 
be in that list of VLANs.  As soon as I did that the CPU went to 100% 
and the RP got bogged down.  It ultimately crashed the RP.  I did a 
packet capture of the IPSec SPA ints during that time and was getting 
close to 1m pps of the same HSRP hello packet.  The RP has to process 
those which is what killed the RP.  We removed all the cyrpto config, 
rebooted and put it all back in there WO/ modifying the 2 virtual 
interfaces and that fixed the problem.

Last night's problem stemmed from that initial fix.  When the IOS 
configures the 2 IPSec SPA virtual ints in VRF Mode it also includes 
some default VLANs, namely 1002-1005 and 1.  It includes those VLANs on 
both virtual interfaces.  Why, I'm not sure.  You shouldn't ever have 
the same VLANs on both ints at the same time.  1002-1005 doesn't matter; 
few people will run into those being used today.  VLAN 1 is a problem 
though.  While I didn't intend to use VLAN 1 it got used nonetheless. 
The DHCP DISCOVER from the new 7201 I connected is the packet that was 
being recirculated indefinitely by the VPN SPAs.  Each 7600 thought that 
the source MAC on the DHCP packet belonged to other 7600, punted the 
packet across the trunk which was flooded out all ports associated with 
VLAN 1 on that 7600 which included both sides of the SPA.  Rinse and 
repeat.  The port-channel between the 7600s was overwhelmed as a result 
and had massive output drops on both sides.


The default config presents a fairly easy way to cause this problem.  My 
questions for the Cisco.com people lurking on c-nsp are:

1) is there any technical reason why VLAN 1 should be allowed on the 
IPSec SPA at all or at least on both virtual interfaces?

2) is there any way to remove VLAN 1 from the virtual interfaces without 
pissing off the IOS process that auto-configures those 2 interfaces?

#2 worries be.  Our TAC engineer told us that if we alter the virtual 
ints' config at all that the auto-config process would break.  I can't 
think of any reason why VLAN 1 should be allowed on the IPSec SPA at 
all.  I definitely can't think of any reason why it should be allowed on 
both virtual interfaces.  That's just setting the system up for failure.

So that's what happened last night.  We were down for 2 full hours.  The 
packet loss caused the firewalls in front of our class5 phone switch to 
freak out and fight over who was the master (dropping keepalive packets 
and each thought the other was dead).  It did the same thing to the 
FWSMs in the 7600s.  The outage took out almost all voice for the entire 
telco.

Any suggestions on how to fix this?  I won't leave a switchport at the 
default of VLAN 1 again but that's a minor thing that set off a config 
problem.  How do I address the misconfiguration that the auto-config 
does on the IPSec SPA ports?

Thanks to Gert and Tony for replying earlier.

Justin




More information about the cisco-nsp mailing list