[c-nsp] 7206-VXR/NPE-300 memory problems

Thu Feb 2 06:19:57 EST 2006

Hello all,

First off, I'll admit I'm a bit rusty on the Cisco front; I take care
of less gear and end up with more major unix emergencies than
networking emergencies these days, so bear with me...

I have a 7206-VXR with an NPE-300 maxed out at 256MB of RAM.  Running
12.2.17a. Going by the Cisco docs, this is the absolute max the
processor card will take.  We take two full BGP feeds (about 170,000
prefixes each) and either memory is just getting tight due to the
ever-expanding route table, or I've got a nasty memory leak.

The first sign of this is that rancid will start sending me config
diffs that have a few thousand less lines of the config.  Then I'll
look at the logs and see malloc failures from rancid trying to write
out the config:

Jan 31 20:00:08 l3-router-lo-0 1052: Jan 31 20:00:06.919 EDT:
%SYS-2-MALLOCFAIL: Memory allocation of 387048 bytes failed from
0x605151A0, alignment 0
Jan 31 20:00:08 l3-router-lo-0 1053: Pool: Processor  Free: 1362196 
Cause: Memory fragmentation
Jan 31 20:00:08 l3-router-lo-0 1054: Alternate Pool: None  Free: 0 
Cause: No Alternate pool
Jan 31 20:00:08 l3-router-lo-0 1055:
Jan 31 20:00:08 l3-router-lo-0 1056: -Process= "SSH Process", ipl= 0, pid= 137
Jan 31 20:00:08 l3-router-lo-0 1057: -Traceback= 605B2450 605B40EC
605151A8 60540778 60540864 60535744 605436A0 6179A4F4 6179AC14
605A7644 605A7630
Jan 31 20:00:38 l3-router-lo-0 1058: Jan 31 20:00:37.679 EDT:
%SYS-2-MALLOCFAIL: Memory allocation of 387048 bytes failed from
0x605151A0, alignment 0
Jan 31 20:00:38 l3-router-lo-0 1059: Pool: Processor  Free: 1360192 
Cause: Memory fragmentation
Jan 31 20:00:38 l3-router-lo-0 1060: Alternate Pool: None  Free: 0 
Cause: No Alternate pool

To "fix" this, I console in and take both transit links down, turn off
cef, and then save the config and verify there's no malloc failures. 
If all is well, a reload takes care of things for 8 months or so.

Tonight, the issue crept right back up within hours.  One transit
provider bounced their bgp session and right away I was seeing rancid
report config lines changed again and then saw the malloc failures in
the log.

Some stats follow.  I'm new to gmail, so I'm hoping that this displays
properly in a fixed-width font...

sh mem:

                Head    Total(b)     Used(b)     Free(b)   Lowest(b) 
Largest(b)Processor   62861FE0   192536608   162616484    29920124    
 122200      956688      I/O   20000000    33554480      550192   
33004288    33004288    33001052    I/O-2    E000000    33554440    
4171032    29383408    29383408    29383356

sh proc memory sorted:

Total: 192536608, Used: 162666684, Free: 29869924
 PID TTY  Allocated      Freed    Holding    Getbufs    Retbufs Process
 143   0  123282800    4292240  108104816          0          0 BGP
Router         0   0     163292       1848   40975948          0      
   0 *Init*            43   0   88769800   25537696    8184684     
74832          0 IP Input           0   0  269744272  259355632   
1779388     190832          0 *Dead*            40   0     660260     
  392     667908       4536          0 ATM PA Helper    117   0    
445700          0     506528          0          0 CCPROXY_CT       
53   0     473188        244     408600          0          0 IP
Background     38   0     104244          0     114072          0     
    0 ATM Periodic       6   0     358400     124332      97180    
188452      76608 Pool Manager      90   0      65800          0     
90628          0          0 QOS_MODULE_MAIN    4   2   29611104  
42749844      70492          0          0 SSH Process       96   0    
118688          0      57860          0          0 Proxy Session Ap 
16   0     260364    4690700      57828      24960          0 ARP
Input         72   0    3681844    3621232      51192          0      
   0 DHCPD Receive     89   0      21060          0      45888        
 0          0 TSP               86   3     710580     679768     
43696          0          0 SSH Process

So to me, it looks like there are two things of interest.  I have
192MB of RAM left after subtracting the I/O memory, BGP is sucking
down 108MB of that and "dead" processes are grabbing about 18MB.  I
assume "init" (41MB) is basically the IOS kernel itself.

The amount of dead memory seems a bit disconcerting considering this
thing has only been up for less than 12 hours now.  BGP seems somewhat
huge and does not agree with the memory usage reported in "sh ip bgp
sum", but I don't know enough to say whether that's normal or not.

In the short term, what are my options to free up memory?  If
something's leaky, is there a decent 12.2 release that's going to
serve me much better?  Can I do anything with BGP to make it less of a
pig?

Some ideas I had included just taking customer routes from one
provider, and ignoring any routes more specific than a /24.  I did try
the customer route option and while it did bring me down to 16K routes
on that transit connection, it did not free up any memory on an
inbound soft reconfig for that neighbor.

At this point, I don't desperately need full routes - we try to steer
as much traffic as possible towards one provider and use the other
provider as backup.  Is there any clever way to take much fewer routes
from both transit providers and still have an automatic failover and
full connectivity?

Lastly, I'm trying to push for some new hardware.  I'd like to
relegate this router to just aggregating DSL and T1 lines and put a
72xx with either an NPE-400 (max 512MB RAM) or an NPE-G1 (max 1GB RAM)
in front of it.  This would give me the ability to limp along if
either failed and give us some actual survivability in smaller DoS
attacks (pipes not full, but high pps).  The NPE-400 still seems a bit
anemic to me though.  And if anyone has a used/refurb dealer that they
prefer, please contact me off-list.

Sorry for the rather basic questions, but I want to stay away from
those "enterprise" folks on comp.dcom.cisco for the time being... 
Technically we are a (small) NSP. :)

Thanks,

Charles