[c-nsp] 7206-VXR/NPE-300 memory problems
Charles Sporkman
spork.sporkman at gmail.com
Thu Feb 2 06:19:57 EST 2006
Hello all,
First off, I'll admit I'm a bit rusty on the Cisco front; I take care
of less gear and end up with more major unix emergencies than
networking emergencies these days, so bear with me...
I have a 7206-VXR with an NPE-300 maxed out at 256MB of RAM. Running
12.2.17a. Going by the Cisco docs, this is the absolute max the
processor card will take. We take two full BGP feeds (about 170,000
prefixes each) and either memory is just getting tight due to the
ever-expanding route table, or I've got a nasty memory leak.
The first sign of this is that rancid will start sending me config
diffs that have a few thousand less lines of the config. Then I'll
look at the logs and see malloc failures from rancid trying to write
out the config:
Jan 31 20:00:08 l3-router-lo-0 1052: Jan 31 20:00:06.919 EDT:
%SYS-2-MALLOCFAIL: Memory allocation of 387048 bytes failed from
0x605151A0, alignment 0
Jan 31 20:00:08 l3-router-lo-0 1053: Pool: Processor Free: 1362196
Cause: Memory fragmentation
Jan 31 20:00:08 l3-router-lo-0 1054: Alternate Pool: None Free: 0
Cause: No Alternate pool
Jan 31 20:00:08 l3-router-lo-0 1055:
Jan 31 20:00:08 l3-router-lo-0 1056: -Process= "SSH Process", ipl= 0, pid= 137
Jan 31 20:00:08 l3-router-lo-0 1057: -Traceback= 605B2450 605B40EC
605151A8 60540778 60540864 60535744 605436A0 6179A4F4 6179AC14
605A7644 605A7630
Jan 31 20:00:38 l3-router-lo-0 1058: Jan 31 20:00:37.679 EDT:
%SYS-2-MALLOCFAIL: Memory allocation of 387048 bytes failed from
0x605151A0, alignment 0
Jan 31 20:00:38 l3-router-lo-0 1059: Pool: Processor Free: 1360192
Cause: Memory fragmentation
Jan 31 20:00:38 l3-router-lo-0 1060: Alternate Pool: None Free: 0
Cause: No Alternate pool
To "fix" this, I console in and take both transit links down, turn off
cef, and then save the config and verify there's no malloc failures.
If all is well, a reload takes care of things for 8 months or so.
Tonight, the issue crept right back up within hours. One transit
provider bounced their bgp session and right away I was seeing rancid
report config lines changed again and then saw the malloc failures in
the log.
Some stats follow. I'm new to gmail, so I'm hoping that this displays
properly in a fixed-width font...
sh mem:
Head Total(b) Used(b) Free(b) Lowest(b)
Largest(b)Processor 62861FE0 192536608 162616484 29920124
122200 956688 I/O 20000000 33554480 550192
33004288 33004288 33001052 I/O-2 E000000 33554440
4171032 29383408 29383408 29383356
sh proc memory sorted:
Total: 192536608, Used: 162666684, Free: 29869924
PID TTY Allocated Freed Holding Getbufs Retbufs Process
143 0 123282800 4292240 108104816 0 0 BGP
Router 0 0 163292 1848 40975948 0
0 *Init* 43 0 88769800 25537696 8184684
74832 0 IP Input 0 0 269744272 259355632
1779388 190832 0 *Dead* 40 0 660260
392 667908 4536 0 ATM PA Helper 117 0
445700 0 506528 0 0 CCPROXY_CT
53 0 473188 244 408600 0 0 IP
Background 38 0 104244 0 114072 0
0 ATM Periodic 6 0 358400 124332 97180
188452 76608 Pool Manager 90 0 65800 0
90628 0 0 QOS_MODULE_MAIN 4 2 29611104
42749844 70492 0 0 SSH Process 96 0
118688 0 57860 0 0 Proxy Session Ap
16 0 260364 4690700 57828 24960 0 ARP
Input 72 0 3681844 3621232 51192 0
0 DHCPD Receive 89 0 21060 0 45888
0 0 TSP 86 3 710580 679768
43696 0 0 SSH Process
So to me, it looks like there are two things of interest. I have
192MB of RAM left after subtracting the I/O memory, BGP is sucking
down 108MB of that and "dead" processes are grabbing about 18MB. I
assume "init" (41MB) is basically the IOS kernel itself.
The amount of dead memory seems a bit disconcerting considering this
thing has only been up for less than 12 hours now. BGP seems somewhat
huge and does not agree with the memory usage reported in "sh ip bgp
sum", but I don't know enough to say whether that's normal or not.
In the short term, what are my options to free up memory? If
something's leaky, is there a decent 12.2 release that's going to
serve me much better? Can I do anything with BGP to make it less of a
pig?
Some ideas I had included just taking customer routes from one
provider, and ignoring any routes more specific than a /24. I did try
the customer route option and while it did bring me down to 16K routes
on that transit connection, it did not free up any memory on an
inbound soft reconfig for that neighbor.
At this point, I don't desperately need full routes - we try to steer
as much traffic as possible towards one provider and use the other
provider as backup. Is there any clever way to take much fewer routes
from both transit providers and still have an automatic failover and
full connectivity?
Lastly, I'm trying to push for some new hardware. I'd like to
relegate this router to just aggregating DSL and T1 lines and put a
72xx with either an NPE-400 (max 512MB RAM) or an NPE-G1 (max 1GB RAM)
in front of it. This would give me the ability to limp along if
either failed and give us some actual survivability in smaller DoS
attacks (pipes not full, but high pps). The NPE-400 still seems a bit
anemic to me though. And if anyone has a used/refurb dealer that they
prefer, please contact me off-list.
Sorry for the rather basic questions, but I want to stay away from
those "enterprise" folks on comp.dcom.cisco for the time being...
Technically we are a (small) NSP. :)
Thanks,
Charles
More information about the cisco-nsp
mailing list