[c-nsp] 12.1(5)T19 + quagga + md5 = crash
Jon Lewis
jlewis at lewis.org
Thu Aug 19 11:45:11 EDT 2004
Has anyone else run into this? I've duplicated it twice now to be sure it
was the cause, but it seems bringing up a customer BGP session on a 7206
running c7200-js-mz.121-5.T19.bin while the customer is using quagga
(derrived from zebra) and Linux on their end of the T1, as the session
comes up, the 7206 crashes.
When the session was activated initially, our end was configured for md5,
but I don't think the customer's was (yet). I got the following messages:
Aug 19 09:21:42 gsvlflma-br-1 594588: Aug 19 09:21:42: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:63234 (RST)
Aug 19 09:21:43 gsvlflma-br-1 594589: Aug 19 09:21:42: %SYS-2-BADSHARE:
Bad refcount in datagram_done, ptr=62CDB208, count=0
Aug 19 09:21:43 gsvlflma-br-1 594590: -Traceback= 605E35FC 606F9DF8
606FB210
6073C104 60720710 6071E598 6071E6C4 6071E850 60620664 60620650
Aug 19 09:22:00 gsvlflma-br-1 594593: Aug 19 09:21:58: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:63234 (RST)
Aug 19 09:22:00 gsvlflma-br-1 594594: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=62674870, count=1
Aug 19 09:22:00 gsvlflma-br-1 594595: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594596: Aug 19 09:21:58: %SYS-2-LINKED: Bad
enqueue of 6266FB00 in queue 62C4D430
Aug 19 09:22:00 gsvlflma-br-1 594597: -Process= "IP Input", ipl= 4, pid=
124
Aug 19 09:22:00 gsvlflma-br-1 594598: -Traceback= 6063C6A4 60639DA0
605F9624
6071D204 605F9564 6132D948 61300B24 61300BB4 612F6CB8 606F7430 60720710
6071E598 6071E6C4 6071E850 60620664 60620650
Aug 19 09:22:00 gsvlflma-br-1 594599: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=6266FB00, count=1
Aug 19 09:22:00 gsvlflma-br-1 594600: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594601: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=62673250, count=1
Aug 19 09:22:00 gsvlflma-br-1 594602: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594603: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pak_enqueue, ptr=6266EA68, count=0
[lots more, I won't bother pasting them all]
Then the customer enabled md5 on their side and the session came up, but
the router still wasn't happy.
Aug 19 09:22:36 gsvlflma-br-1 595169: Aug 19 09:22:35: %BGP-5-ADJCHANGE:
neighbor 209.208.x.y Up
Aug 19 09:22:38 gsvlflma-br-1 595170: Aug 19 09:22:38: %ALIGN-3-SPURIOUS:
Spurious memory access made at 0x6071D2F4 reading 0x5C
Aug 19 09:22:38 gsvlflma-br-1 595171: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071D2F4 6071E6C4 6071E850 60620664 60620650 00000000 00000000
00000000
Aug 19 09:22:38 gsvlflma-br-1 595172: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071D2FC 6071E6C4 6071E850 60620664 60620650 00000000 00000000
00000000
Aug 19 09:22:38 gsvlflma-br-1 595173: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071906C 607205C8 6071E598 6071E6C4 6071E850 60620664 60620650
00000000
Aug 19 09:22:38 gsvlflma-br-1 595174: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071907C 607205C8 6071E598 6071E6C4 6071E850 60620664 60620650
00000000
[lots more tracebacks]
At this point, an ATM DS3 used for DSL aggregation ceased seeing packets
from the telco even though the circuit was still up/up.
We had to reboot to get the ATM DS3 back to normal working order.
shut/no shut on the ATM interface didn't help.
After the reboot, I tried unshutting this new BGP session again.
Aug 19 10:57:05 gsvlflma-br-1 993: Aug 19 10:57:04: %BGP-3-NOTIFICATION:
received from neighbor 209.208.x.y 2/7 (unsupported/disjoint capability)
6 bytes 01040001 0001
Aug 19 10:57:45 gsvlflma-br-1 998: Aug 19 10:57:45: %TCP-6-BADAUTH: No MD5
digest from 209.208.x.y:179 to 209.208.x.z:11057 (RST)
Aug 19 10:57:45 gsvlflma-br-1 999: Aug 19 10:57:45: %SYS-2-BADSHARE: Bad
refcount in datagram_done, ptr=62CC7B60, count=0
Aug 19 10:57:45 gsvlflma-br-1 1000: -Traceback= 605E35FC 606F9E18 606FB230
6073C18C 60720798 6071E5F0 6071E71C 6071E8A8 60620664 60620650
Aug 19 10:57:45 gsvlflma-br-1 1001: Aug 19 10:57:45: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:11057 (RST)
Aug 19 10:57:46 gsvlflma-br-1 1002: Aug 19 10:57:45: %SYS-2-BADSHARE: Bad
refcount in datagram_done, ptr=62CC80F8, count=0
At this point, the router became unresponsive and apparently reloaded
itself.
System returned to ROM by bus error at PC 0x612F7418, address 0xB5C at
10:57:47 EDT Thu Aug 19 2004
System restarted at 10:59:56 EDT Thu Aug 19 2004
System image file is "slot0:c7200-js-mz.121-5.T19.bin"
cisco 7206 (NPE225) processor (revision A) with 245760K/16384K bytes of
memory.
Processor board ID 23683557
R527x CPU at 262Mhz, Implementation 40, Rev 10.0, 2048KB L2 Cache
6 slot midplane, Version 1.3
Minimum process stacks:
Free/Size Name
5556/6000 CDP Protocol
11416/12000 Router Init
7616/12000 Init
5292/6000 RADIUS INITCONFIG
5712/6000 BGP Accepter
4848/6000 BGP Open
10824/12000 Exec
7572/9000 DHCP Client
9240/12000 Virtual Exec
Interrupt level stacks:
Level Called Unused/Size Name
1 10762102 7000/9000 Network interfaces
2 24990 8576/9000 DMA/Timer Interrupt
3 3 8100/9000 PA Management Int Handler
4 127404 8548/9000 Console Uart
5 0 9000/9000 OIR/Error Interrupt
7 520815 8604/9000 NMI Interrupt Handler
Spurious interrupts: 39
System was restarted by bus error at PC 0x612F7418, address 0xB5C at
10:57:47 EDT Thu Aug 19 2004
7200 Software (C7200-JS-M), Version 12.1(5)T19, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Compiled Tue 08-Jun-04 06:46 by ccai
Image text-base: 0x60008960, data-base: 0x616F8000
Stack trace from system failure:
FP: 0x62C3C7E8, RA: 0x612F7418
FP: 0x62C3C808, RA: 0x606F7450
FP: 0x62C3C840, RA: 0x60720798
FP: 0x62C3C860, RA: 0x6071E5F0
FP: 0x62C3C8D0, RA: 0x6071E71C
FP: 0x62C3C8F8, RA: 0x6071E8A8
FP: 0x62C3C930, RA: 0x60620664
FP: 0x62C3C948, RA: 0x60620650
unfortuantely, bootflash was full so we got no crashinfo.
Obviously, this is an IOS problem, perhaps only with BGP md5 handling.
None of the other sessions on this router use md5. Unfortunately, this
seems to be the latest version in the 12.1T train, and we tried an early
12.3M version on this router some months ago and saw several things with
the handling of DSL aggregation break and quickly downgraded back to
12.1T.
If anyone's run into this before, do you know if this is a bug in quagga
(or the md5 patch it requires in the Linux kernel) tickling a bug in IOS?
Will simply removing the BGP password from this new session keep us from
crashing (even if the customer tries md5 for some reason)?
----------------------------------------------------------------------
Jon Lewis | I route
Senior Network Engineer | therefore you are
Atlantic Net |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
More information about the cisco-nsp
mailing list