[c-nsp] 12.1(5)T19 + quagga + md5 = crash

Jon Lewis jlewis at lewis.org
Thu Aug 19 11:45:11 EDT 2004


Has anyone else run into this?  I've duplicated it twice now to be sure it
was the cause, but it seems bringing up a customer BGP session on a 7206
running c7200-js-mz.121-5.T19.bin while the customer is using quagga
(derrived from zebra) and Linux on their end of the T1, as the session
comes up, the 7206 crashes.

When the session was activated initially, our end was configured for md5,
but I don't think the customer's was (yet).  I got the following messages:

Aug 19 09:21:42 gsvlflma-br-1 594588: Aug 19 09:21:42: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:63234 (RST)
Aug 19 09:21:43 gsvlflma-br-1 594589: Aug 19 09:21:42: %SYS-2-BADSHARE:
Bad refcount in datagram_done, ptr=62CDB208, count=0
Aug 19 09:21:43 gsvlflma-br-1 594590: -Traceback= 605E35FC 606F9DF8
606FB210
6073C104 60720710 6071E598 6071E6C4 6071E850 60620664 60620650
Aug 19 09:22:00 gsvlflma-br-1 594593: Aug 19 09:21:58: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:63234 (RST)
Aug 19 09:22:00 gsvlflma-br-1 594594: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=62674870, count=1
Aug 19 09:22:00 gsvlflma-br-1 594595: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594596: Aug 19 09:21:58: %SYS-2-LINKED: Bad
enqueue of 6266FB00 in queue 62C4D430
Aug 19 09:22:00 gsvlflma-br-1 594597: -Process= "IP Input", ipl= 4, pid=
124
Aug 19 09:22:00 gsvlflma-br-1 594598: -Traceback= 6063C6A4 60639DA0
605F9624
6071D204 605F9564 6132D948 61300B24 61300BB4 612F6CB8 606F7430 60720710
6071E598 6071E6C4 6071E850 60620664 60620650
Aug 19 09:22:00 gsvlflma-br-1 594599: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=6266FB00, count=1
Aug 19 09:22:00 gsvlflma-br-1 594600: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594601: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pool_getbuffer, ptr=62673250, count=1
Aug 19 09:22:00 gsvlflma-br-1 594602: -Traceback= 605E26C0 606A3254
6041F070 6042A6A8
Aug 19 09:22:00 gsvlflma-br-1 594603: Aug 19 09:21:58: %SYS-2-BADSHARE:
Bad refcount in pak_enqueue, ptr=6266EA68, count=0
[lots more, I won't bother pasting them all]

Then the customer enabled md5 on their side and the session came up, but
the router still wasn't happy.

Aug 19 09:22:36 gsvlflma-br-1 595169: Aug 19 09:22:35: %BGP-5-ADJCHANGE:
neighbor 209.208.x.y Up
Aug 19 09:22:38 gsvlflma-br-1 595170: Aug 19 09:22:38: %ALIGN-3-SPURIOUS:
Spurious memory access made at 0x6071D2F4 reading 0x5C
Aug 19 09:22:38 gsvlflma-br-1 595171: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071D2F4 6071E6C4 6071E850 60620664 60620650 00000000 00000000
00000000
Aug 19 09:22:38 gsvlflma-br-1 595172: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071D2FC 6071E6C4 6071E850 60620664 60620650 00000000 00000000
00000000
Aug 19 09:22:38 gsvlflma-br-1 595173: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071906C 607205C8 6071E598 6071E6C4 6071E850 60620664 60620650
00000000
Aug 19 09:22:38 gsvlflma-br-1 595174: Aug 19 09:22:38: %ALIGN-3-TRACE:
-Traceback= 6071907C 607205C8 6071E598 6071E6C4 6071E850 60620664 60620650
00000000
[lots more tracebacks]

At this point, an ATM DS3 used for DSL aggregation ceased seeing packets
from the telco even though the circuit was still up/up.

We had to reboot to get the ATM DS3 back to normal working order.
shut/no shut on the ATM interface didn't help.

After the reboot, I tried unshutting this new BGP session again.

Aug 19 10:57:05 gsvlflma-br-1 993: Aug 19 10:57:04: %BGP-3-NOTIFICATION:
received from neighbor 209.208.x.y 2/7 (unsupported/disjoint capability)
6 bytes 01040001 0001
Aug 19 10:57:45 gsvlflma-br-1 998: Aug 19 10:57:45: %TCP-6-BADAUTH: No MD5
digest from 209.208.x.y:179 to 209.208.x.z:11057 (RST)
Aug 19 10:57:45 gsvlflma-br-1 999: Aug 19 10:57:45: %SYS-2-BADSHARE: Bad
refcount in datagram_done, ptr=62CC7B60, count=0
Aug 19 10:57:45 gsvlflma-br-1 1000: -Traceback= 605E35FC 606F9E18 606FB230
6073C18C 60720798 6071E5F0 6071E71C 6071E8A8 60620664 60620650
Aug 19 10:57:45 gsvlflma-br-1 1001: Aug 19 10:57:45: %TCP-6-BADAUTH: No
MD5 digest from 209.208.x.y:179 to 209.208.x.z:11057 (RST)
Aug 19 10:57:46 gsvlflma-br-1 1002: Aug 19 10:57:45: %SYS-2-BADSHARE: Bad
refcount in datagram_done, ptr=62CC80F8, count=0

At this point, the router became unresponsive and apparently reloaded
itself.

System returned to ROM by bus error at PC 0x612F7418, address 0xB5C at
10:57:47 EDT Thu Aug 19 2004
System restarted at 10:59:56 EDT Thu Aug 19 2004
System image file is "slot0:c7200-js-mz.121-5.T19.bin"

cisco 7206 (NPE225) processor (revision A) with 245760K/16384K bytes of
memory.
Processor board ID 23683557
R527x CPU at 262Mhz, Implementation 40, Rev 10.0, 2048KB L2 Cache
6 slot midplane, Version 1.3

Minimum process stacks:
 Free/Size   Name
 5556/6000   CDP Protocol
11416/12000  Router Init
 7616/12000  Init
 5292/6000   RADIUS INITCONFIG
 5712/6000   BGP Accepter
 4848/6000   BGP Open
10824/12000  Exec
 7572/9000   DHCP Client
 9240/12000  Virtual Exec

Interrupt level stacks:
Level    Called Unused/Size  Name
  1    10762102   7000/9000  Network interfaces
  2       24990   8576/9000  DMA/Timer Interrupt
  3           3   8100/9000  PA Management Int Handler
  4      127404   8548/9000  Console Uart
  5           0   9000/9000  OIR/Error Interrupt
  7      520815   8604/9000  NMI Interrupt Handler

Spurious interrupts: 39
System was restarted by bus error at PC 0x612F7418, address 0xB5C at
10:57:47 EDT Thu Aug 19 2004
7200 Software (C7200-JS-M), Version 12.1(5)T19,  RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Compiled Tue 08-Jun-04 06:46 by ccai
Image text-base: 0x60008960, data-base: 0x616F8000


Stack trace from system failure:
FP: 0x62C3C7E8, RA: 0x612F7418
FP: 0x62C3C808, RA: 0x606F7450
FP: 0x62C3C840, RA: 0x60720798
FP: 0x62C3C860, RA: 0x6071E5F0
FP: 0x62C3C8D0, RA: 0x6071E71C
FP: 0x62C3C8F8, RA: 0x6071E8A8
FP: 0x62C3C930, RA: 0x60620664
FP: 0x62C3C948, RA: 0x60620650

unfortuantely, bootflash was full so we got no crashinfo.

Obviously, this is an IOS problem, perhaps only with BGP md5 handling.
None of the other sessions on this router use md5.  Unfortunately, this
seems to be the latest version in the 12.1T train, and we tried an early
12.3M version on this router some months ago and saw several things with
the handling of DSL aggregation break and quickly downgraded back to
12.1T.

If anyone's run into this before, do you know if this is a bug in quagga
(or the md5 patch it requires in the Linux kernel) tickling a bug in IOS?
Will simply removing the BGP password from this new session keep us from
crashing (even if the customer tries md5 for some reason)?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________


More information about the cisco-nsp mailing list