[j-nsp] MX NSR issue

Chris Evans chrisccnpspam2 at gmail.com
Fri Sep 3 21:46:18 EDT 2010


Latest update:

Turns out that the PFE's were discarding ALL traffic. What happened is that
when the master RE failed and the backup came online, the fabric channels
didn't get re-established properly in software. So while the 'show commands'
showed everything as up and there were no alarms that obviously wasn't the
case. After off-lining the SCB module and bringing it back online traffic
started to flow again..

It took us 5hrs to get to that point.. So perhaps some of the other failures
seen by members here might have been the same scenario.

ATAC is going to send me root-cause of the issue on tuesday (hopefully)..

Chris


On Fri, Sep 3, 2010 at 3:56 PM, Chris Evans <chrisccnpspam2 at gmail.com>wrote:

> Felix,
>
> Interesting that you say this.. I'm working with ATAC right now
> troubleshooting this.. We were looking at the PFE jsim information and found
> that it is saying 'firewall discard' on a lot of traffic.. I do not have any
> filters applied on the device at all, but it is giving us this message..
>
> Hopefully it'll lead to an answer..
>
> Chris
>
>
>
> On Fri, Sep 3, 2010 at 9:13 AM, Felix Schueren <
> felix.schueren at hosteurope.de> wrote:
>
>> Chris,
>>
>>
>> >
>> > #1 - I have two eBGP neighbors using BFD. One of the neighbors tripped,
>> now
>> > BFD won't re-establish. BGP is up however.
>> > #2 - I'm using IRB interfaces on the MX platform. After the failover,
>> > traffic will not forward.. You can communicate RE to host, but HOST to
>> HOST
>> > on the same box or external<>HOST connectivity is broken.
>>
>> I sometimes experience issues on RE failover on M- and MX-Boxes. Here's
>> a snippet of a case I opened about a year ago:
>>
>>
>> +++snip+++
>> the effect was that the router was left in a half-working state - IS-IS
>> was up, most IPv4 BGP peerings were up (but not all of them), none of
>> the IPv6 BGP peerings were up. Pings to the loopback address were not
>> possible, we could not ping some directly attached hosts even though ARP
>> was working fine. There were no error messages or any indication of
>> anything being wrong. When testing locally via ping from the router CLI,
>> we got "sendto: operation not permitted" messages. The NTP daemon was
>> logging "sendto(x.x.x.x): Operation not permitted" (and did not work),
>> ping to y.y.y.y (a directly attached host in the same subnet as x.x.x.x,
>> living in z.z.z.z/27) returned the same "operation not permitted"
>> messages, but ARP was working fine.
>>
>> one of the not-working BGP sessions logged this:
>> task_connect: task BGP_remoteAS.a.b.c.d+179 addr a.b.c.d+179: Operation
>> not permitted
>>
>> "restart routing" (at 05:24 CEST) did not help.
>> +++snip+++
>>
>> the remedy for this was (and has been everytime I ran into this, about
>> once a year since 2004): remove lo0-filters, commit, activate
>> lo0-filters, commit. The instant the commit with deactivated lo0-filters
>> is finished, everything works properly, and continues to work even with
>> the lo0 filters back in place.
>>
>> A race condition of some sort, maybe?
>>
>> kind regards,
>>
>> Felix
>>
>> --
>> Felix Schüren
>> Head of Network
>>
>> -----------------------------------------------------------------------
>> Host Europe GmbH - http://www.hosteurope.de
>> Welserstraße 14 - 51149 Köln - Germany
>> Telefon: 0800 467 8387 - Fax: +49 180 5 66 3233 (*)
>> HRB 28495 Amtsgericht Köln - USt-IdNr.: DE187370678
>> Geschäftsführer:
>> Uwe Braun - Alex Collins - Mark Joseph - Patrick Pulvermüller
>>
>> (*) 0,14 EUR/Min. aus dem dt. Festnetz; maximal 0,42 EUR/Min. aus
>> den dt. Mobilfunknetzen
>>
>>
>


More information about the juniper-nsp mailing list