[c-nsp] FRR Recovery Time
gladston at br.ibm.com
gladston at br.ibm.com
Sun Jan 22 15:58:38 EST 2006
Sorry it took so long to get the requested information. It was necessary
to wait for a maintenance window.
There is a new information. We used RSVP hellos to detect the failure, and
this test revealed that FRR is working pretty well. The problem is that
without RSVP hellos the POS alarms are not enough deactivate the remote
interface on the remote end router to have bidirectional communication
recovered.
The failure is simulated disconnecting the fiber on the POS interface of
RA1.
I am wondering if the Carrier has some configuration that does not let the
POS alarms arrive at RB1 when the fiber on RA1 is disconnected.
Any feedback concerned to this is really appreciated.
I have studied the POS alarms, but from what I understood the default
should be enough to allow the remote router detect a failure on the local
router.
This is from Cisco pages:
"You can issue the pos delay triggers path command in order to configure
various path alarms as triggers and in order to specify an activation
delay between 0 and 511 ms. The default delay value is 100 ms."
I did not try that command (pos delay triggers path) to activate path
alarms to bring the protocol down. Do you think that would help?
=============================================================================
>
> RA1-(pos)-----------------------(pos)--RB1
> (giga) (giga)
| |
> (giga) (giga)
> RA2-(pos)----------------------(pos)--RB2
>
> The main tunnel is RA1-RB1
> The backup tunnel is RA1-RA2-RB2-RB1
What type of protection do you have configured? Which path does the
RA1->RB2 tunnel take? Or is the tunnel RA1->RB1 and you protect the link
RA1-RB2?
==============================================================================
Link protection on RA1 and RB1. The protected link is RA1--RB1. It is
protected by a tunnel that take the path RA1-RA2-RB2-RB1.
=================================
copy&paste error? Which link is what? can you please re-post your exact
configuration?
=================================
Yes, sorry, my error. 155Mbps between RA1-RB1 and RA2-RB2.
====================================
Do you have bi-directional tunnels configured, or are you measuring
one-way only (somewhat doubt it using your 2nd measurements with the IP
application).
=====================================
Yes, bi-directional tunnels. And now it is clear that the local router
have FRR working appropriately, but the remote router does not detect the
failure within few milliseconds to start FRR. It seems the Carrier does
not let the alarms arrive at the remote router.
==========================================
Which routers/interface-cards are you using running which IOS release?
How large is your FRR database, i.e. how many tunnels do you protect? Do
you have tuned your IGP to fast-convergence and how many routes are in
your routing table (IGP + BGP)?
The FRR debug would be interesting..
============================================
7609 routers, with 4-port OC3 POS controller, Gigabit ethernet and version
12.2.18SXD6
Just one tunnel is protected.
We do not changed IGP timers.
There are 100 routes. Just OSPF is used.
Debugs from RA1:
2006-01-18 07:40:31 Local7.Warning 172.19.2.60 899: *Jan 18
04:26:38.936 COL: %SONET-4-ALARM: POS3/3: LRDI
2006-01-18 07:40:31 Local7.Debug 172.19.2.60 900: *Jan 18
04:26:40.620 COL: LFIB-FRR: PO3/3: "holddown enabled" -> "holddown
disabled" (Down)
2006-01-18 07:40:31 Local7.Debug 172.19.2.60 901: *Jan 18
04:26:40.620 COL: LFIB-FRR: discarded interface DOWN event for PO3/3
(Down)
2006-01-18 07:40:32 Local7.Notice 172.19.2.60 902: *Jan 18
04:26:40.620 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3
from FULL to DOWN, Neighbor Down: Interface down or detached
2006-01-18 07:40:41 Local7.Warning 172.19.2.60 903: *Jan 18
04:26:48.964 COL: %SONET-4-ALARM: POS3/3: LRDI cleared
2006-01-18 07:40:44 Local7.Error 172.19.2.60 904: SLOT 3/0:
04:02:24: %LINK-3-UPDOWN: Interface POS3/3, changed state to down
2006-01-18 07:41:46 Local7.Warning 172.19.2.60 905: *Jan 18
04:27:54.180 COL: %SONET-4-ALARM: POS3/3: LRDI
2006-01-18 07:41:51 Local7.Warning 172.19.2.60 906: *Jan 18
04:27:59.208 COL: %SONET-4-ALARM: POS3/3: B1 TC alert threshold exceeded
2006-01-18 07:41:55 Local7.Warning 172.19.2.60 907: *Jan 18
04:28:04.232 COL: %SONET-4-ALARM: POS3/3: SLOS cleared
2006-01-18 07:41:56 Local7.Debug 172.19.2.60 908: *Jan 18
04:28:04.236 COL: LFIB-FRR: discarded interface GOING DOWN event for PO3/3
(Down)
2006-01-18 07:41:59 Local7.Error 172.19.2.60 909: SLOT 3/0:
04:03:39: %LINK-3-UPDOWN: Interface POS3/3, changed state to up
2006-01-18 07:42:01 Local7.Warning 172.19.2.60 910: *Jan 18
04:28:09.252 COL: %SONET-4-ALARM: POS3/3: B1 TC alert cleared
2006-01-18 07:42:01 Local7.Warning 172.19.2.60 911: *Jan 18
04:28:10.252 COL: %SONET-4-ALARM: POS3/3: LRDI cleared
2006-01-18 07:42:04 Local7.Debug 172.19.2.60 912: *Jan 18
04:28:12.256 COL: LFIB-FRR: enqueued interface UP event for PO3/3 (Down)
2006-01-18 07:42:04 Local7.Debug 172.19.2.60 913: *Jan 18
04:28:12.260 COL: LFIB-FRR: processing interface UP event for PO3/3 (Down)
2006-01-18 07:42:04 Local7.Debug 172.19.2.60 914: *Jan 18
04:28:12.260 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3
(Up), Tu21121 (Up)
2006-01-18 07:42:04 Local7.Debug 172.19.2.60 915: *Jan 18
04:28:12.260 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 0 items output if
fixup
2006-01-18 07:42:12 Local7.Notice 172.19.2.60 916: *Jan 18
04:28:20.332 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3
from LOADING to FULL, Loading Done
2006-01-18 07:51:51 Local7.Debug 172.19.2.60 925: *Jan 18
04:37:59.281 COL: LFIB-FRR: PO3/3: "holddown disabled" -> "holddown
enabled" (Up)
2006-01-18 07:51:51 Local7.Debug 172.19.2.60 926: *Jan 18
04:37:59.281 COL: LFIB-FRR: enqueued interface DOWN event for PO3/3 (Up)
2006-01-18 07:51:51 Local7.Debug 172.19.2.60 927: *Jan 18
04:37:59.281 COL: LFIB-FRR: processing interface DOWN event for PO3/3 (Up,
held down)
2006-01-18 07:51:51 Local7.Debug 172.19.2.60 928: *Jan 18
04:37:59.281 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3
(Down), Tu21121 (Up)
2006-01-18 07:51:51 Local7.Debug 172.19.2.60 929: *Jan 18
04:37:59.285 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 22 items output if
fixup
2006-01-18 07:51:52 Local7.Debug 172.19.2.60 930: *Jan 18
04:38:01.281 COL: LFIB-FRR: PO3/3: "holddown enabled" -> "holddown
disabled" (Down)
2006-01-18 07:51:52 Local7.Debug 172.19.2.60 931: *Jan 18
04:38:01.281 COL: LFIB-FRR: discarded interface DOWN event for PO3/3
(Down)
2006-01-18 07:51:53 Local7.Notice 172.19.2.60 932: *Jan 18
04:38:01.281 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3
from FULL to DOWN, Neighbor Down: Interface down or detached
2006-01-18 07:52:04 Local7.Debug 172.19.2.60 933: *Jan 18
04:38:12.313 COL: LFIB-FRR: enqueued interface UP event for PO3/3 (Down)
2006-01-18 07:52:04 Local7.Debug 172.19.2.60 934: *Jan 18
04:38:12.317 COL: LFIB-FRR: processing interface UP event for PO3/3 (Down)
2006-01-18 07:52:04 Local7.Debug 172.19.2.60 935: *Jan 18
04:38:12.317 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3
(Up), Tu21121 (Up)
2006-01-18 07:52:04 Local7.Debug 172.19.2.60 936: *Jan 18
04:38:12.317 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 0 items output if
fixup
Cordially,
------------------------------------------------------------------
Alaerte Gladston Vidali
IBM Global Services - SO
Tel.55+11+2121-2879 Fax:55+11+2121-2449
"Oliver Boehmer \(oboehmer\)" <oboehmer at cisco.com>
16/01/2006 05:20
To
Alaerte Gladston Vidali/Brazil/IBM at IBMBR, <cisco-nsp at puck.nether.net>
cc
Subject
RE: [c-nsp] FRR Recovery Time
gladston at br.ibm.com <> wrote on Sunday, January 15, 2006 9:20 PM:
Hi,
> If you have measured FRR recovery time, did you find consistent times
> between your findings on lab and the 50msec stated on theory?
> We measure it using two tools. However, the results are inconsistent
> with the expected value of 50msec.
>
> This is the network:
>
> RA1--------------------------RB1
> | |
> | |
> RA2-------------------------RB2
>
> The main tunnel is RA1-RB2
> The backup tunnel is RA1-RA2-RB2-RB1
What type of protection do you have configured? Which path does the
RA1->RB2 tunnel take? Or is the tunnel RA1->RB1 and you protect the link
RA1-RB2?
> The links are 155Mbps between RA1-RA2 and RB1-RB2 and Giga between
> RA1-RA2 and RB1-RB2
copy&paste error? Which link is what? can you please re-post your exact
configuration?
Do you have bi-directional tunnels configured, or are you measuring
one-way only (somewhat doubt it using your 2nd measurements with the IP
application).
[...]
> Comparing the results:
> These were the result using the first tool:
> 70msec
> 481msec
> 411msec
> 371msec
[...]
> Nevertheless, as it is different from the documentations that states
> 50msec for the operation of FRR, we would like to double check the
> result. We used two ways to fail the link:
> -disconnect the POS fiber manually
> -shutdown the POS interface (using the command POS ais-shut)
> As the time using manual shutdown increased to 800msec, we discarded
> the test using this way of failing the link and just use the first
> way, manually disconnecting the fiber.
>
> Your comments and recommendations are more than welcome.
>
> I am wondering if there is any timer on Cisco that can be configured,
> as "carrier-delay ms x", to improve FRR time. However, as opposite of
> this command that introduces some delay, I would like to improve the
> time of FRR is possible and test again.
Which routers/interface-cards are you using running which IOS release?
How large is your FRR database, i.e. how many tunnels do you protect? Do
you have tuned your IGP to fast-convergence and how many routes are in
your routing table (IGP + BGP)?
The FRR debug would be interesting..
oli
More information about the cisco-nsp
mailing list