[c-nsp] FRR Recovery Time

Sun Jan 22 15:58:38 EST 2006

Sorry it took so long to get the requested information. It was necessary 
to wait for a maintenance window.

There is a new information. We used RSVP hellos to detect the failure, and 
this test revealed that FRR is working pretty well. The problem is that 
without RSVP hellos the POS alarms are not enough deactivate the remote 
interface on the remote end router to have bidirectional communication 
recovered.

The failure is simulated disconnecting the fiber on the POS interface of 
RA1.

I am wondering if the Carrier has some configuration that does not let the 
POS alarms arrive at RB1 when the fiber on RA1 is disconnected.
Any feedback concerned to this is really appreciated.

I have studied the POS alarms, but from what I understood the default 
should be enough to allow the remote router detect a failure on the local 
router.
This is from Cisco pages:
"You can issue the pos delay triggers path command in order to configure 
various path alarms as triggers and in order to specify an activation 
delay between 0 and 511 ms. The default delay value is 100 ms."
I did not try that command (pos delay triggers path) to activate path 
alarms to bring the protocol down. Do you think that would help?

=============================================================================
>
>      RA1-(pos)-----------------------(pos)--RB1
>     (giga)                                  (giga)
         |                                      | 
>     (giga)                                  (giga)
>       RA2-(pos)----------------------(pos)--RB2
>
>  The main tunnel is RA1-RB1
>  The backup tunnel is RA1-RA2-RB2-RB1

What type of protection do you have configured? Which path does the
RA1->RB2 tunnel take? Or is the tunnel RA1->RB1 and you protect the link
RA1-RB2?
==============================================================================

Link protection on RA1 and RB1. The protected link is RA1--RB1. It is 
protected by a tunnel that take the path RA1-RA2-RB2-RB1.

=================================
copy&paste error? Which link is what? can you please re-post your exact
configuration?
=================================

Yes, sorry, my error. 155Mbps between RA1-RB1 and RA2-RB2.

====================================
Do you have bi-directional tunnels configured, or are you measuring
one-way only (somewhat doubt it using your 2nd measurements with the IP
application).
=====================================

Yes, bi-directional tunnels. And now it is clear that the local router 
have FRR working appropriately, but the remote router does not detect the 
failure within few milliseconds to start FRR. It seems the Carrier does 
not let the alarms arrive at the remote router.

==========================================
Which routers/interface-cards are you using running which IOS release?
How large is your FRR database, i.e. how many tunnels do you protect? Do
you have tuned your IGP to fast-convergence and how many routes are in
your routing table (IGP + BGP)?
The FRR debug would be interesting..
============================================

7609 routers, with 4-port OC3 POS controller, Gigabit ethernet and version 
12.2.18SXD6
Just one tunnel is protected.
We do not changed IGP timers.
There are 100 routes. Just OSPF is used.

Debugs from RA1:
2006-01-18 07:40:31     Local7.Warning  172.19.2.60     899: *Jan 18 
04:26:38.936 COL: %SONET-4-ALARM:  POS3/3: LRDI
2006-01-18 07:40:31     Local7.Debug    172.19.2.60     900: *Jan 18 
04:26:40.620 COL: LFIB-FRR: PO3/3: "holddown enabled" -> "holddown 
disabled" (Down)
2006-01-18 07:40:31     Local7.Debug    172.19.2.60     901: *Jan 18 
04:26:40.620 COL: LFIB-FRR: discarded interface DOWN event for PO3/3 
(Down)
2006-01-18 07:40:32     Local7.Notice   172.19.2.60     902: *Jan 18 
04:26:40.620 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3 
from FULL to DOWN, Neighbor Down: Interface down or detached
2006-01-18 07:40:41     Local7.Warning  172.19.2.60     903: *Jan 18 
04:26:48.964 COL: %SONET-4-ALARM:  POS3/3: LRDI cleared
2006-01-18 07:40:44     Local7.Error    172.19.2.60     904: SLOT 3/0: 
04:02:24: %LINK-3-UPDOWN: Interface POS3/3, changed state to down
2006-01-18 07:41:46     Local7.Warning  172.19.2.60     905: *Jan 18 
04:27:54.180 COL: %SONET-4-ALARM:  POS3/3: LRDI
2006-01-18 07:41:51     Local7.Warning  172.19.2.60     906: *Jan 18 
04:27:59.208 COL: %SONET-4-ALARM:  POS3/3: B1 TC alert threshold exceeded
2006-01-18 07:41:55     Local7.Warning  172.19.2.60     907: *Jan 18 
04:28:04.232 COL: %SONET-4-ALARM:  POS3/3: SLOS cleared
2006-01-18 07:41:56     Local7.Debug    172.19.2.60     908: *Jan 18 
04:28:04.236 COL: LFIB-FRR: discarded interface GOING DOWN event for PO3/3 
(Down)
2006-01-18 07:41:59     Local7.Error    172.19.2.60     909: SLOT 3/0: 
04:03:39: %LINK-3-UPDOWN: Interface POS3/3, changed state to up
2006-01-18 07:42:01     Local7.Warning  172.19.2.60     910: *Jan 18 
04:28:09.252 COL: %SONET-4-ALARM:  POS3/3: B1 TC alert cleared
2006-01-18 07:42:01     Local7.Warning  172.19.2.60     911: *Jan 18 
04:28:10.252 COL: %SONET-4-ALARM:  POS3/3: LRDI cleared
2006-01-18 07:42:04     Local7.Debug    172.19.2.60     912: *Jan 18 
04:28:12.256 COL: LFIB-FRR: enqueued interface UP event for PO3/3 (Down)
2006-01-18 07:42:04     Local7.Debug    172.19.2.60     913: *Jan 18 
04:28:12.260 COL: LFIB-FRR: processing interface UP event for PO3/3 (Down)
2006-01-18 07:42:04     Local7.Debug    172.19.2.60     914: *Jan 18 
04:28:12.260 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3 
(Up), Tu21121 (Up)
2006-01-18 07:42:04     Local7.Debug    172.19.2.60     915: *Jan 18 
04:28:12.260 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 0 items output if 
fixup
2006-01-18 07:42:12     Local7.Notice   172.19.2.60     916: *Jan 18 
04:28:20.332 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3 
from LOADING to FULL, Loading Done

2006-01-18 07:51:51     Local7.Debug    172.19.2.60     925: *Jan 18 
04:37:59.281 COL: LFIB-FRR: PO3/3: "holddown disabled" -> "holddown 
enabled" (Up)
2006-01-18 07:51:51     Local7.Debug    172.19.2.60     926: *Jan 18 
04:37:59.281 COL: LFIB-FRR: enqueued interface DOWN event for PO3/3 (Up)
2006-01-18 07:51:51     Local7.Debug    172.19.2.60     927: *Jan 18 
04:37:59.281 COL: LFIB-FRR: processing interface DOWN event for PO3/3 (Up, 
held down)
2006-01-18 07:51:51     Local7.Debug    172.19.2.60     928: *Jan 18 
04:37:59.281 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3 
(Down), Tu21121 (Up)
2006-01-18 07:51:51     Local7.Debug    172.19.2.60     929: *Jan 18 
04:37:59.285 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 22 items output if 
fixup
2006-01-18 07:51:52     Local7.Debug    172.19.2.60     930: *Jan 18 
04:38:01.281 COL: LFIB-FRR: PO3/3: "holddown enabled" -> "holddown 
disabled" (Down)
2006-01-18 07:51:52     Local7.Debug    172.19.2.60     931: *Jan 18 
04:38:01.281 COL: LFIB-FRR: discarded interface DOWN event for PO3/3 
(Down)
2006-01-18 07:51:53     Local7.Notice   172.19.2.60     932: *Jan 18 
04:38:01.281 COL: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.118.22 on POS3/3 
from FULL to DOWN, Neighbor Down: Interface down or detached
2006-01-18 07:52:04     Local7.Debug    172.19.2.60     933: *Jan 18 
04:38:12.313 COL: LFIB-FRR: enqueued interface UP event for PO3/3 (Down)
2006-01-18 07:52:04     Local7.Debug    172.19.2.60     934: *Jan 18 
04:38:12.317 COL: LFIB-FRR: processing interface UP event for PO3/3 (Down)
2006-01-18 07:52:04     Local7.Debug    172.19.2.60     935: *Jan 18 
04:38:12.317 COL: LFIB-FRR: group PO3/3->Tu21121: output if fixup: PO3/3 
(Up), Tu21121 (Up)
2006-01-18 07:52:04     Local7.Debug    172.19.2.60     936: *Jan 18 
04:38:12.317 COL: LFIB-FRR: group PO3/3->Tu21121: fixed 0 items output if 
fixup

Cordially,
------------------------------------------------------------------
Alaerte Gladston Vidali
IBM Global Services - SO
Tel.55+11+2121-2879   Fax:55+11+2121-2449

"Oliver Boehmer \(oboehmer\)" <oboehmer at cisco.com> 
16/01/2006 05:20

To
Alaerte Gladston Vidali/Brazil/IBM at IBMBR, <cisco-nsp at puck.nether.net>
cc

Subject
RE: [c-nsp] FRR Recovery Time

gladston at br.ibm.com <> wrote on Sunday, January 15, 2006 9:20 PM:

Hi,

> If you have measured FRR recovery time, did you find consistent times
> between your findings on lab and the 50msec stated on theory?
> We measure it using two tools. However, the results are inconsistent
> with the expected value of 50msec.
>
> This is the network:
>
>      RA1--------------------------RB1
>        |                                          |
>        |                                          |
>       RA2-------------------------RB2
>
>  The main tunnel is RA1-RB2
>  The backup tunnel is RA1-RA2-RB2-RB1

What type of protection do you have configured? Which path does the
RA1->RB2 tunnel take? Or is the tunnel RA1->RB1 and you protect the link
RA1-RB2?

> The links are 155Mbps between RA1-RA2 and RB1-RB2 and Giga between
> RA1-RA2 and RB1-RB2

copy&paste error? Which link is what? can you please re-post your exact
configuration?

Do you have bi-directional tunnels configured, or are you measuring
one-way only (somewhat doubt it using your 2nd measurements with the IP
application).

[...]
> Comparing the results:
> These were the result using the first tool:
> 70msec
> 481msec
> 411msec
> 371msec

[...]

> Nevertheless, as it is different from the documentations that states
> 50msec for the operation of FRR, we would like to double check the
> result. We used two ways to fail the link:
>     -disconnect the POS fiber manually
>     -shutdown the POS interface (using the command POS ais-shut)
> As the time using manual shutdown increased to 800msec, we discarded
> the test using this way of failing the link and just use the first
> way, manually disconnecting the fiber.
>
> Your comments and recommendations are more than welcome.
>
> I am wondering if there is any timer on Cisco that can be configured,
> as "carrier-delay ms x", to improve FRR time. However, as opposite of
> this command that introduces some delay, I would like to improve the
> time of FRR is possible and test again.

Which routers/interface-cards are you using running which IOS release?
How large is your FRR database, i.e. how many tunnels do you protect? Do
you have tuned your IGP to fast-convergence and how many routes are in
your routing table (IGP + BGP)?
The FRR debug would be interesting..

oli