Fw: some notes on ARIN RPKI outages

Tue Jun 8 08:37:14 EDT 2021

TLP:RED

To be very clear: I am not a fan rolling blackouts. 

But, I feel it is important to warn ahead of time what my perspective is
on the 'tolerance levels' of RPKI services. My hope is that this will
help all parties make better informed decisions.

Kind regards,

Job

----- Forwarded message from Job Snijders <job at fastly.com> -----

Date: Tue, 8 Jun 2021 14:14:42 +0200
From: Job Snijders <job at fastly.com>
To: jcurran at arin.net, bgorman at arin.net
Subject: some notes on RPKI outages

Dear Brad & John,

Over the weekend I reflected a bit more on the outage proposal at hand,
what actions would venture into needless BGP routing churn (balanced
with the need to better understand uptime expectations).

I think everyone agrees that 'the RPKI' is an optional security feature:
if for one reason or another (ROA was deleted, Certificate expired, etc)
the RPKI is 'down', the expectation is that the Internet continues to
work as if RPKI didn't exist at all. Aka - don't reject routes solely
because they are "NotFound". Cool - so far so good :-)

A complication to keep in mind is that when the RPKI Validation State of
a BGP route transitions from "Valid" to "NotFound", there is a cost to
bear in the global routing system. This is somewhat similar (but not
entirely) to the negative aspect of BGP WITHDRAW messages. BGP WITHDRAWs
propagate 'in kinda the wrong order' through the DFZ, potentially
causing micro blackholes along the WITHDRAW propagation path. While
everyone agrees that BGP WITHDRAWs are part of normal operations, most
organizations try to avoid them at all cost. In a similar way we should
all strive to minimize 'RPKI ROA withdraws', because those trigger BGP
UPDATES, which in turn potentially cause re-routing.

I know of three BGP aspects that suffer disproportionally when (lots of)
RPKI ROAs disappear from the view:

    * any network tagging all their routes based on validation state
    * any network with Juniper boxes vulnerable to PR1483097
    * any networks using Cisco IOS XE (where RFC 8097 tagging cannot be turned off)

I expect that over time the above three 'unfortunate complications' will
go down, as more and more networks fix their configs, or upgrade their
Junos. Unfortunately the velocity from 'bugfix published' to 'bugfix
deployed at global scale' is really slow, quite some carriers only
upgrade their routers once a year (if even that often)...

So what _could_ be done?
========================

Option #1: I would focus on access to the repository, along these lines:

    * 1 hour of IPv4/RSYNC down
    * 1 hour of IPv6/RSYNC down
    * 1 hour of IPv4+IPv6/RSYNC down
    * 1 hour of IPv4/RRDP down
    * 1 hour of IPv6/RRDP down
    * 1 hour of IPv4+IPv6/RRDP down
    * 1 hour of DNSSEC failure on rpki.arin.net
    * 1 hour of DNSSEC failure on rrdp.arin.net
    (or variants of the above)

Going beyond 1 hour might raise eyebrows.

I fear that purposefully breaking 'crypto' aspects (such as corrupting
Manifests, revoking top level objects, letting things expire, etc)
rather than 'access aspects', will yield unfavorable responses from the
community and decrease trust in the ARIN trust anchor. For example, a
(shorter) repeat of the August 2020 outage should be avoided.

Option #2: Targetted outages: test prefixes
===========================================

Another path to explore would be to get some test prefixes up in the BGP
DFZ and add/remove ROAs for just those test prefixes. Then the
timestamps of the add/remove actions can be looked up in the routeviews
archive to see if the path changes because of the RPKI validation state
change. Then as a follow-up, ARIN could reach out to networks along the
AS_PATH where you see BGP communities change (probably caused by
suboptimal router configs).

Just my 2 cents! I'm happy to evaluate your plans and provide feedback.

Kind regards,

Job

----- End forwarded message -----