[j-nsp] MX104 with full BGP table problems

Fri May 16 15:28:01 EDT 2014

Thanks for the response; answers inline...

On May 16, 2014, at 1:58 PM, Tyler Christiansen <tyler.christiansen at adap.tv> wrote:

> I don't have experience with the MX104s but do with the rest of the line (MX80 to MX2010 [excluding MX104, of course]).  MX80 isn't dual RE, but the CPUs are the same family between MX80 and MX104 IIRC--the MX104 is just 500 or 600 Mhz faster.  And the MX80 kind of chokes when receiving a full feed (even just one at a time can easily send it up to ~40% during the initial feed consumption).  ;)
> 
> The MX80 and MX104 being sold as edge BGP routers is pretty much only because it has enough memory to do it...not because it's a good idea.
> 
> It's pretty odd for the backup RE to have CPU utilization (based on experience with the other dual RE MX devices).  Some, yes, but not 100% utilization as you show there.  I would buy 100% utilization during initial feed consumption on the master.  After you have some stability in the network, though, the CPU should be back down to ~5-15% (depending on what you have going on).

I agree; we’ve run a few M10is and never had this issue, but.. totally different platform, and much older version of Junos made me generally discount it. These are the first multi-RE boxes we’ve had running any Junos newer then 10.0. Thanks for pointing it out, it’s something I missed in my previous email. As the previous output shows, 15min load averages for each RE are ~1.20 so the load remains elevated. I just confirmed that the 15min load average after about 2hours of “sitting” remains ~1.22.
> 
> How aggressive are your BGP timers?  You may want to consider BFD instead of BGP timers for aggressive keepalives.

BGP timers are default; however, we’ve tried relaxing them with no change in behavior.
> 
> Are you doing just plain IPv4 BGP, or are you utilizing MBGP extensions?  MBGP extensions can inflate the size of the BGP tables and make the router do more work.

We’ve tried both with no difference in performance. The example outputs in my original message were with MBGP extensions enabled but doing only IPv4 unicast on the session produces the same result.
> 
> In all scenarios, you really should probably have loopback IPs in the IGP and have the nexthop set to the loopback IPs for iBGP sessions.  I'm not sure why you have /30 P2P links as the next-hops as they're potentially unstable (even if they're not now, they can easily become unstable once in production).  I assume that since you mentioned you know it's not recommended, you're going to be changing that.

This is a bit of a legacy issue within our network. We’ve operated for nearly 12 years using the actual PtP in our IGP and retaining it in BGP advertisements. It is something we plan to resolve with the deployment of this gear (as well as several new MX960s that were part of the same PO).
> 
> In scenario #2, how many RRs does the MX104 peer with?  And are they sending full routes or full routes + more?

The box was only peering with a single RR. The RR was only sending the standard, full table (~496K routes), no VPN, no mcast, etc.
> 
> Finally, in scenario #3, if you're trying to do a full mesh with 11 other peers, the MX104 will choke if they're all trying to load full tables.  There are about 500,000 routes in the global table, so you're trying to load 5,500,000 routes into a box with a 1.8Ghz CPU and 4GB RAM.

In scenario #3 the total number of routes entering the RE was ~867K with ~496K active.
> 
> Regardless, I would think that the MX104 should be perfectly capable of scaling to at least five or six full feeds.  I would suspect either a bug in the software or very aggressive timers.
> 
> On Fri, May 16, 2014 at 11:00 AM, Brad Fleming <bdflemin at gmail.com> wrote:
> We’ve been working with a handful of MX104s on the bench in preparation of putting them into a live network. We started pushing a full BGP table into the device and stumbled across some CPU utilization problems.
> 
> We tried pushing a full table into the box three different ways:
> 1) via an eBGP session
> 2) via a reflected session on an iBGP session
> 3) via a full mesh of iBGP sessions (11 other routers)
> 
> In situation #1: RE CPU was slightly elevated but remained ~60% idle and 1min load averages were around 0.3.
> 
> In situation #2: RE CPU is highly elevated. We maintain actual p-t-p /30s for our next-hops (I know, not best practice for many networks) which results in a total of about 50-65 next-hops network-wide.
> 
> In situation #3: RE CPU is saturated at all times. In this case we configured the mesh sessions to advertise routes with “next-hop-self” so the number of next-hops is reduced to 11 total.
> 
> It appears that RPD Is the process actually killing the CPU; nearly always running 75+% and in a “RUN” state. If we enable task accounting it shows “Resolve Tree 2” as the task consuming tons of CPU time. (see below) There’s plenty of RAM remaining, we’re not using any swap space, and we’ve not exceed the number of routes licensed for the system; we paid for the full 1Million+ route scaling. Logs are full of lost communication with the backup RE; however, if we disable all the BGP sessions that issue goes away completely (for days on end).
> 
> Has anyone else tried shoving a full BGP table into one of these routers yet? Have you noticed anything similar?
> 
> I’ve opened a JTAC case for the issue but I’m wondering if anyone with more experience in multi-RE setups has seen similar. Thanks in advance for any thoughts, suggestions, or insights.
> 
> 
> Incoming command output dump….
> 
> netadm at test-MX104> show chassis routing-engine
> Routing Engine status:
>   Slot 0:
>     Current state                  Master
>     Election priority              Master (default)
>     Temperature                 39 degrees C / 102 degrees F
>     CPU temperature             42 degrees C / 107 degrees F
>     DRAM                      3968 MB (4096 MB installed)
>     Memory utilization          32 percent
>     CPU utilization:
>       User                      87 percent
>       Background                 0 percent
>       Kernel                    11 percent
>       Interrupt                  2 percent
>       Idle                       0 percent
>     Model                          RE-MX-104
>     Serial ID                      CACH2444
>     Start time                     2009-12-31 18:05:43 CST
>     Uptime                         21 hours, 31 minutes, 32 seconds
>     Last reboot reason             0x200:normal shutdown
>     Load averages:                 1 minute   5 minute  15 minute
>                                        1.06       1.12       1.23
> Routing Engine status:
>   Slot 1:
>     Current state                  Backup
>     Election priority              Backup (default)
>     Temperature                 37 degrees C / 98 degrees F
>     CPU temperature             38 degrees C / 100 degrees F
>     DRAM                      3968 MB (4096 MB installed)
>     Memory utilization          30 percent
>     CPU utilization:
>       User                      62 percent
>       Background                 0 percent
>       Kernel                    15 percent
>       Interrupt                 24 percent
>       Idle                       0 percent
>     Model                          RE-MX-104
>     Serial ID                      CACD1529
>     Start time                     2010-03-18 05:16:34 CDT
>     Uptime                         21 hours, 45 minutes, 26 seconds
>     Last reboot reason             0x200:normal shutdown
>     Load averages:                 1 minute   5 minute  15 minute
>                                        1.22       1.19       1.20
> 
> netadm at test-MX104> show system processes extensive
> last pid: 20303;  load averages:  1.18,  1.14,  1.22  up 0+21:33:35    03:03:41
> 127 processes: 8 running, 99 sleeping, 20 waiting
> Mem: 796M Active, 96M Inact, 308M Wired, 270M Cache, 112M Buf, 2399M Free
> Swap: 1025M Total, 1025M Free
>   PID USERNAME         THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
>  3217 root               1 132    0   485M   432M RUN    120:56 72.85% rpd
> 
> netadm at test-MX104> show task accounting
> Task accounting is enabled.
> 
> Task                       Started    User Time  System Time  Longest Run
> Scheduler                    32294        0.924        0.148        0.000
> Memory                          26        0.001        0.000        0.000
> RT                            5876        0.947        0.162        0.003
> hakr                             6        0.000        0.000        0.000
> OSPF I/O./var/run/ppmd_co      117        0.002        0.000        0.000
> BGP rsync                      192        0.007        0.001        0.000
> BGP_RT_Background               78        0.001        0.000        0.000
> BGP_Listen.0.0.0.0+179        2696        1.101        0.218        0.009
> PIM I/O./var/run/ppmd_con      117        0.003        0.000        0.000
> OSPF                           629        0.005        0.000        0.000
> BGP Standby Cache Task          26        0.000        0.000        0.000
> BFD I/O./var/run/bfdd_con      117        0.003        0.000        0.000
> BGP_2495_2495.164.113.199     1947        0.072        0.012        0.000
> BGP_2495_2495.164.113.199     1566        0.056        0.010        0.000
> BGP_2495_2495.164.113.199     1388        0.053        0.008        0.000
> Resolve tree 3                1421       24.523       13.270        0.102
> Resolve tree 2               14019    16:33.079       20.983        0.101
> Mirror Task.128.0.0.6+584      464        0.018        0.004        0.000
> KRT                           1074        0.157        0.159        0.004
> Redirect                         9        0.000        0.000        0.000
> MGMT_Listen./var/run/rpd_       54        0.009        0.005        0.000
> SNMP Subagent./var/run/sn      258        0.052        0.052        0.001
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>