[c-nsp] Meraki...information

Thu Oct 24 13:52:37 EDT 2013

In response to my own post, this unsurprisingly turned out not to be a problem with the Meraki switches.  Nor was it, and this is a shocker, a problem with my most expertly diagnosed "fundamental misunderstanding" of switching and buffering by an astutely omnipotent responder.  The tl;dr is that it so far appears to be a bug in the ME3600.  Read further if you want the long long long explanation.

For those that don't want to read the whole thing, I have one question to get out there: does anyone have a diagram of the packet flow through the ME3600 architecture?  I thought that I had one somewhere, but I can't seem to find it.

The novel:

What we found out was that the ME3600 is dropping traffic to certain destinations when the 10G port is configured in an EFP-type of configuration, but not if the 10G port is configured as a normal 'switchport' with the WAN VLAN as a tagged member of the port (switchport trunk allowed vlan 100).  It just so happened, that contained on these "certain destinations", were Akamai servers on our network which serve up content for our customers, so any sites displaying Akamai content were loading incredibly slow, if at all.

Furthermore, this only occurred to destinations that exist within any /18, /19, or /20 that our distribution network readvertises from the core to the ME3600, but not if there is a more specific route.  For example, if the client is trying to ping 192.168.198.1 and the most specific route in the ME3600 routing table is 192.168.192.0/18, the switch drops the traffic.  If either the /18 is removed from the BGP routing table of the ME3600 (so the path follows the default route) OR a more specific route gets advertised to the ME3600 (such as 192.168.193.0/24), then everything works perfectly.  Lastly, if the aggregate route is seen through the IGP instead of BGP, things work fine.  It may be important to note that the 'aggregate' route here is just a nailed up /18 to null0 in the core, not an 'aggregate' in the BGP sense of the word.

The TAC engineer believes that there is some problem between the ASICs that handle routing and 10G module.  The fact that there is a different ASIC that handles the 10G module as opposed to the 1G ports would be the reason why things worked properly on the 1G port with the same exact config (Gi0/24) and not on the 10G port (Te0/1).  The switch is dropping this traffic at some point along the internal path.  We discovered yesterday that an IP access-list applied to the WAN SVI shows that the traffic is hitting the ACL, but it never actually leaves the switch, as evidenced by a passive network tap installed on the fiber.

This condition has been completely reproduceable in our lab, but I'm not sure why we're not seeing this problem in our network on a more widespread basis, as all the ME3600s we have deployed are practically identical in configuration (uses 10G EFP ports, ISIS, BGP, MPLS).  It's not a software version issue, because we've tried three versions so far, one of which matches the same IOS we're running in production.  This occurs on 15.3(3)S, 15.3(2)S1 and 15.3(2)S, at a minimum.

If this does turn out to be a bug, I'll be happy to post it to the list or you can contact me directly to request further details.

-evt

> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Eric
> Van Tol
> Sent: Thursday, October 10, 2013 1:04 PM
> To: cisco-nsp at puck.nether.net
> Subject: [c-nsp] Meraki...information
> 
> Hi all,
> We ran into a very strange problem last night with a customer who utilizes
> Meraki switches.  I'd like to ask anyone on the list who is familiar with
> this model of switch whether there is *any* possibility that an upstream
> modification would cause issues with traffic traversing these switches.
> 
> A little background: we attempted to perform a migration of a transport
> circuit in our network from 1G to 10G last night, but the single customer
> attached to the ME3600 where the transport circuit was changed, started to
> have issues.  There are no errors being reported on either end of the
> circuit, light levels are good, and we get consistent 1500-byte df-bit pings
> to their firewall from both inside and outside our borders.  The transport
> circuit is not even a circuit that "touches" the customer's network.
> However, they report slow browsing from within their LAN (but not from their
> DMZ on the same ASA).  When switching the transport circuit back to 1G,
> everything works fine.  There is absolutely no difference in the routing,
> path, or IP addresses on this transport circuit - the only difference is
> link speed.
> 
> Customer now believes the problem is with their Meraki switches, but we are
> both confused about how a change two physical hops upstream from their LAN
> would cause such issues.  The "slow browsing" issue is definitely contained
> within their network, as they are not even able to browse their own website
> which is located entirely on their infrastructure and doesn't pass through
> the 10G link, or even through the CPE we provide.
> 
> I know nothing about the Meraki product, besides the fact that it's a cloud
> managed solution.  Has anyone ever heard of a problem like this before with
> this model of switch?
> 
> Thanks,
> evt
> 
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/