Considerations for federal research funding agencies regarding need for access to vendor-specific information

Draft CAIDA recommendation.

21 jan 2002

I believe (based solely on a single long-ago observation that an M/M/1/K queuing model seemed to predict the measured behaviour of core routers with short buffers pretty accurately) that the speculation in your third sentence above may in fact be true. To relate this to another recent thread on this list, however, this seems like one of those things that should require no speculation since it is not so difficult to measure, yet I know of no good-quality data from "typical" core circuits which has been published anywhere. The network we've built is constructed with insufficient instrumentation to enable us to understand what it is we've built with any certainty, so we speculate.
---- dennis@juniper.net
http://www.postel.org/pipermail/end2end-interest/2003-January/002720.html

Obstacles to modeling and simulation

Critical information for large-scale simulation of the routing mesh, yet is unavailable to anyone lacking source code access:

Deviations from RFC

timers (vendors often change based on tuning tests, e.g., iBGP timers, hold-down timers, and in general timers that are designed to prevent propagation or slow down or speed up state change.)
decision algorithms (BGP, changed based on user feedback?). sometimes documented in release notes.

Bugs/Failure Modes

Completey ignored by all current Internet simulations. Probaby requires research effort to uncover the bugs, but vendor engineers probably could provide assistance in prioritizing some issues in widely-deployed software/hardware that researchers should consider.

BGP failure propagation: router stress

We need to define the purpose of the model very clearly to get the right abstraction level. two issues: router stress (leading to failures/reboots) and BGP behavior. Each requires vastly different kinds of information. Recent research (NMS PI Ogielski) has focused on effects of worms, like Code Red, and other potential infrastructure attacks causing some routers to fail and reboot. Led by the Cisco advisory on Code Red, Ogielski's group formulated a high abstraction level stress model (where we still lack most of the parameters). Assumptions:

Failures are triggered by either memory exhaustion (some memory area) or prolonged CPU overload at interrupt level. Thus our high level model is:

                         f_mem                      f_mem_fail
    Packets of certain   -----> memory requirements ---------> mem ind failure
    types are switched
    at rates R1,R2,...   -----> CPU requirements    ---------> cpu ind failure
                         f_cpu                      f_cpu_fail

f_mem and f_cpu are functions that translate packet rates Rx into memory and CPU resource requirements, respectively. f_mem_fail and f_cpu_fail are functions that describe how the router (IOS) responds to the given memory/CPU requirements. Required knowledge:

How can we quantify the resource requirements? That is, we'd like to be able to plug in numbers for activities involved in packet switching.
How can we quantify the effects? For instance,
1. How does IOS react to linecard/cache/main memory being exhausted? Can it recover through memory management? When does it reboot? Under what conditions?
2. Our current basic assumption regarding CPU load is that only interrupt level CPU load is dangerous in terms of triggering failures". Is this really correct? Is there a mean time to failure under 100% load?

This mission is greatly complicated by the fact that these parameters will vary with: router architecture (shared memory / distributed shared memory / crossbar ); IOS version; switching method; process-vs-fast switched; Cisco Express Forwarding enabled, etc. (Much of these parameters determined by hardware model and IOS version.), as well as site-specifics, such as router configuration (memory, options switched on, etc.) and local network topology.

Aside: question: failure postponing mode
What is the feasibility of introducing a mode that would gloss over some links failures and postpone dropping the BGP session until after a few minutes (or other some configured flap time), while forwarding packets to the addresses affected by failure on backup or non-best paths at the same time? (perhaps switching if you hear valid routes over another session, or at least local-pref the routes down over the interface that died, and put a TTL on. Lots of iBGP dynamics details need attention if you do that.)
Concern (avi): the whole fast-failover/fast-recovery etc seems like a wildcard because if the router is dying or links are failing due to corrupted RAM because the software sucks, then it could be leaving around bad routes. An advantage to peering between loopbacks: if a specific interface dies but others are up (even indirectly) that provide a route between two iBGP-speaking hosts, it takes a few more seconds before the BGP session dies.

US govt role in shepherding collaborations between industry and research community

US federal agencies that fund network research should strongly consider funding proposals that work sufficiently closely with vendors so as to integrate data such as the above into their modeling, simulation, and analysis activities. Ideally such a collaboration would also involve a provider willing to provide operational empirical data.

As an example, NSF could fund researchers to work with Cisco to make available laboratories of used/returned Cisco gear, not to ever be resold or put into production, and potentially to make access to a full-featured IOS running on a generic platform. Note that even this kind of software availability (object form only) would not catch all failure modes on a distributed-switching platform, but it would be a huge step. (Most useful for using the devices as if they were core routers, and simulating input in terms of routing and traffic.)

(Note: John Morgridge personally donated a stack of routers to the University of Wisconsin Advanced Internet Lab. The stated purpose of the lab is to portions of the Internet for research purposes, testing "Internet congestion and loss behavior, network performance and management, techniques for wide area network measurement, and routing protocol behavior".)

@@open questions: Will this lab allow you to download random images and configurations onto them to emulate realistic deployed field? Can researchers get access to download images into the router from www.cisco.com, and you should have access to look up our caveats on the images.

Sensitivity to vendors

A lot of this information will possibly require NDAs with researchers, though much of it, e.g., bugs and RFC deltas, are discussed on public mailing lists now. It is imperative that network researchers respect the sensitivity of vendor-specific information and its potential impact on the vendor as well as the infrastructure, and follow these guidelines:

If you find a bug or some optimization your feel is important, the associated vendor should be the first to hear about it, rather than the press or the nanog list or anything more public, and that vendor should have the opportunity to fix it before they get discovered/thrashed by the press.
Recognize that the vendor is giving researchers access to enormously overloaded engineers. Respect their time by asking for useful answers to real questions that have material relevance to the Internet, as opposed to wondering, theorizing, and supplying free advice.