Notes taken at/around IETF 56 ============================= Note: only minutes that have been approved by participants are included. []: surround comments added later RouteScience visit 14 Mar 2003 (including Mike Lloyd and Sean Finn) ------------------------------------------------------------------- Answer to our question: is there intelligent handling of prefixes that appear in an update message, and are also equivalent in the router's state? Conjecture that there might be intelligent handling by Junipers and Foundries. Rough sketch of the number of ASes at a particular distance from some AS ('AS' below). The intent is to show that it is diamond-shaped. / \ / . . . \ | | / . . . . . . \ | | | . . . . . . . . . . . | | | \ . . . . . . . / | | \ . . . . . / | | \ AS / distance You want detailed policy near an origin AS, but you do not want to tell the world (further away) about it. Can atoms be a solution here? Is it possible to declare atoms some levels a way from an origin AS? Mike takes the opportunity to discuss an idea he has had for setting an AS TTL in BGP announcements. A BGP route will be propagated a maximum of 'TTL' ASes away from the origin AS. This allows fine-grained policy near the origin AS, which is not visible further away. See also the above diagram. Concerned about consistency if the distribution of the atom<->prefix ('adjective') mapping and the atom routing protocols ('noun') are separate. [ Patrick: right now these are carried together in one protocol, and kept together. However, there may be advantages to separate into two protocols. In that case, consistency is indeed an important issue. ] Mike: can you use attribute (to tag routes during routing) without having the separate layer containing atomID (i.e. without doing encapsulation during forwarding)? The goal for the attribute would be to detect efficiently whether two prefixes are in the same atom and can be treated the same. To save work / state in transit routers (in the island), edge routers need to do perform extra work / store extra state. How does that compare? Must also take into account the costs of atoms in a systemic way: cost of edge [capex and opex] and cost of transit [capex and opex] not just cost of hardware/software complexity but management/operations. Re: the statistic of the Tel Aviv study into atoms, which says that 2-3 % of prefixes change atom membership over 8 hours This may actually be a lot. [ Patrick: don't agree. But we do need to verify these figures ourselves. ] Sean: encourages consideration of larger context of re-architecting Internet routing, is this a solution or a part of a solution? [ Sean later expands on this point by email: ] ... the general idea is that the existing Internet routing architecture is based on the high-level distinction: "my network" (i.e., controlled by my IGP) "everything else" (i.e., controlled by BGP) where "my network" advertises to "everything else', and "everything else" advertises back ... modulo policy, link failures, implementation bugs, etc. Per the general point that Mike and Andrew made: > Mike : what is the killer argument for atoms? > Andrew: [improvement in] convergence time would be. An example I used was that specific details of connectivity to small ISPs (indeed, small enterprise networks, etc.) in a faraway location like Thailand are probably not of local interest in North America. One problem seems to be that the "everything else" set is starting to become unwieldy for current control structures. (another is that the policy language we have to work with is weak ... but that is a separate discussion ;) Thus, perhaps a 'three tier' taxonomy to the network fabric would help: "my network" "the 'local' Internet" "the 'remote' Internet" If large sections of "the 'remote' Internet" could be wrapped up in an "Atom-like" unit, it appears that table size, and convergence time, could be significantly improved. A simplistic way to describe this is to 'aggregate based on `contintental routing registries'. Geoff mentioned some legitimate criticisms of this, based on current topology: there are "networks" that span continents. This is true, but ... if it is possible to discuss re-evaluating what "routing" is, it's not clear to me that the current definition of "network' necessarily needs to be held constant ... IETF Monday 17 March 9AM: includes Mike Lloyd, Andrew Lange ----------------------------------------------------------- Mike: putting the 'mpls' slide toward the end avoids the nasty but critically important question here which is 'does this require changing the forwarding plane' this is the million dollar question. Andrew mentioned that though 2547 VPN is to be a replacement for ATM or frame relay, it turns out that people who have ATM or frame relay like it and want to keep them. Mike : what is the killer argument for atoms? Andrew: convergence time would be. The only justifiable argument for atoms is if you can use atoms to reduce bgp convergence time (20 to 5 min). [ Patrick: means convergence time of a single box (rather than the convergence time of a single prefix in the Internet as described by Labovitz). ] Andrew?: Existing bgp features already buy you a lot. Cisco used to have one NLRI per update message. Just a small change (multiple NLRIs in update) cut convergence time in half (group updates by attribute). Andrew: if you were to reboot a gsr (GSR 12000) with full routes, that can take 20 min! Full routes here defined at: @120,000 routes, 2 million paths. This experience is on the old IOS, before the recent improvements in NLRI packing. Convergence time, two kinds: o Labovitz (convergence of the state of multiple routers, one prefix) Mike: go through four scenarios in Labovitz papers (Tup, Tdown, Tlong, Tshort), and see how much atoms will help [ k: i think this is important ] o Reboot (convergence of one router, all prefixes) o Controlled reboot o Spontaneous reboot (crash) Andrew explains how a controlled reboot is performed. It involves changing IGP metrics while the box is allowed to converge. If atoms were to help in the case of a spontaneous reboot, that would be a good selling point. [ k: not sure they would be though, still would need to do all the atom/prefix mapping ] Re: extending atoms with reachability bits Patrick: an announcement of an atom over a link is accompanied with a per-prefix bit specifying whether the prefix is currently reachable through that link. Avoids regrouping prefixes into atoms after link failure: atom contents is administrative, not dependent on link failures Patrick shows example: C | --------- | | ................. . A B . . | | . AS . (not shown) . . | | | . . P1 P2 P3 . ................. Patrick: AS contains prefixes P1, P2, P3 connected to BGP routers A and B, which are connected to BGP router C in some other AS. P1, P2, P3 have independent IGP links to A and B (not shown, too hard in ASCII), i.e. each link from Px to A or B can fail indendently. In this situation, A and B can announce one atom A1 to C containing P1, P2, P3. What happens if the link P1 - A breaks? In that case, there are two atoms: A2 containing only P1 (reachable through B), the other A3 containing P2 and P3 (reachable through A and B). Therefore, two updates need to occur: a new atom (A2) is announced, and atom A1 is transformed into A3 by removing prefix P1. These updates are visible throughout the Internet. Using reachability bits, A and B can continue to announce atom A1 to C. A1 retains all three prefixes, but in its announcement A uses a per-prefix bit to specify that P1 is not currently reachable. B does not change its announcement at all. The result is a single update, which might be absorbed at C. C needs to make a per-prefix decision for the next hop (A or B). P1 can be forwarded to B, P2 and P3 to the next hop for the atom. The problem being solved only occurs if there is a partition of the AS (in this case for P1): B cannot reach P1. Without a partition, A and B will always have reachability to prefixes of the AS. How often does that occur? Andrew or Mike notes that internal partitioning of an AS violates BGP: BGP peers of an AS are required to maintain IBGP sessions with each other. That does not prevent partitioning from occurring of course. Mike : C is going to make a single decision for the atom, except when it doesn't. It's bad to change a fundamental part of BGP, which is that 'nothing but the best path matters'. Mike and Brad don't like reachability bits. Patrick remains unconvinced. [ Patrick: instead of using reachability bits to forward traffic over a backup path for some prefix, they might be used to stop forwarding traffic for the prefix. This will avoid carrying traffic that will not reach its destination anyway. ] Re: origin-unreachability. [ Patrick: the property of atoms that an origin AS can specify that a prefix is *completely* unreachable by removing the prefix from the atom in the atom<->prefix mapping.] Andrew likes the idea of origin-unreachability. The question is: how do you detect a situation in which a prefix is completely unreachable (i.e. there are no alternative paths to the prefix). Answer to our question: is there intelligent handling of prefixes that appear in an update message, and are also equivalent in the router's state? There is no *intelligent* processing, however the packing of multiple NLRI in an update takes care of the parsing overhead etc. Suggestion: atoms as a (community) attribute to be dealt with by the BGP Decision Process as the first attribute following local attributes. If paths that carry the attribute are preferred over paths that do not, there is an incentive to use atoms. Also: billing. Mike discusses his idea of AS TTLs in BGP announcements (see earlier), also wrt Andrew's draft of flexible communities. ----------------------------------------------------------------------- IETF: Geoff Huston, Andrew Partan, Sean Finn, Mike Lloyd, Andrew Lange, Jeffrey Haas (GateD) ----------------------------------------------------------------------- Geoff: null atoms. An AS declares all possible policies that any of its prefixes might have as empty atoms. Prefixes are assigned to these atoms. Changing the policy of a prefix consists of reassigning it to another atom. The underlying assumption is that such changes are cheap: flood updates to the mapping, rather than rerouting atoms. Sean : Should look for static structures in the Internet. Jeffrey: Most ISPs announce one or two routes. [ Jeffrey: I believe this is data I've seen from CAIDA ] When processing BGP updates: o MED "election" is the most expensive piece of route selection. o Policy on path attributes is more expensive than policy on prefixes. In general, regular expressions are expensive to process. o The cost of picking the active route is low. o When you've completed route selection, you send the results to all internal peers. Thus, you build a single packet and send it to all of your internal peers. For external peers, you have to build a packet for each of them. Thus, each external peer is more work than a large number of internal peers. The cost of running policy is relatively high. Convergence of a system is more of an issue than individual policy. Specifically, convergence of the Internet is our primary issue. If the goal of the atoms work is to reduce the number of prefixes/ elements to which policy can be applied, while "atoms" make up part of the load of the work that must be accomplished, we have more problems with the convergence of the system that reducing the policy elements likely would not address. IETF Tuesday 18 March: Dave Meyer, Ted Seely of sprintlink ---------------------------------------------------------- Dave's answer to our question: what aspects of routing table size matter most? o The number of entries in tables is bounded by 2^32. o The dynamics are what matter. There might be some non-standard packing of updates between Cisco routers. Bad: o Per-peer prefix mapping: another table o Convergence issues for distributing the mapping: adding extra inconsistency is not good. Suggests that even deflection routing is better than that. o Any assumptions about providers cooperating Re: Route merging [ Patrick: announcing a primary path and backup paths together using an AS set, to absorb updates which switch between different paths ] Dave / Ted: Does this situation arise enough? Is it a big deal? Should do topology study to look for cases. This hides information. Customers that multihome their prefixes want info about their prefixes. If the primary path fails, customers will notice performance change and will be on the phone. Patrick: what difference does it make to a customer whether a performance change is accompanied by a change in BGP? IBGP behaviour is rapidly improving. EBGP is just the way it is (factorial convergence). Really important issue right now is OPEX. Last year CAPEX was cut, this year it's OPEX. There are raisor-thin margins. Therefore, it is very important that atoms do not increase management/operations. We should check that we're not breaking anything in RFC 2547. IETF Tuesday 18 March: Pedro Marques ------------------------------------ Pedro: Thinks that current aggregation that happens in routers is already better than atoms. Can do IBGP/EBGP. It is preferable to do it on a hop-by-hop basis. [ Patrick: referring to packing of multiple destinations in BGP updates ] Internal IBGP change that doesn't result in EBGP attribute changes can sometimes result in a flap outside, because the implementation is less than perfect. In particular, Cisco may be using heuristics on outbound processing. k: What is the biggest problem then? Pedro: I get tier1 customers telling me they have 100 IBGP events/minute on their networks: what are those? external flaps? IBGP network can amplify the number of updates because it receives same external event from several ingress points. Withdrawal of a route with a MED may cause 100s of IBGP updates. Why? Another interesting question is how well routers actually perform packing of updates. In the past, some Ciscos sent 100,000 updates when 40,000 would have sufficed. The number of attributes matters more than the number of prefixes. Origin declared atoms have limited value. There are 30-40,000 distinct attributes vs. 20,000 distinct paths. For reliability a multihomed AS announces all of its prefixes on all links. For load balancing, a prefix is announced with different 'metric' on the various links (e.g. using a prepended AS path). This results in a greater number of atoms if non-AS path attributes are taken into account. In IBGP, routes from different IBGP peers carry different attributes (notably next hop). [ Patrick: that shouldn't matter. What matters if different prefixes in an atom get assigned a different attribute by an (IBGP) peer. LOCAL_PREF might be interesting here. How does LOCAL_PREF get assigned? ] Thur 20 March: Vijay Gill ------------------------- Re: Route merging (see earlier) Vijay asks Ted Seely of Sprintlink why are AS sets are a problem. Ted: upstream customers may still see the path available but not be using it. Vijay: AOL has 600 prefixes, less than 100 are used for traffic engineering (since they hold most of the traffic). [ k: he used 80/20 rule, wonder if he had actually measured.] [ vijay: The 80/20 rule comes from emperical views. I regularly do traffic engineering and we all know which prefixes to move around to make multi gigabit/s shifts. ] BGP is essentially pretty simple. Vijay still likes atoms, essentially the idea of routing on ASes instead of prefix is intuitively nice. He thinks there are good nuggets in the atoms work, especially the idea of having the distribution be a separate protcol, folks could use it as replacement for 2547 or anything else. 'AS container' routing is a very good idea, that needs to be thought out more. But from an operational perspective, it's hard to use even in an intra-domain sense, for traffic engineering. Currently pretty hacky. Don't announce consistent prefixes or attributes. Selectively block, announce, distribute. Vijay draws figure of a large AS with several POPs to illustrate.