Notes taken at/around IETF 56
=============================

Note: only minutes that have been approved by participants are included.
[]: surround comments added later

RouteScience visit 14 Mar 2003 (including Mike Lloyd and Sean Finn)
-------------------------------------------------------------------

  Answer to our question: is there intelligent handling of prefixes that appear
  in an update message, and are also equivalent in the router's state?

    Conjecture that there might be intelligent handling by Junipers and
    Foundries.

  Rough sketch of the number of ASes at a particular distance from some AS
  ('AS' below). The intent is to show that it is diamond-shaped.  


    / \           /   . . .   \
     |
     |         /    . . . . . .   \
     |
     |       | . . . . . . . . . . . |
     |
     |        \    . . . . . . .  /
     |
     |          \    . . . . .  /
     |
     |             \    AS   /
  distance      

  You want detailed policy near an origin AS, but you do not want to tell the
  world (further away) about it. Can atoms be a solution here? Is it possible
  to declare atoms some levels a way from an origin AS?

  Mike takes the opportunity to discuss an idea he has had for setting an AS
  TTL in BGP announcements. A BGP route will be propagated a maximum of
  'TTL' ASes away from the origin AS. This allows fine-grained policy near the
  origin AS, which is not visible further away. See also the above diagram.

  Concerned about consistency if the distribution of the atom<->prefix
  ('adjective') mapping and the atom routing protocols ('noun') are separate.

    [ Patrick: right now these are carried together in one protocol, and kept
      together. However, there may be advantages to separate into two
      protocols. In that case, consistency is indeed an important issue.
    ]

  Mike: can you use attribute (to tag routes during routing) without having
        the separate layer containing atomID (i.e. without doing encapsulation
        during forwarding)?

    The goal for the attribute would be to detect efficiently whether two
    prefixes are in the same atom and can be treated the same.

  To save work / state in transit routers (in the island), edge routers need to
  do perform extra work / store extra state. How does that compare? Must also
  take into account the costs of atoms in a systemic way: cost of edge [capex
  and opex] and cost of transit [capex and opex] not just cost of
  hardware/software complexity but management/operations.

  Re: the statistic of the Tel Aviv study into atoms, which says that 2-3 % 
  of prefixes change atom membership over 8 hours

    This may actually be a lot.

    [ Patrick: don't agree. But we do need to verify these figures ourselves. ]

  Sean: encourages consideration of larger context of re-architecting Internet
        routing, is this a solution or a part of a solution?

  [ Sean later expands on this point by email: ]

      ... the general idea is that the existing Internet routing architecture
      is based on the high-level distinction:

        "my network"      (i.e., controlled by my IGP)
        "everything else" (i.e., controlled by BGP)

      where "my network" advertises to "everything else', and "everything else"
      advertises back ... modulo policy, link failures, implementation bugs,
      etc.

      Per the general point that Mike and Andrew made:

      >   Mike  : what is the killer argument for atoms?
      >   Andrew: [improvement in] convergence time would be.
   
      An example I used was that specific details of connectivity to small ISPs
      (indeed, small enterprise networks, etc.) in a faraway location like
      Thailand are probably not of local interest in North America.

      One problem seems to be that the "everything else" set is starting to
      become unwieldy for current control structures. (another is that the
      policy language we have to work with is weak ... but that is a separate
      discussion ;)

      Thus, perhaps a 'three tier' taxonomy to the network fabric would
      help:
   
         "my network"
         "the 'local' Internet"
         "the 'remote' Internet"
   
      If large sections of "the 'remote' Internet" could be wrapped up
      in an "Atom-like" unit, it appears that table size, and convergence
      time, could be significantly improved.
   
      A simplistic way to describe this is to 'aggregate based on
      `contintental routing registries'. Geoff mentioned some legitimate
      criticisms of this, based on current topology: there are "networks"
      that span continents. This is true, but ... if it is possible to
      discuss re-evaluating what "routing" is, it's not clear to me
      that the current definition of "network' necessarily needs to
      be held constant ...

IETF Monday 17 March 9AM: includes Mike Lloyd, Andrew Lange
-----------------------------------------------------------

  Mike: putting the 'mpls' slide toward the end avoids the
        nasty but critically important question here which is
        'does this require changing the forwarding plane'
        this is the million dollar question.

  Andrew mentioned that though 2547 VPN is to be a replacement for ATM or frame
  relay, it turns out that people who have ATM or frame relay like it and want
  to keep them.

  Mike  : what is the killer argument for atoms?
  Andrew: convergence time would be.
          The only justifiable argument for atoms is if you can use atoms to
          reduce bgp convergence time (20 to 5 min).
    [ Patrick: means convergence time of a single box (rather than the
        convergence time of a single prefix in the Internet as described by
        Labovitz).
    ]

  Andrew?: Existing bgp features already buy you a lot. Cisco used to
           have one NLRI per update message. Just a small change (multiple
           NLRIs in update) cut convergence time in half (group updates by
           attribute).

  Andrew: if you were to reboot a gsr (GSR 12000) with full routes, that can
          take 20 min!
          Full routes here defined at: @120,000 routes, 2 million paths.
          This experience is on the old IOS, before the recent improvements in
          NLRI packing.

  Convergence time, two kinds:

    o Labovitz (convergence of the state of multiple routers, one prefix)

      Mike: go through four scenarios in Labovitz papers (Tup, Tdown, Tlong,
            Tshort), and see how much atoms will help

        [ k: i think this is important ]

    o Reboot (convergence of one router, all prefixes)

      o Controlled reboot

      o Spontaneous reboot (crash)

      Andrew explains how a controlled reboot is performed. It involves
      changing IGP metrics while the box is allowed to converge.

      If atoms were to help in the case of a spontaneous reboot, that would be
      a good selling point.

        [ k: not sure they would be though, still would need to do all the
          atom/prefix mapping
        ]


  Re: extending atoms with reachability bits

    Patrick: an announcement of an atom over a link is accompanied with a
             per-prefix bit specifying whether the prefix is currently
             reachable through that link. Avoids regrouping prefixes into atoms
             after link failure: atom contents is administrative, not dependent
             on link failures

    Patrick shows example:


                      C
                      |
                  ---------
                 |         |
              .................
              .  A         B  .
              .  |         |  . AS
              .  (not shown)  .
              .  |    |    |  .
              .  P1   P2   P3 .
              .................

    Patrick: AS contains prefixes P1, P2, P3 connected to BGP routers A and B,
             which are connected to BGP router C in some other AS. P1, P2, P3
             have independent IGP links to A and B (not shown, too hard in
             ASCII), i.e. each link from Px to A or B can fail indendently. In
             this situation, A and B can announce one atom A1 to C containing
             P1, P2, P3. What happens if the link P1 - A breaks? In that case,
             there are two atoms: A2 containing only P1 (reachable through B),
             the other A3 containing P2 and P3 (reachable through A and B).
             Therefore, two updates need to occur: a new atom (A2) is
             announced, and atom A1 is transformed into A3 by removing prefix
             P1. These updates are visible throughout the Internet.

             Using reachability bits, A and B can continue to announce atom
             A1 to C. A1 retains all three prefixes, but in its announcement
             A uses a per-prefix bit to specify that P1 is not currently
             reachable. B does not change its announcement at all. The result
             is a single update, which might be absorbed at C.
             
             C needs to make a per-prefix decision for the next hop (A or B).
             P1 can be forwarded to B, P2 and P3 to the next hop for the atom.

    The problem being solved only occurs if there is a partition of the AS
    (in this case for P1): B cannot reach P1. Without a partition, A and
    B will always have reachability to prefixes of the AS. How often does that
    occur?

    Andrew or Mike notes that internal partitioning of an AS violates BGP: BGP
    peers of an AS are required to maintain IBGP sessions with each other. That
    does not prevent partitioning from occurring of course.

    Mike   : C is going to make a single decision for the atom, except when it
             doesn't.

             It's bad to change a fundamental part of BGP, which is that
             'nothing but the best path matters'.

    Mike and Brad don't like reachability bits. Patrick remains unconvinced.

      [ Patrick: instead of using reachability bits to forward traffic over
                 a backup path for some prefix, they might be used to stop
                 forwarding traffic for the prefix. This will avoid carrying
                 traffic that will not reach its destination anyway.
      ]

  Re: origin-unreachability.
    [ Patrick: the property of atoms that an origin AS can specify that a
               prefix is *completely* unreachable by removing the prefix from
               the atom in the atom<->prefix mapping.]

    Andrew likes the idea of origin-unreachability.

    The question is: how do you detect a situation in which a prefix is
    completely unreachable (i.e. there are no alternative paths to the prefix).
  
  Answer to our question: is there intelligent handling of prefixes that appear
  in an update message, and are also equivalent in the router's state?

    There is no *intelligent* processing, however the packing of multiple
    NLRI in an update takes care of the parsing overhead etc.

  Suggestion: atoms as a (community) attribute to be dealt with by the BGP
  Decision Process as the first attribute following local attributes. If paths
  that carry the attribute are preferred over paths that do not, there is an
  incentive to use atoms. Also: billing.

  Mike discusses his idea of AS TTLs in BGP announcements (see earlier), also
  wrt Andrew's draft of flexible communities.


-----------------------------------------------------------------------
IETF: Geoff Huston, Andrew Partan, Sean Finn, Mike Lloyd, Andrew Lange,
      Jeffrey Haas (GateD)
-----------------------------------------------------------------------

  Geoff: null atoms. An AS declares all possible policies that any of its
         prefixes might have as empty atoms. Prefixes are assigned to these
         atoms. Changing the policy of a prefix consists of reassigning it to
         another atom. The underlying assumption is that such changes
         are cheap: flood updates to the mapping, rather than rerouting atoms.

  Sean : Should look for static structures in the Internet.

  Jeffrey: Most ISPs announce one or two routes.
           [ Jeffrey: I believe this is data I've seen from CAIDA ]
           
           When processing BGP updates:
           
             o MED "election" is the most expensive piece of route selection.

             o Policy on path attributes is more expensive than policy on
               prefixes. In general, regular expressions are expensive to
               process.

             o The cost of picking the active route is low.

             o When you've completed route selection, you send the results to
               all internal peers.  Thus, you build a single packet and send it
               to all of your internal peers. For external peers, you have to
               build a packet for each of them.  Thus, each external peer is
               more work than a large number of internal peers. The cost of
               running policy is relatively high.

           Convergence of a system is more of an issue than individual policy.
           Specifically, convergence of the Internet is our primary issue.   If
           the goal of the atoms work is to reduce the number of prefixes/
           elements to which policy can be applied, while "atoms" make up part
           of the load of the work that must be accomplished, we have more
           problems with the convergence of the system that reducing the policy
           elements likely would not address.

IETF Tuesday 18 March: Dave Meyer, Ted Seely of sprintlink
----------------------------------------------------------

  Dave's answer to our question: what aspects of routing table size matter
  most?

    o The number of entries in tables is bounded by 2^32.

    o The dynamics are what matter.

  There might be some non-standard packing of updates between Cisco routers.

  Bad:
    o Per-peer prefix mapping: another table

    o Convergence issues for distributing the mapping: adding extra
      inconsistency is not good. Suggests that even deflection routing is
      better than that.

    o Any assumptions about providers cooperating

  Re: Route merging
    [ Patrick: announcing a primary path and backup paths together using an
               AS set, to absorb updates which switch between different paths
    ]

    Dave / Ted:
      Does this situation arise enough? Is it a big deal? Should do topology
      study to look for cases.

      This hides information. Customers that multihome their prefixes want info
      about their prefixes.

      If the primary path fails, customers will notice performance change and
      will be on the phone.

    Patrick: what difference does it make to a customer whether a performance
             change is accompanied by a change in BGP?
  
  IBGP behaviour is rapidly improving. EBGP is just the way it is (factorial
  convergence).

  Really important issue right now is OPEX. Last year CAPEX was cut, this year
  it's OPEX. There are raisor-thin margins. Therefore, it is very important
  that atoms do not increase management/operations.

  We should check that we're not breaking anything in RFC 2547.

IETF Tuesday 18 March: Pedro Marques
------------------------------------

  Pedro: Thinks that current aggregation that happens in routers is already
         better than atoms.  Can do IBGP/EBGP. It is preferable to do it on
         a hop-by-hop basis.  
         [ Patrick: referring to packing of multiple destinations in BGP
                    updates
         ]

         Internal IBGP change that doesn't result in EBGP attribute changes can
         sometimes result in a flap outside, because the implementation is less
         than perfect. In particular, Cisco may be using heuristics on 
         outbound processing.

      k: What is the biggest problem then?

  Pedro: I get tier1 customers telling me they have 100 IBGP events/minute on
         their networks: what are those?  external flaps?  IBGP network can
         amplify the number of updates because it receives same external event
         from several ingress points.

         Withdrawal of a route with a MED may cause 100s of IBGP updates. Why?

         Another interesting question is how well routers actually perform
         packing of updates. In the past, some Ciscos sent 100,000 updates
         when 40,000 would have sufficed.

  The number of attributes matters more than the number of prefixes. Origin
  declared atoms have limited value.

  There are 30-40,000 distinct attributes vs. 20,000 distinct paths.

  For reliability a multihomed AS announces all of its prefixes on all links.
  For load balancing, a prefix is announced with different 'metric' on the
  various links (e.g. using a prepended AS path). This results in a greater
  number of atoms if non-AS path attributes are taken into account.

  In IBGP, routes from different IBGP peers carry different attributes (notably
  next hop).
    [ Patrick: that shouldn't matter. What matters if different prefixes in an
      atom get assigned a different attribute by an (IBGP) peer. LOCAL_PREF
      might be interesting here. How does LOCAL_PREF get assigned?
    ]


Thur 20 March: Vijay Gill
-------------------------

  Re: Route merging (see earlier)
    Vijay asks Ted Seely of Sprintlink why are AS sets are a problem.
    Ted: upstream customers may still see the path available but not be using
         it.
        
  Vijay: AOL has 600 prefixes, less than 100 are used for traffic engineering
         (since they hold most of the traffic).
           [ k: he used 80/20 rule, wonder if he had actually measured.]

           [ vijay: The 80/20 rule comes from emperical views. I regularly do
                    traffic engineering and we all know which prefixes to move
                    around to make multi gigabit/s shifts. ]

         BGP is essentially pretty simple.  Vijay still likes atoms,
         essentially the idea of routing on ASes instead of prefix is
         intuitively nice.  He thinks there are good nuggets in the atoms work,
         especially the idea of having the distribution be a separate protcol,
         folks could use it as replacement for 2547 or anything else.

         'AS container' routing is a very good idea, that needs to be
         thought out more.  But from an operational perspective, it's hard to
         use even in an intra-domain sense, for traffic engineering. Currently
         pretty hacky. Don't announce consistent prefixes or attributes.
         Selectively block, announce, distribute. Vijay draws figure of a 
         large AS with several POPs to illustrate.