ISMA Data Catalog 2004 Workshop Report

On June 3, 2004, CAIDA hosted its 11th Internet Statistics and Metrics Analysis workshop. This workshop focused on evaluating the design for an Internet Measurement Data Catalog. We invited a group of researchers spanning both active producers of network data and ardent consumers of available data. We asked participants to discuss their existing Internet data sets, their policies for sharing that data, and their methods of dataset management and distribution.

In this report, we present our goals for the workshop, the key findings and future work that resulted from the discussion. For those who wish to delve more deeply into the workshop proceedings, we include informal summaries of meeting presentations with links to actual slides in .pdf format.

Introduction/Goals

The Internet infrastructure currently has no framework for system-level studies of wide-area, cross-domain Internet traffic behavior. As a result, neither representative longitudinal analysis of macroscopic workload trends nor sound preparation for the growing expectations of Internet users is possible today.

While measurement data is difficult to collect and distribute, the largest problem impeding the process of computer network research is a lack of relevant data for testing new theories and technologies. While we encourage additional data sets and collection projects, simply making researchers aware of data that already exists will leverage both the quality and the quantity of research performed today. To this end, CAIDA is designing an Internet Measurement Data Catalog (IMDC) to organize the heterogeneous datasets (both publicly accessible and restricted usage) into a database that researchers can query to find relevant data to support their work. We provide annotation capabilities for researchers so that bugs, novel features, and other information about datasets can be shared by investigators with experience using a particular dataset. In addition to providing the fodder for new inquiries, the IMDC will also facilitate robust science by documenting exactly the data used in a study in a way that allows others to reproduce published results.

Our goal is to create a database architecture and annotation system that will accommodate the diversity of existing and future data sets. We will provide both a web-based user-interface for users to query the database and an API to allow trusted parties to automate contribution of catalog entries for their data sets.

The Internet Measurement Data Catalog will not actually store copies of the dataset, in the same way that a mail order catalog is distinct from the warehouse; we provide a clearinghouse of information about data available elsewhere, complete with Acceptable Use Policy information and access instructions.

The main objectives of the workshop were:

present our IMDC design to the research community
get feedback on compatibility of the proposed architecture and their data production and usage practices
identify remaining problems with data catalog creation and distribution that require further research

Highlights and Key Findings

Workshop participants enthusiastically welcomed the idea of publicly available catalog of the Internet measurement data. Such a catalog, if populated with relevant high quality data sets and properly maintained, would greatly benefit the Internet research community. It will advance the reproducibility of analyses and results, enable longitudinal and cross-discipline studies of the Internet, and open up new cross-domain areas of networking research. The participants highly encouraged continuation of CAIDA's IMDC work and of NSF support of this project.

Over the course of the workshop, the participants made the following suggestions for improved design and future features of the IMDC:

implement some version of "derived-from" DataD modifier

consider possibility of cataloging scripts with data

add some scoring system to distinguish "good" data from "bad" data

add an indicator showing if a given data item is "freely and easily" obtainable or not

add "warning" to the list of standard annotations

add "smart IDs" to the database (URL-like) for further use or, at least, for citation purposes

provide a mechanism for continuous addition of data (i.e. a trace every day)

enable the simplest version of search, "single-box"

provide google-like search

implement an option to create an output in XML format

keep track of IMDC usage and display plots of relevant statistics vs. time

implement automatic e-mail notification of interested users

release the database code to other groups for their internal use

The participants identified convenience of catalog use and the security and reliability of catalog information as two main conditions enabling widespread use and general popularity of the catalog. The following areas in the catalog design raised concerns:

find ways to compress/automatize display of search results (do not show a search result of 10,000 nearly identical entries)

deal carefully with public/private information for Contacts

find ways to prevent pollution of the catalog with useless/poor data

provide convenient means to insert catalog entries for existing voluminous data sets into the catalog and to export search results

The participants concluded that making data used in publications available to other researchers should become an integral part of the Internet research process.

Future Work:

Organize a BOF devoted to IMDC at the Internet Measurement Conference (October 25-27, 2004 in Taormina, Sicily, Italy). Prepare the next Phase of the IMDC in time to be demonstrated at the IMC. This version should catalog most of CAIDA data sets and a limited number of external data sets and allow user to search on them.

Cooperate with NLANR researchers in preparing and annotating their data for inclusion into IMDC.

Write an Open Letter to the Community articulating the position of this ISMA DC Workshop on promoting new attitude toward data sharing and using in the Internet community (Mark Allman):

funding agencies should make public disclosure of data a condition of funding

researchers should utilize responsible practices in data collection to qualify their data for sharing

data collection tools should be designed to output as much information about the running environment and circumstances of collection as possible (e.g. the tool name and options used to run the collection, the time and duration of the measurement, the name of the machine and the operating system supporting the collection, the names of any input files utilized in performing the collection, etc.)

conference and workshop PCs should prefer work based on openly available data

researchers should cite datasets (like they cite papers), in addition to acknowledging the individuals, projects, and funding sources responsible for the collection and distribution of public data)

establish an annual award for the best paper based on a new and public dataset

establish an annual award for the data set best supporting community research efforts

We are open to additional suggestions from the community for methods to promote and reward data sharing practices.

Workshop Session Notes

Internet Measurement Data Catalog

kc claffy (CAIDA) - [pdf slides]

introduced IMDC. This database is one of the tasks of a 3-year project "Correlating Heterogeneous Measurement Data to Achieve System-Level Analysis of Internet Traffic Trends" funded by NSF. The project is currently at the end of the 2nd year.

Internet measurement is rife with challenges and obstacles. Scientifically rigorous monitoring, or even instrumentation, was not a high priority in the post-NSFnet Internet. The data that the research community does use are disparate, incoherent, limited in scope, unindexed, and sometimes proprietary. There is a widely recognized need for globally relevant measurements, including rational architectures for data collection, and hardware support for monitoring high speed links.

Informational science or "data about data" has become an essential NSF goal. This project is about developing meta-data and annotations to describe the data. The mission is to help researchers share data and streamline future collections. Well managed meta-data should include: how collected, by whom, when, where saved, access policies, format, packaging, compression.

Colleen Shannon (CAIDA)

talked about motivation and challenges of IMDC. There are lots of data out there (traces, routing tables, traceroutes, security, names, geographic) of variable quality and research importance. The main goal of IMDC is to provide an easy way for users to find data and for contributors to publish their data. There is a perpetual conflict between these two communities. Users want perfect (100% complete, 100% accurate) freely available data. Contributors do not want (or cannot) spend time and effort on disseminating their data (and often are not funded for this task as well).

IMDC design goals include: (1) flexible framework for contributors; (2) good search capabilities (both simple and sophisticated modes); and (3) ability to share information discovered in data and to correct wrong information.

Design principles include: (1) be ambitious, anticipate possible future uses; (2) start with simple implementation and build it up in steps; and (3) provide multiple access modes as necessary.

David Moore (CAIDA)

presented the central concepts and the overall architecture of the IMDC database and demonstrated the currently built prototype.

The focus of IMDC is on helping users find data. Below we list the main types of objects in the database. Fields common for all objects in the database are: creator (of data); contributor (actually puts data in database); creation time; modification time.

Data Descriptor (DD):

the central conceptual object (atomic entity) of the data catalog. A data descriptor represents a single file containing data that resides on a computer somewhere in the world. A single data descriptor is used to reference all copies of the data item, even copies on disparate computers at different sites. DD fields are: name; description - long, short; URL; keywords; file size; format; location - geographic, net, logistic; platform; time period - start, end, time zone offset, time zone name; creation process^*; MD5 hash (to detect duplicates, to check for corruption).
* - The creation process will be a text field until we gain better understanding of what people might want to put here. It may indicate that data was derived from other data.

Format Descriptor (FD):

points to information about file formats. FD contains: name, description, keywords, package or data format, type (ASCII/binary/mixed), file suffixes.

Package Descriptor (PD):

physical grouping of one or more data files, can be thought of as a downloadable unit. A package may have multiple data files, and a data file may be in multiple packages. PD fields are: name, description, keywords, file size, format ID, MD5 hash, linkage to contained DD/PD via a path.

Location Descriptor (LD):

tells how to actually fetch some data. Packages may be available from multiple locations, but not all packages will be directly available. LD fields: download URL, download procedure (includes AUPs), geographic location of server.

Contact Descriptor (CD):

human component of the database. CD fields are: login, password, name, description (long, short, URL), email (hideable), phone (hideable), address (hideable), country (hideable), organization, research interests.

Tool and ToolSet Descriptors (TD):

what tools are available to conduct measurements, what tools were used to generate data, versions information. TD fields: name, description, keywords, release date, OS. We will finalize the fields after getting some experience with usage. Notes and bugs will be in annotations.

Study Descriptors (SD):

keeps track of data and results used in a particular publication (but is not meant to replace/overtake citeseer). SD fields: name, description, keywords, linkage to DDs, TDs, linkage to StudyWriteup (i.e. actual text of publication).

Collections:

In general, collections are logical groupings of data with a specific purpose. Such groupings may not exist physically, but they could be very important for identifying the data sets used in a paper, or for others to use.

Annotations

include all additional information about a given object in the database. They can be used to let other people know about important findings in the data. Annotations dictionary: key name (e.g. hierarchical namespace, FORMAT-pcap-snaplen), description, value type, position type (time range, all, string). We will standardize certain annotations further when widely accepted. Annotation fields: dictionary key, "object" of annotation (DD, PD, LD), value, position (e.g. time).

The first phase of the catalog implementation deals with creating and cross-linking tables of data, formats, packages, contacts, and locations. A demo version allows the user to browse the database, search for objects of a specified type, look at the detailed information to decide what data is interesting and find out how to get it. (In the course of discussion following the IMDC description, participants of the workshop highly approved the proposed design and made many useful suggestions. Their recommendations aimed to improve the IMDC accuracy and versatility are summarized under Key Findings in the beginning of this report.)

CAIDA data sets

Colleen Shannon (CAIDA)

presented current project areas and related passive data collections at CAIDA. She identified the main challenges in trace collecting and storing: (1) maintenance of remote monitors; (2) large file transfers from monitor sites to UCSD; (3) storage of data.

Availability of CAIDA-housed traces is on a case-by-case basis. Provider-specific agreements determine the use policy for information collected on the backbone links. Data captured on UCSD links are available to researchers but have rigid restrictions on capture of user payload. Release of the UCSD network telescope data is subject to a number of security constraints.

Enabling data access involves a number of steps: (1) sanitation: anonymization, payload stripping; (2) developing Acceptable Use Policies (AUP); (3) setting up a system of user tracking and support; (4) data pre-processing, aggregation, and packaging with time-sensitive information (such as contemporaneous name lookups, routing tables, etc.).

Brad Huffaker (CAIDA) - [pdf slides]

discussed active Internet probing projects at CAIDA and resulting data. A probing tool skitter that we use to collect IP forward path topology is deployed on 25 monitors in 8 countries on 4 continents. We have collected this forward IP topology data continuously since 1998. Another active probing tool is iffinder, which finds multiple interfaces belonging to the same router.

Signing an AUP agreement is required in order to access the raw topology data. AS- and router-level topology derived from raw IP path data are downloadable without restrictions. The probing lists can be released, but with appropriate restrictions including prohibition of active probing for non-CAIDA projects. Note that responding IPs disappear at the rate of about 1% per month, forcing us to replenish the lists on a regular basis. CAIDA also maintains a "do-not-probe-me" list.

CAIDA has limited abilities for mapping IPs to geographical locations. Our own NetGeo database has been unsupported since 2002 and is becoming obsolete. A new tool owl that will parse whois databases is in development, but it is not funded. CAIDA also has a private contract for use of netacuity geographical server (Digital Envoy commercial tool).

Existing Internet Measurement Data
Participants of the workshop shared their experience in data collection and management.

Supratik Bhattacharyya (Sprint ATL) - [pdf slides]

uses a special IPMON system that includes GPS clock and DAG card to collect 44 bytes of each packet on selected links in SprintLink PoPs. They also collect Cisco Netflow data, periodic BGP tables, continuous BGP and IS-IS table updates, and SNMP utilization. Currently, there are more than 60 IPMON systems deployed. Sprint uses Sistina Global File System as data storage and management infrastructure. Meta-data are entered by hand and data are archived on tapes.

Original goal of data management was to provide fully automated analyses of the traces for requesting researchers. This approach did not work for a number of reasons:

difficult to support arbitrary operations,

necessary to filter and sanitize results,

necessary to automate allocation of computing resources,

existing user base was not ready: tools were unstable and poorly documented, users wanted direct access rather than through queries, it was hard to keep meta-data updated.

Now the system is partially automated: after trace is archived on tape, cleaned and put on SAN, a script checks for new clean traces and runs flow analysis. The IPMON web site organizes traces by date of collection. For each trace, the following parameters are shown: link utilization, active flows, traffic breakdown by protocol, by application, packet size distribution, and packet count.

Current measurement projects at Sprint ATL are:

CMON (continuous monitoring system) that runs on a 24/7 basis, computes low-level statistics correlated with routing information and retains a limited history of packet-level information for trouble shooting. Two systems are deployed in San Jose PoP, and more are to follow.

packet trace analysis for security aimed at establishing characteristics of "normal" behavior (a non-trivial problem)

3G data network monitoring attempts to replicate measurement techniques from wired to wireless Sprint PCS 3G data network.

Dan Gunter (LBNL) - [pdf slides]

representing Network Measurements Working Group of the Global Grid Forum discussed their standard schemas for Grid network measurements. The group published a document that targets end users, network admins and researchers, and Grid middleware developers. They proposed a classification and naming methodology for Grid network measurements. NM-WG focus is now on XML schemas used to describe sets of results and to structure user requests (such as querying archived data, running tests on demand, etc.).

Martin Swany (U. Delaware) - [pdf slides]

continued discussion of NM-WG work. Created schemas have to be very versatile in order to be useful for a broad community. Ideally, they would like to re-use a single interface in many different ways.

For series of data, consistent parts of meta data can be incorporated by reference (to an XML object) rather than repeated in every measurement. The next step to data normalization is identifying all data by three broad classes of meta data and timestamp:

characteristic: what measured, type of event

entity/subject/target: what entity measured, generated the event

parameters/methodology: what parameters were in the measurement tool, conditions, what system, who measured Normalization enables more efficient querying.

In dealing with derived data streams, the subject becomes a view. The characteristic and parameters encode the transformation of the original data. A document describing approach to derived chains of data is in progress and will be presented to the community when ready.

Henk Uijterwaal (RIPE NCC) - [pdf slides]

talked about Internet measurements and data at RIPE.
Test Traffic Measurements (TTM) measures key parameters of the connectivity between a user's site and other points on the Internet: delay, losses, and other IPPM metrics. Raw data are: traceroutes, packets sent, packets arrived. A database storing processed data opened up for public access on January 1, 2004. Anonymization of data and circulating the paper in RIPE community for comments prior to publication are the two conditions of access.

Routing Information Service (RIS) collects routing information by using Remote Route Collectors at different locations around the world and integrates this information into a comprehensive view. Raw data are: RIB dumps (3 per day), timestamped BGP updates (IPv4 - from up to 12 locations, since September 1999, 250 Gbyte/yr; IPv6 - since October 2002, a few Gbyte/yr). Log files and software to read files are available online. An AUP allows to download and analyze the data and requests to inform RIPE about publications.

DNSMON is a beta-service monitoring all DNS root and seven TLD servers from a few dozens of locations. Full deployment is expected next year. Data will be open for research.

RIPE also offers access to whois database (restricted due to contact information) and to regularly updated registration (allocation) data. Future plans envision further development of information services with emphasis on providing data for the community. Possibly, different sets of data can be created as necessary for different target groups.

Matthew Zekauskas (Internet2) - [html slides]

presented Abilene Observatory datasets: flow data (last 11 bits of IP addresses are zeroed), one-way latencies for 2*11² paths, router snapshots, 1 and 5 min SNMP usage data, throughput (measured with iperf). They will start collecting more types of data in the near future.

The data are archived in many places with different AUPs. Summaries (graphs, tables, time series of summaries) are stored forever and served on the Web. Raw data in diverse formats (with, probably, insufficient meta-data) are available only by special request and have to be manually recovered. Flow data (collected using Mark Fullmer's flow tools) are kept for 30 days. 5 min SNMP data (polled using custom software) are stored in RRD files.

Future plans include: creating new databases for IGP and BGP data, using Homeland Security grant to clean up databases and to improve access.

George Riley (Georgia Tech) - [pdf slides]

advertised NETI@Home, an open-source software package for conducting passive Internet measurements from world-wide vantage end-points. It collects network performance statistics for a number of commonly used Internet protocols in order to capture "real" users experiences. The software can operate on multiple platforms and is easy to install and upgrade. It runs in the background and reports results to Georgia Tech for subsequent analysis and posting. Users can protect their privacy by selecting the desired disclosure level (e.g., no IP address, or first 24 bits of the IP address (default), or full disclosure). The tool does not sniff packets in a promiscuous mode, but does measurements on a per flow bidirectional basis.

Altruism and pretty pictures (NETIMap) are expected to provide motivation for potential users. Currently, there are 730 unique users since Jan 7, 2004. It woud be ideal to have 10,000 users. About 500 MB of uncompressed binary data have been collected in one week since May 26, 2004.

Christos Papadopoulos (USC/ISI)[pdf slides]

talked about data collection projects on Los Nettos, a 15 year old regional net for Los Angeles area. They currently monitor one (out of three) upstream provider and Internet2. Two minute traces are taken using tcpdump software on FreeBSD PCs and stored on RAID boxes.

The data were used to study DDos attacks signatures (single source vs. multiple source) and to attempt detection of congested links. Data on about 80 DDos attacks are anonymized, binned into 1 ms time series and available on DVDs for external researchers with a reasonable one-page AUP. So far, 8-10 users have requested access to these data.

Les Cottrell's (SLAC) - [pdf slides]

main interest is in end-user Internet measurements. He presented PingER - a more than 7 year old Internet measurement project involving 35 monitor sites and about 550 remote sites in more than a hundred countries. Very lightweight ping probes are sent every 30 minutes between a growing number (currently, about 3700) of source-destination pairs. A monitor site collects about 0.5 MB/pair/month. Data are archived at SLAC and FNAL. About 40 users access these data when they are posted, and there are a few requests for archived data per year.

Another project, IEPM-BW, monitors high performance paths using iperf, bbcp, bbftp, GridFTP, and ping. Ten monitor sites and about 60 remote hosts from nine countries participate in this project. Raw measurements are stored in flat files and in Oracle database. Recent data are available via Web Services (using NM-WG request schema).

Continuous measurements are hard. Keeping remote sites accessible, collecting data from monitor hosts, and continuous evolution of NM-WG schema definitions are among challenging issues.

Nick Feamster (MIT) - [pdf slides]

presented wide-area network data and analysis efforts at MIT. They use RON testbed of 31 widely distributed nodes with stratum 1 NTP servers and CDMA time synchronization. Periodic pairwise active probes measure one-way delay and loss, while three consecutive lost probes trigger a traceroute. There are also daily pairwise traceroutes over testbed topology and iBGP feeds at eight measurement hosts. All data are pushed to a central measurement host.

The following problems are associated with the data:

changes in connectivity (IP renumbering, upstream providers change)

non-standardized and sometimes buggy tools

data management (continuous collection vs. archival, equipment failures and outages, complaints, etc.)

miscellaneous issues (keeping track of occurring problems, hosts are not firewalled, iBGP sessions to border router on the same LAN, etc.)

A few projects make use of the collected data. BGP monitor overview summarizes BGP updates by time. Failure characterization study showed that failures typically occur about 3-4 minutes before BGP activity. 60% of failures that appeared at 3 or more hops from an end host coincided with at least one BGP message. Invalid prefix advertisement study showed that a large number of offending ASes leaks out routes from private address space. Simple static filters would alleviate this transgression. Over 50% of bogus routes persist for more than one hour and many of them stay around for a day or more.

Yuval Shavitt (Tel Aviv University)

proposed to let the Internet measure itself. His project Distributed Internet MEasurement and Simulation (DIMES) aims to study the structure and topology of the Internet and is similar in concept to NETI@home. It relies upon assistance of thousands of volunteers, who will download and run the open source DIMES agent to perform network measurements such as Ping and Traceroute from all corners of the globe. The tool has a very low bandwidth consumption (< 1 KB/s) and does not monitor any activity performed by a user.

The project is now in its testing phase and a fully working version is expected by the fall. The data will be collected and archived at the Tel Aviv University with processed data made available on the web. The following analyses have been proposed: characterizing completeness of Internet AS maps, tracking Internet growth, studying router PoP level topology, investigating BGP optimality and convergence, testing Internet virus protection methodologies.

Bill Yurcik (NCSA) - [pdf slides]

gave an overview of scalable security data management for internal/external data sharing. There are many different incentives to share data: saving time and effort, getting economic advantages, legal requirements, research interests, and SECURITY (probably, the carrot that often drives data collection). Thus, there is no one-size-fits-all solution for data sharing. It is important to recognize that cooperation and sharing need to be promoted since they make us less vulnerable to malicious Internet activities^*.
* - provided that the data are shared "with the right people".

Security solution space is multidimensional. Network data come in 18 commonly available logs, and each one has unique characteristics. Processing algorithms should look at different attributes across all logs in order to achieve maximum situational awareness and to enable smart human decisions. Log anonymization at multiple levels may be a good solution for data sharing.

The Forum of Incident Response and Security Teams (FIRST) is a good example of cooperation that works. It started from ground up and, currently, has more than 100 members (by invitation only) from government, commercial, and academic organizations. FIRST members cooperate in reacting quickly and preventing incidents and share the relevant information among themselves and with the community at large.

Discussion of supporting tools and techniques

Mark Allman (ICIR) - [pdf slides]

started this session with discussion of challenges encountered in building a culture that values data catalogs. Obviously, a working data catalog would lead to a better science by improving reproducibility of the results, adding more vantage points, and providing longitudinal views of the Internet. Then why do not researchers share more data? As a rule, they do not get credit for releasing their data although the effort required is comparable to that of writing a paper or software. Multiple privacy/policy/legal/competitive issues impede sharing of passive measurements. In dealing with active measurements, laziness (or the lack of designated funding?) is the main obstacle to sharing since cleaning and packaging the data is often a time consuming task. Also, it is difficult to make data collected for someone's own purposes useful for others when meta-data are often insufficient.

Route Views is a prominent and instructive example of broad participation in gathering data and using them. Features that led to its success are: homogeneous measurements of just one type, easy to set up, useful to both researchers and operators (giving them a motivation to participate). The impact of this project on the community is dramatic since "everyone uses routeviews".

We need a real cultural shift and commitment from the research community in order to change prevalent attitudes toward data and to keep the IMDC repository operational. Some suggestions and recommendations are:

repository must be easy and useful to researchers

tools should help to collect meta-data

we need tools to help researchers integrate their measurements to the catalog

we need anonymization techniques that work

concentrate on easy stuff first: catalog active measurements

find pioneers to seed the system with their data sets

in publications, require citations as acknowledgments of data used

More drastic suggestions:

reject papers whose authors do not release the data

make data contribution a condition for funding.

Ethan Blanton (Purdue University) - [pdf slides]

shared his view of the Scalable Internet Measurement Repository (SIMR) which appeared as a forerunner of the IMDC. Working schema definitions are the crux of the project. Careful enumeration of interesting characteristics maximizes consistency and makes searching more effective, but decreases flexibility and may impede future developments. Details of measurements (level of anonymization, concurrent conditions at measurement time, host location, sampling used, etc.) are very important, but often invisible when looking at the data. Annotating all details is hard, especially because different studies care about different things.

Other challenges faced by a large measurement database are: (i) drawing an explicit line between data to catalog and derived results, (ii) database pollution and preserving signal-to-noise ratio above a certain threshold, (iii) scalability of user interaction with database to find/get individual data items.

Yu. Shavitt: would it be a good idea to design the data catalog for maximum submission simplicity encouraging people to put their information in? If over-engineered and cumbersome to contribute to, the catalog will not be populated. At the same time, there is a huge community of users who cannot generate data, but are willing to filter signal from noise in their searches. Eventually, tools will emerge to cleanup data or to get good data.
E. Blanton: both paths are bad, leading to either a giant database of worthless crud or to an empty database. We need to find the right balance.

Timur Friedman (Paris 6) - [pdf slides]

works on the French measurement infrastructure Metropolis. It is a multi-partner project funded by government for three years to measure RENATER and other French networks with emphasis on security. For active measurements they use RIPE TTM boxes, SATURNE boxes, generic BSD and Linux boxes (equipped with NIMI or a new tool Pandora) installed at each of French partners. These measurements will be extended to other European partners as well. For passive measurements they employ DAG cards, QoSMOS and Ipanema boxes. Sampling is implemented as necessary to measure OC192 links.

Currently, the data are not advertised, but available to researchers on request. AUPs and restrictions vary depending on which institution conducted the measurements. Passive traces are subject to prefix preserving transformation.

Timur gave the following recommendations to the IMDC team:

always discuss what was NOT measured (e.g., distributed monitors may fail randomly and would bias results)

plan experiments and build meta data (tools, arguments and parameters, platforms, times, etc.) into distributed systems

support the idea of data 'publication' and 'citation'

convert data to XML format (easy to parse, standardized - but cumbersome)

Juana Sanchez (UCLA) - [pdf slides]

teaches Statistics to graduate and undergraduate students using the Internet as an example of a complex probabilistic system. Her objective is to introduce students to the field and motivate them to propose ideas and solutions for real situations. For educational purposes, she would like to have access to already processed datasets free of engineering issues.

Examples of possible research problems:

probabilistic modeling: do packet counts follow a mixture of Poisson distributions?

statistical characterization of traces (using Hurst parameter)

what causes burstiness and why do bursts cause long-range dependencies?

does a burst correspond to a bump in the wavelet spectrum?

is the queuing theory (that models telephone networks very well) applicable to the Internet?

general network tomography: apply pseudo-likelihood approach to estimate source-destination traffic intensities from link data

network topology identification: knowing final delays, can we estimate the link tree structure in the middle?

sampling problems: how to get full information about a certain population at a lower cost then full census and what are the right metrics?

Dave Plonka (U. of Wisconsin - Madison) - [pdf slides]

presented his approach to bare-bones measurement data archiving. There are the following types of data to deal with:

passive: exported flow data and SNMP-gathered measurement data (50K of RRD files, for 16K switch ports on campus)

active: traceroute and ping-like text output, BGP from Route Views and from campus routers.

Flow data are packet-sampled flow records from Juniper (with varying sample rates and varying regularity) and non-sampled flow-data from Ciscos (sometimes lossy, always voluminous).

Raw (binary) flow files, sometimes compressed, are kept for 5-14 days. This life time is enough for operational use while storage space limitations make longer intervals infeasible. RRD files storing up to 10 years of data with 5 min granularity and occasional copies of raw data are archived for a long term. Anonymization is rather cumbersome and takes hours for a day-long flow data set.

Each data directory contains detailed README files and implements meaningful file naming conventions {collector}.{date}.{time}{TZ}{encoding}.{fmt}. There is also a journal/log of events ('events.txt').

An AUP to access the data initially resembled the NLANR/CAIDA model: signing usage agreement documents, keeping data (and therefore analysis) on the central server, releasing as little as possible (but no less), asking researchers to describe their projects when they apply for data. This approach turned out to be impractical and eventually evolved into trust relationships between researchers and practitioner (creator/archiver). The result is minimally successful, time-consuming, and not scalable. A possible future solution may be: the older the data, the less restrictions to release them.

The authors of the following two talks were not available to present at the workshop, but their slide sets are available:

Dave Meyers (U. of Oregon and Cisco Systems)

Route Views Update

Bill Manning (ep.net)

DNS software - authoritative server checks

Acknowledgments

We are indebted to Matthew Zekauskas and Les Cottrell for their invaluable minutes taken during the workshop and heavily used in preparing this report.

The workshop was sponsored by the NSF grant "Correlating Heterogeneous Measurement Data to Achieve System-Level Analysis of Internet Traffic Trends" NSF ANI-0137121 and by WIDE gift fund.