



We propose to develop a system to enable discovery of the full potential value of massive raw Internet end-to-end path measurement data sets.
Principal Investigator(s): kc claffy
Funding source: NSF CNS-1925729. Period of performance: September 1, 2019 - August 31, 2022.
Project Summary
Internet cartography has emerged as a new field of computer as well as network science, with several global Internet measurement infrastructures executing comprehensive topology mapping measurement experiments, continuously, for years. UCSD's Center for Applied Internet Data Analysis (CAIDA) has operated the longest-running of these measurement infrastructure platforms (Archipelago), which has supported scientific measurement experiments of the global Internet since September 2007. This platform has collected 90 billion traceroutes in 39 TB of files, growing 16 billion traces and 7 TB annually (5-year doubling rate). These data sets have already yielded impacts across a broad range of CISE sub-disciplines. Yet the biggest remaining obstacle to even more productive scientific use of this unbounded wealth of information is infrastructural: the lack of an easy-to-use and analytically powerful exploratory interface to the data.
Researchers have made explicit requests for search functionality that would have a transformative impact on several focused areas of CISE-funded research. In response to community feedback, we propose to develop the FANTAIL system - Facilitating Advances in Network Topology Analysis - to enable discovery of the full potential value of massive raw Internet end-to-end path measurement data sets. We envision a four-component system: (1) an interactive web interface; (2) an API built on web standards; (3) a full-text search system based on Elasticsearch; and (4) a big data processing system based on Spark, leveraging SDSC's cluster resources. Although our goal is to enhance the general accessibility and utility of this data, our project will be driven by specific compelling use cases, in response to research community needs for interactive exploratory capabilities. To this end, we will identify and implement reusable components, analysis modules, which will serve as primitives for constructing more complex data-processing pipelines. Users will specify, via the web interface or API, a sequence of analysis modules to execute on the set of traceroute paths matched by their queries. The FANTAIL system will then perform the queries, run the analysis modules, and provide the output for download for further analysis or processing by researchers on their own systems. We will implement analysis modules that are useful for (1) performing data reduction (to minimize the amount of data users have to download and process), (2) enhancing raw traceroute data with various annotations available publicly or created by us, and (3) offloading commonly-needed analysis/data processing tasks from users.
Projected Timeline
Task | Description | Projected Date | Status |
---|---|---|---|
Infrastructure Development | |||
1 | Acquire and deploy a new server to support FANTAIL | Year 1 | |
2 | Set up Elasticsearch and Spark on SDSC cluster computers, and develop software needed to connect Elasticsearch with Spark | Year 1 | done |
3 | Convert most recent few years of CAIDA traceroute data into JSON format and import into Elasticsearch | Year 1 | done |
4 | Develop a command-line tool to perform queries through the Elasticsearch API, and determine the exact Elasticsearch query expressions needed to execute traceroute queries | Year 1 | done |
5 | Implement analysis module to
(1) reduce a set of traceroute paths to the set of unique paths or to a graph, and (2) extract, analyze, and compute various statistics on round-trip time data | Year 1 | done done |
6 | Implement a web API to perform traceroute queries, construct and execute a data processing pipeline, and download results in a suitable format | Year 1 | Ongoing |
7 | Develop tools to store DNS, IXP, bdrmapIT, and TNT data in a database and make accessible from FANTAIL | Year 2 | |
8 | Implement all remaining analysis modules | Year 2 | |
9 | Implement an interactive web site to perform traceroute queries, construct and execute a data processing pipeline, and download results in a suitable format | Year 2 | |
10 | Implement all analysis recipes | Year 3 | |
11 | Implement support for executing analysis recipes to interactive web site and API | Year 3 | |
12 | Import most recent few years of RIPE Atlas traceroute data into Elasticsearch | Year 3 | |
13 | Import the remainder of CAIDA traceroute data into Elasticsearch | Year 3 | |
14 | Develop tools to automate importing of new CAIDA and RIPE data | Year 3 | |
15 | Document FANTAIL and its capabilities for operational maintenance | Year 3 | |
Community Activities | |||
1 | Create and open a project web site | Year 1 | done |
2 | Organize a Community Workshop; publish the workshop report and recommendations | Year 1 | done |
3 | Attend CCRI PI Community meeting | Year 1 | |
4 | Create a mailing list to support FANTAIL users | Year 1 | |
5 | Organize a Community Workshop; publish the workshop report and recommendations | Year 2 | |
6 | Discuss with researchers their needs for generally usable analyses modules | Year 2 | |
7 | Attend CCRI PI Community meeting | Year 2 | |
8 | Conduct FANTAIL user survey #1 | Year 2 | |
9 | Identify and prioritize improvements of FANTAIL capabilities based on users' feedback | Year 2 | |
10 | Organize a Community Workshop; publish the workshop report and recommendations | Year 3 | |
11 | Attend CCRI PI Community meeting | Year 3 | |
12 | Conduct FANTAIL user survey #2 | Year 3 | |
13 | Refine FANTAIL capabilities based on users' feedback | Year 3 | |
14 | Engage FANTAIL users in sustainability discussions | Year 3 |