Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
Center for Applied Internet Data Analysis
Internet Protocol Address (IP) Geolocation Bibliography
This page presents an annotated bibliography of papers and datasets related to the field of Internet Protocol (IP) address geolocation. Many applications require the association of Internet resources with an accurate geographic label at some granularity. For some applications knowing the country of origin might be sufficient; for others a more precise indication at state, city or zip code granularity, or even a specific latitude/longitude is needed. Below we provide an overview of published literature related to geolocation in an attempt to describe the current state of the art. We conducted this literature search as part of our efforts to compare geolocation tools.

Introduction

IP address geolocation reminds one of the classic bumper sticker, "think globally, act locally." In today's far reaching Internet, organizations and institutions of all kinds from corporations to governments want exactly that, the ability to communicate to the entire world and, at the same time, to develop applications which help them to target, limit, customize their messages, balance resources, and coordinate responses based on the location of the receiver. Organizations accomplish this by using tools and services that translate an IP address or prefix range into a geographic location (country, state, city, zip, geographic latitude/longitude) associated with the address(es). Simple, right?

However, which method(s) work best? Which sources of geolocation services and information return the most reliable locations and at what cost? What is the geographic resolution? Further, if a source provides the geographic location of the owner of an IP address, is this location the same as the location where the device is actually broadcasting and receiving packets? And, if different, can the difference be quantified?

What constitutes a "good" geolocation result? Some numbers: with a total land area of 1.5×108 km2 and 195 countries, the average country size on Earth is about 7.7×105 km2, or a linear size of 880 km. The surface area of the US is about 107 km2. With 50 states, over 3,000 counties and on the order of 43,000 zip codes, the average linear size of a state, county or zip code is about 450, 55 and 15 km, respectively. Looking at another big country, China (about the same size as the USA) has 33 provinces, 333 prefectures, about 3000 counties, and about 42000 townships, giving sizes of 550, 170, 60 and 18 km, respectively. To begin to be useful a geolocation method would at the very least need to be able to pinpoint the correct country, and, in large countries like the USA or China, the correct state or province. Looking at the above numbers this would require geolocation errors of at most a few hundred kilometers. An accuracy measured in tens of kilometers would be required to be effective at a truly local level (county or zip code).

Useful Definitions

A number of concepts are commonly encountered in the geolocation literature. We define the main ones here.

  • IP geolocation describes methods of assigning a geographic label to an individual Internet Protocol address (IP).
  • A Vantage Point (VP) is a measurement infrastructure node with a known geographic location.
  • A Landmark is a responsive Internet identifier with a known location to which the VP will launch a measurement that can serve to calibrate other measurements to potentially unknown geographic locations. Some papers use the term Active Landmarks to refer to points which act as both landmark and vantage point. Often they are part of an infrastructure platform like PlanetLab.
  • A Target is an Internet identifier whose location will be inferred from a given method. Typically some targets have known geographic locations (ground truth), which researchers can use to evaluate the accuracy of their geolocation methodology.
  • A Location is a geographic place that geolocation techniques attempt to infer for a given target. Examples include cities and ISP Points of Presences (PoPs).

Not all terms are used in all papers.

Geolocation Papers

The tables below contain annotations for papers on the topic of geolocation . We have collected and reviewed papers published between 1996 and 2010, starting with papers from peer-reviewed academic research conferences, and then including papers cited from this initial seeding, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers.

The first table emphasizes papers that directly address geolocation methodology, introducing new methods, extensions to previous methods, performance analysis, etc. The second table includes papers that address other geolocation-related issues, including applications of geolocation, and coordinate-based methods for modeling network delays.

Alongside author and publication information, the tables include a number of additional columns.

The papertype gives a category indication; we use "survey", "analysis", "methodology", "tools", and "other". "Methodology" papers develop specific methods of geolocaton; "analysis" papers focus on providing a quantitative foundation for geolocation methods (e.g., by comparing results for several methods); "survey" papers provide an overview of geolocation issues.

Data describes the type of data on which the results claimed in the paper are based. We mention here if the paper describes "ground truth" (authoritative mappings between IP addresses and geographic locations) used to validate geolocation results.

Findings gives a brief description of the main results claimed in the paper.

Probes gives an indication of the experimental setup (probes, landmarks, targets) used in a geolocation experiment (where appropriate).

Click on a checkbox below to show that attribute for each paper in a separate column. Enter text in the 'Filter' field to limit the listing. Below this table is a similar table of attributes of the data sets analyzed in this set of papers.

Filter:
Columns:
YearPublicationAuthorsPaper Type
MethodDataFindingsMeasurement Setup

IDTitleYearPublicationAuthorsPaper TypeMethodDataFindingsMeasurement SetupPDF
P-01A Means for Expressing Location Information in the Domain Name System1996RFC EditorDavis, C. and Vixie, P. and Goodwin, T. and Dickinson, I.RFC 1876Description of DNS LOC records
P-02GTrace - A Graphical Traceroute Tool1999Usenix LISAPeriakaruppan, R. and Nemeth, E.toolGUI for displaying traceroute results on geographic mapuses geo-info from DNS LOC, WHOIS (NetGeo), IP-to-location databases, hostname heuristics, combined with RTT-based verification
P-03Where in the World is netgeo.caida.org?2000INETMoore, D. and Periakaruppan, R. and Donohoe, J. and k claffy, ktool
P-04An Investigation of Geographic Mapping Techniques for Internet Hosts2001SIGCOMMPadmanabhan, V.N. and Subramanian, L.methodologyIP2Geo suite: GeoTrack (traceroute info+host name heuristics), GeoPing (based on similarities in RTT-delay patterns), GeoCluster (IP-to-location database+BGP routing info)IP-to-location datasets: 41772 Hotmail users at state granularity; 181246 IPs from bCentral web-hosting company at zipcode granularity; 142807 IPs from FooTV at zipcode granularityGeoTrack most promising with median errors of 28 (well-connected hosts) to few hundred km. Median errors for geolocation on same set of univ hosts: 28 km for GeoCluster; 102 for GeoTrack; 382 for GeoPing14 landmarks and 365 targets (in US, mostly univ-based)
P-05Similarity Models for Internet Host Location2003ICONZiviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.analysisexplores various similarity measures for GeoPing-type (see P-04) geolocationRIPE TTM delay measurements from Dec 2002 to Jan 2003)similarity based on "city-block" distance measure works best55 landmarks (RIPE TTM hosts)
P-06Toward a measurement-based geographic location service2004Passive and Active Network Measurement Workshop (PAM)Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.analysisexplores various similarity measures for GeoPing-type (see P-04) geolocationdelays from probes to all landmarks, and from probes to target hostsufficient correlation between geographic distance and network delay exists for coarse-grained geolocation; explores mostly distance based similarity measures; median distance error of 314 km397 landmarks (RIPE TTM and LibWeb servers); 9 probes (NIMI)
P-07Constraint-based Geolocation of Internet Hosts2004IEEE/ACM Transactions on NetworkingGueye, B. and Ziviani, A. and Crovella, M. and Fdida, S.methodologymultilateration based on geometric constraints (upper limit on target host distance from landmark) derived from delay measurements between landmarks provides location estimate and confidence regionone-way delays between TTM hosts (Dec 2002-Feb 2003); RTT delays between AMP hosts (30 Jan 2003); known landmark locationsmedian error of 95 km (NLANR, USA) and 22 km (TTM, Europe)Landmarks: 95 NLANR AMP hosts; 42 RIPE TTM hosts
P-08Demographic Placement for Internet Host Location2003GLOBECOMZiviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.methodologydevelops methodology for efficiently deploying landmarks and probes for delay-based geolocation methods (in particular GeoPing from P-04Landmarks are placed using demographic criteria (locations with high user density); probes are placed sparsely at locations with high connectivity
P-09Improving the accuracy of measurement-based geographic location of Internet hosts2005Computer Networks and ISDN SystemsZiviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.methodologyexplores several issues related to implementation of GeoPing-type geolocation: correlation between RTT and geographic distance: optimal placements of landmarks and probes; methods for evaluating similarities between delay patternsdelays from LIP6 (Paris, France) to 109 LibWeb hosts in June 2002; delays between 55 RIPE TTM host from Dec 2002 to Feb 2003number and location of landmarks and probes is optimized using a demographic approach; similarity based on "city-block distances" outperforms Euclidean distance model109 LibWeb webservers; 55 RIPE TTM hosts
P-10Towards IP geolocation using delay and topology measurements2006Internet Measurement Conf (IMC)Katz-Bassett, E. and John, J.P. and Krishnamurthy, A. and Wetherall, D. and Anderson, T. and Chawathe, Y.methodologyDevelops Topology-based geolocation (TBG): improves on pure CBG based on end-to-end delays by leveraging network topology at the router level and validated external hints; uses global optimization approach to determine router and target locations simultaneously.PlanetLab hosts are used as landmarks. Geolocation experiments using targets collocated with 11 Abiline PoPs, 22 Sprint PoPs and 128 Univ. hostsImproves errors by typically factors 3 to 4 as compared with CBG68 PlanetLab landmarks; 128 US Univ. host targets
P-11Leveraging Buffering Delay Estimation for Geolocation of Internet Hosts2006Int Federation for Information Processing Technical Committee 6 (IFIP-TC6) Networking ConfGueye, B. and Uhlig, S. and Ziviani, A. and Fdida, S.methodologyGeoBuD: CBG (P-07) augmented with buffering delay estimates at intermediate routers derived from traceroutestraceroutes from PlanetLab landmarks for 17 Oct 2005 (US dataset); and 21 Nov 2005 (WE dataset)incorporating buffering delays at intermediate routes in CBG (P-07) reduces median geolocation error from 228 to 144 km for US dataset and 137 to 100 im for WE dataset29 US PlanetLab landmarks with 87 AMP targets; 27 WE PlanetLab landmarks with 57 RIPE TTM targets
P-12IP Geolocation2007Internet Measurement seminarHolzhauer, F.surveyreview of methods, emphasizing CBG ([P-07], and TBG ([P-10])
P-13Octant: A Comprehensive Framework for the Geolocalization of Internet Hosts2007USENIX Symp on Networked System Design and Implementation (NSDI)Wong, B. and Stoyanov, I. and Gun Sirer, E.methodologyuses positive and negative constraints on hosts and intermediate routers; assigns "weights" to handle uncertainty in constraints; uses fictitious "height" to capture last hop delays; uses geometric technique based on Bezier curvers that can incorporate extraneous geographic hintsLatency data collected on 1-Feb-2006 and 18-Sep-2006 between landmarks, intermediate routers and targetsMedian geolocation error of 35 km51 PlanetLab hosts; 53 public traceroute servers
P-14Investigating the Imprecision of IP Block-based Geolocation2007Passive and Active Network Measurement Workshop (PAM)Gueye, B. and Uhlig, S. and Fdida, S.analysisuses CBG geolocation to investiage geographic spread of IP addresses in same IP blockCBG (P-07) locations for 18759 IPs in 876 IP blocks between 31 Mar and 19 Apr 2006. IPs are CoralCDN Web clients, linked to IP blocks using database from paper R-05~ 60% of IP blocks have spread in excess of 200 km74 PlanetLab landmarks
P-15Assessing the geographic resolution of exhaustive tabulation for geolocating Internet hosts2008Passive and Active Network Measurement Workshop (PAM)Siwpersad, S.S. and Gueye, B. and Uhlig, S.analysiscomparison of locaion estimates from CBG (P-07) with locations from MaxMind and Hexasoft databasessingle IP from 41758 MaxMind and 15823 Hexasoft IP blocks are geolocated with CGBdatabase location for more than 90% of IP blocks lies outside CBG confidence region39 PlanetLab landmarks
P-16Internet geolocation: evasion and counterevasion2009ACM Computing SurveysMuir, J. and van Oorschot, P.C.surveyoverview of geolocation methods with general discussion of limitations; discussion of ways adversaries can avoid geolocationp; mentions extraction of IP using Java applet, and RTT measurement by HTTP refreshesno geolocation method is robust (works for all IP addresses, network configs, and against adverserial users); those trying to evade geolocation, can complicate the task for locators, but geographic information can leak in many ways
P-17Mining the Web and the Internet for Accurate IP Address Geolocations2009IEEE Conf on Computer Communications (INFOCOM)Guo, C. and Liu, Y. and Shen, W. and Wang, H.J. and Yu, Q. and Zhang, Y.methodologydata base mining technique (Structon): geographic information from large database of Webpages, combined with number of heuristics to increase accuracy and coverage500 million Chinese URLs, augmented with traceroutes, WHOIS and BGP information87% accuracy at city-level granularity
P-18Statistical geolocation of Internet hosts2009Int Conf on Computer Communications and Networks (ICCCN)Youn, I. and Mark, B.L. and Richards, D.methodologydelay-based statistical method: delay-to-distance relationship is expressed in a probability density function; solution by iterative force-directed methoddelay measurements between all pairs of landmarks every five minutes for one weekCompared to GeoPing (P-04) and CBG (P-07) median errors are reduced by ~20%; mean errors by ~50% (i.e., significant improvement especially in reducing large errors in GeoPing and CBG85 PlanetLab landmarks
P-19A study of geolocation databases2010arXiv cs.NI/1005.5674v3Shavitt, Y. and Zilberman, N.surveystatistical analysis of PoP "range of convergence" and deviations of IP and PoP locations within PoPPoP map of 3800 PoPs (52K IPs) derived from DIMES traceroute measurements in March 2010vast majority of location info in databases is correct, but also errors in the range of 1000s kmDIMES
P-20GeoWeight: Internet host geolocation based on a probability model for latency measurements2010Australasian Conf on Computer Science (ASCS)Arif, M.J. and Karunasekera, S. and Kulkarni, S.methodologyconstraint-based (CBG) augmented by a probability model for latency vs. geographic distance150000 latency measurements PlanetLab landmarks from 23 Sep 2008 to 25 Oct 2008; latencies from landmarks to 80 NA targetsMedian geolocation errors of ~44 km compared to > 200 for Octant and > 500 for CBG50 PlanetLab landmarks and 80 targets in North America
P-21A model based approach for improving router geolocation2010Computer Networks: The Int Journal of Computer and Telecommunications NetworkingLaki, S. and Matray, P. and Haga, P. and Csabai, I. and Vattay, G.methodologydevelops detailed path-latency model (separating propagation and per-hop delays); uses global optimization to solve for target locations (similar to P-10)mean geolocation errors of ~150 kmETOMIC landmarks and 41 GEANT2 targets; 151 PlanetLab nodes, used as landmarks and targets
P-22Internet Host geolocation using maximum likelihood estimation technique2010IEEE Int Conf on Advanced Information Networking and Applications (AINA)Arif, M.J. and Karunasekera, S. and Kulkarni, S. and Gunatilaka, A. and Ristic, B.methodologydelay-based statistical method: delay-to-distance relationship is expressed as a probability density function; solution by MLE methoddelay measurements between landmarks from 23 Sep to 25 Oct 2008median error of 134 km, compared to 216 for Octant (P-13) and 506 km for CBG (P-07) on same dataset50 NA PlanetLab landmarks and 50 other NA hosts as targets
P-23Dude, where's that IP? Circumventing measurement-based IP geolocation2010Usenix Security SympGill, P. and Ganjali, Y. and Wong, B.analysisSimulated "attacks" using PlanetLab testbed to foil delay-based and topology-aware geolocation attemptstopology-aware techniques are more susceptible to tampering than simpler delay-based techniques50 NA and 30 WE PlanetLab nodes
P-24A learning-based approach for IP geolocation2010PAMEriksson, B. and Barford, P. and Sommers, J. and Nowak, R.,methodologystatistical method: expresses relation of distance to delay and hop count as a probability density function; solution by learning-based classification methodiPlane data for 12 Dec 2008 to 8 Jan 2009, supplemented with traceroutes between 375 PlanetLab hosts. Three sets of PlanetLab traceroutes between 11 Dec 2008 6 Jan 2009MaxMind db is used as "ground truth". Results are compared with CBG (P-07: mean error reduces from 519 km for CBG to 408 km for proposed learning-based method375 NA PlanetLab nodes
P-25Spotter: A model based active geolocation service2011INFOCOMLaki, S. and Matray, P. and Haga, P. and Sebok, T. and Vattay, G.methodologydelay-base statistical method: combining spatial probability density function for all landmarks defines estimated region for location of targetuses PlanetLab nodes as targets; also uses 23000 Cogent IP address locations in Europe and USall landmarks are described by same probabilistic delay-distance modelPlanetLab
P-26Towards street-level client-independent IP geolocation2011UsenixWang, Y. and Burgener, D. and Flores, M. and Kuzmanovic, A. and Huang, C.methodologycombines active measurement approach with an active web-mining technique. Uses CBG for "coarse" geolocation; refines location using "relative network distance" in combination with large number of landmarks located using web-mining technique.method evaluated using 88 PlanetLab nodes; a set of 72 residential IP address; and 3rd undisclosed datasetclaims median errors of 1--2 km for the three datasets used.88 PlanetLab targets
P-27iPlane Nano: path prediction for peer-to-peer applications2009Usenix Symp on Networked systems design and implementation (NSDI)Madhyasth, H.V. and Katz-Bassett, E. and Anderson, T. and Krishnamurthy, A. and Venkataramani, A.methodologyprovide atlas for PoP-level paths, with latencies, and loss rates predictions between arbitrary hosts on the Internet by path stitching across inferred PoP paths.iPlane dataiPlane iNano provides PoP-level paths between arbitrary end-hosts with an atlas that is less than 7MB in size and can be updated
P-28Matchmaking for online games and other latency-sensitive P2P systems2009ACM SIGCOMM Computer Communication ReviewAgarwal, S. and Lorch, J.R.methodologyHtrea: place pairs of clients in a network coordinate system to provide client to client latency prediction. They seed their network corrdinate system with Maxmind Geolite coordinates.3.5 million console to console latencies from Halo (microsoft) Geolite IP to geographic location (Maxmind)50% of predictions under 15 ms for Htrae, 24 ms for Geolite. 95% of Htrea's predictions with in 138 ms and 208 for Geolocation.one time volunteers Htrea deployment on 11 home machines
P-29IP geolocation databases: unreliable?2011ACM SIGCOMM Computer Communication ReviewPoese, I. and Uhlig, S. and Kaafar, M.A. and Donnet, B. and Gueye, B.surveyCompare prefix distributions in 5 geolocation DBs with groundtruth.Groundtruth: 357 BGP prefixes from large European ISP with city-level location of router advertising subnet inside ISP. Databases of HostIP, IP2Location, InfoDB, Maxmind, Software77databases are strongly biased to popular countries; db IP blocks use official advertisements of ISP; while some of the ISP address space is geolocated decently (e.g. 20% of Maxmind within 10s of km of groundtruth), in most cases DBs are off by 100s to 1000s of km.
P-30Geocompare: a comparison of public and commercial geolocation databases2011CAIDA Technical ReportHuffaker, B. and Fomenkov, M. and claffy, kcsurveycompare geographic database against each other and ground truth datasetRIR, Software77, HostIP, IPligence, Cyscape, MaxMind GeoIP, MaxMind GeoLite, IPnfoDB, and Digital Envoydatabases roughly all agree on country, MaxMind GeoIP and Digital Envoy did best on ground truth. Digital Envoy did best on routers.
P-31Network measurement based modeling and optimization for IP geolocation2012Computer NetworksDong, Z. and Perera, R.D.W. and Chandramouli, R. and Subbalakshmi, K.P.methodologymeasurement-based geolocation method tested on PlanetLab nodes; k-means clustering of landmark-to-landmark measurements defines distance segments that each fits RTT vs distance using polynomial regression; semidefinite programming method finds optimized location of target host from estimated distances to landmarks.traceroutes from PlanetLab probes to landmarks for one week in Nov 2010Best results: 27-32 km in North America; 41-53 in Europe81 North-American and 90 European PlanetLab probes; 206 PlanetLab landmarks
P-32A structural approach for PoP geo-location2012Computer NetworksFeldman, D. and Shavitt, Y. and Zilberman, N.methodologyMethod to generate PoP-level geographic maps from IP-level graph based on DIMES traceroutes. PoP identification based on structure ('motifs') and a partitioning algorithm that assigns nodes to PoPs; geographic location assigned to PoP from geoloc DBsDIMES: 56M traceroutes from 2009 Jul and 33M from 2010 Oct. GeoDB useds: MaxMind, IPligence, HostIP, IP2Location, GeoBytesComparision with published PoP maps, finds that most large PoPs are found, but few small ones. The majority of incorrect links is attributed to database errors.DIMES, 1308 agents in 49 countries
P-33Posit: a lightweight approach for IP geolocation2012ACM SIGMETRICS Performance Evaluation ReviewEriksson, B. and Barford, P. and Maggs, B. and Nowak, R.methodologyCBG for geolocation to constrained region; Finds the most likely location from limited pool (ie city) given RTT to target and landmarks in constrained region.431 Akamai vantage points/targets and addtion 283 targets
P-34Enhancing the classification accuracy of IP geolocation2012Military Communications Conf (MILCOM)Maziku, H. and Shetty, S. Han, K. and Rogers, T.methodologyMachine-learning approach (extension of P-24). Larger set of classifiers include average, median, mode and std dev of delay measurement; hop count; pop. density.142,937 (23,843 after de-aliasing) router IP addresses from traceroutes between PlanetLab nodes between Jun and Oct 2011.Results heavily depend on good coverage by landmarks. Median errors vary from 0 (NE US) to ~500 km (N-Central US).67 well-distributed PlanetLab nodes across US serve as landmarks and probes
P-35Towards geolocation of millions of IP addresses2012ACM Internet measurement conference (IMC)Hu, Z. and Heidemann, J. and Pradkin, Y.methodologyselect 10 nearest, by RTT, Vantage Points to probe /24 prefix400 PlanetLab nodes to 25 known landmarksproves that selecting 10 Vantage Points with lowest RTT values to a given /24 prefix greatly reduces the amount of traffic needed to geolocate without large increases in error
P-36Using Whois based geolocation and Google maps API for support cybercrime investigations2013Recent Advances in Telecommunications and CircuitsButkovic, A. and Orucevic, F. and Tanovic, A.methodology
P-37Topology mapping and geolocating for China's Internet2013IEEE Trans. On Parallel and Distributed SystemsTian, Y. and Dey, R. and Liu, Y. and Ross, K.W.methodology

Related Papers


IDTitleYearPublicationAuthorsPaper TypeMethodDataFindingsMeasurement SetupPDF
R-01Predicting Internet Network Distance with Coordinates-Based Approaches2002IEEE Conference on Computer Communications (INFOCOM)Ng, T. S. E. and Zhang, H.methodologydevelops GNP, a coordinate-based method for estimating minimum RTT using absolute coordinatesdistance (minimum RTT) measurements between landmarks and two sets of targetsEuclidean embedding combined with a relative error measurement function works best19 landmarks (12 in NA; 5 in AP; 2 EU); two target set: 869 global; 127 Abilene-connected
R-02Virtual Landmarks for the Internet2003Internet Measurement Conference (IMC)Tang, L. and Crovella, M.methodology, analysiscoordinate-based method for estimating minimum RTT using Euclidean embedding; uses "virtual landmarks" for speed and scalibilityseven collections of RTT datanetwork distances can be described with 7-9 orthogonal vectors; ~90% of distances preserved with relative error <0.5; "virtual landmark" method simpler and faster than nonlinear optimization NLANR AMP
R-03On the geographic location of Internet resources2003J. on Selected Areas in Communications, Vol. 21., pp. 934-947Lakhina, A. and Byers, J.W. and Crovella, M. and Matta, I.analysisstatistical analysis of geographic properties of router topology.Uses CAIDA Skitter data (26 Dec 2001 to 1 Jan 2002) and Scan Project Mercator data (Aug 1999). Geolocation is done using IxMapper and EdgeScape.Superlinear relation between router and population density; connection patterns linked to geographic distance. Nr of AS locations correlates with AS degree and AS geographic dispersal.
R-04Vivaldi: A Decentralized Network Coordinate System2004SIGCOMMDabek, F. and Cox, R. and Kaashoek, F. and Morris, R.methodologyCoordinate-based method for predicting communication latencies using 2D Euclidean embedding augmented with a "height" componentRTTs between 192 PlanetLab nodes; and between 1740 DNS serversMedian relative error in RTT prediction of 11%PlanetLab
R-05Geographic Locality of IP Prefixes2005Internet Measurement Conference (IMC)Freedman, M. J. and Vutukuru, M. and Feamster, N. and Balakrishnan, H.analysisStatistical analysis of geographic properties of IP prefixes within context of implications for routing policies. Uses undns for geolocation (i.e., IP-to-geographic location mapping based on geographic information in DNS names)170000 IP prefixes from RouteViews from 27-Feb-2005; traceroutes to CoralCDN clients and servers; traceroutes from PlanetLab hosts to 4 IPs per prefix discontiguous prefixes announced by AS from single location usually due to fragmented alloction by registries; announcement of contiguous prefixes announced by AS from different geographic locations limits oportunities of aggregration of prefixes 25 PlanetLab hosts
R-06Geolocalization of Proxied Services and its Application to Fast-Flux Hidden Servers2009IMCCastelluccia, C. and Kaafar, M.A. and Manils, P. and Perito, D.applicationapplication of CBG (P-07) to geolocation of fast-flux hidden servers
R-07Eyeball ASes: From Geography to Connectivity2010Internet Measurement Conference (IMC)Rasti, A. and Magharei, N. and Rejaie, R. and Willinger, W.analysisIP geolocation (48x106 IP addresses from P2P apps) done using GeoIP and IP2Location; used to determine geo- and PoP-level footprints of ASesPoP info for 45 ASes in NA and EU compiled from online data
R-08Improving AS Relationship Inference Using PoPs2013Traffic Monitoring and Analysis Workshop (TMA 2013)Neudorfer, L. and Shavitt, Y. and Zilberman, N.methodologyMethod to use PoP level maps to find complex and anomalous AS relationships29M DIMES traceroutes (May 2012); DIMES IP-to-PoP mapping (May 2012) with 5215 PoPs, 98650 IPs in 2636 AS; CAIDA AS rank data from August 2012 with 119,924 AS pairs.Discusses several examples complex and/or anomalous AS relationships between AS (different at different ASes)DIMES

Measurement Infrastructure

The above IP geolocation bibliography references several active measurement infrastructures either because they are used directly in a geolocation experiment, or because datasets obtained with the infrastructure are analysed. These resources are listed here with references back to the papers.

Filter:
Columns:
OrganizationDescriptionPaper ID

IDNameOrganizationDescriptionPaperID
D-01PlanetLabPrinceton UniversityGlobal research network for the development of new network services. Currently over 1000 nodes worldwideP-10, P-11, P-13, P-14, P-15, P-18, P-20, P-21, P-22, P-23, P-24, P-25, P-26, R-04, R-05, R-06
D-02iPlaneUniversity of WashingtonScalable service for predictions of Internet path performance for emerging overlay services (incl. access to iPlane datasets)P-20, P-22, P-24
D-03TTMRIPETest Traffic Measurement Service (TTM) measures key parameters of he connectivity between points on the internetP-05, P-06, P-07, P-09, P-11
D-04AMPNLANRNLANR Active Measurement Project (AMP), active between 1998-2006. Datasets available at RIPE.P-07, P-11, R-02
D-05ETOMICEuropean Traffic Observatory Measurement Infrastructure (ETOMIC) is a measurement infrastructure, distributed throughout Europe, that is able to carry out active measurementsP-21
D-06GEANT2High-bandwidth, academic Internet serving Europe's research and education communityP-21
D-07DIMESTel Aviv Univ.Distributed scientific research project, aimed at studying structure and topology of the InternetP-19
D-08SkitterCAIDATool for actively probing the Internet for topology and performance analysis. Retired in 2008. Dataset availabe from CAIDAR-02

Geographic Information and Geolocation Methods

Current geolocation techniques can be broadly divided into two categories: database-driven (or registry-based P-25) and measurement-based. This categorization mirrors a similar division in the types of geographic information available for IP geolocation: qualitative data, and numerical (quantitative) data. Both have been present in geolocation efforts from the onset.

The class of quantitative data includes the workhorse of measurement-based geolocation methods: delay measurements from probes to landmarks and targets. A number of publications establish the relationship between Internet delay and geographic distance (P-06, P-09, P-10, P-25) in the presence of obfuscating factors like circuitous routing, buffering and other delays (P-11,P-21), etc. Also included in this class is network topology information, typically derived from traceroute measurements. Topology information can be an integral part of a geolocation algorithm (e.g., when intermediate routers to an end target are geolocated alongside the target itself in a global optimization; P-10), but is also used in simpler arguments that relate topological proximity to geographic proximity (e.g., when geolocating the last intermediate router when the real target is unreachable). Hop counts (also derived from traceroutes) are explored in a recent paper (P-24) as another quantitative measure of geographic distance.

The class of qualitative data includes the usual suspects (WHOIS registry, DNS LOC records, DNS names, BGP router tables), but also databases based on information gathered from the Internet community (either directly through user input, or indirectly, e.g. by parsing large quantities of URLs P-17). This probably also includes the various types of proprietary databases used in commercial geolocation products. All of these contain geographic information (directly as in DNS LOC records, or indirectly by linking to an organization or AS number) that, if correctly interpreted, provide clues about the geographic location of an IP address, or IP address block.

The earliest geolocation attempts, GTrace (P-02; constructed around NetGeo), GeoTrack and GeoCluster (P-04) emphasize qualitative data (primarily WHOIS records and DNS names), but already delay (RTT) measurements are incorporated. GTrace uses RTT data to validate results using "speed-of-light" arguments; GeoPing (P-04) is purely RTT-based. From these early attempts a number of measurement-based algorithms have appeared in the academic literature. The table below provides an overview of the accuracy achieved by the various techniques. In this list only the first three are database-driven; all others (starting with GeoPing) are measurement-based.

GeoPing (P-04), which uses similarities between "fingerprints" (based on delay measurements from a set of probes) for target and landmarks to select the location of the landmark with the most similar fingerprint as the target location, appears to be mostly of historical significance at this point as the first measurement-based geolocation method. Constraint-based geolocation (CBG; P-07), using deterministic geometric constraints derived from delay measurements to constrain the probable location of a target, has set the stage for future development, and is the most common "benchmark" used to compare more recent models against.

Subsequent geolocation methods show an increasing sophistication in extracting geographic information, either by supplementing delay measurements with additional data, or by more complex algorithms. Topology-based geolocation (TBG; P-10) introduces topology measurements to simultaneously geolocate intermediate routers and targets. Further refinements include an improved analysis of delay measurements (separating the distance-sensitive propagation delays from other processing delays; P-11, P-21), incorporating database-driven approaches to improve geolocation accuracy (P-10), and integrating hop counts into the geolocation algorithm (P-24). Algorithms also are evolving. The most recent models favor probabilistic approaches, which seem to be a better match to the essentially statistical nature of the relation between geographic distance and delay measurements. GeoWeight (P-20) marks a transition by combining deterministic constraints, similar to CBG, with probability assignments; P-18, P-22, P-24 and P-25 describe delay measurements using probability density functions, and use various statistical methods to build a geolocation algorithm.

Few detailed descriptions of database-driven techniques exist in the literature. The exceptions are NetGeo (P-03) and Structon (P-20). Not surprisingly, published literature contains little concrete information about algorithms employed in commercial geolocation products. Whether the qualitative input data are web pages, WHOIS registry records, or DNS names a database-driven geolocation algorithm tends to be a collage of various heuristic arguments, approximations and intelligent guesswork.

Error Distance Matrix

The table below compiles numbers from geolocation experiments described in the above publications for measurement-based techniques. The column headers indicate a range of median errors in geolocation distance reported in the papers; the values in the columns are the number of experiments that report median errors in the indicated range. Even though direct comparison of these numbers is tricky due to the wide variations in experiment characteristics (different types of targets, different set of landmarks, etc.), the picture that emerges is that state-of-the-art measurement based techniques can comfortably geolocate targets with median errors of < 250 km, while some techniques under favorable conditions can approach an accuracy of < 100 km. To put this in context: 1000 km can be roughly viewed as country granularity; 50 km approaches city or zip code granularity.

MethodIDd < 5 km5 < d < 50 km50 < d < 100 km100 < d < 250 km250 < d < 500 km500 < d < 750 km
NetGeoP-03
1
GeoTrackP-04
1
2
1
GeoClusterP-04
1
GeoPingP-04
1
5
2
CBGP-07
1
3
6
2
2
TBGP-10
1
1
GeoBuDP-11
1
1
OctantP-13
2
4
2
SGP-18
1
GeoWeightP-20
1
Geo-RhP-21
1
MLEP-22
1
Naive BayesP-24
1
SpotterP-25
1
1
Street GeoLocP-26
1
Dong et al.P-31
1
Maziku et al.P-34
1

Discussion

A direct comparison between measurement-based and database-driven approaches, or even just between measurement-based algorithms is tricky at best. A systematic comparison would require the availability of a reliable "ground truth" database of IP addresses at known geographic locations. This is difficult to find. However, in practice, the pool of potential test targets at known locations is limited: most recent published experiments select their ground truth from hosts in measurement infrastructures like PlanetLab in North America or Europe. So, even though hard to quantify, the ground truth in some published experiments probably is similar. In some papers the same ground truth is used to compare different algorithms (typically CBG is used as a benchmark, which explains the high number of entries for CBG in the above table), providing some insight in comparative performance. Obvious questions remain though. How representative are results based on a limited number of PlanetLab targets for the Internet as a whole? How much does the accuracy for a method vary from well-connected hosts (routers) to a heterogenous collection of end hosts? Looking at the above table, the median errors for CBG experiments vary from better than 50 km to more than 500 km (one order of magnitude) presumably reflecting a wide variation in experiment characteristics.

In an average sense the performance of the best geolocation techniques can be quantified reasonably well: the best measurement-based methods have median errors of at most a few hundred km (well within country granularity), with the best results maybe approaching 50 km (city or zip level). Similarly database-driven techniques also appear to do quite well at the country level, but start running out of steam at the city level. Whether database-driven or measurement-based, all techniques suffer from what might be called an outlier syndrome. All techniques are plagued by outliers with location errors well exceeding 1000 km (or country level). It would seem that for any potential application of geolocation the key question to ask is whether being right most of the time is good enough. If the answer is yes, a secondary question is whether the average accuracy of a selected algorithm is satisfactory.

  Last Modified: Tue Oct-13-2020 22:21:40 UTC
  Page URL: https://www.caida.org/projects/cybersecurity/geolocation/bib/index.xml