Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
Center for Applied Internet Data Analysis
www.caida.org > research : traffic-analysis : classification-overview
Internet Traffic Classification
Internet traffic classification gains continuous attentions while many applications emerge on the Internet with obfuscation techniques. Related papers tend to try to classify whatever traffic samples a researcher can find, with no systematic integration of results. To fill this gap, we have created a structured taxonomy of traffic classification papers and their data sets. Furthermore, we hope to reveal issues and challenges in traffic classification.

Introduction

The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.

Application Trends

Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.

Definitions

We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).

Annotated Papers

We have collected and reviewed papers published betweeen 1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed academic research conferences, and then including many papers cited from this intial seeding set of papers, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers, including data sets and methods used, goals, and basic empirical findings. We use five paper categories: survey, analysis, methodology, tools, and other. Analysis papers typically attempt to derive trustworthy numbers on actual traffic cross-section, while methodology papers focus on methods of classifications. Click on a checkbox below to show that attribute for each paper in a separate column. Below this table is a similar table of attributes of the data sets analyzed in these set of papers.

IDTitleYearPublicationAuthorsPaper TypeClassfication GoalsClassfication CharacteristicsMethodEmpirical Findings% of traffic P2PPDFDataID
1Toward the Accurate Identification of Network Applications2005PAMA. Moore, K. PapagiannakiMethodologyCoarse-grained ClassificationApplication PayloadExact MatchingPort-based:
64.54%BULK;27.30%WWW;
Content-based:
45.00%BULK;20.40%WWW;
1.5%[D-48]
2Flow Clustering Using Machine Learning Techniques2004PAMA. McGregor, M. Hall, P. Lorier, J. BrunskillMethodologyCoarse-grained ClassificationFlow CharacteristicsMachine Learning/Stat (EM)
3Internet Traffic Classification Using Bayesian Analysis Techniques2005SIGMETRICSA. Moore, D. ZuevMethodologyCoarse-grained ClassificationFlow CharacteristicsMachine Learning/Stat (Bayesian)65% accuracy on per-flow classification and better than 95% with refinements[D-48]
4Class-of-service Mapping for QoS2004IMCM. Roughan, S. Sen, O. Spatscheck, N. DuffieldMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (NN,LDA)[D-59][D-68]
5Is P2P Dying or just Hiding?2004GLOBECOMT. Karagiannis, A. Broido, N. Brownlee, K. Claffy, M. FaloutsosMethodologyFine-grained ClassificationApplication PayloadExact Matching2003:HTTP(72%,47.7%);SMTP(1.3%,1.2%);P2P(8%,10.7%)
2004:HTTP(56%,52.1%);SMTP(3.2%,9.7%);P2P(14%,9.9%)
8%,10.7% in 2003;14%,9.9% in 2004[D-69]
6Transport Layer Identification of P2P Traffic2004SIGCOMMT. Karagiannis, A. Broido, M. Faloutsos, K. ClaffyMethodologyCoarse-grained Classification (I)Flow CharacteristicsHeuristicsP2P traffic continues to grow unabatedly15%-20%[D-70]
7An Analysis of Internet Chat Systems2003SIGCOMMC. Dewes, A. Wichmann, A. FeldmannProfiling/Analysismiss less than 8.3% of all existing chat connections and to correctly classify at least 93.1%[D-67]
8Profiling Internet Backbone Traffic: Behavior Models and Applications2005SIGCOMMK. Xu, Z. Zhang, S. BhattacharyyaProfiling/Analysis[D-49]
9Accurate Scalable In-Network Identification of P2P Traffic2004WWWS. Sen, O. Spatscheck, D. WangMethodologyCoarse-grained Classification (I)Flow CharacteristicsHeuristicsless than 5% false positive and false negative ratios[D-71][D-72]
10BLINC Multilevel Traffic Classification in the Dark2005SIGCOMMT. Karagiannis, K. Papagiannaki, M. FaloutsosMethodologyCoarse-grained ClassificationFlow CharacteristicsHeuristicsweb:14%,37.5%,33.5%;data(ftp):67.4%,7.6%,5.4%;1.2%,31.9%,31.3%[D-48]
11CoralReef Software Suite as a Tool for System and Network Administrators2001LISAD. Moore, K. Keys, R. Koga, E. Lagache, K. ClaffyTools
12Packet-level Traffic Measurements from the Sprint IP Backbone2003IEEE NetworkC. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. SeelyProfiling/Analysisover 90% flows have packet sizes of 1495 bytes or greater0.1%-80%(p2p+unknown)[D-77]
13Snort Lightweight Intrusion Detection for Networks1999LISAM. RoeschTools[D-66]
14Internet Traffic Characterization1994K. ClaffyProfiling/Analysis
15Discriminators for Use in Flow-based Classification2004A. Moore, D. Zuev, M. CroganOther[D-48]
16Identifying the TCP Behavior of Web Servers2001J. Padhye, S. FloydProfiling/Analysis[D-58]
17Inherent Behaviors for On-line Detection of Peer-to-Peer File Sharing2007IEEE Global IntenetG. Bartlett, J. Heidemann, C. PapadopoulosMethodologyFine-grained Classification (I)Flow CharacteristicsHeuristicsachieve up tp an 83% true positive rate with only a 2% false positive rate[D-6][D-7]
18Graption: Automated Detection of P2P Applications Using Traffic Dispersion Graphs2008Technical ReportM. Iliofotou, P. Pappu, M. FaloutsosMethodologyCoarse-grained Classification (I)Flow CharacteristicsHeuristicsmore than 90% precision and recall for P2P detection9.19%[D-8][D-11]
19Traffic Classification through Simple Statistical Fingerprinting2007SIGCOMM CCRM. Crotti, M. Dusi, F. Gringoli, L. SalgarelliMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (Normalized thresholds)[D-12]
20Traffic Classification Using Clustering Algorithms2006SIGCOMMJ. Erman, M. Arlitt, A. MahantiMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (K-Means, DBSCAN)47.3%(Bytes,HTTP);35.1%(Bytes,P2P);6.0%(Bytes,SMTP)35.1%(Bytes)[D-36][D-39]
21Unexpected Means of Protocol Inference2006IMCJ. Ma, K. Levchenko, C. Kreibich, S. Savage, G. VoelkerMethodologyFine-grained ClassificationApplication PayloadMachine Learning/Stat (Product Distribution; Markov Processes; CSG)[D-40][D-41][D-42]
22On Inferring Application Protocol Behaviors in Encrypted Network Traffic2006Journal of Machine Learning ResearchC. Wright, F. Monrose, G. MassonMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (HMM)achieve greater than 90% for serveral protocols in aggregate traffic[D-43]
23Traffic Classification on the fly2006SIGCOMM CCRL. Bernaille, R. Teixeira, I. Akodjenou, A. Soule, K. SalamatianMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (K-Means)[D-44]
24Blind Application Recognition Through Behavioral Classification2005L. Bernaille, A. Soule, M. Jeannin, K. SalamatianMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (HMM)
25Early Application Identification2006CONEXTL. Bernaille, R. Teixeira, K. SalamatianMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (K-Means, GMM, Spectral Clustering)classify known applications with an accuracy over 90%; identify new applications as unknown with a probability of 60%[D-15][D-16][D-17][D-18][D-19][D-20]
26Early Recognition of Encrypted Application2007PAML. Bernaille, R. TeixeiraMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Statmore than 85% accuracy in recognizing the application in an SSL connection [D-13][D-14][D-20]
27Traffic Classification using a Statistical Approach2005PAMD. Zuev, A. MooreMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (Bayes)achieve better than 83% accuracy on both a per-byte and a per-packet basis[D-48]
28Appmon: An Application for Accurate per Application Network Traffic Characterization2006BroadBand EuropeD. Antoniades, M. Polychronakis, S. Antonatos, E. Markatos, S. UbikTools
29Profiling the End Host2007PAMT. Karagiannis, K. Papagiannaki, N. Taft, M. FaloutsoProfiling/Analysis[D-21][D-22]
30Revealing Skype Traffic: when randomness plays with you2007SIGCOMMD. Bonfiglio, M. Mellia, M. Meo, D. Rossi, P. TofanelliProfiling/Analysis[D-24][D-25]
31A Traffic Characterization of Popular on-line Games2005IEEE/ACM Transactions on NetworkingW. Chang Feng, F. Chang, W. chi Feng, J. WalpoleProfiling/Analysis[D-52]
32Hit-list worm detection and bot identification in large networks using protocol graphs2007RAIDM. Collins, M. ReiterProfiling/Analysis[D-23]
33Identifying Known and Unknown Peer-to-Peer Traffic2006IEEE NCAF. Constantinou, P. Mavrommatiss
34Offline/realtime Traffic Classification Using Semi-supervised Learning2007Perform. EvalJ. Erman, A. Manhanti, M. Arlitt, I. Cohen, C. WilliamsonMethodologyCoarse-grained ClassificationMachine Learning/Stat (K-Means)37.4%(Bytes,P2P,Campus);80%(Bytes,P2P,Residential);61.5%(Bytes,P2P,WLAN)37.4%(Bytes,Campus);80%(Bytes,Residential);61.5%(Bytes,WLAN)[D-26]
35Identifying and Discrimination between Web and Peer-to-Peer Traffic in the Network Core2007WWWJ. Erman, A. Manhanti, M. Arlitt, C. WilliamsonMethodologyCoarse-grained ClassificationFlow CharacteristicsMachine Learning/Stat (K-Means)38.3%[D-27]
36Acas: Automated Construction of Application Signatures2005SIGCOMMP. Haffner, S. Sen, O. Spatscheck, D. WangMethodologyFine-grained ClassificationApplication Payload (64 bytes)Machine Learning/Stat (Naive Bayes, AdaBoost, Maximum Entropy)[D-53]
37Comparison of Internet Traffic Classification Tools2007IMRG WACIH. Kim, K. Claffy, M. Fomenkova, N. Browlee, D. Barman, M. FaloutsosSurvey/Compare[D-28][D-29][D-30][D-31][D-32][D-33][D-34]
38A Survey of Techniques for Internet Traffic Classification Using Machine Learning2008IEEE Communications Surveys and TutorialsT. Naguyen, G. ArmitageSurvey/Compare
39Towards Automated Application Signature Generation2008NOMSB. Park, Y. Won, M. Kim, J.HongMethodologyFine-grained ClassificationApplication PayloadHeuristics[D-1]
40Analyzing Peer-to-Peer Traffic across Large networks2004IEEE/ACM Transactions on NetworkingS. Sen, J. WangProfiling/Analysis[D-73]
41Role Classification of Hosts within Enterprise Networks based on Connection Patterns2003USENIXG. Tan, M. Poletto, J. Guttag, F. KaashoekMethodologyFlow CharacteristicsHeuristics[D-64][D-65]
42Self-learning IP Traffic Classification based on Statistical Flow Characteristics2005PAMS. Zander, T. Nguyen, G. ArmitageMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (EM)[D-36]
43Tunnel Hunter: Detecting Application-Layer Tunnels with Statistical Fingerprinting2008Computer NetworksM. Dusi, M. Crotti, F. Gringoli, L. SalgarelliMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat[D-2]
44A Preliminary Look at the Privacy of SSH Tunnels2008ICCM. Dusi, F. Gringoli, L. SalgarelliMethodologyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (GMM)[D-3]
45A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detecion2003SIAMA. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. SrivastavaSurvey/Compare[D-63]
46Automatically Inferring Patterns of Resources Consumption in Network Traffic2003SIGCOMMC. Estan, S. Savage, G. VarghesesMethodologyCoarse-grained ClassificationFlow CharacteristicsHeuristics[D-60][D-61][D-62]
47Mining Anomalies Using Traffic Feature Distributions2005SIGCOMMA. Lakhina, M. Crovella, C. DiotMethodologyCoarse-grained ClassificationHeuristics (K-Means, Hierarchical Agglomerative Algorithm)[D-54][D-55]
48Network Traffic Analysis using Traffic Dispersion Graphs (TDGs): Techniques and Hardware Implementation2007Technical ReportM. Iliofotou, P. Pappu, M. Faloutsos, M. Mitzenmacher, S. Singh, G. VargheseTools[D-35][D-36][D-37][D-38]
49Heuristics to Classify Internet Backbone Traffic based on Connection Patterns2008ICOINW. John, S. TafvelinMethodologyCoarse-grained ClassificationFlow CharacteristicsHeuristicleave only 0.2% of the data unclassified; can identify 95% of P2P flows42%(average) in connections, 79%(average) in traffic[D-4]
50Trends and Differences in Connections Behavior within Classes of Internet Backbone Traffic2008PAMW. John, S. Tafvelin, T. OlovssonProfiling/AnalysisP2P and HTTP traffic exhibit different peak times.P2P traffic was found to be clearly dominating with 90% of the transfer volums, especially during evening and night times. In contrast, HTTP traffic has its main activities(9% of the data volumes) during office hours.93%(evening);91%(night);86%(office hours)[D-4][D-5]
51Flow Analysis of Internet Traffic: World Wide Web versus Peer-to-Peer2005System and Computers in JanpanM. Perenyi, T. Dang, A. Gefferth, S. MonlnarProfiling/Analysis57.52% for WWW, 21.53% for P2P, 20.95% for other21.53%[D-56]
52Identification and Analysis of Peer-to-Peer Traffic2006Journal of CommunicationsM. Perenyi, T. Dang, A. Gefferth, S. MonlnarMethodologyFine-grained ClassificationFlow CharacteristicsHeuristics60%-80%[D-45]
53Analysis of Peer-to-Peer Traffic on ADSL2005PAML. Plissonneau, J. Costeux, P. BrownProfiling/Analysis40% of connections are only connection reattempts, and it concerns about 30% of peers60% in 2004, 65% in 2003(lies on P2P ports)[D-57]
54Analysis of Internet Backbone Traffic and Header Anomalies Observed2007IMCW. John, S. TafvelinProfiling/Analysis[D-4]
55Flow Classification by Histograms or How to Go on Safari in the Internet2004SIGMETRICSA. Soule, K. Salamatian, N. Taft, R. Emilion, K. PapagiannakiMethodologyCoarse-grained ClassificationFlow CharacteristicsMachine Learning/Stat (EM)[D-74]
56Streaming Video Traffic: Characterization and Network Impact2002WCWJ. Merwe, S. Sen, C. KalmanekProfiling/Analysis[D-59]
57The architecture of CoralReef: an Internet traffic monitoring software suite 2001PAMK. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, K. ClaffyTools
58A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification2006SIGCOMMN. Williams, S. Zander, G. ArmitageSurvey/Compare[D-36][D-46][D-47]
59Internet Traffic Identification using Machine Learning2006GLOBECOMJ. Erman, A. Mahanti, M. ArlittMethodogyFine-grained ClassificationFlow CharacteristicsMachine Learning/Stat (EM, Bayes)81.2%http;3.1%smtp;2.0%dns[D-36]
60Estimating P2P Traffic Volume at USC2007G. Bartlett, J. Heidemann, C. Papadopoulos, J. PepinProfiling/AnalysisPort Number and Flow CharacteristicsExact Matching and Heuristics3%-13% of active hosts on campus participate in P2P21%-33%(less than, Byte)[D-75]
61A Longitudinal Study of P2P Traffic Classification2006MASCOTSA. Madhukar, C. WilliamsonSurvey/Compare30%-70% of the campus Internet traffic for 2003-2005 was P2P30%-70%[D-76]
62On the Validation of Traffic Classification Algorithms2008PAMG. Szabo, D. Orincasy, S. Malomsoky and I. SzaboOtherFine-grained Classification, for validate classifcation methodsP2P:70%;Web:26%;VoIP:2%;Streaming:1%;Secure Channel:1%;70%[D-78]
63Accurate Traffic Classification2007WoWMoMG. Szabo, I. Szabo and D. OrincasyoMethodologyFine-grained Classification[D-79]
64Observing Slow Crustal Movement in Residential User Traffic2008ACM CoNEXTKenjiro Cho, Kensuke Fukuda, Hiroshi Esaki and Akira KatoAnalysisPort NumberBetween May 2005 and May 2008, the average annual growth rate of (A1) RBB customers: 26% for inbound; 28% for outbound; 27% for the combined volume;0.92%(gnutella,2005);0.25%(bittorrent,2005);0.12%(edonkey,2005);0.94%(gnutella,2008);0.22%(bittorrent,2008);0.13%(edonkey,2008);[D-80][D-81]
65Classification of Network Traffic via Packet-Level Hidden Markov Models2008GLOBECOMAlberto Dainotti, Walter de Donato, Antonio Pescape and Pierluigi Salvo RossiMethodologyFlow CharacteristicsMachine Learning (HMM)[D-82][D-83]
66TIE: a Community-oriented Traffic Classification Platform2009IMAAlberto Dainotti, Walter de Donato, Antonio Pescape and Giorgio VentreTools
67Challenging Statistical Classification for Operational Usage: the ADSL Case2009IMCMarcin Pietrzyk, Jean-Laurent, Guillaume Urvoy-Keller and TaoufikAnalysisTCP Flow CharacteristicsMachine Learning/StatMost bytes and flows are due to eDonkey; The vast majority of traffic in the HTTP Streaming class is due to Dailymotion and Youtube;5%-40%(flows);10%-50%(bytes)[D-84]
68On Dominant Characteristics of Residential Broadband Internet Traffic2009IMCGregor Maier, Anjia Feldmann, Vern Paxson and Mark AllmanAnalysisPort Number and Application PayloadExact MatchingHTTP carries more than 50% traffic; Flash Video contributes 25% of all HTTP traffic;14%(bytes)[D-85][D-86]

Datasets

Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:

  • Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
  • SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
  • Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
  • Other: except from above, such as application level session logs from web sites;
The data come from three types of capture environments:Intranet environment, Edge/Border environment, Backbone environment.

Words Filter:
Columns Filter:
Link Type Capture Environments Geographic Location
Payload Size and Length

IDNameYearLink TypeCapture EnvironmentsGeographic LocationPayloadSize and LengthPaperID
1POSTECH-20072007AcademicBackboneAsiaYes(full)3 h
450 Gbytes
[P-39]
2TunnelHunter-20072007AcademicEdge/BorderEuropeYes(part)[P-43]
3TunnelSSH-20062006AcademicEdge/BorderEuropeYes(full)0.25 h each hour
for three weeks
50 Gbytes
[P-44]
4
SUNET1-2006
2006AcademicBackboneEuropeNo0.33*4 h each day
for 20 days
[P-49][P-50][P-54]
5
SUNET2-2006
2006AcademicBackboneEuropeNo276 randomized times(10 mins)
during 80 days
[P-50]
6LosNettos-20052005Academic and CommercialBackboneNorth America24 h, 08/31/2005[P-17]
7LosNettos-20062006Academic and CommercialBackboneNorth America24 h, 10/03/2006[P-17]
8CAIDA-OC48-20032003BackboneNorth AmericaNo1 h
95 Gbytes
[P-18]
9Abilene-ABIL-20042004AcademicBackboneNorth AmericaNo1 h
714 Gbytes
[P-18]
10PAIX-PAY1-20042004BackboneNorth AmericaYes(16 bytes)1 h
435 Gbytes
[P-18]
11PAIX-PAY2-20042004BackboneNorth AmericaYes(16 bytes)1 h
374 Gbytes
[P-18]
12UNIBS-AcademicEdge/BorderEuropeYes(part)[P-19]
13Paris6-20042004AcademicEdge/BorderEuropeYes1 h[P-26]
14Paris6-20062006AcademicEdge/BorderEuropeYes1 h[P-26]
15Paris6-2004-20052004-2005AcademicEdge/BorderEuropeYes1*3 h
27 Gbytes;35 Gbytes;
300 Gbytes
[P-25]
16College-20032003AcademicEuropeNo15 mins
900 Mbytes
[P-25]
17ADSL-20042004NoneNo15 mins
2.3 Gbytes
[P-25]
18WirelessCrawdad-20032003AcademicEuropeNo5 h 30 mins
330 Mbytes
[P-25]
19Enter-CommericalEdge/BorderNoneYes1 h 20 mins
300 Mbytes
[P-25]
20UMass-20052005AcademicEdge/BorderNorth AmericaYes(4 bytes)[P-25][P-26]
21MicroResearch1-20052005CommercialEdge/BorderNorth AmericaYes1 month[P-29]
22MicroResearch2-20052005CommercialEdge/BorderNorth AmericaYes2 weeks[P-29]
23CISCO-Backbone(/8)No[P-32]
24Polito-Academic-20062006AcademicEurope95 h[P-30]
25Polito-ISP-20062006BackboneEurope24 h[P-30]
26Calgary1-20062006Academic and CommercialNorth AmericaYes(full)1*48 h over 6 months[P-34]
27Calgary2-20062006AcademicNorth AmericaYes(full)1*8 h over 4 days[P-35]
28PAIX-I2004CommercialBackboneNorth AmericaYes(16 bytes)2 h
91 Gbytes
[P-37]
29PAIX-II2004CommercialBackboneNorth AmericaYes(16 bytes)2 h 2 mins
891 Gbytes
[P-37]
30WIDE-12006BackboneOceaniaYes(40 bytes)55 mins
14 Gbytes
[P-37]
31KEIO-I2006AcademicEdge/BorderAsiaYes(40 bytes)30 mins
16 Gbytes
[P-37]
32KEIO-II2006AcademicEdge/BorderAsiaYes(40 bytes)30 mins
16 Gbytes
[P-37]
33KAIST-I2006AcademicEdge/BorderAsiaYes(40 bytes)48 h 12 mins
506 Gbytes
[P-37]
34KAIST-II2006AcademicEdge/BorderAsiaYes(40 bytes)21 h 16 mins
259 Gbytes
[P-37]
35WIDE-22006BackboneOceania2 h[P-48]
36AUCK-2001/2003Edge/BorderOceania1 h /3 days[P-20][P-42][P-48][P-58][P-59]
37OC48-20032003BackboneNorth America1 h 2 mins[P-48]
38UCSD-honeypotAcademicIntranetNorth America5 mins[P-48]
39Calgary3-20062006AcademicEdge/BorderNorth AmericaYes(full)1 h[P-20]
40Cambridge-20032003AcademicEurope24 h[P-21]
41Wireless-20062006AcademicIntranetNorth America5 days[P-21]
42UCSDDepart-20062006AcademicBackboneNorth America1 h[P-21]
43GMU-20032003AcademicNorth America10 mins of each quarter
hour over 2 months
[P-22]
44PMC-AcademicEdge/BorderNoneYes1 hour[P-23]
45Callrecords-20052005CommercialEdge/BorderEuropelogs[P-52]
46Leipzig-II-2003022120031 h[P-58]
47Nzix-II-200020001 h[P-58]
48Genome AcademicAcademicEdge/BorderEuropeYes(full)24 h/43.9 h
268 Gbytes/495 Gbytes
[P-1][P-3][P-10][P-15][P-27]
49Tier1ISP-BackboneNorth AmericaNo24 h/3 h
11-98 Gbytes
[P-8]
50University-weekday2004AcademicNone24.6 h
1223 Gbytes
[P-10]
51University-weekend2004AcademicNone33.6 h
1652 Gbytes
[P-10]
52Mshmro-20022002CommercialNoneYes7 days
60 Gbytes
[P-31]
53Accessnetwork-2004/52004-2005Nonel/4/8 h
100 Gbytes
[P-36]
54Abilene-20032003AcademicBackboneNorth AmericaNo20 days(1/100 pkts)[P-47]
55Geant-20042004BackboneEuropeNo23 days(1/1000 pkts)[P-47]
56Waseda-20022002AcademicEdge/BorderAsiaNoweekday nights
over 1 month
[P-51]
57ADSL-2002/32003-2004BackboneNoneweekdays and weekend
days of Sep 2004
and June 2003
[P-53]
58WebServer-None10,000 webServers
as testing purposes
[P-16]
59StreamingLogs2001CommercialNone[P-4][P-56]
60UCSD-NAP-20022002CommercialNorth America31 days[P-46]
61Research-20022002AcademicEdge/BorderNone39 days (roughly 15,000 hosts)[P-46]
62OC48-20012001BackboneNone8 h[P-46]
63DARPA-199819982 and 7 weeks of
network-based attacks
[P-45]
64Mazu-CommercialNorth America[P-41]
65BigComany-CommercialNone[P-41]
66Tier1-multi2001-2002BackboneNone1 h - 6 days[P-13]
67Saarland-20022002AcademicEuropeYes8 days
950 Gbytes
[P-7]
68Gigascope2003/2004[P-4]
69CAIDA-OC48-2002/42002-2004BackboneNorth AmericaYes(4 bytes)1 h[P-5]
70CAIDA-OC48-2003/42003-2004BackboneNorth AmericaYes(part)1-2 h[P-6]
71InternetAccessTrace-20032003NoneYes(full)24 h and 18 h
120 Gbytes
[P-9]
72VPN-20032003NoneYes(full)6 days
1.8 Tbytes
[P-9]
73MultiRouter-20012001BackboneNone8000 million flow
level records
[P-40]
74Tier1ISP-OC12-20012001BackboneNorth AmericaNo3.5 days[P-55]
75USC-20062006Edge/BorderNorth AmericaNo14 hour period[P-60]
76USC-20062003-4Edge/BorderNorth AmericaNo2 years[P-61]
77POPs-BackboneNorth AmericaNo[P-12]
78AccessNetwork-IntranetNoneNo43 hours; 6 Gbytes[P-62]
79MobileNetwork-Europe and AsiaYes(full)[P-63]
80JanpanISPSNMP-2004-2008CommercialBackboneAsiaaggregated SNMP data (month-long) from 6 ISPs;[P-64]
81JanpanISPNetFlow-2005,2008CommercialBackboneAsiaSampled NetFlow data from 1 ISP;[P-64]
82NapleItalyAcademicBackboneEuropeGenerated by a set of conttolled boxes;[P-65]
83WPIUSAAcademicBackboneNorth America[P-65]
84ADSLPoPFrance2006, 2008CommercialEuropeyes(full)1-2 h
26-60 Gbytes
[P-67]
85ISPEuropean2008, 2009CommercialEurope2*24 h, 14*90 mins
>4 Tbytes, 100-600 Gbytes
[P-68]
86DSLSession2009CommercialEurope10 days, 6*24 h
(DSL session)
[P-68]

Discussion

P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).

One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.

Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.

YearRangePaperID
200221.5%[51]
20049.19-60%[5],[6],[10],[18],[53]
200635.1-93%[20],[34],[35],[50]
Table 1. P2P Range(Year).
 
YearLink LocationRangePaperID
2004Campus link31.3%[10]
2004ADSL link60%[53]
2004Backbone link9-14%[5],[18]
17-25%[6]
Table 2. P2P Range(Link Location).
Geographic LocationYearP2P RangePaperID
Europe200560-80%[52]
200679-93%[49],[50]
North America20038%,10.7%[5]
200414%,9.9%[5]
2003-049.19-70%[6],[18],[61]
200621-35.1%[20],[34],[35]
Asia200221.53%[51]
20051.34% (port-based)[64]
20081.29% (port-based)[64]
Table 3. P2P Range(Geographic).
 
YearTimeRangeDataIDPaperID
2006midnight to 10am80%[D-26][34]
9am to 10am61.5%
2006evening93%[D-4],[D-5][50]
night91%
office hours86%
Table 4. P2P Range(Time).

UDP Traffic Analysis

It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.

In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.

Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.

PaperID
Year
UDP/TCP Ratio
Notes
pkts
bytes
flows
P-1
around 2003
0.01
0.01
2006
0.11
0.05
0.02
0.04
WLAN Trace
0.20
0.20
10-hour Residential Trace
2006
1.12
2005
0.01
2008
0.02
Table 5. Values of UDP/TCP Ratio(from papers).
 
Trace
Sample
UDP/TCP Ratio
Total IP Traffic
(pkts/bytes/flows)
pkts
bytes
flows
08-2002
0.11
0.03
0.11
(1371M/838GB/79M)
01-2003
0.12
0.05
0.27
(463M/267GB/26M)
GigaSUNET
04-2006
0.06
0.02
1.06
(422M/294GB/9M)
11-2006
0.08
0.03
1.45
06-2008
0.14
0.05
1.43
(4427M/2279GB/197M)
02-2009
0.19
0.07
2.34
(1922M/1410GB/110M)
OptoSUNET
01-2009
0.21
0.11
3.09
(1100M/657GB/41M)
02-2009
0.20
0.11
2.63
Table 6. Values of UDP/TCP Ratio(from real-traces).

Conclusion

This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.

Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.

However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."

Acknowledgements

This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.

  Last Modified: Tue Oct-13-2020 22:21:54 UTC
  Page URL: https://www.caida.org/research/traffic-analysis/classification-overview/index.xml