Internet Traffic Classification

Skip to Content

Center for Applied Internet Data Analysis

HOME
- About
- Program Plan
- Annual Report
- Joining CAIDA
- Staff
- Jobs
- Legal Agreements
- How-To
- Blog
- Contact Us
CATALOG
- Datasets
- Media
- Papers
- Recipes
- Software
RESEARCH
DATA
TOOLS
- Overview
SERVICES
PUBLICATIONS
WORKSHOPS
- Overview
- AIMS
- DUST
- IMAPS
- KISMET
- WIE
- WOMBIR
- Collaborative
- Archive
PROJECTS
FUNDING
- PANDA
- FANTAIL
- PENMAN
- SLAM
- Censorship Outages
- MapKIT
- STARDUST
- MADDVIPR
- Program Plan

www.caida.org > research : traffic-analysis : classification-overview

Internet Traffic Classification

Internet traffic classification gains continuous attentions while many applications emerge on the Internet with obfuscation techniques. Related papers tend to try to classify whatever traffic samples a researcher can find, with no systematic integration of results. To fill this gap, we have created a structured taxonomy of traffic classification papers and their data sets. Furthermore, we hope to reveal issues and challenges in traffic classification.

Introduction

The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.

Application Trends

Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.

Definitions

We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).

Annotated Papers

We have collected and reviewed papers published betweeen 1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed academic research conferences, and then including many papers cited from this intial seeding set of papers, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers, including data sets and methods used, goals, and basic empirical findings. We use five paper categories: survey, analysis, methodology, tools, and other. Analysis papers typically attempt to derive trustworthy numbers on actual traffic cross-section, while methodology papers focus on methods of classifications. Click on a checkbox below to show that attribute for each paper in a separate column. Below this table is a similar table of attributes of the data sets analyzed in these set of papers.

ID	Title	Year	Publication	Authors	Paper Type	Classfication Goals	Classfication Characteristics	Method	Empirical Findings	% of traffic P2P	PDF	DataID
1	Toward the Accurate Identification of Network Applications	2005	PAM	A. Moore, K. Papagiannaki	Methodology	Coarse-grained Classification	Application Payload	Exact Matching	Port-based: 64.54%BULK;27.30%WWW; Content-based: 45.00%BULK;20.40%WWW;	1.5%		[D-48]
2	Flow Clustering Using Machine Learning Techniques	2004	PAM	A. McGregor, M. Hall, P. Lorier, J. Brunskill	Methodology	Coarse-grained Classification	Flow Characteristics	Machine Learning/Stat (EM)
3	Internet Traffic Classification Using Bayesian Analysis Techniques	2005	SIGMETRICS	A. Moore, D. Zuev	Methodology	Coarse-grained Classification	Flow Characteristics	Machine Learning/Stat (Bayesian)	65% accuracy on per-flow classification and better than 95% with refinements			[D-48]
4	Class-of-service Mapping for QoS	2004	IMC	M. Roughan, S. Sen, O. Spatscheck, N. Duffield	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (NN,LDA)				[D-59][D-68]
5	Is P2P Dying or just Hiding?	2004	GLOBECOM	T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, M. Faloutsos	Methodology	Fine-grained Classification	Application Payload	Exact Matching	2003:HTTP(72%,47.7%);SMTP(1.3%,1.2%);P2P(8%,10.7%) 2004:HTTP(56%,52.1%);SMTP(3.2%,9.7%);P2P(14%,9.9%)	8%,10.7% in 2003;14%,9.9% in 2004		[D-69]
6	Transport Layer Identification of P2P Traffic	2004	SIGCOMM	T. Karagiannis, A. Broido, M. Faloutsos, K. Claffy	Methodology	Coarse-grained Classification (I)	Flow Characteristics	Heuristics	P2P traffic continues to grow unabatedly	15%-20%		[D-70]
7	An Analysis of Internet Chat Systems	2003	SIGCOMM	C. Dewes, A. Wichmann, A. Feldmann	Profiling/Analysis				miss less than 8.3% of all existing chat connections and to correctly classify at least 93.1%			[D-67]
8	Profiling Internet Backbone Traffic: Behavior Models and Applications	2005	SIGCOMM	K. Xu, Z. Zhang, S. Bhattacharyya	Profiling/Analysis							[D-49]
9	Accurate Scalable In-Network Identification of P2P Traffic	2004	WWW	S. Sen, O. Spatscheck, D. Wang	Methodology	Coarse-grained Classification (I)	Flow Characteristics	Heuristics	less than 5% false positive and false negative ratios			[D-71][D-72]
10	BLINC Multilevel Traffic Classification in the Dark	2005	SIGCOMM	T. Karagiannis, K. Papagiannaki, M. Faloutsos	Methodology	Coarse-grained Classification	Flow Characteristics	Heuristics	web:14%,37.5%,33.5%;data(ftp):67.4%,7.6%,5.4%;	1.2%,31.9%,31.3%		[D-48]
11	CoralReef Software Suite as a Tool for System and Network Administrators	2001	LISA	D. Moore, K. Keys, R. Koga, E. Lagache, K. Claffy	Tools
12	Packet-level Traffic Measurements from the Sprint IP Backbone	2003	IEEE Network	C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. Seely	Profiling/Analysis				over 90% flows have packet sizes of 1495 bytes or greater	0.1%-80%(p2p+unknown)		[D-77]
13	Snort Lightweight Intrusion Detection for Networks	1999	LISA	M. Roesch	Tools							[D-66]
14	Internet Traffic Characterization	1994		K. Claffy	Profiling/Analysis
15	Discriminators for Use in Flow-based Classification	2004		A. Moore, D. Zuev, M. Crogan	Other							[D-48]
16	Identifying the TCP Behavior of Web Servers	2001		J. Padhye, S. Floyd	Profiling/Analysis							[D-58]
17	Inherent Behaviors for On-line Detection of Peer-to-Peer File Sharing	2007	IEEE Global Intenet	G. Bartlett, J. Heidemann, C. Papadopoulos	Methodology	Fine-grained Classification (I)	Flow Characteristics	Heuristics	achieve up tp an 83% true positive rate with only a 2% false positive rate			[D-6][D-7]
18	Graption: Automated Detection of P2P Applications Using Traffic Dispersion Graphs	2008	Technical Report	M. Iliofotou, P. Pappu, M. Faloutsos	Methodology	Coarse-grained Classification (I)	Flow Characteristics	Heuristics	more than 90% precision and recall for P2P detection	9.19%		[D-8][D-11]
19	Traffic Classification through Simple Statistical Fingerprinting	2007	SIGCOMM CCR	M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (Normalized thresholds)				[D-12]
20	Traffic Classification Using Clustering Algorithms	2006	SIGCOMM	J. Erman, M. Arlitt, A. Mahanti	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (K-Means, DBSCAN)	47.3%(Bytes,HTTP);35.1%(Bytes,P2P);6.0%(Bytes,SMTP)	35.1%(Bytes)		[D-36][D-39]
21	Unexpected Means of Protocol Inference	2006	IMC	J. Ma, K. Levchenko, C. Kreibich, S. Savage, G. Voelker	Methodology	Fine-grained Classification	Application Payload	Machine Learning/Stat (Product Distribution; Markov Processes; CSG)				[D-40][D-41][D-42]
22	On Inferring Application Protocol Behaviors in Encrypted Network Traffic	2006	Journal of Machine Learning Research	C. Wright, F. Monrose, G. Masson	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (HMM)	achieve greater than 90% for serveral protocols in aggregate traffic			[D-43]
23	Traffic Classification on the fly	2006	SIGCOMM CCR	L. Bernaille, R. Teixeira, I. Akodjenou, A. Soule, K. Salamatian	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (K-Means)				[D-44]
24	Blind Application Recognition Through Behavioral Classification	2005		L. Bernaille, A. Soule, M. Jeannin, K. Salamatian	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (HMM)
25	Early Application Identification	2006	CONEXT	L. Bernaille, R. Teixeira, K. Salamatian	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (K-Means, GMM, Spectral Clustering)	classify known applications with an accuracy over 90%; identify new applications as unknown with a probability of 60%			[D-15][D-16][D-17][D-18][D-19][D-20]
26	Early Recognition of Encrypted Application	2007	PAM	L. Bernaille, R. Teixeira	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat	more than 85% accuracy in recognizing the application in an SSL connection			[D-13][D-14][D-20]
27	Traffic Classification using a Statistical Approach	2005	PAM	D. Zuev, A. Moore	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (Bayes)	achieve better than 83% accuracy on both a per-byte and a per-packet basis			[D-48]
28	Appmon: An Application for Accurate per Application Network Traffic Characterization	2006	BroadBand Europe	D. Antoniades, M. Polychronakis, S. Antonatos, E. Markatos, S. Ubik	Tools
29	Profiling the End Host	2007	PAM	T. Karagiannis, K. Papagiannaki, N. Taft, M. Faloutso	Profiling/Analysis							[D-21][D-22]
30	Revealing Skype Traffic: when randomness plays with you	2007	SIGCOMM	D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, P. Tofanelli	Profiling/Analysis							[D-24][D-25]
31	A Traffic Characterization of Popular on-line Games	2005	IEEE/ACM Transactions on Networking	W. Chang Feng, F. Chang, W. chi Feng, J. Walpole	Profiling/Analysis							[D-52]
32	Hit-list worm detection and bot identification in large networks using protocol graphs	2007	RAID	M. Collins, M. Reiter	Profiling/Analysis							[D-23]
33	Identifying Known and Unknown Peer-to-Peer Traffic	2006	IEEE NCA	F. Constantinou, P. Mavrommatiss
34	Offline/realtime Traffic Classification Using Semi-supervised Learning	2007	Perform. Eval	J. Erman, A. Manhanti, M. Arlitt, I. Cohen, C. Williamson	Methodology	Coarse-grained Classification		Machine Learning/Stat (K-Means)	37.4%(Bytes,P2P,Campus);80%(Bytes,P2P,Residential);61.5%(Bytes,P2P,WLAN)	37.4%(Bytes,Campus);80%(Bytes,Residential);61.5%(Bytes,WLAN)		[D-26]
35	Identifying and Discrimination between Web and Peer-to-Peer Traffic in the Network Core	2007	WWW	J. Erman, A. Manhanti, M. Arlitt, C. Williamson	Methodology	Coarse-grained Classification	Flow Characteristics	Machine Learning/Stat (K-Means)		38.3%		[D-27]
36	Acas: Automated Construction of Application Signatures	2005	SIGCOMM	P. Haffner, S. Sen, O. Spatscheck, D. Wang	Methodology	Fine-grained Classification	Application Payload (64 bytes)	Machine Learning/Stat (Naive Bayes, AdaBoost, Maximum Entropy)				[D-53]
37	Comparison of Internet Traffic Classification Tools	2007	IMRG WACI	H. Kim, K. Claffy, M. Fomenkova, N. Browlee, D. Barman, M. Faloutsos	Survey/Compare							[D-28][D-29][D-30][D-31][D-32][D-33][D-34]
38	A Survey of Techniques for Internet Traffic Classification Using Machine Learning	2008	IEEE Communications Surveys and Tutorials	T. Naguyen, G. Armitage	Survey/Compare
39	Towards Automated Application Signature Generation	2008	NOMS	B. Park, Y. Won, M. Kim, J.Hong	Methodology	Fine-grained Classification	Application Payload	Heuristics				[D-1]
40	Analyzing Peer-to-Peer Traffic across Large networks	2004	IEEE/ACM Transactions on Networking	S. Sen, J. Wang	Profiling/Analysis							[D-73]
41	Role Classification of Hosts within Enterprise Networks based on Connection Patterns	2003	USENIX	G. Tan, M. Poletto, J. Guttag, F. Kaashoek	Methodology		Flow Characteristics	Heuristics				[D-64][D-65]
42	Self-learning IP Traffic Classification based on Statistical Flow Characteristics	2005	PAM	S. Zander, T. Nguyen, G. Armitage	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (EM)				[D-36]
43	Tunnel Hunter: Detecting Application-Layer Tunnels with Statistical Fingerprinting	2008	Computer Networks	M. Dusi, M. Crotti, F. Gringoli, L. Salgarelli	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat				[D-2]
44	A Preliminary Look at the Privacy of SSH Tunnels	2008	ICC	M. Dusi, F. Gringoli, L. Salgarelli	Methodology	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (GMM)				[D-3]
45	A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detecion	2003	SIAM	A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava	Survey/Compare							[D-63]
46	Automatically Inferring Patterns of Resources Consumption in Network Traffic	2003	SIGCOMM	C. Estan, S. Savage, G. Vargheses	Methodology	Coarse-grained Classification	Flow Characteristics	Heuristics				[D-60][D-61][D-62]
47	Mining Anomalies Using Traffic Feature Distributions	2005	SIGCOMM	A. Lakhina, M. Crovella, C. Diot	Methodology	Coarse-grained Classification		Heuristics (K-Means, Hierarchical Agglomerative Algorithm)				[D-54][D-55]
48	Network Traffic Analysis using Traffic Dispersion Graphs (TDGs): Techniques and Hardware Implementation	2007	Technical Report	M. Iliofotou, P. Pappu, M. Faloutsos, M. Mitzenmacher, S. Singh, G. Varghese	Tools							[D-35][D-36][D-37][D-38]
49	Heuristics to Classify Internet Backbone Traffic based on Connection Patterns	2008	ICOIN	W. John, S. Tafvelin	Methodology	Coarse-grained Classification	Flow Characteristics	Heuristic	leave only 0.2% of the data unclassified; can identify 95% of P2P flows	42%(average) in connections, 79%(average) in traffic		[D-4]
50	Trends and Differences in Connections Behavior within Classes of Internet Backbone Traffic	2008	PAM	W. John, S. Tafvelin, T. Olovsson	Profiling/Analysis				P2P and HTTP traffic exhibit different peak times.P2P traffic was found to be clearly dominating with 90% of the transfer volums, especially during evening and night times. In contrast, HTTP traffic has its main activities(9% of the data volumes) during office hours.	93%(evening);91%(night);86%(office hours)		[D-4][D-5]
51	Flow Analysis of Internet Traffic: World Wide Web versus Peer-to-Peer	2005	System and Computers in Janpan	M. Perenyi, T. Dang, A. Gefferth, S. Monlnar	Profiling/Analysis				57.52% for WWW, 21.53% for P2P, 20.95% for other	21.53%		[D-56]
52	Identification and Analysis of Peer-to-Peer Traffic	2006	Journal of Communications	M. Perenyi, T. Dang, A. Gefferth, S. Monlnar	Methodology	Fine-grained Classification	Flow Characteristics	Heuristics		60%-80%		[D-45]
53	Analysis of Peer-to-Peer Traffic on ADSL	2005	PAM	L. Plissonneau, J. Costeux, P. Brown	Profiling/Analysis				40% of connections are only connection reattempts, and it concerns about 30% of peers	60% in 2004, 65% in 2003(lies on P2P ports)		[D-57]
54	Analysis of Internet Backbone Traffic and Header Anomalies Observed	2007	IMC	W. John, S. Tafvelin	Profiling/Analysis							[D-4]
55	Flow Classification by Histograms or How to Go on Safari in the Internet	2004	SIGMETRICS	A. Soule, K. Salamatian, N. Taft, R. Emilion, K. Papagiannaki	Methodology	Coarse-grained Classification	Flow Characteristics	Machine Learning/Stat (EM)				[D-74]
56	Streaming Video Traffic: Characterization and Network Impact	2002	WCW	J. Merwe, S. Sen, C. Kalmanek	Profiling/Analysis							[D-59]
57	The architecture of CoralReef: an Internet traffic monitoring software suite	2001	PAM	K. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, K. Claffy	Tools
58	A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification	2006	SIGCOMM	N. Williams, S. Zander, G. Armitage	Survey/Compare							[D-36][D-46][D-47]
59	Internet Traffic Identification using Machine Learning	2006	GLOBECOM	J. Erman, A. Mahanti, M. Arlitt	Methodogy	Fine-grained Classification	Flow Characteristics	Machine Learning/Stat (EM, Bayes)	81.2%http;3.1%smtp;2.0%dns			[D-36]
60	Estimating P2P Traffic Volume at USC	2007		G. Bartlett, J. Heidemann, C. Papadopoulos, J. Pepin	Profiling/Analysis		Port Number and Flow Characteristics	Exact Matching and Heuristics	3%-13% of active hosts on campus participate in P2P	21%-33%(less than, Byte)		[D-75]
61	A Longitudinal Study of P2P Traffic Classification	2006	MASCOTS	A. Madhukar, C. Williamson	Survey/Compare				30%-70% of the campus Internet traffic for 2003-2005 was P2P	30%-70%		[D-76]
62	On the Validation of Traffic Classification Algorithms	2008	PAM	G. Szabo, D. Orincasy, S. Malomsoky and I. Szabo	Other	Fine-grained Classification, for validate classifcation methods			P2P:70%;Web:26%;VoIP:2%;Streaming:1%;Secure Channel:1%;	70%		[D-78]
63	Accurate Traffic Classification	2007	WoWMoM	G. Szabo, I. Szabo and D. Orincasyo	Methodology	Fine-grained Classification						[D-79]
64	Observing Slow Crustal Movement in Residential User Traffic	2008	ACM CoNEXT	Kenjiro Cho, Kensuke Fukuda, Hiroshi Esaki and Akira Kato	Analysis		Port Number		Between May 2005 and May 2008, the average annual growth rate of (A1) RBB customers: 26% for inbound; 28% for outbound; 27% for the combined volume;	0.92%(gnutella,2005);0.25%(bittorrent,2005);0.12%(edonkey,2005);0.94%(gnutella,2008);0.22%(bittorrent,2008);0.13%(edonkey,2008);		[D-80][D-81]
65	Classification of Network Traffic via Packet-Level Hidden Markov Models	2008	GLOBECOM	Alberto Dainotti, Walter de Donato, Antonio Pescape and Pierluigi Salvo Rossi	Methodology		Flow Characteristics	Machine Learning (HMM)				[D-82][D-83]
66	TIE: a Community-oriented Traffic Classification Platform	2009	IMA	Alberto Dainotti, Walter de Donato, Antonio Pescape and Giorgio Ventre	Tools
67	Challenging Statistical Classification for Operational Usage: the ADSL Case	2009	IMC	Marcin Pietrzyk, Jean-Laurent, Guillaume Urvoy-Keller and Taoufik	Analysis		TCP Flow Characteristics	Machine Learning/Stat	Most bytes and flows are due to eDonkey; The vast majority of traffic in the HTTP Streaming class is due to Dailymotion and Youtube;	5%-40%(flows);10%-50%(bytes)		[D-84]
68	On Dominant Characteristics of Residential Broadband Internet Traffic	2009	IMC	Gregor Maier, Anjia Feldmann, Vern Paxson and Mark Allman	Analysis		Port Number and Application Payload	Exact Matching	HTTP carries more than 50% traffic; Flash Video contributes 25% of all HTTP traffic;	14%(bytes)		[D-85][D-86]

Datasets

Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:

Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
Other: except from above, such as application level session logs from web sites;

The data come from three types of capture environments:Intranet environment, Edge/Border environment, Backbone environment.

Words Filter:

Columns Filter:

ID

Name

Year

Link Type

Capture Environments

Geographic Location

Payload

Size and Length

PaperID

1

2007

Academic

Backbone

Asia

Yes(full)

3 h
450 Gbytes

2

TunnelHunter-2007

2007

Academic

Edge/Border

Europe

Yes(part)

3

TunnelSSH-2006

2006

Academic

Edge/Border

Europe

Yes(full)

0.25 h each hour
for three weeks
50 Gbytes

4

2006

Academic

Backbone

Europe

No

0.33*4 h each day
for 20 days

[P-49][P-50][P-54]

5

2006

Academic

Backbone

Europe

No

276 randomized times(10 mins)
during 80 days

6

LosNettos-2005

2005

Academic and Commercial

Backbone

North America

24 h, 08/31/2005

7

LosNettos-2006

2006

Academic and Commercial

Backbone

North America

24 h, 10/03/2006

8

CAIDA-OC48-2003

2003

Backbone

North America

No

1 h
95 Gbytes

9

Abilene-ABIL-2004

2004

Academic

Backbone

North America

No

1 h
714 Gbytes

10

PAIX-PAY1-2004

2004

Backbone

North America

Yes(16 bytes)

1 h
435 Gbytes

11

PAIX-PAY2-2004

2004

Backbone

North America

Yes(16 bytes)

1 h
374 Gbytes

12

UNIBS-

Academic

Edge/Border

Europe

Yes(part)

13

2004

Academic

Edge/Border

Europe

Yes

1 h

14

2006

Academic

Edge/Border

Europe

Yes

1 h

15

Paris6-2004-2005

2004-2005

Academic

Edge/Border

Europe

Yes

1*3 h
27 Gbytes;35 Gbytes;
300 Gbytes

16

College-2003

2003

Academic

Europe

No

15 mins
900 Mbytes

17

ADSL-2004

2004

None

No

15 mins
2.3 Gbytes

18

WirelessCrawdad-2003

2003

Academic

Europe

No

5 h 30 mins
330 Mbytes

19

Enter-

Commerical

Edge/Border

None

Yes

1 h 20 mins
300 Mbytes

20

UMass-2005

2005

Academic

Edge/Border

North America

Yes(4 bytes)

21

MicroResearch1-2005

2005

Commercial

Edge/Border

North America

Yes

1 month

22

MicroResearch2-2005

2005

Commercial

Edge/Border

North America

Yes

2 weeks

23

CISCO-

Backbone(/8)

No

24

Polito-Academic-2006

2006

Academic

Europe

95 h

25

Polito-ISP-2006

2006

Backbone

Europe

24 h

26

Calgary1-2006

2006

Academic and Commercial

North America

Yes(full)

1*48 h over 6 months

27

Calgary2-2006

2006

Academic

North America

Yes(full)

1*8 h over 4 days

28

PAIX-I

2004

Commercial

Backbone

North America

Yes(16 bytes)

2 h
91 Gbytes

29

PAIX-II

2004

Commercial

Backbone

North America

Yes(16 bytes)

2 h 2 mins
891 Gbytes

30

WIDE-1

2006

Backbone

Oceania

Yes(40 bytes)

55 mins
14 Gbytes

31

KEIO-I

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

30 mins
16 Gbytes

32

KEIO-II

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

30 mins
16 Gbytes

33

KAIST-I

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

48 h 12 mins
506 Gbytes

34

KAIST-II

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

21 h 16 mins
259 Gbytes

35

WIDE-2

2006

Backbone

Oceania

2 h

36

AUCK-

2001/2003

Edge/Border

Oceania

1 h /3 days

[P-20][P-42][P-48][P-58][P-59]

37

OC48-2003

2003

Backbone

North America

1 h 2 mins

38

UCSD-honeypot

Academic

Intranet

North America

5 mins

39

Calgary3-2006

2006

Academic

Edge/Border

North America

Yes(full)

1 h

40

Cambridge-2003

2003

Academic

Europe

24 h

41

Wireless-2006

2006

Academic

Intranet

North America

5 days

42

UCSDDepart-2006

2006

Academic

Backbone

North America

1 h

43

GMU-2003

2003

Academic

North America

10 mins of each quarter
hour over 2 months

44

PMC-

Academic

Edge/Border

None

Yes

1 hour

45

Callrecords-2005

2005

Commercial

Edge/Border

Europe

logs

46

Leipzig-II-20030221

2003

1 h

47

Nzix-II-2000

2000

1 h

48

Genome Academic

Academic

Edge/Border

Europe

Yes(full)

24 h/43.9 h
268 Gbytes/495 Gbytes

[P-1][P-3][P-10][P-15][P-27]

49

Tier1ISP-

Backbone

North America

No

24 h/3 h
11-98 Gbytes

50

University-weekday

2004

Academic

None

24.6 h
1223 Gbytes

51

University-weekend

2004

Academic

None

33.6 h
1652 Gbytes

52

Mshmro-2002

2002

Commercial

None

Yes

7 days
60 Gbytes

53

Accessnetwork-2004/5

2004-2005

None

l/4/8 h
100 Gbytes

54

Abilene-2003

2003

Academic

Backbone

North America

No

20 days(1/100 pkts)

55

Geant-2004

2004

Backbone

Europe

No

23 days(1/1000 pkts)

56

Waseda-2002

2002

Academic

Edge/Border

Asia

No

weekday nights
over 1 month

57

ADSL-2002/3

2003-2004

Backbone

None

weekdays and weekend
days of Sep 2004
and June 2003

58

WebServer-

None

10,000 webServers
as testing purposes

59

2001

Commercial

None

60

UCSD-NAP-2002

2002

Commercial

North America

31 days

61

Research-2002

2002

Academic

Edge/Border

None

39 days (roughly 15,000 hosts)

62

OC48-2001

2001

Backbone

None

8 h

63

DARPA-1998

1998

2 and 7 weeks of
network-based attacks

64

Mazu-

Commercial

North America

65

BigComany-

Commercial

None

66

Tier1-multi

2001-2002

Backbone

None

1 h - 6 days

67

Saarland-2002

2002

Academic

Europe

Yes

8 days
950 Gbytes

68

2003/2004

69

CAIDA-OC48-2002/4

2002-2004

Backbone

North America

Yes(4 bytes)

1 h

70

CAIDA-OC48-2003/4

2003-2004

Backbone

North America

Yes(part)

1-2 h

71

InternetAccessTrace-2003

2003

None

Yes(full)

24 h and 18 h
120 Gbytes

72

VPN-2003

2003

None

Yes(full)

6 days
1.8 Tbytes

73

MultiRouter-2001

2001

Backbone

None

8000 million flow
level records

74

Tier1ISP-OC12-2001

2001

Backbone

North America

No

3.5 days

75

USC-2006

2006

Edge/Border

North America

No

14 hour period

76

USC-2006

2003-4

Edge/Border

North America

No

2 years

77

POPs-

Backbone

North America

No

78

AccessNetwork-

Intranet

None

No

43 hours; 6 Gbytes

79

MobileNetwork-

Europe and Asia

Yes(full)

80

JanpanISPSNMP-

2004-2008

Commercial

Backbone

Asia

aggregated SNMP data (month-long) from 6 ISPs;

81

JanpanISPNetFlow-

2005,2008

Commercial

Backbone

Asia

Sampled NetFlow data from 1 ISP;

82

NapleItaly

Academic

Backbone

Europe

Generated by a set of conttolled boxes;

83

WPIUSA

Academic

Backbone

North America

84

ADSLPoPFrance

2006, 2008

Commercial

Europe

yes(full)

1-2 h
26-60 Gbytes

85

ISPEuropean

2008, 2009

Commercial

Europe

2*24 h, 14*90 mins
>4 Tbytes, 100-600 Gbytes

86

DSLSession

2009

Commercial

Europe

10 days, 6*24 h
(DSL session)

Discussion

P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).

One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.

Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.

Table 1. P2P Range(Year).
Year	Range	PaperID
2002	21.5%	[51]
2004	9.19-60%	[5],[6],[10],[18],[53]
2006	35.1-93%	[20],[34],[35],[50]

Table 2. P2P Range(Link Location).
Year	Link Location	Range	PaperID
2004	Campus link	31.3%	[10]
2004	ADSL link	60%	[53]
2004	Backbone link	9-14%	[5],[18]
2004	Backbone link	17-25%	[6]

Table 3. P2P Range(Geographic).
Geographic Location	Year	P2P Range	PaperID
Europe	2005	60-80%	[52]
Europe	2006	79-93%	[49],[50]
North America	2003	8%,10.7%	[5]
	2004	14%,9.9%	[5]
	2003-04	9.19-70%	[6],[18],[61]
	2006	21-35.1%	[20],[34],[35]
Asia	2002	21.53%	[51]
	2005	1.34% (port-based)	[64]
	2008	1.29% (port-based)	[64]

Table 4. P2P Range(Time).
Year	Time	Range	DataID	PaperID
2006	midnight to 10am	80%	[D-26]	[34]
2006	9am to 10am	61.5%	[D-26]	[34]
2006	evening	93%	[D-4],[D-5]	[50]
	night	91%
	office hours	86%

UDP Traffic Analysis

It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.

In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.

Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.

Table 5. Values of UDP/TCP Ratio(from papers).
PaperID	Year	UDP/TCP Ratio			Notes
PaperID	Year	pkts	bytes	flows	Notes
P-1	around 2003	0.01	0.01
P-34	2006	0.11	0.05
		0.02	0.04		WLAN Trace
		0.20	0.20		10-hour Residential Trace
P-49 P-54	2006			1.12
P-64	2005		0.01
P-64	2008		0.02

Table 6. Values of UDP/TCP Ratio(from real-traces).
Trace	Sample	UDP/TCP Ratio			Total IP Traffic (pkts/bytes/flows)
Trace	Sample	pkts	bytes	flows	Total IP Traffic (pkts/bytes/flows)
CAIDA-OC48	08-2002	0.11	0.03	0.11	(1371M/838GB/79M)
CAIDA-OC48	01-2003	0.12	0.05	0.27	(463M/267GB/26M)
GigaSUNET	04-2006	0.06	0.02	1.06	(422M/294GB/9M)
GigaSUNET	11-2006	0.08	0.03	1.45	(422M/294GB/9M)
CAIDA-OC192	06-2008	0.14	0.05	1.43	(4427M/2279GB/197M)
CAIDA-OC192	02-2009	0.19	0.07	2.34	(1922M/1410GB/110M)
OptoSUNET	01-2009	0.21	0.11	3.09	(1100M/657GB/41M)
OptoSUNET	02-2009	0.20	0.11	2.63	(1100M/657GB/41M)

Conclusion

This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.

Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.

However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."

Acknowledgements

This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.

Last Modified: Tue Oct-13-2020 22:21:54 UTC

Page URL: https://www.caida.org/research/traffic-analysis/classification-overview/index.xml