Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
The Center for Applied Internet Data Analysis
Architecture
arch_diagram_xsmall.png
Corsaro-Out Architectural Overview

Conceptually, the Corsaro package is comprised of three main components, each of which interacts with the libtrace trace processing library. At the core is the libcorsaro library which responsible for coordinating the analysis plugins and reading and writing files. The plugins module contains a set of analysis plugins which libcorsaro uses to process packets and deserialize plugin output. The libcorsaro library can then be leveraged by Tools to carry out analysis.

libcorsaro

At the heart of Corsaro is the libcorsaro library. libcorsaro fundamentally operates in one of two modes, Corsaro-Out and Corsaro-In . At a high level, the Corsaro-Out mode is used to analyze packet traces and generate aggregated output. The Corsaro-In mode, on the other hand, is used to read the data generated by Corsaro in the Corsaro-Out mode and allow further analysis to be carried out.

One function that Corsaro takes care of in both modes is plugin management. Once a plugin has been correctly implemented within the plugin sub-library (see the Creating a New Plugin tutorial for more), Corsaro takes care of compiling, linking and running the plugin.

There are currently two different methods for controlling which of the available plugins are used - at compile time by using the

--with[out]-<plugin> 

option to configure (see the Installation section), and by giving an explicit list of plugins to the corsaro tool using the -p argument. The configure option has the effect of only compiling and linking the enabled plugins. That is, disabled plugins are not present in the final binary. The runtime option on the other hand allows a subset of the compiled plugins to be run. Neither method is significantly more or less efficient, but conceptually, one can think of the compile-time method as setting the 'default' plugin set, and the runtime method as specifying a custom list of plugins for a specific run of the tool. Note that using the compile-time method means that tools such as cors2ascii will not be able to read data written by a plugin that was disabled.

Intervals

The primary method that Corsaro uses to facilitate data aggregation is the compression of data into time bins. These bins are known as intervals and are an integral part of the Corsaro trace analysis framework.

Based on a configurable interval length (see corsaro_set_interval), Corsaro notifies plugins at the start and end of each interval (see Plugins). Interval times are based on the timestamps of packets (that is, the time the packet was captured at), with the first interval beginning at the time of the first packet (truncated down to the nearest second). An interval has both a start time and an end time associated with it. All packets that fall within the interval will have a timestamp which is greater than or equal to the start time, and less than or equal to the end time. That is:

start_time <= packet.timestamp <= end_time

Corsaro supports interval durations at second granularity. That is, the shortest possible interval is 1 second. As such, all packet times are truncated down to the nearest second when making interval calculations. For example, given the following interval (of length 60):

# CORSARO_INTERVAL_START 0 1325390400
# CORSARO_INTERVAL_END 0 1325390459

Packets with the following timestamps would be included in this interval:

1325390400.806566 # when truncated is == start time
1325390420.365544 # when truncated is > start_time and < end_time
1325390459.734654 # when truncated is == end_time

The last timestamp is one which may seem counter-intuitive at first, but this policy allows for the accurate representation of the last interval of an analysis run, the length of which may be less than the specified interval length. If the last packet is received before the scheduled interval end time, the recorded end time of the interval will be the timestamp of the last packet in the interval, truncated down to the nearest second.

When using the Corsaro-Out mode to process a trace file, plugins can use intervals to generate aggregated statistics. For example, the FlowTuple plugin maintains flow records which are written out and reset at the end of each interval. Note that while plugins may take advantage of the interval architecture, they are not required to, and may simply ignore the interval start and end event notifications (or a subset of them). For example, the RS DoS plugin aggregates data to 5 minute intervals regardless of the interval length passed to Corsaro.

For more information about the implementation of the interval framework, see the Corsaro-Out and Plugins sections of this document.

IO Framework

Corsaro is fundamentally designed to batch process large volumes of trace data, and output analyzed, but still potentially large quantities of aggregated data. It has also been designed to be easily extensible. To facilitate this, the Corsaro IO framework was created to abstract much of the details of opening, reading from, and writing to, Corsaro files.

The Corsaro IO framework is built on the libwandio library (included with libtrace) which we leverage to provide threaded and compressed IO. By default each file has a dedicated thread to perform (de)compression (if needed) and disk access. Compression is enabled based on the suffix of the file passed to corsaro_alloc_output, as described in Corsaro-Out .

IO performance can be affected by several factors, such as disk speed. CPU speed can also significantly affect performance if compression is used. As such it is important to test Corsaro under peak-load if you plan to use it on a live interface.

This section provides an introductory guide to the basic features of the Corasro IO framework, a more advanced tutorial is planned for future releases.

Opening a File

When a plugin needs to write to an output file, most of the time, it simply calls corsaro_io_prepare_file and passes it a pointer to the current corsaro state and a string which represents the plugin name. For example, the FlowTuple plugin makes the following call:

state->outfile = corsaro_io_prepare_file(corsaro, plugin->name)

Which will open a file with the current compression method, and output format.

The file name of the opened file will be compiled based on the template given to the corsaro_alloc_output function as described in Corsaro-Out .

There are currently two Corsaro-specific specifiers that are supported in the template string:

Specifier Description Example
%P Plugin Name flowtuple
%N Monitor Name ucsd-nt

In addition to these, all format specifiers supported by strftime(3) are also recognized and will be populated with the start time of the first interval in the file. Output file rotation was added in the 2.0.0 release of Corsaro. Plugins should use the corsaro_is_rotate_interval function to determine whether to rotate their output files. See the Creating a New Plugin tutorial for more information about rotating output files.

For example, given a monitor name of ucsd-nt (see corsaro_set_monitorname), a first-interval start time of 2011-11-11 00:00, and the template:

%N.%s.%P.cors.gz

the FlowTuple plugin would open a file named:

ucsd-nt.1320969600.flowtuple.cors.gz

To determine which output mode the file has been opened in (and as such, which mode Corsaro is operating in), simply use the CORSARO_FILE_MODE macro. If you need more control over the name of the file, or compression, mode, etc., then the corsaro_io_prepare_file_full function may be more appropriate.

Writing to a File

Corsaro supports many methods of writing to files, but the two simplest (and most flexible) are corsaro_file_write and corsaro_file_printf

If Corsaro is operating in binary mode, the corsaro_file_write function should be used, whereas in ASCII mode, the corsaro_file_printf function should be used.

Binary Output

corsaro_file_write takes four arguments:

  1. a pointer to the corsaro state
  2. a pointer to a corsaro_file_t opaque structure (returned by corsaro_io_prepare_file)
  3. a pointer to the data to write
  4. the length of the data

For example, the FlowTuple plugin writes a corsaro_flowtuple record by calling:

flowtuple, sizeof(corsaro_flowtuple_t))

ASCII Output

corsaro_file_printf functions in much the same way as fprintf(3), albeit requiring 3 arguments rather than 2:

  1. a pointer to the corsaro state
  2. a pointer to a corsaro_file_t opaque structure (returned by corsaro_io_prepare_file)
  3. a pointer to the format string (just like fprintf(3))

For example, the FlowTuple plugin prints the contents of a corsaro_flowtuple record by calling:

"|%"PRIu16"|%"PRIu16
"|%"PRIu8"|%"PRIu8"|0x%02"PRIx8
"|%"PRIu16
",%"PRIu32"\n",
ip_a, ip_b,
ntohs(flowtuple->src_port),
ntohs(flowtuple->dst_port),
flowtuple->protocol,
flowtuple->ttl,
flowtuple->tcp_flags,
ntohs(flowtuple->ip_len),
ntohl(flowtuple->packet_cnt));

Closing a File

To close an output file, simply call corsaro_file_close and pass a pointer to the corsaro state, and a pointer to the corsaro_file_t structure which represents the file to be closed.

Geolocation Framework

Version 2.0.0 of Corsaro added a geolocation framework which simplifies the code needed for a plugin to augment a packet with geolocation-type meta-data. It also provides consumers a well-defined mechanism for retrieving the meta-data associated with packet.

Creating a Geolocation Provider

The geolocation framework is comprised of three major pieces:

  • Providers
  • Data-Structures
  • Records

Providers

A geolocation provider is a supplier of geolocation information. A provider will likely implemented in a plugin and will provide a lookup from a specific database for each packet that is processed. For example, the Geolocation plugin implements two providers: Maxmind Geolocation, and Net Acuity Edge Geolocation.

A provider is registered with the geolocation framework using the corsaro_geo_init_provider function, which requires 3 arguments:

  1. a pointer to the current corsaro object
  2. the ID of the provider to be registered
  3. the ID of the data-structure to use
  4. a boolean value indicating whether to set this provider as the default
    • if this is set to 1, this will be the provider returned by corsaro_geo_get_default (unless a later plugin also asks to be default)

This will create a data-structure of the appropriate type and return an object that can then be used to fill the data-structure with corsaro_geo_record objects.

Data-Structures

To further reduce the implementation cost for plugins, the geolocation framework abstracts the details of the underlying data-structure which is used to map from an IP address prefix (32 bit address and 8 bit mask) to a geolocation record. This allows different data-structure implementations to be used, and easily switched between by plugins. In the current release, the only implemented data-structure is a Patricia Trie.

The specific data-structure used by a provider should not affect either the provider implementation, nor the consumer implementation, and is specified at run-time. This allows end-users to choose the most appropriate data-structure for their use-case.

The implemented data-structures are listed in the corsaro_geo_datastructure_id enum.

Records

A geolocation record represents a set of fields that can be associated with any number of IP addresses (prefixes).

Initialization

It is the responsibility of the provider to, at initialization time, create and insert all possible records into the data-structure. To demonstrate this process, consider the implementation of the maxmind geolocation provider.

The maxmind geolocation databases are normalized into two tables. The first is a set of 'location' records which have fields (similar to those in corsaro_geo_record) representing a physical location. The second table is a set of IP prefixes ('blocks'), and the location that each corresponds to.

A packet is matched to a location by determining which prefix best matches the source IP address, and then retrieving the corresponding record. Therefore, when the Geolocation plugin parses the Locations table, it creates a corsaro_geo_record for each row by calling corsaro_geo_init_record, and caches these records temporarily in a hash which maps the ID of the record to the record object.

The Blocks file is then parsed, and the appropriate record is retrieved from the temporary hash by using the ID foreign key in the Blocks file. At this point the record is associated with with the prefix by calling corsaro_geo_provider_associate_record. Only one record may be associated with each prefix. When a record is associated with a prefix, it is inserted into the data-structure using the prefix as a key.

Processing

After the provider is initialized and all records have been inserted into the data-structure, packet processing will begin. When a provider processes a packet it must determine which (if any) record matches the address of the packet.

To do this, the corsaro_geo_provider_lookup_record function is used. The returned record can then be used to 'tag' the packet by calling corsaro_geo_provider_add_record. Multiple records can be added for each packet. Also, note that the 'tagged' records are not cleared automatically. The provider must explicitly call corsaro_geo_provider_clear to remove previously tagged records. This should be done before tagging the packet with new records.

Using a Geolocation Provider

Retrieving a Provider

If there is a specific provider that is needed (e.g. pfx2as provider is required by the Prefix-to-AS plugin), then the corsaro_geo_get_by_id or corsaro_geo_get_by_name functions can be used. If the plugin can use any default provider, the corsaro_geo_get_default function should be used. This will allow the end-user to use a different geolocation provider without altering the code.

Retrieving Tagged Records

Once a provider object has been retrieved, the head of the list of tagged corsaro_geo_record objects which match the current packet can be obtained by calling corsaro_geo_next_record function.

In most cases it will be sufficient to call this function only once to get the first record, but there may be instances where there are multiple records tagged. (though currently there cannot be multiple records associated with a prefix in the data-structure.)

Logging

Corsaro supports limited logging functionality, both to stderr, and to a dedicated log file.

To write to the Corsaro log file (and to stderr if the --enable-debug option is passed to configure), one can simply call the corsaro_log function (corsaro_log_in if operating in Corsaro-In mode). The log file is named using the normal output file template, but using the reserved plugin name, log.

corsaro_log and corsaro_log_in are both based on printf(3), and require 3 arguments:

  1. a string which identifies the function writing to the log
    • the variable __func__ will automatically be converted to the function name by the compiler
  2. a pointer to the corsaro (or corsaro_in) state
  3. a printf(3) style format string

For example, we could record a message in the log when malloc has failed within the corsaro_alloc_output function:

corsaro_log(__func__, corsaro, "malloc failed");

If this ever gets called, the log will contain a line giving the time at which the log message was issued, the function that issued it, and the actual message. The output for the above example would be similar to the following:

[15:52:43:828] corsaro_alloc_output: malloc failed

There are other logging functions which may be useful in certain occasions when a corsaro state pointer is not available. See corsaro_log.h for more information.

Note
The Corsaro Logging framework is very limited and as such will likely be replaced or improved in future.

Corsaro-Out

The Corsaro-Out mode is used when processing packet traces. In this mode, Corsaro is passed a series of packets, which are in turn passed down a chain of analysis plugins.

The basic process for using libcorsaro in the Corsaro-Out mode is:

  1. Allocate a corsaro instance using corsaro_alloc_output
  2. Optionally call corsaro_set_traceuri etc to set parameters
  3. Call corsaro_start_output to initialize the plugins (and create the files)
  4. Call corsaro_per_packet with each packet to be processed
  5. Call corsaro_finalize_output when all packets have been processed

corsaro_alloc_output

To initialize Corsaro-Out, the corsaro_alloc_output function is used. corsaro_alloc_output takes a single argument, a string which describes the template to use for the output files created. Currently the only field supported (and required) is %P which is replaced with a plugin-specific string, usually the plugin name. The suffix of the file name will be checked to determine the appropriate compression (if any) to use. Supported options are .gz (gzip), and .bz2 (bzip).

Upon allocation, libcorsaro sends allocate events to each of the enabled plugins. For the moment the plugin simply passes back a reference to a static structure describing properties of the plugin, such as the name and unique ID. This allows plugins to be dynamically enabled and disabled based on name. A pointer to an initialized corsaro opaque structure is then passed back to the user. Every function in libcorsaro that references some state must be passed a pointer to the appropriate structure. This is usually a pointer to a corsaro structure. That is to say, there is no global state in libcorsaro, thus potentially allowing multiple instances of Corsaro-Out to be used simultaneously. While this functionality is fully implemented, no effort has thus-far been made to make Corsaro thread-safe.

corsaro_set_[option]

Using this object, the user may then optionally call functions to set parameters such as corsaro_set_traceuri, corsaro_set_interval and corsaro_set_monitorname.

Enabling output rotation

As of version 2.0.0, Corsaro supports the rotation of output files.

To enable output rotation for all files, call corsaro_set_output_rotation and pass the number of intervals after which files should be rotated.

To use a different rotation interval for the meta-data output files (Global Output File and Log File), use the corsaro_set_meta_output_rotation function.

To "align" the intervals to a multiple of the interval length use the corsaro_set_interval_alignment function. For example, given a 1 minute interval, intervals will end at whole minute values (10:01, 10:02, etc.). If this option is not used (the default), intervals will end based on the time of the first packet received.

corsaro_start_output

Once all options have been set, the corsaro_start_output function is called, again passing a pointer to the corsaro structure returned by corsaro_alloc_output. The plugin manager is now started, which scans the list of available plugins, and issues an initialization event to all enabled plugins. Each plugin then establishes any state needed for analysis, opens output files, etc. The Global Output File is also opened at this time. This file contains meta-data about the parameters provided to libcorsaro, such as the trace URI, interval duration, start time, finish time and active plugins. It also provides a common place for all plugins to write meta-data to.

corsaro_per_packet

libcorsaro is now ready to start processing packets. The user now simply calls corsaro_per_packet, passing a reference to a libtrace packet object. The first packet will cause a start interval event to be sent to each plugin. Based on the timestamp in this first packet and the interval specified (either by the default in corsaro_int.h or using the corsaro_set_interval function), Corsaro will determine the time at which the interval should complete. The packet is then passed down the chain of plugins.

The corsaro_per_packet function should be repeatedly called, once for each packet to be processed. Once a packet is received which has a timestamp outside the calculated interval bounds, the plugins are notified with an interval end event, and they (optionally) write out aggregated data for the interval. If the time of the packet which triggered the interval end event exceeds the calculated interval end by more than the interval length (i.e. at least an entire interval passed before receiving the packet), interval start and end events will be generated until the packet's timestamp is within the current interval - effectively generating empty intervals. This ensures that plugins receive interval events for every possible interval between the start and end times.

corsaro_finalize_output

Once all packets have been processed, libcorsaro is notified using the corsaro_finalize_output function. The current interval is ended, using the timestamp of the last packet received (this is the only interval that may be shorter than the configured duration). The trailers are then written to the Global Output File which is then closed. Each plugin is sent an finalize event which allows them to free state and close output files. The corsaro structure is then freed.

Corsaro-In

The Corsaro-In mode is used when processing data which has been created by a Corsaro plugin in the Corsaro-Out mode. This allows further analysis to be carried out on the aggregated data generated by a plugin.

The basic steps for using libcorsaro in the Corsaro-In mode are:

  1. Allocate a corsaro_in instance using corsaro_alloc_input
  2. Call corsaro_start_input to open the input file and load the appropriate plugin
  3. Call corsaro_in_read_record until all records have been read
    • For each record returned, cast to the appropriate type and carry out any processing required
  4. Call corsaro_finalize_input when all records have been processed

corsaro_alloc_input

To initialize Corsaro-In, the corsaro_alloc_input function is used. corsaro_alloc_input takes a single argument - the file to be read. As with the initialization process for Corsaro-Out, the plugin manager generates a set of properties which describe the available plugins, but none of them are initialized until corsaro_start_input is called. Once the necessary data structures have been created, corsaro_alloc_input returns a pointer to an corsaro_in opaque structure that will be used to maintain state as described in the Corsaro-In section.

corsaro_start_input

The corsaro_start_input function can then be called with the corsaro_in pointer returned by corsaro_alloc_input. When corsaro_start_input is called, the input file (as passed to corsaro_alloc_input) is opened, and Corsaro attempts to find the correct plugin to read the data contained in the file.

There are two methods that are used to identify the plugin to use to read an input file. The first is by searching the file name for a string that identifies the plugin that created it. For example, all files generated by the FlowTuple plugin contain the string "flowtuple" in the name (unless manually renamed after creation). If no plugin name is found in the file name, the first few bytes of the file are inspected in an attempt to find a magic number that identifies a plugin. For example, the FlowTuple plugin searches for the number 0x53495854 (or 0x53495855 if compiled without /8 optimizations) beginning in the 14th byte of the file (after the corsaro magic numbers).

The global output file, described in the Corsaro-Out section is treated as a special plugin and detected using the same heuristics as other plugins.

corsaro_in_read_record

Once Corsaro-In has been started, corsaro_in_read_record can then be called to read the first record in the file. corsaro_in_read_record takes three parameters:

  1. corsaro, a corsaro_in instance pointer
  2. record_type, a pointer to a corsaro_in_record_type_t value
  3. record, a pointer to an allocated corsaro_in_record_t object

The corsaro_in pointer, contains the state of this Corsaro-In instance. It is the same pointer that was returned by corsaro_alloc_input, and will be needed for almost all Corsaro-In functions.

The corsaro_in_record_type_t pointer (record_type) is used both for input to, and output from, the function. If the value it points to is set on input it will act as a filter, and only records with that type will be returned. Setting the value it points to to NULL on input will act as a wildcard, and the next record from the file will be returned, regardless of type. When the function returns, the value will be set to the type of the record. See corsaro_in_record_type for the list of possible values.

The final parameter is a pointer to a corsaro_in_record_t structure. A corsaro_in_record structure provides a reusable buffer for reading data from files. It is allocated by using the corsaro_in_alloc_record function, and should be reused for each call to corsaro_in_read_record.

If successful, corsaro_in_read_record returns the number of bytes read from the file, a value of 0 indicates EOF, and -1 indicates an error occurred. If the function was successful, the record_type parameter will be set to the type of record read, and the data will have been loaded into the record parameter. The corsaro_in_get_record_data function returns a void pointer to the record data. This pointer can be cast to the appropriate type by checking the value of the record_type parameter. See the note in the corsaro_in_record_type documentation for more information about how to cast this pointer.

The corsaro_in_read_record function should be called repeatedly until the desired record(s) have been read, or until 0 (EOF) is returned.

corsaro_finalize_input

Once all desired records have been read, corsaro_finalize_input should be called to shutdown the plugins, close the input file, and free any allocated memory. At this point the corsaro_in state object is freed and must no longer be used.

Plugins

Corsaro has been designed to facilitate the easy addition of packet analysis logic. This is implemented by a set of plugins that each contain some specific logic for analyzing packets and generated aggregated output.

Plugin Events

At a high level, the plugin interface is a simple set of event notifications from libcorsaro. There are also several other functions which can optionally be implemented if the plugin complies with the Corsaro-In API for de-serializing data.

The events that Corsaro-Out issues to a plugin are:

  • Initialize Output
    • Establish state and open any output files needed.
  • Close Output
    • Close all output files and free state.
  • Start Interval
    • Establish state for a new interval.
    • Open any output files that were rotated.
  • End Interval
    • Write out any data for the current interval.
    • Close any output files that must be rotated.
  • Process Packet
    • Analyze the given packet and possibly update state.

The Process Packet events are issued in sequence to each plugin, in order from highest to lowest priority. This allows plugins to pass state down the chain by augmenting the corsaro_packet_state structure contained in the corsaro_packet structure passed with the event. Passing state between plugins minimizes rework. For example, the RS DoS plugin uses the class determined by the FlowTuple plugin to detect when a packet has been classified as backscatter.

Run-Time Options

As of version 2.0.0, plugins also support run-time options. Currently options are implemented using a string of command-line-style options which is given to the corsaro_enable_plugin function along with the name of the plugin.

This string will be parsed, and an argv array (along with argc) will be populated in the appropriate corsaro_plugin structure. These may then be parsed using the standard getopt(3) functions.

For more information about the structure of a plugin and instructions for creating a new plugin, see the Creating a New Plugin tutorial.

Tools

In the Corsaro architecture, Tools are pieces of software that use the libcorsaro library to provide functionality to users. As with libcorsaro, these tools fall into two broad categories:

  1. Tools that read trace data and analyze packets
    • uses libcorsaro in Corsaro-Out mode
    • e.g. the corsaro tool
  2. Tools that read Corsaro data and perform further transformation/analysis
    • uses libcorsaro in Corsaro-In mode
    • e.g. the cors2ascii tool

For more details about these and other tools provided in the Corsaro package, see the Tools page.