animation-bear dataset

HPL is pleased to announce the release of a large collection of traces collected at a feature animation company.

There are a number of different types of data that were collected, we are in the process of making this all available, so some dangling links are to be expected. You can also return to other datasets

All of the traces have been stored in the DataSeries format. Current versions of the conversion programs, and some tools to analyze these traces are available as part of the DataSeries source distribution. The source distribution is available from http://tesla.hpl.hp.com/opensource/ in the DataSeries-2008-02-27.tar.bz2 (or later) file, Lintel-2008-02-27.tar.bz2. Trace examples and some data will be available from http://iotta.snia.org/traces/ and http://apotheca.hpl.hp.com/pub/datasets.

If people end up writing papers using these traces, or create interesting analysis, we request that you send mail to software@cello.hpl.hp.com with a copy of the paper/analysis. We would like to be able to show customers the benefit of letting us make traces of their systems available, and for analysis, we would like to incorporate those analysis into the DataSeries distribution so other people can take advantage of them. You can also send questions about the traces, to that address and to skottie@pobox.com; we may be able to answer more detailed questions such as "is this encrypted filename map to ..," or "is this group of encrypted strings related in some way."

Details on the different data types:

NFS

The NFS data is divided into two groups, nfs-1, collected in fall
2003-spring 2004, and nfs-2 collected in fall 2006 - spring 2007.
Filenames were encrypted during trace conversion.  

NFS-1 -- ~55 billion requests or replies

Each set of data is a single contiguous trace from a single network
location.  In general, the network tracing dropped relatively few
packets (few percent), although detailed statistics on drops are only
available for some of the traces.  If present, the files
ifconfig-interval.out, lindump-mmap-network.out,
slow-copy-network.out, and trace-network.log provide details on the
tracing process.  During processing a few of the individual trace
files were lost; these can be easily identified by gaps in the
numbering of an individual set of data. 

What was traced:
  set-[0-7]: A pair of render racks, nominally 79 clients.
  set-[8-10]: A version control server
  set-[11,20]: An commercial relational database database server
  set-[12,13,15-19]: Various NFS servers
  set-14: A NFS cache

Known issues:
  set-[0-11]: due to an error in conversion not discovered until after
    the original data was removed, the IP extents are missing entries for
    IP fragment packets.
  set-6: a few converted files were lost, 23919, 23920
  set-11: NFS V3 operations parsing was not implemented when this dataset
    was converted, so only common nfs bits were extracted.   

  * NFS-2

Each data set here is also a contiguous trace.  Due to improvements in
the tracing technology (a switch from 1gbit to 10gbit tracing), these
traces have much more data collected over a shorter period of time.
Conversion statistics are now part of the dataseries files, drops were
even more negligable than before, a few thousand packets dropped out
of billions.  The capture-log file for each trace is the output from
the capture tool during each trace.  Each of these trace sets is about
1 week of traces.

What was traced:
  set-[0-3]: A number of render racks; we selected the higher performing
    (newer) machines to trace, much as was done earlier.
  set-4: A collection of NFS caches.

* LSF

The LSF accounting at three different clusters {ers,rwc,gld} were
collected.  Most of the columns in this dataset are encrypted, but all
that generally matters is determining if two jobs were for the same
production, sequence, shot, object, etc.  The encryption done was
consistent with the encryption in the filesystem, and many directory
names in the filesystem traces can be found as production sequence
shot and object entries in the LSF traces.

The ers cluster was only active in 2003-2005, so complete statistics
are available for that cluster.  The rwc cluster had statistics going
back to 1999, however, early use of the cluster had great variety in
the job naming conventions making it less likely that the features
were successfuly extracted.  The gld cluster had statistics going back
to 2004. 

* SAR

The system accounting records for the ERS cluster was collected over a
subset of the time that the cluster was operating.  Over 100
statistics such as disk, cpu and network usage were collected each
second.

* IP-hostname

At a few points in time, we recorded a mapping from the hosts to the
IP address.  This can be used to correlate the NFS and LSF data.  We
have a snapshot from 2003, and a snapshot from 2007.  Some of the 2003
traces are after an IP renumbering performed at the customer site.  We
unfortunately did not keep a snapshot during that interval.