Latest News and Events

The SAMSI-FODAVA Workshop on Interactive Visualization and Analysis of Massive Data will be held on December 10-12, 2012.
Posted: October 02, 2012
The FODAVA Annual Meeting will immediately follow (Dec 12-13) the SAMSI/FODAVA joint workshop at the same location.
Posted: September 05, 2012
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced techniques from num
Posted: June 30, 2012

Data Sets

(ACLED) Armed Conflict Location and Event Database

World event data with time, location, casualty count, etc. Relatively low volume and only for specific countries, although there are over 70 sets of data. ACLED data are presented in two forms - the first is a simple excel sheet called “Country_X” which will give all information on the politically violent events in which actors from this country are involved in (even if abroad). The Shapefile for each country is based on the Full excel file.

Airplane Crashes Data Set

Over 5,000 Geo-temporal points with time, location, and other metadata information.

ATnT Face Database

This Database of Faces, (formerly 'The ORL Database of Faces'), contains a set of face images taken between April 1992 and April 1994 at the lab. The database was used in the context of a face recognition project carried out in collaboration with the Speech, Vision and Robotics Group of the Cambridge University Engineering Department.

Cloud Images Data Set (MATLAB - 1.1 MB) and documentation

Contains 41 cloud images generated for weather forecasting. These cloud images are used to test clustering algorithms that segment an image into clusters of clouds. The shapes of the cloud clusters which tend to be perceived by human vision are highly non-elliptical. This poses difficulty to many widely used clustering algorithms such as k-means or mixture-model-based clustering which implicitly assume Gaussian-type clusters.
for M. Qiao and J. Li, "Two-way Gaussian mixture models for high dimensional classification"

CNN Transcript Collection (2000-2012) (April 25, 2012)

This is a collection of CNN's (Cable News Network) publicly provided transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research. The compressed download file is approximately 1GB large.

Enron Email Data Set

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

Federal Election Commission Data Set

Contribution records with contributor, candidate, committee and summary data files. Detailed annual files contain all individual contributions and comprises a bulk of the data which contains 10 years of data.

Flickr Shapefiles Data Set

Dataset that was built to identify places by name using tags on images. Version 2.0 has shapes for roughly two hundred and seventy thousand (270K) WOE IDs.

Flight Info Data Set

Dataset provides a record of every US flight between 1987 and 2008 along with reported delays, arrival times, and other flight details. Data is massive and very useful for historical model fitting and is broken down into individual years.

Foursquare - Friend Network

This dataset is a graph of friendships among the users in the Foursquare network which is a location based online social network. This graph data contains 106218 nodes and 3473834 edges (11 MB).

KDD Cup 1999: Data

Computer network intrusion detection for machine learning (18 MB uncompressed data) from the KDD Cup Challenge from 1999.

KDD Cup 2003: Data

Network mining and usage log analysis. There are approximately 29,000 hep-th papers with 1.7 gigs of data. The papers have been compressed to about 500M and divided into separate years for downloading.

KDD Cup 2005: Data

The data set is 800,000 unique search queries from end user internet search activities. Data is in a text file, one query per line (7.5MB).

Kiva (or Processed Data Including Term-Document Matrices for Loan Description and Other Text Fields)

Lending information for microloans which includes locations as well as payback rates and full justification for loans. The data is archived nightly so it is most useful for apps that don't require live data, such as data analyses and visuals.

Microsoft's GeoLife GPS Trajectories

This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours.

OpenStreetMap

This is a feed of individually submitted GPS traces. There are over 2 billion of these points, although they must be individually downloaded and processed into a larger data set.

SNAP Brightkite Data Set

A location-based social networking data set where users shared their locations by checking-in. The friendship network was collected using Brightkite's public API, and consists of 58,228 nodes and 214,078 edges. The network is originally directed was constructed into a network with undirected edges when there is a friendship in both ways. There is a total of 4,491,143 checkins of these users over the period of Apr. 2008 - Oct. 2010.

SNAP Enron Data Set

The graph network derived from about half a million emails. The 36,692 nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge (total of 367,662 edges) from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

SNAP Epinions Data Set

Data of the friend network from Epinions.com which contains 75,789 nodes and 508,837 edges.

SNAP EU Email Communication Network

Data generated from European research institution which was recorded from October 2003 to May 2005. Overall we have 3,038,531 emails between 287,755 different email addresses. 265214 Nodes and 420045 Edges.

SNAP Flickr Data Set

This dataset of 105,938 nodes and 2,316,948 edges is built by forming links between images sharing common metadata from Flickr. Edges are formed between images from the same location, submitted to the same gallery, group, or set, images sharing common tags, images taken by friends, etc.

SNAP Google Webgraph Data Set

875,713 Nodes represent web pages and the 5,105,039 directed edges represent hyperlinks between them. The data was released in 2002 by Google as a part of Google Programming Contest.

SNAP Gowalla Data Set

This dataset is Stanford's collection of location-based social networking check-ins. The friendship network is undirected and was collected using Gowalla's public API, and consists of 196,591 nodes and 950,327 edges. There is a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

SNAP LiveJournal Data Set

This dataset is Stanford's collection of the social network community, Live Journal. This graph data contains 68,993,773 edges over 4,847,571 users in their friend network.

SNAP Memetracker Data Set

This dataset is Stanford's collection 96 million meme's from Memetracker. For each document (blog post or news media article), it contains the URL (author), time, memes, and links.

SNAP Patent Citation Network Data Set

Patent Citation Edge list with 3,774,768 Nodes and 16,518,948 Edges. The data set spans 37 years (January 1, 1963 to December 30, 1999), and includes all the utility patents granted during that period, totaling 3,923,922 patents. The citation graph includes all citations made by patents granted between 1975 and 1999, totaling 16,522,438 citations.

SNAP Slashdot Data Set

A collection of a friend/foe network data set from Slashdot which is a social site. This snapshot from February 2009 contains a network with 82168 nodes and 948464 edges.

SNAP Twitter Data Set

467 million Twitter posts from 20 million users covering a 7 month period from June 1 2009 to December 31 2009. It is estimated to be about 20-30% of all public tweets published on Twitter during the particular time frame. It contains author, time and content information for each data point.

SNAP Wikipedia Vote Network

A collection of a complete dump of Wikipedia page edit history (from January 3 2008) with the extracted administrator elections and vote history data. It contains 2,794 elections with 103,663 total votes and 7,066 users participating in the elections (either casting a vote or being voted on).

Synthetic Kronecker and Erdos-Renyi Data Sets

Synthetic Kronecker graph generator tool with diameter 2 and synthetic Erdos-Renyi random graph generator tool.

TDT2 (Topic Detection and Tracking Phase 2)

The TDT2 English Corpus has been designed to include six months of material drawn on a daily basis from six English news sources. The period of time covered is from January 4 to June 30, 1998. The six sources are the New York Times News Service, the Associated Press Worldstream News Service, CNN "Headline News", ABC "World News Tonight", Public Radio International's "The World", and the Voice of America.

VAST Benchmark Repository

The repository has the VAST datasets and the materials provided by the teams who submitted entries (their answers, videos and explanations) to the Challenge, and the solutions.

YouTube Data Set

A directed graph of YouTube videos, where each video is a node in the graph. If a video b is in the related video list (first 20 only) of a video a, then there is a directed edge from a to b.