Latest News and Events

The SAMSI-FODAVA Workshop on Interactive Visualization and Analysis of Massive Data will be held on December 10-12, 2012.
Posted: October 02, 2012
The FODAVA Annual Meeting will immediately follow (Dec 12-13) the SAMSI/FODAVA joint workshop at the same location.
Posted: September 05, 2012
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced techniques from num
Posted: June 30, 2012

Efficient Data Reduction and Summarization

Ping Li

The ubiquitous phenomenon of massive data (including data streams) imposes considerable challenges in data visualization and exploratory data analysis. About 15 years ago, terabyte datasets were still considered `ridiculous.' However, modern datasets managed by Stanford Linear Acceleration Center (SLAC), NASA, NSA, etc. have reached the perabyte scale or larger. Corporations such as Amazon, Wal-Mart, Ebay, and search engine firms are also major generators and users of massive data. The general theme of data reduction and summarization has become an active and highly inter-disciplinary area of research. This project proposes to develop various approximation techniques, which generate a "fingerprint" or "sketch" of the massive data by transforming the original data. These `sketches' are reasonably small (hence easy to store) and can provide approximate answers which are usually good enough for practical purposes. This proposal concerns the fundamental problems of processing/transforming massive (possibly dynamic) data. In particular, it focuses on (A) developing systematic fundamental tools for effective data reduction and efficient data summarization; (B) applying these tools to improve numerical analysis, visualization, and exploratory data analysis. Two lines of theoretically sound techniques for data reduction and summarization will be developed and further improved: (1) the method of stable random projections (SRP), effective in heavy-tailed data; (2) the method of Conditional Random Sampling (CRS), mainly for sparse data. Concrete applications of SRP and CRS will be investigated. Widely-used basic numerical algorithms can be rewritten by taking advantage of SRP or CRS. Popular methods/tools for exploratory data analysis will also benefit considerably from the development of data reduction techniques.