Latest News and Events
Global Structure Discovery on Sampled Spaces
Over the past decade, the precipitous drop in the cost of disk storage and the build-up of world-wide high-bandwidth fiber optic communications has made massive amounts of data of different modalities (text, images,video) easily available to everyone over the Web. In science, engineering, business, and medicine, high-bandwidth sensors, large-scale simulations, and data collection bots generate immense data sets that need to be analyzed. Making sense of all this disparate data in becoming increasingly challenging and difficult. Unlike traditional databases where data is carefully massaged to adhere to rigid schemata, much of the above data comes unstructured, is often dynamic rather than static, can contain large amounts of noise or even errors, and can be incomplete. This project aims to develop general, rigorous and efficient techniques for analyzing massive and distributed sets of unstructured data. The basic aim is to exploit certain ideas from computational topology and geometry in the study of the global structure of large, distributed data sets -- and especially to develop data representations and transformations that makes this structure more apparent. Topology studies the connectivity of spaces, so it is global by its very nature. It is able to determine certain connectivity invariants in a way that is unaffected by deformations of an object and does not require explicit parameterizations of the object geometry. Its strength lies, in a sense, in its relative insensitivity to geometric properties, which permits it to discern underlying combinatorial information about how the geometric object is constructed, and therefore detect some qualitative properties.
This type of global analysis can be quite important in understanding the overall structure of data sets. Geometry, though more local by nature, can also be used to study global structure by discovering how parts of an object relate to another, or how parts of different objects can be similar. For example, the Erlanger program of Felix Klein has fueled for over a century mathematicians' interest in invariance under certain group actions as a key principle for understanding geometric spaces.
Such invariances or symmetries can also be key to understanding and reasoning about data sets.
The methods proposed here can be applied in many different settings where massive unstructured data sets arise. In science or engineering, large-scale distributed simulations can produce immense data sets; as an example, consider the Folding@Home project at Stanford that generates protein folding trajectories using hundreds of thousands of CPUs throughout the world. In business, companies such as Google and Yahoo! have to mine billions of web clicks to develop algorithms for matching ads to web page content or to individual users. In medicine 3D imaging is becoming commonplace. Medical imaging diagnostic systems, distributed throughout medical offices nationwide, should be able to efficiently share information about shapes of organs and thereby collectively learn about whether certain variations are associated with different diagnostic outcomes or treatment successes. In all these cases, understanding the global structure of the data can provide valuable scientific, engineering, or medical insights, enabling better business decisions, or leading to more effective medical treatment planning.