Data-Intensive Scalable Computing for Science
Wednesday, February 4 , 2009
Building 3 Auditorium - 11:00 AM
(Refreshments at 11:00 AM)
Data analytics "at scale" become extremely difficult as dataset sizes increase. These tasks are data intensive in nature, constrained by I/O bandwidth and obtain little benefit from abundant computational resources. Internet services companies have developed systems and abstractions to support their search business. Scalable systems such as GFS/HDFS, Map-Reduce and BigTable are used to build distributed applications that process, index, and analyze web-scale datasets. Open-source implementations of these systems, such as Hadoop/HDFS/HBase, are available and in wide used for analyzing unstructured data.
We are exploring the use of these frameworks as building blocks for data-intensive scalable computing systems for science (DISCS) that are easy to program. I will show how we how are using Hadoop to solve and better understand "data challenges" in Earth Sciences and Astrophysics. This insight will enable service design and construction for next-generation scalable systems in data-intensive computations for science.
Julio L�pez is a Systems Scientist faculty in the Parallel Data Laboratory at Carnegie Mellon University. His research interests are in systems and application support for data intensive computing at large scale. His current research focuses on creating scalable approaches for data analytics in high-performance computing. His work includes methods for compression of large seismic wavefields, scalable I/O for ground motion simulations and indexing techniques for multi-dimensional meshes. He and the rest of the CMU Quake team were the winners of the 2006 Supercomputing analytics challenge and 2003 Supercomputing Gordon Bell award. He obtained his M.S. and Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, and his B.Eng. in Computer Systems from Universidad EAFIT in Medell�n, Colombia.
IS&T Colloquium Committee Host: Ben Kobler