Grid Workflow Execution using GENESIS SciFlo
Wednesday, November 30, 2005
Building 3 Auditorium - 3:30 PM
(Refreshments at 3:00 PM)
Brian Wilson, will talk about Grid Workflow Execution using GENESIS SciFlo. The General Earth Science Investigation Suite (GENESIS) project is a NASA-sponsored partnership between the Jet Propulsion Laboratory, academia, and NASA data centers to develop a new suite of Web Services tools to facilitate multi-sensor investigations in Earth System Science. The goal of GENESIS is to enable large-scale, multi-instrument atmospheric science using combined datasets from the AIRS, MODIS, MISR, and GPS sensors. Investigations include cross-comparison of spaceborne climate sensors, cloud spectral analysis, study of upper troposphere-stratosphere water transport, study of the aerosol indirect cloud effect, and global climate model validation. The challenges are to bring together very large datasets, reformat and understand the individual instrument retrievals, co-register or re-grid the retrieved physical parameters, perform computationally-intensive data fusion and data mining operations, and accumulate complex statistics over months to years of data. To meet these challenges, we have developed a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data access, subsetting, registration, mining, fusion, compression, and advanced statistical analysis.
SciFlo is a system for Scientific Knowledge Creation on the Grid using a Semantically-Enabled Dataflow Execution Environment. SciFlo leverages Simple Object Access Protocol (SOAP) Web Services and the Grid Computing standards (WS-* \& Globus Alliance toolkits), and enables scientists to do multi-instrument Earth Science by assembling reusable Web Services and native executables into a distributed computing flow (tree of operators). The SciFlo client \& server engines optimize the execution of such distributed data flows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The scientist injects a distributed computation into the Grid by simply filling out an HTML form or directly authoring the underlying XML dataflow document, and results are returned directly to the scientist's desktop. Once an analysis has been specified for a chunk or day of data, it can be easily repeated with different control parameters or over months of data.
We will discuss the design issues and solutions used in the implementation of SciFlo, including XML dataflow documents, heavy use of XML datatyping \& semantic web ontologies, parallel dataflow execution engines, data access simply by naming objects, and distributed catalog lookup of operator bundles. To illustrate the SciFlo concepts, an example dataflow will be demonstrated in which atmospheric temperature and water vapor profiles from the AIRS, GPS, and MODIS instruments are retrieved using SOAP (data query \& access) services, co-registered, and visually \& statistically compared on demand. Such cross-validation analyses have already been run for years of GPS and AIRS retrievals (see http://sciflo.jpl.nasa.gov for more information).
IS&T Colloquium Committee Host: Jim Tilton