Welcome to Penn Provenance
Funded by NIH NIBIB #1U01EB020954-01, "Approximating and Reasoning about Provenance" and NSF ACI-1547360, "Data Provenance: Provenance-Based Trust Management for Collaborative Data Curation".
In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here - yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools:
- Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all.
- We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.
- We are investigating mechanisms for combining curation or annotations from multiple users, computing trust, and determining consensus annotations based on provenance.
- We are developing generalizations of data provenance for "non-relational" operators such as those in linear algebra, time series manipulations, and more.
- Fine-Grained Provenance for Matching and ETL. Nan Zheng, Abdussalam Alawini, Zachary G. Ives. To appear, ICDE 2019.
- Dataset Relationship Management. Zachary G. Ives, Soonbo Han, Yi Zhang, Nan Zheng. CIDR 2019.
- Collaborating and Sharing Data in Epilepsy Research. Joost Wagenaar, Greg Worrell, Matthias Dumpelmann, Zachary Ives, Brian Litt, Andreas Schulze-Bonhage. Journal of Clinical Neurophysiology.
- Looking at Everything in Context. Zachary Ives. Zhepeng Yan, Nan Zheng, Brian Litt, Joost B. Wagenaar. CIDR 2015.
- Approximated Summarization of Data Provenance. Eleanor Ainy, Pierre Bourhis, Susan B. Davidson, Daniel Deutch, Tova Milo. PROX. EDBT 2016: 620-623.
- Fine-grained Provenance for Linear Algebra Operators. Zhepeng Yan, Val Tannen, Zachary Ives. TaPP 2016.
Click on the menu on the left for tools related to automatic provenance capture (Tracker) and a cloud-based PROV repository.
The Penn Provenance Team includes members from computer science, bioengineering, biology, and medicine. Key participants and collaborators include:
- Zachary Ives, CIS
- Junhyong Kim, Biology
- Susan Davidson, CIS
- Sampath Kannan, CIS
- Val Tannen, CIS
- Brian Litt, Bioengineering and Neurology
- Abdussalam Alawini, CIS
- Soonbo Han
- John Frommeyer, SEAS
- Nan Zheng, CIS
- Stephen Fisher, Biology