Blockchain use for data provenance in scientific workflow

S. Sigurjonsson. Blockchain use for data provenance in scientific workflow. 7, 2018.

  • Sindri Mar Kaldal Sigurjonsson

In Scientific workflows, data provenance plays a big part. Through data provenance, the execution of the workflow is documented and information about the data pieces involved are stored. This can be used to reproduce scientific experiments or to proof how the results from the workflow came to be. It is therefore vital that the provenance data that is stored in the provenance database is always synchronized with its corresponding workflow, to verify that the provenance database has not been tampered with. The blockchain technology has been gaining a lot of attention in recent years since Satoshi Nakamoto released his Bitcoin paper in 2009. The blockchain technology consists of a peer-to-peer network where an append-only ledger is stored and replicated across a peer-to-peer network and offers high tamperresistance through its consensus protocols. In this thesis, the option of whether the blockchain technology is a suitable solution for synchronizing workflow with its provenance data was explored. A system that generates a workflow, based on a definition written in a Domain Specific Language, was extended to utilize the blockchain technology to synchronize the workflow itself and its results. Furthermore, the InterPlanetary File System was utilized to assist with the versioning of individual executions of the workflow. The InterPlanetary File System provided the functionality of comparing individual workflows executions in more detail and to discover how they differ. The solution was analyzed with respect to the 21 CFR Part 11 regulations imposed by the FDA in order to see how it could assist with fulfilling the requirements of the regulations. Analysis on the system shows that the blockchain extension can be used to verify if the synchronization between a workflow and its results has been tampered with. Experiments revealed that the size of the workflow did not have a significant effect on the execution time of the extension. Additionally, the proposed solution offers a constant cost in digital currency regardless of the workflow. However, even though the extension shows some promise of assisting with fulfilling the requirements of the 21 CFR Part 11 regulations, analysis revealed that the extension does not fully comply with it due to the complexity of the regulations.