WorkflowDSL: Scalable Workflow Execution with Provenance

Authors Tharidu Fernando
Editors
Title WorkflowDSL: Scalable Workflow Execution with Provenance
Type master thesis
Organization Software Competence Center Hagenberg
Institution School of Information and Communication Technology
School KTH Royal Institute of Technology
Month September
Year 2017
SCCH ID# 17068
Abstract

Scientific workflow systems enable scientists to perform large-scale data intensive scientific experiments using distributed computing resources. Due to the diversity of domains and complexity of technology, delivering a successful outcome efficiently requires collaboration between domain experts and technical experts. However, existing scientific workflow systems require a large investment of time to familiarise and adapt existing workflows. Thus, many scientific workflows are still being implemented by script based languages (such as Python and R) due to familiarity and extensive third party library support. In this thesis, we implement a framework that uses a domain specific language that enables domain experts to collaborate on fine-tuning workflows. Technical experts are able to use Python for task implementations. Moreover, the framework includes support for parallel execution without any specialized code. It also provides a provenance capturing framework that enables users to analyse past executions and retrieve complete lineage of any data item generated. Experiments which were performed using a real-world scientific workflow from the bioinformatics domain show that users were able to execute workflows efficiently while using our DSL for workflow composition and Python for task implementations. Moreover, we show that captured provenance can be useful for analysing past workflow executions.