WorkflowDSL: Scalable Workflow Execution with Provenance

Autoren Tharidu Fernando
Titel WorkflowDSL: Scalable Workflow Execution with Provenance
Typ Master-Arbeit
Organisation Software Competence Center Hagenberg
Institution School of Information and Communication Technology
Universität KTH Royal Institute of Technology
Monat September
Jahr 2017
SCCH ID# 17068

Scientific workflow systems enable scientists to perform large-scale data intensive scientific experiments using distributed computing resources. Due to the diversity of domains and complexity of technology, delivering a successful outcome efficiently requires collaboration between domain experts and technical experts. However, existing scientific workflow systems require a large investment of time to familiarise and adapt existing workflows. Thus, many scientific workflows are still being implemented by script based languages (such as Python and R) due to familiarity and extensive third party library support. In this thesis, we implement a framework that uses a domain specific language that enables domain experts to collaborate on fine-tuning workflows. Technical experts are able to use Python for task implementations. Moreover, the framework includes support for parallel execution without any specialized code. It also provides a provenance capturing framework that enables users to analyse past executions and retrieve complete lineage of any data item generated. Experiments which were performed using a real-world scientific workflow from the bioinformatics domain show that users were able to execute workflows efficiently while using our DSL for workflow composition and Python for task implementations. Moreover, we show that captured provenance can be useful for analysing past workflow executions.