Design and implementation of a domain specific language for data preprocessing pipelines

S. Luftensteiner. Design and implementation of a domain specific language for data preprocessing pipelines. 6, 2017.

  • Sabrina Luftensteiner

Especially, they are used in machine learning projects to transform raw data into easier processible data. As most of the existing solutions for the definition of the data processing pipelines are based on a target language or platform, this bachelor-thesis concentrates on the development of an independent domain specific language. Previous to the main part, the company and the project are introduced. The main part of this thesis deals with the development of a domain specific language which is adapted to the mentioned problems. At first the definition of the grammar is discussed. The grammar defines rules for the domain specific language and therefore its structure. Afterwards, validators are outlined. Validators are additional constraint checks and they are also used to specify further requirements. Another important part of the thesis is code generation. It enables the generation of code for different target languages, whereat this thesis uses Python and R. The generated code is divided in three parts: the business layer, the abstract layer and the workflow layer. The business layer contains a skeletal implementation for the business logic, the workflow layer includes the definition of the workflow and the abstract layer is used as seperator. Additionally, the testing for each area is described. Concluding, the created domain specific language is represented by means of an example using R and Python as target languages.