Automated continuous data quality measurement with QuaIIe

L. Ehrlinger, B. Werth, W. Wöß. Automated continuous data quality measurement with QuaIIe. International Journal on Advances in Software, volume 11, number 3&4, pages 400-417, 12, 2018.

  • Lisa Ehrlinger
  • Bernhard Werth
  • Wolfram Wöß
JournalInternational Journal on Advances in Software

Data quality measurement is essential to gain knowledge about data used for decision-making and to evaluate the trustworthiness of those decisions. Example applications, which are based on automated decision-making, are self-driving cars, smart factories, and weather forecast. One-time data quality measurement is an important starting point for any data quality project to detect critical data that does not meet expectations and to define improvement goals for data cleansing activities. The complementary task of continuous data quality measurement is essential to ensure that data continues to conform to requirements and to detect unexpected changes in the data. However, most existing data quality tools allow quality measurement at a specific point in time while leaving the automation and scheduling to the user. In this paper, we highlight the need for (1) domainindependent ad hoc measurement, to provide a quick insight of an information system’s qualitative condition, and (2) continuous data quality measurement, to observe how data quality evolves over time. Both requirements can be achieved with our data quality tool QuaIIe (Quality Assessment for Integrated Information Environments, pronounced ['kvAl@]), which we developed to calculate metrics for the quality dimensions accuracy, correctness, completeness, pertinence, timeliness, minimality, readability, and normalization on both data-level and schema-level. The quality measurements can be either exported as a user- and machinereadable quality report, or they can be periodically stored in a database, which allows for long-term analysis. In this paper, we demonstrate the application of QuaIIe for ad hoc and continuous data quality measurement.