Data quality measurement in wide-column stores

J. Hilber. Data quality measurement in wide-column stores. 6, 2018.

  • Julia Hilber
  • Wolfram Wöß
  • Dr. Lisa Ehrlinger

Many companies and organizations make decisions with the help of data stored in database. Therefore, it is very important to know the data quality of a database, otherwise bad decisions are made if the data quality is poor. The most data issues come from human errors during data acquisition, faulty process, wrongly-designed architectures, inconsistent definitions and incorrect usage of data.

Nowadays the NoSQL stores are a hype resulting from unstructured data in the web, and hence it is even important to assess the quality of the NoSQL stores. Therefore, this thesis provides an approach for assessing the data quality for a Cassandra store only on the schema level. The schema part is more important than the instance part because a change in the schema is maybe not possible due to applications which do not allow changes. A correction and assessment of the instances is easier. The data quality assessment is done for the Cassandra store, because Cassandra is the second most used database from the NoSQL stores.

In this work, the extension of the tool QuaIIe with the assessment of the Cassandra schema is described and implemented. This is achieved with an exact analysis of the existing program, especially the connector for MySQL databases. The further step is the transformation of the Cassandra schema into the DSD vocabulary and the direct comparison of the data quality assessment between a Cassandra store and a MySQL database. At the end of the implementation the assessment of the Cassandra schema is performed and evaluated.