Benefits and drawbacks of representing and analyzing source code and software engineering artifacts with graph databases

R. Ramler, G. Buchgeher, C. Klammer, M. Pfeiffer, C. Salomon, H. Thaller, L. Linsbauer. Benefits and drawbacks of representing and analyzing source code and software engineering artifacts with graph databases. volume 338, pages 125-148, DOI, 1, 2019.

  • Rudolf Ramler
  • Georg Buchgeher
  • Claus Klammer
  • Michael Pfeiffer
  • Christian Salomon
  • Hannes Thaller
  • Lukas Linsbauer
  • Dietmar Winkler
  • Stefan Biffl
  • Johannes Bergsmann
Buchoftware Quality: The Complexity and Challenges of Software Engineering and Software Quality in the Cloud - Proc. SWQD 2019
TypIn Konferenzband
SerieLecture Notes in Business Information Processing

Source code and related artifacts of software systems encode valuable expert knowledge accumulated over many person-years of development. Analyzing software systems and extracting this knowledge requires processing the source code and reconstructing structure and dependency information. In analysis projects over the last years, we have created tools and services using graph databases for representing and analyzing source code and other software engineering artifacts as well as their dependencies. Graph databases such as Neo4j are optimized for storing, traversing, and manipulating data in the form of nodes and relationships. They are scalable, extendable, and can quickly be adapted for different application scenarios. In this paper, we share our insights and experience from five different cases where graph databases have been used as a common solution concept for analyzing source code and related artifacts. They cover a broad spectrum of use cases from industry and research, ranging from lightweight dependency analysis to analyzing the architecture of a large-scale software system with 44 million lines of code. We discuss the benefits and drawbacks of using graph databases in the reported cases. The benefits are related to representing dependencies between source code elements and other artifacts, the support for rapid prototyping of analysis solutions, and the power and exibility of the graph query language. The drawbacks concern the generic frontends of graph databases and the lack of support for time series data. A summary of application scenarios for using graph databases concludes the paper.