

Open source software repositories like Sourceforge and GitHub provide a rich and varied source of data to mine. Data mining analysis (e.g., clustering, regression, etc.) which are based on the newly accessible information from software repositories (e.g., contributors, commits, code frequency, active issues and active pull requests) must be developed with the aim of proactively improving software quality, not only reactively responding to issues.

Descriptive statistics (e.g., mean, median, mode, quartiles of the data-set, variance and standard deviation) are not enough to generalize specific behaviours such as how prone a file is to change. The ability to not only examine static snapshots of software but also the way they have evolved over time is opening up new and exciting lines of research towards the goal of enhancing the quality assessment process. In recent years, scientists and engineers have started turning their heads towards the field of software repository mining. In this project, the aim was to achieve improvements in software development processes in relation to change control, release planning, test recording, code review and project planning processes. Open software repositories, with their availability and wide spectrum of data attributes are an exciting testing ground for software repository mining and quality assessment research.

With the boom in data mining which has occurred in recent years and higher processing powers, software repository mining now represents a promising tool for developing better software. By Jesús Alonso Abad, Carlos López Nozal and Jesús M.
