Github repository

research, empirical software engineering, data analysis

Software Engineering Lab

I explored methods of detecting Github wiki pages to automatically mine additional sources of software documentation.

As part of another ongoing research project, I explored ML models in Weka for application upon fNIR(brain imaging) data to detect instances of cognitive load within software developers upon encountering linguistic anti patterns in open source software.

Plan

I iterated on methods of accessing open source software to access information about the repository: BeautifulSoup to extract data from the webpage, further analysis on the source code and documentation.

Execution

To help better understand the occurrence of documentation within wiki pages in open source software projects, the scripts: (1) Take an input of a list of Github projects (2) Scrapes web pages of Github projects that actually have wiki pages with BeautifulSoup & generates a text report detailing: 1 - # projects that no longer exist 2 - # projects with default (empty) / no wiki page 3 - # projects with wiki pages that have content (3) Converts the text report to CSV for further analysis

What Was Observed

In a sample of 87,674 Github projects: 1) don't exist: 12,891 14.70% 2) empty or no Wiki: 73,570 83.91% 3) existing Wiki: 1,213 1.38% There were also 12,891 no longer existing projects, leaving 74,783 Github projects, 1,213, or 1.62% of which have Wiki pages.