I had some fun recently after agonizing over the problem of bug prevention. I put out the observation that bugs have a tendency to cluster, and that the more often code is edited the greater the need for the code to be frightfully
virtuous,
clean, and thoroughly
tested. The problem was with determining which code should be cleaned up in order to reduce the injection of bugs.
I considered various metrics and measurements, including cyclomatic complexity, code duplication, and various design details, but ultimately dismissed them for fear that they would be unconvincing. I didn't want my audience to think the heat maps is just "coach Tim lecturing." In keeping with our description of effective
information radiators we needed the information to be simple and stark.
Our SCM is based on git, and our tracking software is Jira. We have to include a recognizable, open, Jira ticket id in every code commit. Many of the tickets are enhancements, improvements, or investigations. Some are problems with legacy data or data import, and some are defects. Tickets are associated with releases, and the release dates are in the database as custom fields. I decided to count tickets, and not commits. We commit frequently, and some more frequently than others. Counting commits would be mostly noise, but counting tickets seemed like a pretty good idea.
I fired up python and grabbed a partner. By using the PyGit library, we were able to parse all the git commits, collect ticket numbers, and parse the diffs to find the names of all files touched in the commit. I tossed together a shelf (bdb) of file names to ticket numbers.
Next I pulled down SOAPpy and my Jira-smart partner helped me build a Jira query. We pulled down the tickets and created another handy shelf, including the ticket number as a key and key facts like the descriptions, summaries, a defect indicator for defect tickets, and release numbers with dates.
Finally we walked through all of the tickets and ordered them by release date. I guess that's pretty coarse-grained, but it makes sense to keep it pegged to recognizable events. Releases have more data available (BOMs, release notes, etc) in case we need to dive deeper.
So then we get the fudgy part. A file that doesn't get any activity is not very interesting. It may be good or bad, tested or untested, clean or filthy, DRY or very non-dry, but nobody is messing with it so we can ignore it for a while. On the other hand, the more a file is touched, the more important it is that the file is easy to work with.
We don't have any information about which file(s) contained a defect, only the set of files that were touched while resolving the defect. That rightfully includes any tests, flawed files, test utilities improved, and any other source files touched by renames or other refactorings. The more often a file is updated in the resolution of a defects, the closer it probably is to the problem file. I needed to weight activity related to defects much more highly than activity related to non-defect tickets.
As a rough heat map metric, I took the total tickets for a period and added back the total defects * 2, counting each defect three times effectively. That just feels about right to me and my partners. If I have two updates to one file with no defects, the code might not need much preventative maintenance. If I have two defects in one release then definitely needs some love. I came up with a ranking and graphed the top five files.
That left me with a jittery graph, so I decided to use a longer period, which I settled on being 3 releases. The January release "heat index" (heh) is based on November through January, and the February heat index is December through February. It gives us a smoother graph, but may not be the smartest way to smooth it. It is simple, though, and that counts for something.
The top five files were pretty interesting. When I looked over it with a production support (triage) person, the files were very familiar to them. A manager-type remembered the defects associated with them quite well. A features programmer told me that these files really were troublesome. Some files had been remedied by dedicated refactoring and testing time, and we could see how those files had lost 'heat' over the period of a year. Other files were trending upward in heat, because of a combination of discovered defects and being in a functional area that is seeing a lot of expansion.
We checked out the top two files which were trending hotter and found that they had low unit test coverage. We suspect a connection there.
The interesting parts of this experiment (to me) were:
1) We closed a feedback loop (request->code->release->defect report->code)
2) The metric is not based on my ideas of quality, just historic fact. so the answers are credible
3) There is so much more we can do with this data. More on that later.
This is the new hotness, and it is presented as an information radiator in the bullpen area. We are all "hot" on the idea of testing and refactoring these files, so hopefully we'll see some marked improvement over the next few releases.