I had some fun recently after agonizing over the problem of bug prevention. I put out the observation that bugs have a tendency to cluster, and that the more often code is edited the greater the need for the code to be frightfully virtuous, clean, and thoroughly tested. The problem was with determining which code should be cleaned up in order to reduce the injection of bugs.
I considered various metrics and measurements, including cyclomatic complexity, code duplication, and various design details, but ultimately dismissed them for fear that they would be unconvincing. I didn't want my audience to think the heat maps is just "coach Tim lecturing." In keeping with our description of effective information radiators we needed the information to be simple and stark.
Our SCM is based on git, and our tracking software is Jira. We have to include a recognizable, open, Jira ticket id in every code commit. Many of the tickets are enhancements, improvements, or investigations. Some are problems with legacy data or data import, and some are defects. Tickets are associated with releases, and the release dates are in the database as custom fields. I decided to count tickets, and not commits. We commit frequently, and some more frequently than others. Counting commits would be mostly noise, but counting tickets seemed like a pretty good idea.
I fired up python and grabbed a partner. By using the PyGit library, we were able to parse all the git commits, collect ticket numbers, and parse the diffs to find the names of all files touched in the commit. I tossed together a shelf (bdb) of file names to ticket numbers.
Next I pulled down SOAPpy and my Jira-smart partner helped me build a Jira query. We pulled down the tickets and created another handy shelf, including the ticket number as a key and key facts like the descriptions, summaries, a defect indicator for defect tickets, and release numbers with dates.
Finally we walked through all of the tickets and ordered them by release date. I guess that's pretty coarse-grained, but it makes sense to keep it pegged to recognizable events. Releases have more data available (BOMs, release notes, etc) in case we need to dive deeper.
So then we get the fudgy part. A file that doesn't get any activity is not very interesting. It may be good or bad, tested or untested, clean or filthy, DRY or very non-dry, but nobody is messing with it so we can ignore it for a while. On the other hand, the more a file is touched, the more important it is that the file is easy to work with.
We don't have any information about which file(s) contained a defect, only the set of files that were touched while resolving the defect. That rightfully includes any tests, flawed files, test utilities improved, and any other source files touched by renames or other refactorings. The more often a file is updated in the resolution of a defects, the closer it probably is to the problem file. I needed to weight activity related to defects much more highly than activity related to non-defect tickets.
As a rough heat map metric, I took the total tickets for a period and added back the total defects * 2, counting each defect three times effectively. That just feels about right to me and my partners. If I have two updates to one file with no defects, the code might not need much preventative maintenance. If I have two defects in one release then definitely needs some love. I came up with a ranking and graphed the top five files.
That left me with a jittery graph, so I decided to use a longer period, which I settled on being 3 releases. The January release "heat index" (heh) is based on November through January, and the February heat index is December through February. It gives us a smoother graph, but may not be the smartest way to smooth it. It is simple, though, and that counts for something.
The top five files were pretty interesting. When I looked over it with a production support (triage) person, the files were very familiar to them. A manager-type remembered the defects associated with them quite well. A features programmer told me that these files really were troublesome. Some files had been remedied by dedicated refactoring and testing time, and we could see how those files had lost 'heat' over the period of a year. Other files were trending upward in heat, because of a combination of discovered defects and being in a functional area that is seeing a lot of expansion.
We checked out the top two files which were trending hotter and found that they had low unit test coverage. We suspect a connection there.
The interesting parts of this experiment (to me) were:
1) We closed a feedback loop (request->code->release->defect report->code)
2) The metric is not based on my ideas of quality, just historic fact. so the answers are credible
3) There is so much more we can do with this data. More on that later.
This is the new hotness, and it is presented as an information radiator in the bullpen area. We are all "hot" on the idea of testing and refactoring these files, so hopefully we'll see some marked improvement over the next few releases.
Sounds like a great idea. Is there a public version of this heatmap collector? I'd also throw in the counting of task annotations in a file (@todo, @fixme, etc) that may refer to technical debt. The more there are in a file, the bigger attention is needed.
ReplyDeleteThe code was written for my employer, and belongs to the company. I really should get into some open source projects so that I can show my work. It is not available, but could be reproduced.
ReplyDeleteIt is fairly trivial once you know what you want. I spent most of my time figuring out what data I want and how I wanted to index it, what was meaningful, etc.
You have enough of a head start here that you could almost certainly reproduce the results using the tools for your project.
I considered digging data out of code, but am pleased that instead I used the historical record without looking into code. We parsed our simian report (similarity index) and originally were adding duplication into the heat map, but then decided that a purely historical record was more credible than one with code theory.
What I would like to incorporate in the future is code coverage, but we would have to build up a record of the code coverage at each release and we don't have that currently. It should be a matter of checking out the code, running unit and fitnesse tests, and parsing the records. Coverage % would reduce the heat index and show us which files need work instead of just showing us which need to be clean.
I like your marker idea, and think it would be an interesting additional criteria to look at along with coverage.
Brilliant, Tim. I like how that gets Management and Engineering on the same team.
ReplyDeleteDon't get too granular with it though. Breaking it down file by file is universal in more than one way. It's language neutral, all managers understand what files are, and every version control system understands files.
If people want to further develop the idea then they should be getting statisticians involved. They can help us develop those creepy smart algorithms for classifying files and improving our decisions.
Anyway, completely brilliant. You and whoever else built that should be consuming Guinness and yelling a lot. For like 2 weeks at least.
Now that it's a little later and we're able to act on it, the heat index is is proving to be a mighty good "bird dog."
ReplyDeleteIt is almost as if there is a correlation between ugliness of code and defects . I think maybe the takeaway is this:
Ugly code touched often breaks often.
I'll try to punctuate that later.
Nice simple idea. It would be interesting (at least to me) if the factor for defect related check ins actually makes much of a difference.
ReplyDeleteJust guessing I would assume, new or newly changed code cause more trouble then old code.
If this is true you should get more or less the same heatmap (at least for the hottest files) more or less independently of the weighting factor.
Tim, you inspired a colleague and I. I've set up a project at https://github.com/andrewheald/code-heatmap to implement your ideas. Early days yet. It's taking shape. I'll post again when some solid results are coming of it.
ReplyDeleteGreat idea Tim. In my google tech talk http://bit.ly/80-20-rules I talk a lot about bug clusters. I mentioned using check out records as a guide for programs that need maintenance, but this extends that idea significantly by adding the "bugs found" weighting to the heat map. Great stuff!
ReplyDeleteWe have a heat map built into our jenkins build now at a client site. It is not a proper word cloud (would that be cool?) but instead is html. It sizes the files by the total number of changes, and colors it by the % of changes that are defects.
ReplyDeleteWe changed the script to be a 60-day sampling period. In the past two months, I can tell you where our activity has been and how much of the work was defects, and at a glance I know which files need to be REALLY clean.
The visualization came from Brandon Carlson, including the idea of hovering on the name to see the path, total changes, and total defects.
It's pretty sweet, really.
Interesting read! I love how you tackled the challenge of bug prevention and used a practical approach with Jira and git commits. The heat map idea is genius—it provides a clear visual of where attention is needed. Jira time tracking can indeed play a crucial role in identifying patterns and areas for improvement. Looking forward to more insights from your data-driven approach!
ReplyDelete