Tuesday, February 9, 2010

The Problem Of Copying

Many programmers see their jobs as a process of gluing together code fragments to make things work. Many of them produce results very quickly and are lauded for productivity. TDD, mocking libraries, dependency injection, and refactoring tools seem to them a silly waste of time; they can hack working examples together faster than most "craftsman" programmers.

There is some validity to the approach. Quite often programmers find code samples on the web or in a very fine "python cookbook" or the like, and they copy them into their code base. Likewise, an API comes with example code for a reason. There is nothing wrong with starting from a good external example, as this will often jump-start your application's use of an api or technique. If plagiarism is avoided (ie samples are public domain or appropriately licensed) then this kind of copy-and-paste may be valid and useful.

The problem comes when copying and pasting withing an application. In this context, it is a quick-and-dirty approach to programming. A search of the web or twitter for the phrase "quick and dirty" will show primarily examples of the term being used as a virtue or feature. It suggests that the information it references is wholesome, folksy, and free from self-indulgent over-revision. In such cases, I fear that the word 'Quick' has distracted the reader and that the word 'Dirty' has been swept under the rug.

Dirty is not a virtue. While code-copying code is a known development disabler, it is hard to tell copy-and-paste programmers that the way they work is wrong (let alone telling their managers!). Quick is clearly a virtue in the marketplace, so how can it be wrong to be done sooner?
"The trouble with quick and dirty is that dirty remains long after quick has been forgotten." ~Steve McConnell
The problem is that they're not done sooner. Copy-n-pasters just stop sooner, and the code goes out in poor condition. Lacking unit tests, we can't know if it is really done, and code pasted together in long stringy functions is notably hard to test. To make it testable, it usually has to be refactored into methods, which then are refactored to remove duplication. If we aren't refactoring, it's rather unlikely that we're testing. If we haven't tested the code then who is to say it's done? If there are not tests around the methods up and down the call stack from our paste-receiving function, who is to say that we've not broken some other code? And what of code not directly in our call stack, but which touches upon the data or structures we've modified?

In what kind of bizarro world is "broken" compatible with "done"?

Would anyone sign up for a program where we would intentionally degrade system development speed by even 5% per quarter for the life of our product? Would we approve a continual increase in both fresh bugs and error regression? It is unlikely, all else being equal.

Would businesses sign up if they could have the next five features implemented before the mud hits the fan? Most of them would, and gladly. The problem is that "quick and dirty" sounds very quick and not so dirty. There is a lot of pressure to quiet the angry customer, fulfill a contract, or close a sale. It is foolish to deny that "sooner" matters. Any programmer worth his pay would love to be able to go home with even 10% more accomplishment. "Faster", sings the siren, "faster still!".

The allure of copying is that it causes problems that accumulate "later" while seeming to give benefits that are paid "now." The ability to defer gratification is a famous indicator of a student's likelihood of success, but businesses exist to gratify customers as quickly as possible.

The promise is false. Code copying does not position us to work faster next week, but slower. Say you have five lines of code in an if/else statement inside a loop in one of your functions: code which converts a Lead to a Customer. I need to do the same, so I cut those lines from your function and paste it into mine. As a result:
  1. The compiler has to compile that code twice.
  2. If there is a defect, it needs to be corrected twice or it will be reported as a regression.
  3. If I realize an improvement in my code, I have to back-feed it into yours or callously leave yours unimproved. Is it likely I will spend my feature development time fixing your code?
  4. There is now even more untested code in the system.
  5. If Lead-to-Customer conversion gathers new requirements, we have to find both examples and enhance them. Heaven help us if we miss the one the customers actually use most!
  6. We are working against our IDE, which would happily have located the conversion method for us, had it not been scribbled into the middle of our individual functions.
  7. If I write unit tests for my copy, yours is still untested.
  8. When QC tests your functionality, they will need to cover the edge cases of customer conversion. They they will need to cover those cases again in my code.
  9. When our two versions diverge, programmers may use either my improved or your unimproved version. They may reinvent my improvement, wasting time for little gain.
  10. Our business people request that our coworkers build a batch customer conversion feature, but there is no function to convert leads to customers. The coworkers must reinvent the process or find and evaluate our copies. Either way, they have wasted development time.
  11. The code we pasted into now is doing multiple things, making it harder to read and comprehend.
  12. The code we pasted into is likely leaking variables that would have been encapsulated in a single function call.
  13. It is very likely that our merged functions are harder to optimize than they would have been with a shared function call in place.
If the code is written perfectly, never needed anywhere else, and subject to no requirement changes, then copying would be harmless, but in such cases copying is not necessary. To copy code, then, is to ignore the consequences to the team in order to seem to be done sooner. This is an act of selfishness.

The bias against code duplication goes well beyond being a pet peeve of the Agile Otter. It has been well-explained by many programming texts. For a few online references, see what people say about duplicate code at Wikipedia, C2, Ralph Johnson's blog, the "prag progs", or the abstraction principle. Big balls of mud grow from accumulated dirt, and much of that dirt is the residue from "quick and dirty" programming. After all, what is mud but dirt that is not DRY?

Habitually copying and pasting code among function bodies is not an act of heroism, but rather an act of corporate sabotage. It is sad that we do not treat it as such.


  1. Excellent Rant, Tim.

    "Copy and Paste" doesn't sound harmful at all. In fact, "Copy and Paste" is something business folks do all the time. It is understandable that they would hear the phrase and feel comfort in the familiarity and the knowledge that you also don't waste time reinventing the wheel.

    But "Copy and Paste" doesn't accurately depict what is actually happening. That's why I call it "Copy/Paste/Molest". That is what the "developer" is really doing. They are engaging in an act that causes drastic, immediate, and potentially permanent damage to the health of the code.

  2. When talking about copy-paste in context of source code I always say copy-paste-bug...

  3. Part of the reason I left my previous position is that copy-pasters were habitually rewarded with raises and promotions. Indeed the ability to remember the duplication was praised as a development skill. "See he knows the 7 places to change."

    Worst offense - a defect was copy pasted in the product. It was a released into wild. A prominent customer angrily reported it. The same developer then made a "quick and dirty" fix....

    ... and was awarded an bonus for customer responsiveness.

  4. Yeah, copying and pasting is evil, no doubt. In its pure form, it's easy to hate. But how about not so pure forms, which, I suspect, is a more common occurrence than the pure variation. For example:

    - I need to copy the code, but with a *slight* change. Do I factor out the code and parameterize it now? hm, maybe... what if it happens again? more parameters? maybe not... entirely different refactoring? maybe... or not... :)

    - I do factor out the code, but what if there is no common place for the factored-out code to live because a larger de-coupling considerations separated the copy-and-paste points too far away from each other and the common code has no clear home.

    Don't know if this makes sense. But, in my experience, gray areas are very common.

    Great post though, no doubt!

  5. I can certainly see the legitimate copying of a code example from a personal collection of code snippets (field stones), an API documentation, or a snippet site on the web. If this is the first time it is copied into the application, and is non-plagaristic, then I think that can be okay.

    I'm less keen on the copying of code within an application (or amongst the various dll/jar/module/etc within an application). If it needs copied, it probably needs refactored and/or moved into some kind of common library. If multiple apps need it, then it needs a better home where they can all have it.

    Sometimes I will admit that it may have to be re-inlined in one place to be better extracted, as it may have temporal binding that you don't need. I see that as a good thing.

    I tend to start tests by copying, but leaving repeated code in tests just eats at me. I tend to re-refactor tests to get rid of it. If there is much of that going on, it signals a need for either utility methods or setup/teardown.

    The part that really eats me is when the code you want is woven into a form or report or similar place. There is a pressure to duplicate the code in your unit tests and your automated ATs/regression/etc. That pressure must be met aggressively. Code woven into inconvenient places needs refactoring more than code duplicated in convenient places.

    Refactoring is like weight loss. The harder it seems, the more you need to do it.

    Creating new places for common stuff is an interesting task. There are certainly times to allow a "misc" library to exist... for now.

    I believe that quite a lot of design is emergent. Code is like life in that we often don't know which parts are the most important at the time. Like life, we need to try to treat it all as being important.

    A wise young man told me that the difference between the way one codes when one has time and the way one codes under pressure is the degree to which one "sucks". If the gray is in the code, then you have an interesting situation to blog and discuss later. If it is in your circumstance, you have to decide how good you want to be.

  6. Ah to be able to surround oneself with rational programs that aren't hacked up, half baked, unplanned, irrational junk...

    Copy and paste has its place in the world of art, or perhaps journalism. Beyond that...

  7. Maybe it's our langauges that are wrong. Or our compilers. If people are using copy-and-paste to get faster short-term results than proper coding, then maybe we need to copy-and-paste more and let our compilers take care of the burden?

    Why can't I copy-and-paste to my heart's content, then have my compiler sniff around for duplication and eliminate it automatically?

    Would this be the best of both worlds?

    Don't answer that: it's rhetorical.

  8. Nah, I don't know that there *should* be a tool that spins crap into gold. Besides, learning to see and eliminate duplication is a profound skill and well worth learning.

  9. IMHO when you use "Copy and Paste" you stop thinking on the code you use because our brain is lazy and process of thinking wants a lot of energy from our body. The result of stop thinking on the code is known! So every time when you copy even a word you give a chance to your brain to stop thinking!

    Anton Jorov Antonov