Lies, Damned Lies and Statistics

Posted by Jim Jagielski on Friday, November 18. 2005

Yesterday I received an Email from a friend. Among other things he said that he noticed that I'd been doing a lot of work lately on Apache (I'm using Apache as the common shorthand for the Apache Web Server here :-) ). I agreed that yes, I had ramped up a bit over the last few weeks, fixing some bugs and adding a feature here and there. No, he said... I mean "a *LOT*" of work. This was confusing to me. Although I had been doing more development on it than usual (I do pride myself on the fact that even after over 10 years, I'm still actively developing and committing on Apache), I would consider maybe the amount of work as "a good amount" but certainly not "a *LOT*". How did he decide on that adjective? He then sent me a link, which he came upon. This is a link which points to a page on an "open source" company's site that provides "development statistics" on various open source projects. These stats are based on a tool called 'mpy-svn-stats' which tracks things such as "size of the log entry," the "number of commits" and the "number of paths" (that is, the number of touched files). It's then that I figured out what the problem was. You see, in addition to doing some real development, I also have been spending time doing some simple code cleanups; things like detabbing source code indenting, removing trailing whitespace, etc... This made it appear that I was developing code like a maniac, when instead I was "just" cleaning up code. Please note: none of those changes did *anything* to fix bugs, improve performance, add features, anything like that. They were completely non-functional changes. Yet they made it appear that I was doing a lot of "real" work. That's why I hate those types of "metrics" because they are basically meaningless. The "value" or "amount" of development depends on what the changes *do* . A single commit that changes one file and adjusts maybe a dozen lines of code can be (and usually is) significantly more important that 2 dozen commits that adjust the formating of if-else statements. Yet the latter is "seen" as more active. It's also for this reason that the ASF avoids these types of metrics as well. Not only are they meaningless, but they can do serious harm to the community around the project. They tend to imply a "hierarchy" of developers, rather than the real communal aspect so important the ASF projects. Not only that, but they give the wrong impression, provide a totally inaccurate picture of reality, and can encourage bad behavior. They are also very easy to trick, abuse and defraud. That is another good aspect of open source: you can go deeper than simple minded statistics and see what the changes actually do. But, imo, it's best to just ignore those "stats" completely... In development, it's usually quality over quantity.

The author does not allow comments to this entry


Search for an entry in IMO:

Did not find what you were looking for? Post a comment for an entry or contact us via email!