Monday 20 October 2008
The Tragedy Of Naive Software Development
A friend of mine is doing a graduate project in geography and statistics. He's trying to do some kind of stochastic simulation of populations. His advisor suggested he get hold of another academic's simulator and adapt it for this project.
The problem is, the software is a disaster. It's a festering pile of copy-paste coding. There's lots of use of magic numbers instead of symbolic constants ("if (x == 27) y = 35;"). The names are all wrong, it's full of parallel arrays instead of records, there are no abstractions, there are no comments except for code that is commented out, half the code that isn't commented out is dead, and so on.
It gets worse. These people don't know how to use source control, which is why the comment code out so they can get it back if they need it. No-one told them about automated tests. They just make some changes, run the program (sometimes), and hope the output still looks OK.
This probably isn't anyone's fault. As far as I know, this was written by someone who had to get a job done quickly who had no training and little experience. But I think this is not uncommon. I know other people who did research in, say, aeronautics but spent most of their time grappling with gcc and gdb. That is a colossal waste of resources.
What's the solution? Obviously anyone who is likely to depend on programming to get their project done needs to take some good programming classes, just as I'd need to take classes before anyone let me near a chemistry or biology lab. This means that someone would actually have to teach good but not all-consuming programming classes, which is pretty hard to do. But I think it's getting easier, because these days we have more best practices and rules of thumb that aren't bogus enterprise software process management --- principles that most people, even hardcore hackers, will agree on. (A side benefit of forcing people into those classes is that maybe some will discover they really like programming and have the epiphany that blood and gears will pass away, but software is all.)
There is some good news in this story. This disaster is written in Java, which is no panacea but at least the nastiest sorts of errors are off-limits. The horror of this program incarnated in full memory-corrupting C glory is too awful to contemplate. I'm also interested to see that Eclipse's Java environment is really helping amateur programmers. The always-instant, inline compiler error redness means that wrestling with compiler errors is not a conscious part of the development process. We are making progress. I would love to see inline marking of dead code, though.
Comments
Besides, in this project there are so many compiler warnings the dead variable/function warnings are lost in the trees.
- Parallel arrays and index masks are the biggest design pattern known to the scientific community. Any scientific coder would instantly know what way going on if he saw these things used in practice. They are often used in scientific codes to quickly add extra 'properties' to a calculation. This allows you to create an array to hold intermediate projects, calculate pressure over a finite difference field, and generally makes the mathematics code look like math. Changing this to standard object and records often leads to an even bigger mess.
- Commenting out a routine is perfectly appropriate when the code is traded around and no one looks at version control. Seriously, when is the last time you traced backwards through version control when you had a problem understanding a given function? At least with the commented out version, enabling features is as easy as removing or adding a comment. Thank your stars you don't have to worry about #ifdefs.
- Count your blessings. It's easier to refactor procedural code without horrendous abstractions. The hardest part is dealing with global variables, and this can be fixed with a search and replace.
- Creating unit tests for numerical applications is actually a huge pain. Only the most trivial problems have analytical solutions you can check against. Furthermore, you can't take binary diffs of the program's output. Most C/C++/FORTRAN compilers play fast and loose with floating point operator associativity when optimizing. This can cause slightly different floating point results when two unrelated statements are ordered differently. Often it's easier to run the simulation and compare the visualizations.
Your solutions to the problem are also naive. Look at "How Java’s Floating-Point Hurts Everyone Everywhere" for a good description of why scientific coders shy away from Java.
http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
I didn't say refactoring was hard. With tools like Eclipse, it's very simple. It's just not done.
I do VCS archaeology almost every day. There are great tools to support it, for example:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/layout/generic/nsBlockFrame.cpp&rev=3.959
Commenting out code does not scale over time and makes code hard to read. VCS browsing not only lets you see what changed, but why and when, and what the context looked like at the time...
I don't claim Java is "the solution" to every problem, I just said it's a lot better for this app than C.
For lots of FP apps, Kahan's issues are not relevant and Java would work fine and be better than C. For the others, use something else. (Actually I suspect for most people, availability of quality libraries is more likely to determine the language choice than Kahan's bugbears.) Using the right tool for the job (and knowing what tools are available and what their strengths and weaknesses are) is another area where a little training and experience provide great benefit.
I understand that automated testing of stochastic and numericallly sensitive algorithms is hard. But you can at least do sanity checks on output, catch crashes/exceptions and infinite loops, and in some cases compare to expected results with tolerance.
What really scares me is that code like this produces publishable results. If you don't have validation that's precise enough to be automatable then why should I trust your science? Having a human look over the results and see if they match expectations is obviously a hopeless methodology.
A quick comment on: "but it's not doing enough whole-program analysis to know which public functions of classes are dead"
IMHO this is quasi-impossible. It would be plain wrong for Eclipse (or any other tool) to assume that a method if it's public and is not explicitly called, it's not called at all and a warning should be shown. This method could be a part of an external API or could be called via reflection - either way there is no way of Eclipse figuring that out.
For example, the same principle could be applied to your site design, which has left this article in six 11-line columns on my screen (borderline unreadable.)
Try not to be so harsh to your lessers, lest your betters remind you how it feels from the other side of the fence.
b.m: that's a weakness of Java, really. It should be possible to declare what your public API is, and what methods are exposed to reflection, and then tools can do intelligent dead code detection.
his generation of n00blar crap coding needs to disappear
Disclaimer: I work for CodeGear.
When the amount of warning increases beneath what you can remember in your head, you have to stop and do some refactoring and fixing.
There was no dynamic allocation at all.
It parsed a 30MB blob of numbers in ascii with scanf (this took a few minutes) stored it in a fixed 2d array, and had the rampant copy past programming.
It also had 2000 line functions (1500 line loops inside) and had some headers... that were treated no differently from regular source files...
Luckily my only task for this was to get it running... which it did without me having to changed the code.
Greg Wilson was my prof for a course and got me the job I mentioned in my other post...
Small world