dreadedmonkeygod . net

HTML and Regular Expressions

Jeff Atwood pauses to consider parsing HTML with regular expressions, a practice that begins life in your program as an expedient three-line hack, and gradually grows, becoming the primary source of bugs, and monopolizing development time.

So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand.

Absolutely. Weighing the strengths and weaknesses of different approaches to a problem is absolutely central to the practice of software development.

But there's an even bigger lesson lurking here.

When you first start, it's really easy to handle 90% of cases with a couple of regexps. And if that's good enough, so be it.

But as the program grows, parsing code intermingles with processing logic, the code gets steadily more fragile. Eventually every bugfix requires a new regexp, and every regexp introduces two new bugs, and the descent into madness is complete.

At the beginning, using regexps was clearly the right choice. And at every step along the way, continuing to use regexps was clearly the expedient choice. But at the end, when every change to the code involves long hours of tear-your-hair-out debugging, using regexps is very, very wrong. But you're stuck; switching from rexexps to a more robust tool takes more time than you have, and sticking with regexps makes maintenance too expensive to continue.

Chosing the right tool for the job is just the beginning. One must manage the growth of a piece of software as one tool reaches its limits and a new one is required. And that is what's missing from most undergrad programs. Programs are built once, given a grade, and then abandoned. There's very little chance to build the know-how to make small, cheap course corrections as a project grows, rather than hurculean, bet-the-business overhauls later on. Just as important as recognizing the right tool is knowing when the tool you picked two years (or two weeks) ago has reached the end of its usefulness, and having the foresight to build a program that can survive the transition to a new tool, and thrive afterward.

Post a Comment

Name:
Email (Never, ever displayed.)
URL:
Remember me next time.
Comments (Sorry, no HTML allowed. Space paragraphs with a blank line.):