As programmers, we like to solve problems. We like to get ideas to spring from our heads, channel through our fingertips, and create magical solutions.
But sometimes we are too quick to jump in and start cranking out code to solve the problem without considering all the implications of the issues we’re trying to solve. We don’t consider that someone else might have already solved this problem, with code available for our use that has already been written, tested, and debugged. Sometimes we just need to stop and think before we start typing.
For example, when you encounter these seven coding problems, you’re almost always better off looking for an existing solution than trying to code one yourself:
1. Parsing HTML or XML
The task whose complexity is most often underestimated–at least based on how many times it’s asked about on StackOverflow– is parsing HTML or XML. Extracting data from arbitrary HTML looks deceptively simple, but really should be left to libraries. Say you’re looking to extract a URL from an <img> tag like
It’s a simple regular expression to match a pattern.
The string “foo.jpg” will be in capture group #1 and can be assigned to a string. But will your code handle tags with other attributes like:
<img id="bar" src="foo.jpg">
And after you change your code to handle that, will it handle alternate quotes like:
or no quotes at all, like:
What about if the tag spans multiple lines and is self-closing, like:
<img id="bar" src="foo.jpg" />
And will your code know to ignore this commented-out tag:
<!-- <img src="foo.jpg"> -->
By the time you’ve gone through the cycle of finding yet another valid case your code doesn’t handle, modifying the code, retesting it, and trying it again, you could have used a proper library and been done with it.
That’s the story with all of these examples: You’ll spend far less time finding an existing library and learning to use it rather than trying to roll your own code, then debug it, then extend it to fit all the cases you hadn’t thought of when you started.
2. Parsing CSV and JSON
CSV files are deceptively simple, yet fraught with peril. Files of comma-separated values are trivial to parse, right?
# ID, name, city 1, Queen Elizabeth II, London
Sure, until you have double-quoted values to handle embedded commas:
2, J. R. Ewing, "Dallas, Texas"
Once you get around those double-quoted values, what happens when you have embedded double quotes in a string that have to be escaped:
3, "Larry \"Bud\" Melman", "New York, New York"
You can get around those, too, until you have to deal with embedded newlines in the middle of a record.
JSON has all the same data type hazards of CSV, with the added headache of being able to store multi-level data structures.
Save yourself the hassle and inaccuracy. Any data that can’t be handled with splitting the string on a comma should be left to a library.
If it’s bad to read structured data in an unstructured way, it’s even worse to try to modify it in place. People often say things like, “I want to change all the <img> tags with such-and-such a URL so they have a new attribute.” But even something as seemingly simple as “I want to change any fifth field in this CSV with the name Bob to Steve” is dangerous because as noted above, you can’t just count commas. To be safe you need to read the data–using a comprehensive library–into an internal structure, modify the data, and then write it back out with the same library. Anything less risks corrupting the data if its structure doesn’t precisely match your expectations.
3. Email address validation
There are two ways you can validate an email address. You can have a very simple check like, “I need to have some characters before an @ sign, and then some characters after,” matching it against this regular expression:
It’s not complete and it lets invalid stuff through, but at least you’ve got an @ sign in the middle.
Or you can validate it against the rules in RFC 822. Have you read RFC 822? It covers all sorts of things that you rarely see but are still legal. A simple regular expression isn’t going to cut it. You’re going to need to use a library that someone else has already written.
If you’re not going to validate against RFC822, then anything else you do is going to be a matter of using a subset of rules that seem reasonable but might not be correct. That’s a valid design tradeoff to make many times, but don’t fool yourself into thinking that you’ve covered all the cases unless you go back to the RFC, or use a library written by someone who has.
(For far more discussion about validation of email addresses, see this StackOverflow thread.)
4. Processing URLs
URLs aren’t nearly as odious as email addresses, but they’re still full of annoying little rules you have to remember. What characters need to be encoded? How do you handle spaces? How about + signs? What characters are valid to go in that part after the # sign?
Whatever language you’re working in, there’s code to break apart URLs into the components you need, and to reassemble them from parts, properly formatted.
5. Date/time manipulation
Date/time manipulation is the king of problem sets with rules you can’t possibly wrap your head around all at once. Date/time handling has to account for time zones, daylight saving time, leap years, and even leap seconds. In the United States, we have only four time zones to think about, and they’re all an hour apart. The rest of the world is not so simple.
Whether it’s date arithmetic where you’re figuring out what three days after another day is, or you’re validating that an input string is in fact a valid date, use an existing library.
6. Templating systems
It’s almost a rite of passage. A junior programmer has to create lots of boilerplate text and comes up with a simple little format like:
Dear #user#, Thank you for your interest in #product#...
It works for a while, but then she winds up having to add multiple output formats, and numeric formatting, and outputting structured data in tables, and on and on until she’s built an ad hoc monster requiring endless care and feeding.
If you’re doing anything more complex than simple string-for-string substitution, step back and find yourself a good templating library. To make things even simpler, if you’re writing in PHP, the language itself is a templating system (though that is often forgotten these days).
7. Logging frameworks
Logging tools are another example of projects that start small and grow into behemoths. One little function for logging to a file soon needs to be able to log to multiple files, or to send email on completion, or have varying log levels and so on. Whatever language you’re using, there are at least three log packages that have already been around for years and will save you no end of aggravation.
But isn’t a library overkill?
Before you pooh-pooh the idea of a module, take a hard look at your objections. The Number 1 objection is usually, “Why do I need an entire library just to do (validate this date/parse this HTML/etc.),” My response is, “What’s the harm in using it?” Chances are you’re not writing microcontroller code for a toaster where you have to squeeze out every byte of code space.
If you have speed concerns, consider that avoiding a library may be premature optimization. Maybe loading up an entire date/time library makes your date validation take 10 times as long as your mostly correct homemade solution, but profile your code first to see if it actually matters.
We programmers are proud of our skills, and we like the process of creating code. That’s fine. Just remember that your job as a programmer is not to write code but to solve problems, and often the best way to solve a problem is to write as little code as possible.