Screen scraping: it’s one of those things that we all do, but no one likes to admit to. That reminds me of something else that I can’t quite put my hand on…
I did my first screen scrape before I even knew what it was. I was probably a teenager at the time. I wrote a simple MS-DOS program (remember MS-DOS?!) to compare two directories. It was called dircmp.exe. You would run the command with two arguments and it would tell you which files were in one directory that weren’t in the other and vice versa. The proper way of creating this type of program is to use DOS system calls to read the directories. I didn’t really know what that was, or how to use it, so in my mind, the only sensible way was for my code to call ‘dir’ and then process the text, formatted just like it would be on the screen.
And it worked! It was fast and efficient, and worked on a few versions of DOS. I haven’t run it in a while, but I’d be surprised if it still worked. Since it expected the directory listing to be formatted in a particular way, if the listing format changed, my program would very quickly crumble like a dried leaf.
That wasn’t the first time I screen scraped. And it won’t be the last. My next foray into screen scraping was trying to get and process box scores for a fantasy hockey league I ran back in the late 90s. There weren’t other readily available options for someone who, like me, wasn’t willing to spend money to connect to some sort of reliable API. Once a day, I’d manually copy the webpage into a text box in my program, and it would tell me how many points each player had in each game. Except of course, when it didn’t, and I’d have to edit the mistakes in the copied text.
APIs are generally more reliable. These days it’s all about REST, and while REST is not always implemented well, when it exists, it’s usually pretty good.
But sometimes the data is just not available in a nice neat format. That’s when it’s screen scraping to the rescue!
So now that you have decided to screen scrape, how do you minimize the potential problems? Here are some suggestions:
- 1. Use multiple sources to corroborate the data and fill in missing pieces and compensate for mistakes, or in case one or more sources disappears.
- 2. Wrap the screen scrape code with something that can easily be swapped out in case a better solution (e.g. API) becomes available.
- 3. Use defensive programming techniques. (Do this anyway, screen scraping or not!) Unreliable/spotty data sources WILL definitely cause your code to crash if it’s not written well.
- 4. Consider building your own API that you can consume. The data would then be filled in with a separate process that does the screen scraping.
- 5. If your program runs unattended, build an alert system that can contact you if something goes wrong and cannot be automatically recovered.