Real World Data

When you’re developing an application, it can sometimes be difficult to get data from the real world that matches your data model. Instead, all you are given is a schema or spec for that data and you have to make up sample data.

There are various reasons why you might not get real data: it might be difficult to find, you may have not have permissions (or can’t find anyone with permissions), or the data might be sensitive and people are unwilling to easily part with it. Data scrubbers are available, but they still have to be configured and run, which the data owner may not have time or inclination to do.

Ultimately, it doesn’t matter how simple your data model is. You will likely not get your software right unless you can test with real world data. You just don’t know what you’re going to run into. The data might not match the schema, or there may be holes in the schema that you don’t see. Real world data will slip through like a hernia! I came across one of these recently. I was given a schema that had data laid out in a tree structure. What was not mentioned was that the branches of the tree could loop back to themselves! I only found that when I looked at the sample. Perhaps it was an omission in the schema, or perhaps it was a mistake in my interpretation. Either way, I had to rework part of my application.

With real world data, you may spot opportunities for application enhancement. For example, if you’re told that a field must be a string but all the real world data is numeric, you can change your data model to account for it. And that might actually speed things up! Or make it easy to perform mathematical operations. (Just a word of caution though: check if there is a risk the data type might change in the future.)

If the schema was correct and data was ‘perfect’, you still can’t consider every combination of input. To know which combinations to focus on, which is particularly important during testing, you will need to understand how your objects are populated. You can’t make that up. Sample data will get you going, but only real world data will help you finish.

