HACKER Q&A
📣 superzamp

How do you monitor data integrity?


I'm finding little if none literature on this topic; all resources about monitoring are heavily focused on capturing and analyzing run-time traces: app logs, http statuses, etc. Database constraints and unit tests do exist, but so do bugs. As it is almost impossible anyway to prevent bad data from being saved sooner or later, do you have a strategy to monitor your data integrity?


  👤 hypewatch Accepted Answer ✓
You mentioned database constraints and testing already, which are great automated methods.

Besides these automated methods another way to monitor data integrity is for a human to manually analyze the data for inconsistencies. And to ask other domain experts at your company to analyze the data for issues. Sometimes this is called the “smell test”. This method can be an effective way to find deeper integrity issues with your data that would be hard to find with automated tests.

Maintaining data integrity is very difficult, domain dependent, and takes lots of work. Good luck!


👤 Someone1234
If it was possible to create a rule that allowed me to find "bad data" then I'd use that same rule in a constraint and outlaw the data as it was written. "Data integrity" is entirely domain dependent and externalities outside the actual data can cause it to lose integrity (i.e. real life can ruin data's integrity).

For example, you could have mapping data telling drivers how to travel between sites, but what if a site's entrance moves? What if there's temp road construction? What if a new bridge is built shortening the route? All of which have harmed your data's integrity, without the data itself changing.

So my only strategy is:

- When you find yourself correcting bad data, consider if a constraint is possible.

- Referential constraints are superior to logical ones, but both are better than nothing (because logic could risk placing business rules into a constraint, which can balloon over time).

- Data Types are "super weapon" constraints. Define exactly what you plan on storing, don't future proof, and avoid nullable types whenever possible. Storing "magical strings" is often wrong, create a table.

- Simpler data sets are generally higher quality for longer. Consider simplifying and streamlining your actual data design (e.g. using generic building blocks and references for common domain tasks).

In general finding bad data/low integrity is like being able to completely define the whole problem domain in one go, and then deciding not to use that same information for constraints, it doesn't really make sense. If you're consistently finding yourself looking for and correcting data integrity problems then go back to the design and fix that.