What is your solution to data duplication across applications?

Question

I work in enterprise and there is an incredible amount of replication and "de-normalisation" across both old and new applications.I am investigating methods of record linkage, but it does not seem like a solved problem.Is there a tried and tested method to go about matching the records and fixing the duplication?

tixocloud · Accepted Answer

TLDR: Reframe the data duplication problem into a business problem.
It's an incredibly tough problem and I haven't seen it solved with technology alone (mind you, I've been working at banks for the last few years and have usually 100k+ databases so projects usually run out of money before all the data is consolidated!)
From a theoretical standpoint, the way to solve this and the holy grail would be a single source that becomes the system of record. Except all applications are now attempting to become the system of record for their domain which they really shouldn't.
There isn't really a tried and tested method to match the records and fix other than to go in there and create pipeline rules that fix it before it's being saved/ingested. Tools like Trifacta could help.
The real way to solve the problem is a competent IT team at both the executive and operator level to create a proper strategy that also is consistently delivering value as the data cleansing exercise is being done. What this means is having mapped out all the systems, prioritize and identify where the business gets most value from the data cleansing exercise and start fixing it there.
The challenge is that data cleansing exercises are usually multi-year projects. At the same time, you're expected from the business to deliver value so you need to figure out the strategy and consistently get and maintain buy-in from the business. Vendors are always touting their latest and greatest which exacerbates the problems further.