The Problem with Dirty Data
The best business intelligence and analytics platform will fail if it uses dirty data. No company sets out to build mountains of dirty data. It’s just from day one few of us thought to make sure every spreadsheet and every database had matching labels, names, categories, formats and even matching database models. And who would ever think that leaving out a data field or entry could put business intelligence and analytics reports in question?
“Data Governance is a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”
Data governance seems to become an issue only long after the problem could have been solved very easily. Just imagine if from day one everyone started using the same format for dates. Let’s look at Microsoft Excel as an example, as so many small and even medium-sized businesses seem to depend on the venerable spreadsheet for collecting data and creating reports. Excel has 17 different formats for dates. Even ZIP Codes can be entered two different ways – don’t forget ZIP Plus Four. Excel even has a “custom” data format so individual users can create their own specialized formats.
What if one data source uses “Payment” as a database naming convention while another calls it “Payable”? Don’t expect even though you are embedding BI that your application can deduce that they mean the same thing. Sometimes your employees won’t even know. How about salary? Or is it wages, pay, earnings, remuneration or compensation? Putting that same data in the cloud didn’t clean it up, either.
Michael Stonebraker, a researcher at MIT’s Computer Science and Artificial Intelligence Laboratory, knows a thing or two about data, considering he helped develop relational databases that include Ingres, Postgres, Vertica and VoltDB, to name a few. He calls data curation the end-to-end process of creating good data. Stonebraker even cofounded a company that uses machine learning to create a bottom-up approach to data quality.
Preparing Your Data for Business Intelligence
If the number of data sources your business or client accesses is manageable, cleaning the data the top down way might be preferable. Someone in IT and/or a DBA should be put in charge of standardizing all of the data sources, if possible, or, at least, identifying the conflicts so that data integration can include conversions (which is a messy way to do things). The problem that person or team will have is a lack of knowledge about a data set. We can’t expect IT staff to be experts in health care, for example.
The larger the amount of data and data sources that need to be integrated through a BI and analytics platform, the more likely a bottom-up approach would seem better suited for the job. In this method, the system asks a domain expert if things are the same or different when encountering the “wages” vs. “salaries” conflict, for example. Stonebraker’s product, Tamr, asks a human administrator when it doesn’t know – and then collects those answers for future comparisons.
Whether a business picks top down or bottom up approaches to imposing data quality will be up to the business to decide using factors such as how many data sources are accessed. A business with thousands and thousands of spreadsheets might not have the manpower available to clean all of that data. But it had better make sure data that is used to develop business intelligence and analytics is clean.
Learn more about how Izenda delivers self-service analytics that anyone in the user community can take advantage of, not just the analysts.