Data Hygiene : Making the case for the right amount of clean data

Published By : SiliconANGLE

How much Big Data is enough? Many people are under the impression that the more data you have, the more accurate your analysis will be. Rather than basing research off a small sample population, why not study the entire population? Many analysts who work with big data are now saying this is a false assumption. At some point, it does not matter how much more data you have. Moreover, the quality of the resulting knowledge you glean from analyzing Big Data is less dependent on the quantity of data and more the result of sound analytic processes.

Rather than spending time amassing as much data as possible, discerning data analysts are looking for clean data, which is data that has been purged of inaccuracies and vetted for authenticity. With Big Data, cleansing becomes a larger problem, and the more data you have, the bigger your mess can become. Therefore, it is better to start with a clean approach to gathering the data rather than haphazardly mining any and all data, only to have to sort through it later.

Predictive analysis can play a big role in ensuring that data gathering methods will produce clean data. Someone with skills in soft data science can determine the most reliable variables that will strengthen a data model. It would therefore be shortsighted to suggest that a computer system, even a current generation artificial intelligence, can take the place of a good analyst who can start with clean data rather than having to cleanse data as an afterthought.

Data is always messy


Fractal Analytics founder and CEO Srikanth Velamakanni agrees that it is possible to have too much data but also maintains that working with samples is “always risky”. Regarding data hygiene he said, “Data is always messy. Everybody has messy data. Having said that, we’ve never considered this a problem performing analytics and getting results. You can always get enough from it that it’s useful. If you’re digging for gas, you don’t mind the more expensive process because the value of the gas you finally get to.”

Ultimately, Velamakanni explains, you must find a balance between the two. You do not want to work with old data from ten years ago if you know it is no longer relevant to your research. At the same time, you do not want to ignore current data because it is larger than a traditional sample size. Quality Big Data analytics is found somewhere between the extremes.

Another risk data analysts take with big data is attempting to quantify human behavior. After all, business always deals with human behavior on some level, whether it is with customers in a retail environment or prisoners in a correctional facility. You can amass quite a bit of quantitative data about people but still not have a true understanding of why they are doing what they do or the circumstances surrounding their actions. Many businesses still gather plenty of qualitative data about customers and their opinions. That may take the form of interviews, focus groups and other softer approaches to gathering data. This can actually help them fine tune their quantitative predictive analysis.

Big Data is not going away, but as organizations actually start to sift through their mountains of information, they will soon find that much of it is not what they actually need. They will need to make better decisions regarding how to gather clean data, and there may be an opening in the Big Data market for vendors and service providers that can help businesses do just that. How much Big Data is enough? The answer may be in the data itself.