We’ve shared a few of our experiences around data preparation for migrations, the importance of a data strategy, and why data quality is so important to your big data architecture.
But we haven’t touched on the definition, and the crucially, the criticality of data quality!
Data is everywhere and increasing daily in volume and sources captured.
Good timing therefore we think to provide you with a definition of data quality – what it means, and why it’s so important to you and your business, and should you really care about it?
Getting to grips with your data
Your data sources and data creation may have a variety of sources and volume. You could have vast new volumes of data coursing through your business, or huge volumes of incumbent data sitting within your systems and applications.
If your business hasn’t figured out how to effectively use its data – you’re missing out on critical opportunities to transform it, launch new products or services or remain competitive.
Once you get to grips with your data however, you’re faced with another challenge – it’s quality.
It is without question impossible to make informed, timely business decisions if your data quality is bad. There are many terms to try to define and describe data quality including complete and accurate – but simply put the larger concept of data quality is really about whether or not your data is fit for the purpose or purposes you need to use it for.
How hard is Data Quality to achieve?
Undeniably, it used to be difficult. But modern solutions to achieving data quality are not as erroneous as some might think they are. Most businesses, regardless of industry, still have challenges achieving data quality.
In fact nearly 85% of CEOs around the world say they’re concerned about the quality of the data they are using to make business critical decisions, partly down to the fact that poor quality data has been proven to cost companies up to 25% of their annual revenue in lost sales, bad decisions and poor productivity.
Common Data Quality Challenges
Data silos or isolated data is often a challenge. These are separate data groups owned by different business units or groups, often contained within a specific software package. Clearly due to its confinement and separation, it’s often inaccessible to the rest of the business as it may have strict permissions or is not compatible with other software around the company. Because it’s not easily accessible, the business can’t get a complete picture of it; nor much value out of it.
Complex and big data is often hard to fathom data quality in too. Data comes from many different sources, it can be structured or unstructured with varying criteria, often in huge volumes. Making sense of this data is often labour intensive and time consuming with the result often resulting in the data being potentially out of date by the time is even collected!
Big question is – how to approach data quality?
As you would approach any business endeavour, you will undoubtedly manage improving your data quality as a multi-step, multi-method process that may involve various components. Of course, what you choose is very much dependent on what you want to get of your data, touching on our earlier point in that data quality is really about what you want to get out of it.
You may consider:
Big data scripting – does require significant understanding of the types of data that need to be synthesized to know which scripting language to use.
Open source tools – tools are available however in practical terms they require some level of customization before any real benefit is realized with limited support also, meaning fall back onto your existing IT team to make them work for you.
Traditional ETL (extract, load, transform) – integrates data from various sources into a data warehouse where it is prepped and managed for analysis. Challenge here is it requires a team of skilled data scientists to scrub and cleanse the data first to address incompatibilities. Also, with the tools used in ETL the tools often process in batches instead of real time.
Modern cleansing & ETL tools – an example of which is iData, removing the manual work of traditional ETL tools by providing the ability to automate the cleansing, validation, transformation and de-duplication of your data before it is moved securely and stored in a data warehouse or data lake. What is more, iData continues to monitor your data and catches bad data inputs and alerts for remedial action and provides total coverage.