I will bet that most data migrations seem relatively easy at first. Take data from one source to another, a database to a data warehouse for example. Of course, once you start you soon realize there is a lot more to it than first expectations. There will be more data, from sources you did not even know the business had or had not been mapped out, and you will have more than you thought you had!
You then must ensure that data matches your unified criteria, which means you must get it all matched up to a common schema to ensure that it is A. useful and B. of good quality.
Add to this security and compliance issues and resource constraints and you have the recipe for what seemed like a straightforward project becoming a complex, extremely time-consuming undertaking to get right!
We have learned a lot from traditional data migrations, and I don’t think any business these days can even afford to simply settle for the ‘good enough’ when it comes to data quality as it impacts on so much – decision making, confidence in direction, goals and outcomes.
A more up to date approach I believe is to spend less time on the ETL part of the process, and apply more focus to better data quality, and faster access to it too – without increasing the time spent on it!
So how do we minimize our focus on the ETL side?
Data sources is a good starting point on this – they tend to be for the most part legacy, or modern.Legacy systems by their very nature tend to be complex, usually on-premise, with limited accessibility. Modern data sources include applications, SaaS Services, modern databases, streaming data etc. At this point I would recommend a data census as part of your data migration plan to identify the amount of data sources your organization has.
I think the key is treating the legacy and modern data sources differently, for most success. They do not have common challenges, they have differences. This then therefore has a big impact to your ETL process, as you formulate strategies on how to deal with each type of data source.
As I touched on, Legacy and modern data sources have different inherent challenges and therefore and knowing which approach to use for each is key.Let us say you apply a modern or traditional approach to both modern and legacy sources, here is what the outcomes may look like dependent on what you are looking to apply and achieve:
Migrating to a cloud data warehouse – with a Modern approach this is likely to be a phased migration, however traditional would rend to be an all at once migration (which is time consuming).
Data Modelling – with a Modern approach you are likely to use your cloud data warehouse to de-normalize the data, and with traditional it is like you will normalize in stage, then migrate (again very time consuming)
Capturing data changes – with a Modern approach you will significantly reduce overheads by only capturing changes, with Traditional it is like you will copy over all the data each time (very time consuming)!
Managing semi-structured data sources – With a Modern approach you will leverage semi-structured data types in your modern cloud data warehouse and with Traditional it’s likely you will have to Stringify and extract (which is very error prone)
Migrating schemas – This is automatic with Modern, completely manually mapped in traditional.
I believe the most effective road to success it to spend less time on the ETL phase by using an innovative tool designed to handle your data sources effectively and properly.