What is Data Lineage? Explained in simple terms

In today’s World we are living in the era where data is very crucial part of our businesses or work. When we use data for our regular businesses or work then it really becomes important to know the background the respective data which we have. Especially when your business rely on its data to make important decisions.

As more organizations are adopting data initiatives and becoming more data driven nowadays, it was never been more important to understand that fact that where our data is coming from? Moreover the price of not knowing the background of your data can literally impact on your businesses and even this may cost you clients. Now let’s explore more of it to understand this better.

Let’s explore the concept of data lineage in some more depth.

Data lineage helps you to determine the history of your data and in knowing the ultimate understanding of its origin. Data Lineage even helps you to validate the accuracy and consistency of your data that ultimately helps you to improve the data quality. As it helps us in understanding the transformations of our data that has undergone through some process to help us achieve regulatory compliance. As here we are knowing the fact that where the apple came from, who harvested it, and who transported them.

All of this adds up to having more trust in your data. Data Lineage tools helps to automate these processes, providing a record of the data throughout its past lifecycle, including all the sources and information, all the data transformations, and even provides impact analysis of the data.

For data driven organizations and AI initiatives startup companies, data lineage becomes very vital part of delivering trusted data to its consumers/users, or whether that’s a data scientist or even to an analyst or to an auditor. It is also critical for our AI models whether which kind of data has been fed to them. As it is the very basic foundation of trust.

In the past, it was the manual effort requiring significant labor and prone to human error to understand the limit lineage of the food supply. However, nowadays RFID tags and tracking technologies have revolutionized our food supply, ensuring higher quality, fresher and more consistent food from farms to our tables.

Understanding Data Lineage with an example:

Let’s take an apple for example, have you ever thought of the trust that we put on our food supply chain?
Can you have the same blind trust on your data that you have on your food supply chain?

Generally, we shop Apple from a grocery store where we see baskets of apples on display and select the ones that we want. But the question is how do we know that we can trust our bought apple? or Where did it came from or where did it grown? and even who picked them from trees and when? Which warehouse was it stored in? and for how long?

Data Lineage example
Data Lineage example

Now if I’m able to explain the context behind the concept of data lineage then these same questions applies in the organizations too where We ask these questions in data lineage in order to maintain the regulatory compliance.

Now let’s get back to our analogy, Apples are grown on a tree, or in a farm where farmers have hundreds or even thousands of trees on their farm. Where they water them, grow them and even curate them, until they’re ready to be picked. And once it’s the time to harvest them, then the apples are put into bushels and collected on the farm.

Those bushels are then loaded into the truck and then it delivers to a warehouse where they are sorted and selected for the quality. And at the final stages they’re being sold to the grocery stores. Here what I’m trying to explain is the walk through the lineage of that apple and same goes for the data as well.

Here, I showed you the history from the inception of the apple in the field all the way to the shopper who is buying it for their favorite snack or recipe. Now let’s take a step further.

Each Apple carries a significant amount of metadata. That means it can tell you which distribution center it went to, which truck transported it all the way back to the tree in the farm where it was grown. That’s a lot of information for one apple. But the point is that metadata is the key to its lineage and ensuring that it’s trusted and so helps in maintaining the regulatory compliance.

Similarly, in the data world having this type of information is crucial for ensuring the process and standards which are in compliance with the regulatory rules.

Now let’s say that a grocer has a complaint about the quality of the apples which they have been receiving from their distributor. Then the distributor identifies the they come from one particular farm. The data also shows the apples in questions have come from a specific set of trees on that farm. Now we are getting to the root cause of the problem.

Understanding the level of history and impact can help the farmer identify which trees are producing the bad apples and the necessary correct actions. Also, it can help the distributor to understand that which farm are producing the good apples and can eliminate purchasing bad apples from a specific farm or from any specific part of a farm. And it helps the distributor to deliver the high quality, trusted and delicious apples to the consumers.

Conclusion:

Automated data lineage will do the same for our data and our AI models where data is being fed. In short, the automated data lineage solutions help clients and organizations to create dynamic and real time lineage views of their data. That shows the history of the data all the way back to its origin, validate accuracy and consistency, improving data quality, and understand all the data transformations that data has undergone to ensure the regulatory compliance. All of this leads and adds up more trust and confidence in your data.

Leave a Comment