Let’s start from the very basics:
There are four general basic steps in which any kind of data flows through an organization.
- Data collection & storage: In the first step, we collect and ingest the data from different sources like web traffic, web forms, surveys, or media consumption etc. Here note that all the data is being stored in the raw format.
- Data preparation: The next step is to prepare the data which includes ‘cleaning data’, for example: finding missing or duplicate values, or converting data into more organized format or unifying them into a single format data for a specific use etc…
- Exploration & data visualization: Once the data is being cleaned and organized, it can be exploited for example we can explore the data or we can visualize the data in the form or charts or graphs, we can build dashboards to track changes or compare two to more different sets of data.
- Experiment & prediction: Finally, when we have a good grasp of our data, we are ready to run experiments like evaluating which article title gets the more hits or we can build predictive models like whether forecasts application or a model to predict stock prices etc.

Data engineers are generally responsible for the first step of the above process that is to ingest the collected data and storing it. Data engineers have the great responsibility as they lay the groundwork for data analysts, data scientists and machine learning engineers.
In the ground level, if the data gets scattered around, or corrupted or if they’re difficult to access, then there’s not much to prepare, explore or experiment with. And that’s exactly why you need a data engineer. Their job is to deliver the correct data in the right format in the hands of right people as efficiently as possible. They ingest data from different sources, optimize the databases for analysis and also manage the data corruption.
Data engineers develop, construct, test and maintain architectures such as databases and large-scale processing systems to process and handle massive amount of data.
Data Engineers and Big Data:
With the advent of big data, the demand for data engineers has increased as well. Big data can be defined as data so large that you have to think about how to deal with its size? Because as data size increases it becomes difficult to manage them using the traditional data management systems. And this is why data engineers are in big demand in today’s era. This below graph helps in making sense of big data.

In order of volume, big data is basically composed of sensors and devices data, social media data, enterprise data and VoIP (voice communications, multimedia sessions) data. Big data basically comprises of 5 V’s which are given below:
- Volume(how much?): the quantity of data points.
- Variety(what kind?): the type and nature of the data like text, photos, videos, audio etc.
- Velocity(how frequent?): how fast the data is being generated and processed.
- Veracity(how accurate?): the trustworthy the data sources are?
- Value(how useful?): how actionable the data is and how it can be used in real-world scenarios/problems.
Data engineers have to take all of the above V’s into the consideration while data collection and storage process.
Data engineers and Data scientists:
If you’re a beginner in data engineering, you might have heard about data scientists as well and many people often gets buzzed between a data engineer and a data scientist. And to prevent the confusion and assumptions that come with buzzwords, let’s clarify how data engineers and data scientists contrasts and are being compared.
As you now already know that data engineers mainly focus on the first part of the workflow that is of data collection and storage. Their role is to store and ingest the data so that they can be easily accessible and can be ready to be analyzed. However, data scientists intervene in the rest of the workflow as their role is to prepare the data according to their analysis needs, explore it, build insightful visualization and then run experiments and build predictive models.
Data engineers lay the groundwork that makes the data science activity possible. Now let’s see how data engineers enable data scientists.
Data engineers:
Ingest and store data
Setup databases
Build Data pipelines
Strong software skills
Data scientists:
Exploit Data
Access databases
Use pipeline outputs
Strong analytical skills required.
Conclusion:
Basically, data engineering plays a fundamental role in managing and utilizing large volume of data that includes data collection, data storage, pipeline automation, data integration, security & compliance etc.
1 thought on “Understanding Data Engineering in a lay man’s language”