Evolution of Data Pipelines
Published on : Tuesday 03-05-2022
This article takes a look at a few architectures like Data Mesh and associated areas one needs to consider.

In the past, when data had to be updated, operators manually entered it into a data table. This would lead to manual user entry errors and time lag. Since this was majorly done in batches, mostly as a daily job, there was substantial lead time from the time the event occurred to the time it was reported. Decision makers had to live with this time lag and often make decisions on stale data.
Fast forward into the present and now we have real-time updates and insights which are common place requirements. Building data pipelines essentially was with the intent to move data from one layer (transactional or event sources) to data warehouses or lakes where insights were derived.
The question is with these advancements in requirements to support real-time insights, and other quality requirements are we efficient by using traditional architectures or popularly used ETL approaches. Let’s find out!
Current state of data pipeline architectures and challenges
Data pipelines are important to any Product Digitisation program. Later half of this decade we witnessed immense focus on digital architecture and technologies being adopted. Adoption of micro services and containerisation is only seeing a strong growth trajectory establishes this fact. We also see tech advancements being applied but limited to traditional ‘OLTP’ or core service/business logic.
However, the story is a bit different, when one inspects the patterns involved in Data pipelines or ‘OLAP’ side of things. We observe limited adaptation to tech evolution seen in core services space. Most common data pipelines are built using either traditional ETL, or ELTL architectures. These are popular industry de-facto approaches. Though these do solve the larger problem at hand, i.e., deriving actionable insights, but it also comes with certain limitations. Let’s explore some of these challenges:
Siloed Teams: The ETL process requires expertise or skills in data extraction or migration. This could mean the technical team is layered or structured to deal with technical nuances of the process. For example, an ETL engineer is many a time oblivious to insights being derived and how it is consumed by end users.
Limited Manifestation: The implementation team is now trying to fit any use-case that is desired in to the set structure or pattern. Though this is always not a problem or a wrong thing to do, there are times this can be more in-efficient. For example, how does one extract from an unstructured source and deal with modelling the intermediate persistence schema?
Latency: Time taken to process, extract, transform and load the data many times does introduce lags. This lag could be attributed to the fact that data is processed in batches, or the necessary intermediate load steps to persist interim results. In few business scenarios, this is not acceptable. For example, data streams emanating from an IoT service are stored and batch processed at a later scheduled time; thereby, introducing a lag from data generation to updated insights on dashboards.
Future state of data pipeline architecture and key considerations
As we see advancements in general software architecture like Microservice, Service Mesh, and so on, there is a need for similar modernisation. One key approach emerging is distributing the data pipeline for the domains instead of centralised data pipeline contributing to build multiple such objects resulting in Data Mesh. Data Mesh aims to address these challenges by adopting a different approach:
1. Team or pods that are aligned on functional feature delivery
2. Treat Data as Product (discoverable, self-contained and secure), and
3. Polyglot storage and communication are facilitated via Mesh.
(Initial read on Data Mesh can be found at: https://www.ltts.com/blog/understanding-data-mesh)
Data Mesh can be implemented in various ways. One effective pattern is to use Event driven approach and Event storming to form Data Products. A Domain can comprise one or more Data Products. This would also mean that data can be redundant and persisted in one or more stores. This is referred to as Polyglot storage. Finally, these data products are consumed via the Mesh APIs designed along the lines of each domain requirement.
Other architectural styles include Data Lake, Data Hub and Data Virtualisation. A brief comparison on these can be found on the link mentioned above.
Some other considerations that one should evaluate:
a. Facilitate easy data access any time using standard interfaces like SQL. Tech like Snowflake, DBT, Materialise enable such real-time joins which not only enables BI, but also helps in low level plumbing of the pipeline
b. Design Data Pipelines to be robust and fault tolerant, e.g., checkpoint intermediate results, where required, for further analysis
c. Leverage distributed loosely-couple processing units, scalable to use polyglot technologies, e.g., Spark job or Python models
d. Use Data Virtualisation to mitigate bottlenecks, e.g., shorten lead time for data availability, and
e. Use of DataOps effectively to track and evaluate your Data pipeline performance.
Conclusion
Finally, I would like to conclude with a disclaimer. This article is not to discard current architectures associated with ETL. In fact, for certain use cases like batch jobs, ETL is still a very good option to adopt. The intent here is more of a realisation one would need to have based on the varied requirements and explore further architectures which could suit well for the need. In this article, we looked at a few such architectures like Data Mesh and associated areas one needs to consider.
Article Courtesy: NASSCOM Community – an open knowledge sharing platform for the Indian technology industry: https://community.nasscom.in/communities/data-science-ai-community/evolution-data-pipelines
(This article is credited to L&T Technology Services)