Moving from ELT/ETL to real time processing¶
This chapter aims to evaluate the considerations necessary for transforming logic and SQL scripts designed for batch processing into those suitable for real-time processing. A direct "lift and shift" approach—where SQL scripts are converted to Flink statements on a one-to-one basis—is not feasible. Refactoring is essential, as SQL processing often differs significantly in cases involving complexity and stateful operators, such as joins.
Most of the filtering and selection scripts can be ported 1 to 1. While most stateful processing needs to be refactorized and deeply adapted to better manage states and complexity.
There is a repository with tools to process existing dbt project to find dependencies between table, use local LLM to do some transformations, and create target pipelines per sink table.
Motivation for moving to data stream processing¶
The Data integration adoption is evolving with new need to act on real-time data and reduce batch processing cost and complexity. The following table illustrates the pros and cons of data integration practices for two axes: time to insights and data integity
Time to insights | Data integrity | |
---|---|---|
Low | High | |
High | Lakehouse or ELT: + Self-service. - Runaway cost, - No knowledge of data lost, - complext data governance, - data silos. | Data Stream Platform: + RT decision making, + Operation and analytics on same platform. + Single source of truth, + Reduced TCO, + Governance |
Low | Hand coding: + customized solution specific to needs. - Slow, - difficult to scale, - opaque, - challenging governance. | ETL: + Rigorous, + data model design, + governed, + reliable. - Slow, - Point to point, - Difficult to scale. |
Assessment questions¶
Try to get an understanding of the data integration requirements by looking at:
- Current data systems and data producers to a messaging system like Kafka
- Development time to develop new streaming logic or ETL job
- What are the different data landing zones and for what purpose. Review zone ownership.
- Level of Lakehouse adoption and governance, which technology used (Iceberg?)
- Is there a data loop back from the data lake to the OLTP?
- Where data cleaning is done?
- Is there any micro-batching jobs currently done, at which frequency, for which consuners?
- What data governance used?
- How data quality control is done?
Time condiderations¶
Some implementation challenges¶
Joins considerations¶
The SQL, LEFT JOIN, joins records that match and don’t match on the condition specified. For non matching record the left columns are populated with NULL. SQL supports LEFT ANTI JOIN, but not Flink. So one solution in Flink AQL is to use a null filter on the left join condition: