Skip to content

Moving from ELT/ETL to real time processing

This chapter aims to evaluate the considerations necessary for transforming logic and SQL scripts designed for batch processing into those suitable for real-time processing. A direct "lift and shift" approach—where SQL scripts are converted to Flink statements on a one-to-one basis—is not feasible. Refactoring is essential, as SQL processing often differs significantly in cases involving complexity and stateful operators, such as joins.

Most of the filtering and selection scripts can be ported 1 to 1. While most stateful processing needs to be refactorized and deeply adapted to better manage states and complexity.

There is a repository with tools to process existing dbt project to find dependencies between table, use local LLM to do some transformations, and create target pipelines per sink table.

Motivation for moving to data stream processing

The Data integration adoption is evolving with new need to act on real-time data and reduce batch processing cost and complexity. The following table illustrates the pros and cons of data integration practices for two axes: time to insights and data integity

Time to insights Data integrity
Low High
High Lakehouse or ELT: + Self-service. - Runaway cost, - No knowledge of data lost, - complext data governance, - data silos. Data Stream Platform: + RT decision making, + Operation and analytics on same platform. + Single source of truth, + Reduced TCO, + Governance
Low Hand coding: + customized solution specific to needs. - Slow, - difficult to scale, - opaque, - challenging governance. ETL: + Rigorous, + data model design, + governed, + reliable. - Slow, - Point to point, - Difficult to scale.

Assessment questions

Try to get an understanding of the data integration requirements by looking at:

  • Current data systems and data producers to a messaging system like Kafka
  • Development time to develop new streaming logic or ETL job
  • What are the different data landing zones and for what purpose. Review zone ownership.
  • Level of Lakehouse adoption and governance, which technology used (Iceberg?)
  • Is there a data loop back from the data lake to the OLTP?
  • Where data cleaning is done?
  • Is there any micro-batching jobs currently done, at which frequency, for which consuners?
  • What data governance used?
  • How data quality control is done?

Time condiderations

Some implementation challenges

Joins considerations

The SQL, LEFT JOIN, joins records that match and don’t match on the condition specified. For non matching record the left columns are populated with NULL. SQL supports LEFT ANTI JOIN, but not Flink. So one solution in Flink AQL is to use a null filter on the left join condition:

from table_left
left join table_right
    on table_left.column_used_for_join = table_right.column_used_for_join
    where table_right.column_used_for_join is NULL;