Flink Project Demonstrations¶

This git repository includes multiple projects to illustrate shifting from batch processing to data stream processing using Flink, and how to manage data as a product.

The first project is a classical customer 360 assessment. The different components are in the customer_360 folder. It starts from one Spark project to illustrate how to implement a customer 360 analytic data product using Apache Spark batch processing, then supports the same semantic with a data streaming processing using Apache Flink. This examples illustrates how to use Agentic solution to automate most of the migration using the shit_left utilities
The project online_store illustrates starting from a white page to process online store activities for customer behavior and real-time inventory. This is a fictuous demonstration to build using a set of incremental labs.

Customer 360 introduction¶

The high level architecture for batch processing is presented in the figure below, and use a simplified version of the star schema of the Kimball methodology, to build dimensional models:

To shift left by moving the batch-based Spark processing to a real-time processing with Kafka and Flink project, the architecture looks like:

The project also demonstrates an automatic migration from Spark to Flink using the shift_left tool and agentic AI.

A data as a product design¶

Any data as a product design starts by assessing the domain, the data ownership, the data sources, and consumer of the analytic data products,... We recommend to read this chapter to review a methodology to build data as a product.

The specific customer 360 use case is a multi-channel retailer (bricks-and-mortar stores, e-commerce, mobile app) migrating their analytics platform to an Analytics Lakehouse with real-time processing.

The project identifies the following key business domains and assigning the following ownership:

Domain	Data Owner/Team	Key Data Sources
Customer	Customer Experience/CRM Team	CRM system, loyalty program, support tickets, app usage logs
Sales	Finance/Sales Operations Team	Point of Sale (POS) system, e-commerce transactions, regional sales ledgers
Product	Merchandising Team	Product catalog, inventory system, supplier data
Logistics	Supply Chain Team	Warehouse management, shipping manifests, tracking data

The Customer Domain Team is responsible for building a core data product named, the Customer 360 Data Product.

Metadata	Description
Product Name:	customer.analytics.C360
Purpose:	To provide a comprehensive, high-quality, and up-to-date view of a customer for analytics, BI, and ML initiatives across all other domains.
Data Storage:	Uses the Lakehouse's object storage in an open format (e.g. Iceberg tables) for ACID-compliant, performant data.
Ingestion:	Real-time are managed by the Customer Domain Team to ingest and transform raw data from source systems into the Lakehouse.

Data Product Output:¶

The final curated data is exposed via well-defined, easily consumable interfaces:

SQL Endpoint: A materialized view or table on the Lakehouse that other domains can query directly using SQL.
API Service: A REST API for low-latency, record-by-record lookups (e.g., for a personalized recommendation service).
File Export: Secure, versioned file exports for large-scale ML model training.

Data Product Characteristics:¶

Discoverable: Registered in a central Data Catalog (a self-service component of the Mesh).
Addressable: Has a unique identifier and clear documentation (customer.analytics.C360).
Trustworthy: Includes built-in data quality checks (DQ) and validation rules enforced by the Customer Team.
Self-Describing: Contains rich, up-to-date metadata (schema, lineage, ownership, DQ metrics).
Secure & Governed: Access is controlled using federated governance rules (e.g., fine-grained, tag-based access control on the Lakehouse).

Cross-Domain Consumption for Analytics¶

Other domains may consume the data as a product to achieve their analytical goals:

Consuming Domain	Analytical Goal	Data Product Consumed
Marketing	Predict churn for a targeted email campaign.	customer.analytics.Customer360Profile (to get demographics, loyalty score, purchase history).
Product	Analyze which customer segments are buying a new line of shoes.	customer.analytics.Customer360Profile joined with the Sales Domain's sales.transaction.AggregatedDailySales product.
Finance	Calculate the Customer Lifetime Value (CLV)	Queries the customer.analytics.Customer360Profile via the SQL endpoint.

Other projects¶

TBC

flink_data_products includes a set of small data as a product examples implemented using Flink SQL. They are used to demonstrate pipeline management with shift_left tool and simplest use cases.
- the saleops is a simple example of Kimball structure for a star schema about revenu computation in the context of sales of products within different channels (See the readme for details).
Ksql project: a set of files to be used for testing automatic migration from ksql.

To Be continued....