Apache Flink - Core Concepts¶

Version

Update 07/2025 - Review done with simplification and reduced redundancy.
Update - revision 11/23/25
2/2026: Refactor content as part of the new cookbook chapter

Why Flink?¶

Traditional data processing faces key challenges:

Transactional Systems: Monolithic applications with shared databases create scaling challenges
Analytics Systems: ETL pipelines create stale data and require massive storage and often duplicate data across systems. ETLs extract data from a transactional database, transform it into a common representation (including validation, normalization, encoding, deduplication, and schema transformation), and then load the new records into the target analytical database. These processes are run periodically in batches.

Flink enables real-time stream processing with three application patterns:

Event-Driven Applications: Reactive systems using messaging
Data Pipelines: Low-latency transformation and enrichment
Real-Time Analytics: Immediate computation and action on streaming data

Flink Apps bring stateful processing to serverless. Developers write event handlers in Java (similar to serverless functions) but with annotations for state, timers, and multi-stream correlation. State is automatically partitioned, persisted, and restored. Event-time processing handles late-arriving data correctly. Exactly-once guarantees ensure critical business logic executes reliably.

Overview of Apache Flink¶

Apache Flink is a distributed stream processing engine for stateful computations over bounded and unbounded data streams. It has become an industry standard due to its performance and comprehensive feature set.

Key Features:

Low Latency Processing: Offers event time semantics for consistent and accurate results, even with out-of-order events.
Exactly-Once Consistency: Ensures reliable state management to prevent duplicates and message loss.
High Throughput: Achieves millisecond latencies while processing millions of events per second.
Powerful APIs: Provides APIs for operations such as map, reduce, join, window, split, and connect.
Fault Tolerance and High Availability: Supports failover for TaskManager nodes, eliminating single points of failure.
Multilingual Support: Enables streaming logic implementation in Java, Scala, Python, and SQL.
Extensive Connectors: Integrates seamlessly with various systems, including Kafka, Cassandra, Pulsar, Elasticsearch, file systems, JDBC-compliant databases, HDFS, and S3.
Kubernetes Native: Supports containerization and deployment on Kubernetes with a dedicated Kubernetes operator to manage session, job, or application clusters as well as Job and Task Managers.
Dynamic Code Updates: Allows for application code updates and job migrations across different Flink clusters without losing application state.
Batch Processing: Also transparently supports traditional batch processing workloads, as reading a table at rest becomes a stream in Flink.

Stream Processing Concepts¶

A Flink application runs as a job - a processing pipeline structured as a directed acyclic graph (DAG) with:

Sources: Read from streams (Kafka, Kinesis, Queue, CDC etc.)
Operators: Transform, filter, enrich data
Sinks: Write results to external systems

Operations can run in parallel across partitions. Some operators (like Group By) require data reshuffling or repartitioning.

Bounded and unbounded data¶

A stream is a sequence of events, bounded or unbounded:

Apache Flink supports batch processing by processing all the data in one job with a bounded dataset. It is used when we need all the data to assess trends, develop AI models, and when throughput matters more than latency. Jobs are run when needed, on input that can be pre-sorted by time or by any other key.

The results are reported at the end of the job execution. Any failure requires a full restart of the job.

Hadoop was designed to do batch processing. Flink has the capability to replace Hadoop MapReduce processing.

When latency is a major requirement—for example monitoring and alerting or fraud detection—streaming is the natural choice.

Dataflow¶

In Flink 2.1.x, applications are composed of streaming dataflows. A dataflow can consume from Kafka, Kinesis, queues, and other sources. A typical high-level view of a Flink application is presented in the figure below:

A Flink application (source: Apache Flink product documentation)

Stream processing includes a set of functions to transform data, and to produce a new output stream. An operator in Flink is a component that performs a specific operation on the data stream. Operations can be transformations (e.g., map, filter, reduce); an action (e.g., print, save); or, a source or sink. Intermediate steps compute rolling aggregations like min, max, mean, or collect and buffer records in a time window to compute metrics on a finite set of events.

Streaming dataflow (source: Apache Flink product documentation)

Data is partitioned for parallel processing. Flink performs computations using tasks, subtasks, and operators. Each stream has multiple partitions, and each operator has multiple tasks for scalability. Tasks are the basic unit of execution in Flink. A task represents a piece of work that gets scheduled and executed by the Flink runtime.

Each task is responsible for executing a specific part of the data processing logic defined by Flink. Tasks are parallelizable, meaning you can have multiple instances of a task running in parallel to process data streams more efficiently. A subtask in Flink is a parallel instance of a task. A task can be divided into multiple subtasks that can all be running at the same time. Each subtask processes a portion of the data leading to more efficient data processing.

Distributed processing (source: Apache Flink product documentation)

Operations like GROUP BY require data reshuffling across the network, which can be costly but enables distributed aggregation.

INSERT INTO results
SELECT key, COUNT(*) FROM events
WHERE color <> 'blue'
GROUP BY key;

Runtime architecture¶

Flink consists of a Job Manager and n Task Managers deployed on k hosts.

Client applications compile batch or streaming applications into a dataflow graph. The client submits the DAG to the JobManager. The JobManager controls the execution of one or more applications. Developers submit their application (JAR file or SQL statements) via CLI, REST, or a Kubernetes manifest. The Job Manager receives the Flink application for execution and builds a task execution graph from the defined JobGraph. It parallelizes the job and distributes slices of the DataStream processing logic the developers defined. Each parallel slice of the job is a task that runs in a task slot.

Once the job is submitted, the Job Manager schedules the job to different task slots within the Task Manager. The Job Manager may create resources from a compute pool, or when deployed on Kubernetes, it creates pods.

The Resource Manager manages task slots and uses an underlying orchestrator such as Kubernetes or YARN (deprecated).

A Task slot is the unit of work executed on CPU. The Task Managers execute the actual stream processing logic. There are multiple task managers running in a cluster. The number of slots limits the number of tasks a TaskManager can execute. After it has been started, a TaskManager registers its slots to the ResourceManager:

The Dispatcher exposes an API to submit applications for execution. It also hosts the web user interface.

Once the job is running, the Job Manager coordinates the activities of the Flink cluster, such as checkpointing and restarting TaskManagers that may have failed.

Tasks load data from sources, perform processing, then send data among themselves for repartitioning and rebalancing, and finally push results to the sinks.

When Flink is not able to process a real-time event, it may have to buffer it, until other necessary data has arrived. This buffer has to be persisted in durable storage so data is not lost if a TaskManager fails and has to be restarted. In batch mode, the job can reload the data from the beginning. In batch, the results are computed once the job is done (for example, select count(*) AScountfrom bounded_pageviews; returns one result), while in streaming mode each event may be the last one received, so results are produced incrementally after every event or after a period of time based on timers.

Parameters

taskmanager.numberOfTaskSlots: 2

Only one Job Manager is active at a given point in time, and there may be n Task Managers. It is a single point of failure, but it starts quickly and can leverage checkpoint data to restart its processing.

There are different deployment models:

Session mode: Deploy on a long-running cluster where multiple jobs share the same Flink cluster (one JobManager coordinates several jobs).
Per-job mode: spin up a cluster per job submission. This provides better resource isolation.
Application mode: creates a cluster per application with the main() method executed on the JobManager. It can include multiple jobs, but they run inside the app. This saves CPU cycles and reduces the bandwidth needed to ship dependencies to workers.

Flink can run on common resource managers such as Hadoop YARN, Mesos, or Kubernetes. For development, you can use Docker images to deploy a session or job cluster.

See also the deployment to Kubernetes chapter for how to use the Flink Kubernetes operator to deploy and monitor Flink applications.

State Management¶

In Flink, state consists of information that an operator remembers about past events, which is used to influence the processing of future events.

Core Concept of State¶

Stateful operations are required for many common use cases, such as:
- Windowing: Aggregating events over time (e.g., sum of sales per minute).
- Pattern Detection: Tracking a sequence of events to find specific patterns.
- Machine Learning: Updating model parameters based on a stream of data.
- Analytics: Maintaining counters or profiles for unique users.
We can distinguish different types of operations:
- Stateless Operations process each event independently without retaining information:
  - Basic operations: INSERT, SELECT, WHERE, FROM
  - Scalar/table functions, projections, filters
- Stateful Operations maintain state across events for complex processing:
  - JOIN operations (except CROSS JOIN UNNEST)
  - GROUP BY aggregations (windowed/non-windowed)
  - OVER aggregations and MATCH_RECOGNIZE patterns

Types of State¶

Flink distinguishes between two main categories of state:

Managed State: Handled by the Flink runtime. Flink manages the storage, rescaling, and fault tolerance of this state.
Raw State: Handled by the user in their own data structures. It is generally not recommended as Flink cannot automatically manage it during rescaling.

Within Managed State, there are several sub-types:

Keyed State: Tied to a specific key (e.g., a user ID). It is partitioned across the cluster so that each key's state is handled by exactly one parallel task. Flink maintains one state instance per key value and Flink partitions all records with the same key to the operator task that maintains the state for this key. The key-value map is sharded across all parallel tasks:

Keyed state

Each task maintains its state locally to improve latency. For small state, the state backends will use JVM heap, but for larger state RocksDB is used. A state backend takes care of checkpointing the state of a task to a remote and persistent storage.
Operator State: Tied to a parallel operator instance rather than a key. It is often used for source/sink connectors (e.g., tracking Kafka offsets).
Broadcast State: A special type of operator state where the state is duplicated across all parallel tasks of an operator.

Flink keeps state for fault tolerance. In fact, Flink uses stream replay and checkpointing.

All data maintained by a task and used to compute the results of a function belongs to the state of the task. Functions may use <K, V> pairs to store values and may implement CheckpointedFunction to make local state fault tolerant.

While processing the data, the task can read and update its state and computes its results based on its input data and state.

State management may address very large states, and no state is lost in case of failures.

Within a DAG, each operator needs to register its state. Operator state is scoped to an operator task: all records processed by the same parallel task have access to the same state.

State can grow over time. Local state persistence improves latency while remote checkpointing ensures fault tolerance.

Flink and Kafka integration with state management

Flink ensures fault tolerance through checkpoints and savepoints that persistently store application state.

Fault Tolerance and Consistency¶

Checkpoints: Flink periodically takes distributed snapshots of the state and stores them in durable storage.
Exactly-once Semantics: By combining checkpoints with replayable data sources, Flink guarantees that each event affects the state exactly once, even in the event of a failure.
Savepoints: These are manually triggered snapshots used for operational tasks like application upgrades, A/B testing, or migrating to a different cluster.

See deeper explanations in the cookbook chapter

State Backends¶

State backends determine how the state is physically stored. Options typically include:

HashMap State Backend: Stores state as objects on the JVM heap (very fast, but limited by memory).
Embedded RocksDB: Stores state in an embedded database on local disk. RocksDB is a key-value store based on Log-Structured Merge-Trees (LSM Trees). Flink organizes state into "Key Groups." Each RocksDB instance on a TaskManager handles a specific set of these groups. It saves asynchronously and serializes using bytes. There is a limit of 2^31 bytes per key and value. It supports incremental checkpoints, which is important for maintaining performance as state grows into the terabytes.
- The process: When an operator updates state, it writes to the RocksDB MemTable and a Write-Ahead Log (WAL). Once the MemTable is full, it is flushed to disk as a static SST file.
- ForSt uses a tree-structured key-value store and can use object storage on remote file systems, allowing Flink to scale state beyond the local disk capacity of a TaskManager.
- HashMapStateBackend uses the Java heap to keep state as Java objects; reusing heap-backed state across incompatible changes is unsafe.
ForSt (Disaggregated): The 2.x preference for cloud-native scaling and fast recovery, by using Distributed File Systems (DFS).

Sources of knowledge¶

Windowing¶

Windows group stream events into finite buckets for processing. Flink provides window table-valued functions (TVFs): Tumbling, Hop, Cumulate, and Session.

Tumbling windows¶

A tumbling window assigns events to non-overlapping buckets of fixed size. Records are assigned to the window based on an event-time attribute field, specified by the DESCRIPTOR() function. Once the window boundary is crossed, all events within that window are sent to an evaluation function for processing.
Count-based tumbling windows define how many events are collected before triggering evaluation.

Time-based tumbling windows define a time interval (for example, n seconds) during which events are collected. The amount of data within a window can vary depending on the incoming event rate.

.keyBy(...).window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

in SQL:

-- computes the sum of the price in the orders table within 10-minute tumbling windows
SELECT window_start, window_end, SUM(price) as `sum`
FROM TABLE(
    TUMBLE(TABLE `examples`.`marketplace`.`orders`, DESCRIPTOR($rowtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;

See example TumblingWindowOnSale.java in the my-flink folder. To test it, do the following:

# Start SaleDataServer: it listens on socket 9181, reads avg.txt, and sends each line to the socket
java -cp target/my-flink-1.0.0-SNAPSHOT.jar jbcodeforce.sale.SaleDataServer
# Inside the JobManager container, start the job with:
`flink run -d -c jbcodeforce.windows.TumblingWindowOnSale /home/my-flink/target/my-flink-1.0.0-SNAPSHOT.jar`.
# The job creates the data/profitPerMonthWindowed.txt file with accumulated sales and number of records in a 2-second tumbling window
(June,Bat,Category5,154,6)
(August,PC,Category5,74,2)
(July,Television,Category1,50,1)
(June,Tablet,Category2,142,5)
(July,Steamer,Category5,123,6)
...

Sliding windows¶

Sliding windows allow overlapping periods, meaning an event can belong to multiple buckets. This is particularly useful for capturing trends over time. The slide interval defines how often a new window starts and the window size defines its duration. For example, the following snippet defines a 2-second window that starts every 1 second:
```
.keyBy(...).window(SlidingProcessingTimeWindows.of(Time.seconds(2), Time.seconds(1)))
```
As a result, each event that arrives during this period will be included in multiple overlapping windows, enabling more granular analysis of the data stream.

Session window¶

A session window begins when the data stream processes records and ends after a defined period of inactivity. The inactivity threshold is set using a timer, which determines how long to wait before closing the window.

.keyBy(...).window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))

The operator creates one window for each data element received. If there is a gap of 5 seconds without new events, the window will close. This makes session windows particularly useful for scenarios where you want to group events based on user activity or sessions of interaction, capturing the dynamics of intermittent data streams effectively.

Global¶

Global: One window per key; requires explicit triggers.

See Windowing TVF documentation.

.keyBy(0)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(5))

See Windowing Table-Valued Functions details in Confluent documentation.

Trigger¶

A Trigger in Flink determines when a window is ready to be processed.

Each window has a default trigger associated with it. For example, a tumbling window might have a default trigger set to 2 seconds, while a global window requires an explicit trigger definition.

You can implement custom triggers by creating a class that implements the Trigger interface, which includes methods such as onElement(..), onEventTime(..), and onProcessingTime(..).

Flink provides several default triggers:

EventTimeTrigger fires based upon progress of event time
ProcessingTimeTrigger fires based upon progress of processing time
CountTrigger fires when # of elements in a window exceeds a specified parameter.
PurgingTrigger is used for purging the window, allowing for more flexible management of state.

Eviction¶

Evictor is used to remove elements from a window either after the trigger fires or before/after the window function is applied. The specific logic for removing elements is application-specific and can be tailored to meet the needs of your use case.

The predefined evictors:

CountEvictor removes elements based on a specified count, allowing for fine control over how many elements remain in the window.
DeltaEvictor evicts elements based on the difference between the current and previous counts, useful for scenarios where you want to maintain a specific change threshold.
TimeEvictor removes elements based on time, allowing you to keep only the most recent elements within a given time frame.

Event time¶

Time is a central concept in stream processing and can have different interpretations based on the context of the flow or environment:

Processing Time refers to the system time of the machine executing the task. It offers the best performance and lowest latency since it relies on the local clock. But it may lead to non-deterministic results due to factors like ingestion delays, parallel execution, clock skew, and backpressure.
Event Time is the timestamp embedded in the record at the event source level. Using event-time ensures consistent and deterministic results, regardless of the order in which events are processed. This is crucial for accurately reflecting the actual timing of events.
Ingestion Time denotes the time when an event enters the Flink system. It captures the latency introduced during the event's journey into the processing framework.

In any time window, the order of arrival may not match event time, and some events with an older timestamp may fall outside the time window boundaries. To address this challenge, particularly when computing aggregates, it's essential to ensure that all relevant events have arrived within the intended time frame.

The watermark serves as a heuristic for this purpose.

Watermarks¶

Watermarks are special markers periodically injected by Flink source operator, to indicate event-time progress in each stream. They track how time advances and help handle out-of-order records. Each Watermark is associated to a timestamp, expressed in milliseconds after epoch. This is the core mechanism for triggering computation at event-time and not at processing time.

Watermarks are only used by time-based operators, such as time windows, temporal joins or pattern matching, for example they determine when windows can safely close by estimating when all events for a time period have arrived.

Operations which are NOT based on time (e.g. simple JOIN, UNION ALL, filtering by WHERE conditions) do not use Watermarks. Watermarks are also not used in batch mode/snapshot queries.

Confluent Cloud for Flink

Confluent Cloud for Apache Flink provides a default watermark strategy based on the $rowtime column for all tables, whether created automatically from a Kafka topic or using a CREATE TABLE statement. Watermark is computed per Kafka partition, with at least a minimum of 250 records per partition. Developers may change the watermark of existing table using ALTER TABLE, or specifying using create table and select one timestamp(3) column.

Key Concepts¶

Generated in the data stream at regular intervals, they are part of the source operator processing or immediately after it. Each parallel subtask of the source typically generates its watermarks independently, based on the events it processes. This is especially important for partitioned sources like Kafka, where each source subtask might read from one or more partitions and emit watermark for each partition.
Watermark generation logic is defined using a WatermarkStrategy. This strategy is typically applied directly