Skip to content

Confluent Platform for Flink

The official product documentation after 07/2025 release is here.

The main features are:

  • Fully compatible with open-source Flink.
  • Deploy on Kubernetes using Helm
  • Define environment, which does a logical grouping of Flink applications with the goal to provide access isolation, and configuration sharing.
  • Deploy application with user interface and task manager cluster
  • Exposes custom kubernetes operator for specific CRD

The figure below presents the Confluent Flink components deployed on Kubernetes:

1
Confluent Platform for Flink
  • CFK supports the management of custom CRD, based on the Flink for Kubernetes Operator. (CFK 3.0.0 supports CP 8.0.0)
  • CMF (Confluent Manager for Apache Flink) adds security control, and a REST API server for the cli or a HTTP client
  • FKO is the open source Flink for Kubernetes Operator
  • Flink cluster are created from command and CRDs and run Flink applications within an environment

Be sure to have confluent cli.

  • Environment: for access control isolation, configuration sharing
  • Compute pool represents resources to run Task manager and Job manager. Each Flink SQL statement is associated with exactly one Compute Pool. See example of pool definition and in cmf folder
  • For SQL catalog to group database concept for Flink SQL table queries. It references a Schema Registry instance and one or more Kafka clusters.

Set up

Important source of information for deployment

Metadata management service for RBAC

SQL specific

Understanding Sizing

The observed core performance rule is that Flink can process ~10,000 records per second per CPU core. This baseline may decrease with larger messages, bigger state, key skew, or high number of distinct keys.

There are a set of characteristics to assess before doing sizing:

Statement Complexity Impact

Statement complexity reflects the usage of complex operators like:

  • Joins between multiple streams
  • Windowed aggregations
  • Complex analytics functions

More complex operations require additional CPU and memory resources.

Architecture Assumptions

  • Source & Sink latency: Assumed to be minimal
  • Key size: Assumed to be small (few bytes)
  • Minimum cluster size: 3 nodes required
  • CPU limit: Maximum 8 CPU cores per Flink node

State Management & Checkpointing

  • Working State: Kept locally in RocksDB
  • Backup: State backed up to distributed storage
  • Recovery: Checkpoint size affects recovery time
  • Frequency: Checkpoint interval determines state capture frequency
  • Aggregate state size consumes bandwidth and impacts recovery latency.

Resource Guidelines

Typical Flink node configuration includes:

  • 4 CPU cores with 16GB memory
  • Should process 20-50 MB/s of data
  • Jobs with significant state benefit from more memory

Scaling Strategy: Scale vertically before horizontally

Latency Impact on Resources

Lower latency requirements significantly increase resource needs:

  • Sub-500ms latency: +50% CPU, frequent checkpoints (10s), boosted parallelism
  • Sub-1s latency: +20% CPU, extra buffering memory, 30s checkpoints
  • Sub-5s latency: +10% CPU, moderate buffering, 60s checkpoints
  • Relaxed latency (>5s): Standard resource allocation Stricter latency requires more frequent checkpoints for faster recovery, additional memory for buffering, and extra CPU for low-latency optimizations.

Estimation Heuristics

The estimator increases task managers until there are sufficient resources for:

  • Expected throughput requirements
  • Latency-appropriate checkpoint intervals
  • Acceptable recovery times
  • Handling data skew and key distribution
Important Notes
  • These are estimates - always test with your actual workload
  • Start with conservative estimates and adjust based on testing
  • Monitor your cluster performance and tune as needed
  • Consider your specific data patterns and business requirements

See this estimator project for a tool to help estimating cluster sizing.

Monitoring

Troubleshouting

Submitted query pending

When submitting SQL query from the Flink SQL, it is possible it goes in pending: the job manager pod is created and running, but the task manager is pending. Assess what is going on with kubectl describe pod <pod_id> -n flink, which gives error message like insufficient cpu. Adding compute pool cpu is worse as the compute pool spec for task manager and job manager is for the default settings to create those pods.