Confluent Platform for Flink¶
The official product documentation after 07/2025 release is here.
The main features are:
- Fully compatible with open-source Flink.
- Deploy on Kubernetes using Helm
- Define environment, which does a logical grouping of Flink applications with the goal to provide access isolation, and configuration sharing.
- Deploy application with user interface and task manager cluster
- Exposes custom kubernetes operator for specific CRD
The figure below presents the Confluent Flink components deployed on Kubernetes:

- CFK supports the management of custom CRD, based on the Flink for Kubernetes Operator. (CFK 3.0.0 supports CP 8.0.0)
- CMF (Confluent Manager for Apache Flink) adds security control, and a REST API server for the cli or a HTTP client
- FKO is the open source Flink for Kubernetes Operator
- Flink cluster are created from command and CRDs and run Flink applications within an environment
Be sure to have confluent cli.
Specific concepts added on top of Flink¶
- Environment: for access control isolation, configuration sharing
- Compute pool represents resources to run Task manager and Job manager. Each Flink SQL statement is associated with exactly one Compute Pool. See example of pool definition and in cmf folder
- For SQL catalog to group database concept for Flink SQL table queries. It references a Schema Registry instance and one or more Kafka clusters.
Set up¶
- See my dedicated chapter for Confluent Plaform Kubernetes deployment.
- See the makefile to deploy CMF, and the product documentation
Important source of information for deployment¶
- Confluent Platform deployment using kubernetes operator
- Deployment overview and for Apache Flink.
- CP Flink supports K8s HA only.
- Flink fine-grained resource management documentation.
- CP v8 announcement: builds on Kafka 4.0, next-gen Confluent Control Center (integrating with the open telemetry protocol (OTLP), Confluent Manager for Apache FlinkĀ® (CMF)), Kraft native (support significantly larger clusters with millions of partitions), Client-Side field level encryption. CP for Flink support SQL, Queues for Kafka is now in Early Access,
Metadata management service for RBAC¶
- Metadata Service Overview
- Single broker Kafka+MDS Deployment
- Git repo with working CP on K8s deployment using RBAC via SASL/Plain and LDAP
- Configure CP Flink to use MDS for Auth
- Additional Flink RBAC Docs
- How to secure a Flink job with RBAC
- Best Practices for K8s + RBAC
SQL specific¶
Understanding Sizing¶
The observed core performance rule is that Flink can process ~10,000 records per second per CPU core. This baseline may decrease with larger messages, bigger state, key skew, or high number of distinct keys.
There are a set of characteristics to assess before doing sizing:
Statement Complexity Impact¶
Statement complexity reflects the usage of complex operators like:
- Joins between multiple streams
- Windowed aggregations
- Complex analytics functions
More complex operations require additional CPU and memory resources.
Architecture Assumptions¶
- Source & Sink latency: Assumed to be minimal
- Key size: Assumed to be small (few bytes)
- Minimum cluster size: 3 nodes required
- CPU limit: Maximum 8 CPU cores per Flink node
State Management & Checkpointing¶
- Working State: Kept locally in RocksDB
- Backup: State backed up to distributed storage
- Recovery: Checkpoint size affects recovery time
- Frequency: Checkpoint interval determines state capture frequency
- Aggregate state size consumes bandwidth and impacts recovery latency.
Resource Guidelines¶
Typical Flink node configuration includes:
- 4 CPU cores with 16GB memory
- Should process 20-50 MB/s of data
- Jobs with significant state benefit from more memory
Scaling Strategy: Scale vertically before horizontally
Latency Impact on Resources¶
Lower latency requirements significantly increase resource needs:
- Sub-500ms latency: +50% CPU, frequent checkpoints (10s), boosted parallelism
- Sub-1s latency: +20% CPU, extra buffering memory, 30s checkpoints
- Sub-5s latency: +10% CPU, moderate buffering, 60s checkpoints
- Relaxed latency (>5s): Standard resource allocation Stricter latency requires more frequent checkpoints for faster recovery, additional memory for buffering, and extra CPU for low-latency optimizations.
Estimation Heuristics¶
The estimator increases task managers until there are sufficient resources for:
- Expected throughput requirements
- Latency-appropriate checkpoint intervals
- Acceptable recovery times
- Handling data skew and key distribution
Important Notes
- These are estimates - always test with your actual workload
- Start with conservative estimates and adjust based on testing
- Monitor your cluster performance and tune as needed
- Consider your specific data patterns and business requirements
See this estimator project for a tool to help estimating cluster sizing.
Monitoring¶
Troubleshouting¶
Submitted query pending¶
When submitting SQL query from the Flink SQL, it is possible it goes in pending: the job manager pod is created and running, but the task manager is pending. Assess what is going on with kubectl describe pod <pod_id> -n flink
, which gives error message like insufficient cpu. Adding compute pool cpu is worse as the compute pool spec for task manager and job manager is for the default settings to create those pods.