Flink Getting Started Guide¶
Updates
- Created 2018
- Updated 2/14/2025 - improve note, on k8s deployment and get simple demo reference, review done.
- 03/30/25: converged the notes and update referenced links
- 09/25: Review - simplications.
There are four different approaches to deploy and run Apache Flink / Confluent Flink:
- Local Binary Installation
- Docker-based local deployment
- Kubernetes Deployment Colima, AKS, EKS, GKS, King or minicube
- Confluent Cloud Managed Service
Pre-requisites¶
Before getting started, ensure you have some of the following dependant components:
- Java 11 or higher installed (OpenJDK) for developing Java based solution.
- Docker Engine and Docker CLI (for Docker and Kubernetes deployments)
- kubectl and Helm (for Kubernetes deployment)
- Confluent Cloud account (for Confluent Cloud deployment, and Flink SQL development)
- Confluent cli installed for both Confluent Cloud and Confluent Platform
- Git clone this repository, to access code and deployment manifests and scripts
1. Open Source Apache Flink Local Binary Installation¶
This approach is ideal for development and testing on a single machine.
Installation Steps¶
- Download and extract Flink binary. See in the release page -> downloads the last version.
-
Download and extract Kafka binary See download versions
-
Set environment variables:
-
Start the Flink cluster:
-
Access the Web UI at http://localhost:8081
-
Submit a job (Java application)
-
Download needed SQL connector for Kafka
-
Stop the cluster:
Running simple SQL application¶
Troubleshooting¶
- If port 8081 is already in use, modify
$Flink_HOME/conf/flink-conf.yaml - Check logs in
$Flink_HOME/logdirectory - Ensure Java 11+ is installed and JAVA_HOME is set correctly
2. Docker-based Deployment¶
This approach provides containerized deployment using Docker Compose.
Quick Start¶
-
Build a custom Apache Flink image with your own connectors. Verify current docker image tag then use the Dockerfile:
-
Start Flink session cluster:
Docker Compose with Kafka¶
To run Flink with Kafka:
Customization¶
- Modify
deployment/custom-flink-image/Dockerfileto add required connectors - Update
deployment/docker/flink-oss-docker-compose.yamlfor configuration changes
During development, we can use docker-compose to start a simple Flink session cluster or a standalone job manager to execute one unique job, which has the application jar mounted inside the docker image. We can use this same environment to do SQL based Flink apps.
As Task manager will execute the job, it is important that the container running the Flink code has access to the jars needed to connect to external sources like Kafka or other tools like FlinkFaker. Therefore, in deployment/custom-Flink-image, there is a Dockerfile to get the needed jars to build a custom Flink image that may be used for Taskmanager and SQL client. Always update the jar version with new Flink version.
Docker hub and maven links
3. Kubernetes Deployment¶
This approach provides scalable, production-ready deployment using Kubernetes.
Apache Flink¶
See the K8S deployment deeper dive chapter and the lab readme for Apache Flink OSS details.
Confluent Platform Manager for Flink on Kubernetes¶
See Kubernetes deployment chapter for detailed instructions, and the deployment folder and readme. See also the Confluent operator documentation, submit Flink SQL Statement with Confluent Manager for Apache Flink:
4. Confluent Cloud Deployment¶
This section presents a SRE-facing checklist when onboarding a new data engineer so they can access Flink workspaces and deploy Flink SQL statements in Confluent Cloud. Most of infrastructure as code for Confluent resources, service account, role binding can be done with Terraform.
Then the classical tasks that should be enabled to a Data Engineer are:
- Engineer can log into Confluent Cloud, open the Flink workspace, and see the correct catalogs/databases/tables based on RBAC.
- Engineer can run a simple SELECT query against an allowed table and see results.
- Engineer can deploy one background statement (e.g.,
INSERT INTO ... SELECT ...) using the appropriate service account / compute pool, and monitoring shows it as healthy and consuming data
The following is a detailed chekclist to assess as a Data Engineer to get the full scope of what to consider to deploy Flink Statement in production:
1. Organization, Environments, and Regions¶
- Which Confluent Cloud organization will the engineer use (ID / name)?
- Which environments (dev, test, prod) should they have access to?
- For each environment, which cloud provider and region(s) are in scope for Flink (must match Kafka regions)?
2. Identity, RBAC, and Service Accounts¶
- What is the IdP/SSO setup and which SSO groups should map to Flink roles?
- Should the engineer be granted FlinkDeveloper only, or also FlinkAdmin in any environments? See this summary for access controle
- Which Kafka cluster / topic-level RBAC roles do they need to read/write tables from Flink?
- Will long-running Flink SQL statements use a service account? If yes:
- Which service account(s) should be created or reused?
- Who grants the Assigner role binding so the engineer can run statements with that service account?
3. Networking and Connectivity¶
- Will the engineer access Flink via public networking (allowed IPs) or private networking?
- If private networking:
- Are we using a Confluent Cloud Network (CCN), PrivateLink Attachment (PLATT), or both for the relevant environments/regions?
- Are Flink endpoints reachable from the engineer’s network path (CCN / PLATT / peering)?
- Are any corporate firewalls / proxies changes required (domains, ports) to reach:
- Confluent Cloud UI (Flink workspace)
- Flink SQL Shell / drivers (e.g., confluent-sql)
- Metrics / observability endpoints (Metrics API, Prometheus, Datadog, etc.)?
4. Kafka & Schema Registry Prerequisites¶
- Which Kafka clusters and topics will the engineer read/write from Flink?
- Are the necessary topics and schemas already created?
- Is Schema Registry enabled and reachable in the same regions, and are SR API / RBAC permissions configured for this engineer (or service account)?
- Are there any naming conventions (topics, tables, catalogs/databases) the engineer must follow?
5. Flink Compute Pools and Capacity¶
- Are we using the default compute pool or explicit compute pools for this environment/region?
- For explicit compute pools:
- Which compute pool(s) should this engineer target (name, region, environment)?
- What are the max CFU limits per pool and any per-statement limits we want to enforce?
- Do we have any workload isolation rules (e.g., separate pools for dev vs prod, or per team/use case)?
- Who is responsible for monitoring and adjusting CFU capacity as workloads grow?
6. Tooling, Automation, and CI/CD¶
-
Which tools are approved/standard for managing Flink SQL:
- Confluent Cloud console (Flink workspace)
- Confluent CLI
- Terraform Confluent provider
- dbt-confluent (dbt adapter)
- shift_left utils to manage Flink project at scale with best practices enforcement.
-
Where should Flink SQL artifacts live (git repo, dbt project, etc.) and what is the review process?
- How is promotion handled (dev → test → prod) for Flink SQL
7. Observability, Monitoring, and Alerts¶
- Which observability stack is in use for Flink - outside of Confluent Cloud?
- Are standard dashboards already available for:
- Compute pools (CFU usage, backpressure, errors)
- Flink statement health and lag (pending records / messages behind)?
- What alerting rules should be configured for this engineer’s workloads (e.g., lag thresholds, failure/restart loops, CFU saturation)?
- How will the engineer access logs / error details for their statements (UI, API, log integrations)?
8. Guardrails, Data Governance, and Safety¶
- Are there RBAC guardrails to prevent:
- DDL in prod (CREATE/DROP/ALTER) by individuals
- Writes to sensitive topics/tables
- Cross-environment/table joins that violate governance?
- What is the approved approach for handling late data, DLQs, and error-handling.mode on tables (if already standardized)?
9. Incident Management and Support Paths¶
- For Flink-related issues, what is the internal escalation path (Slack channels, on-call rotations, runbooks)?
- Are there any runbooks specific to:
- Stuck or degraded statements
- Compute pool capacity exhaustion
- Misconfigured networking / access failures?
- Who is the primary SRE contact for this engineer’s environment/region?
Demonstrate with Confluent cli
This approach provides a fully managed Flink service and very easy to get started quickly without managing Flink clusters or Kafka Clusters. It uses the confluent cli
- Upgrade the cli
-
Create a Flink compute pool:
-
Start SQL client:
-
Submit SQL statements:
CREATE TABLE my_table ( id INT, name STRING ) WITH ( 'connector' = 'kafka', 'topic' = 'my-topic', 'properties.bootstrap.servers' = 'pkc-xxxxx.region.provider.confluent.cloud:9092', 'properties.security.protocol' = 'SASL_SSL', 'properties.sasl.mechanism' = 'PLAIN', 'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="<API_KEY>" password="<API_SECRET>";', 'format' = 'json' );
See Confluent Cloud Flink documentation for more details.
Explore the Shift Left project, your dedicated CLI for scaling and organizing Confluent Cloud Flink projects with an opinionated, streamlined approach.
Additional Resources¶
- Apache Flink Documentation
- Confluent Cloud Documentation
- Flink Kubernetes Operator
- Docker Hub Flink Images
- Shift Left project to manage Flink project at scale.