Environment Setup¶
Version
Created January 2025 - Updated 03/21/25: setup guide separation between user of the CLI and developer of this project.
This chapter discusses the CLI tool to manage Flink project, in a shift_left exercise or in close future in a data as a product project. To install the CLI, which is based on Python, use a virtual environment.
Additionally, when using the tool to do migration of existing code to Apache Flink SQL, a LLM running within Ollama is used. The Qwen model with 32 billion parameters requires 64 GB of memory and a GPU with 32 GB of VRAM. Therefore, it may be more practical to create an EC2 instance with the appropriate resources to handle the migration of each fact and dimension tables, generate the stageg migrated SQLs, and then terminate the instance. Currently, tuning the pipeline and finalizing each automatically migrated SQL is done manually as some migrations are not perfect. There is work to be done in prompting and chaining AI agents.
To create this EC2 machine, Terraform configurations are defined in the IaC folder. See the readme. With Terraform and setup.sh script the next sections are automated. The EC2 does not need to run docker.
Important Ollama and LLM is used for migration, and to classify SQL statements.
Pre-requisites¶
- On Windows - enable WSL2
- All Platforms - install git
-
All Platforms - install make for windows,
- Mac OS:
brew install make
- Linux:
sudo apt-get install build-essential
- Mac OS:
-
All Platforms - install confluent cli
-
All Platforms - Install Python 3.12
-
Clone this repository (this will not be necessary once the CLI will be available in pypi):
git clone https://github.com/jbcodeforce/shift_left_utils.git
cd shift_left_utils
- Create a Python virtual environment:
python -m venv .venv
- Activate the environment:
source .venv/bin/activate
- Install the shift_left CLI using the command (this is temporary once the CLI will be loaded to pypi):
pip install src/shift_left/dist/shift_left-0.1.1-py3-none-any.whl
- Validate the CLI is available via:
shift_left --help
Create a new Flink data project¶
This is step is only valuable when starting a new Confluent Flink project, or a new data as a product using Flink and Confluent.
shift_left project init <project_name> <project_path> --project-type
# example for a default Kimball project
shift_left project init flink-project ../
# For a project more focused on developing data as a product
shift_left project init flink-project ../ --project-type data-product
At this stage, you should have three folders for the project: flink_project, the dbt_source, the shift_left_utils. For the Kimball based flink project there is a pipelines folder with the same structure as defined by the Kimball guidelines:
├── flink-project
│ ├── pipelines
│ ├── common.mk
│ ├── dimensions
│ ├── facts
│ ├── intermediates
│ ├── sources
│ └── stage
└── src_dbt_project
or for a data product:
├── pipelines
│ ├── common.mk
│ └── data_product_1
│ ├── dimensions
│ ├── facts
│ │ └── fct_order
│ │ ├── Makefile
│ │ ├── sql-scripts
│ │ │ ├── ddl.fct_order.sql
│ │ │ └── dml.fct_order.sql
│ │ ├── tests
│ │ └── tracking.md
│ ├── intermediates
│ └── sources
└── staging
Working in a project¶
- Start a Terminal
- Connect to Confluent Cloud with CLI, then get the environment and compute pool identifiers:
confluent login --save
- Get the credentials for the Confluent Cloud Kafka cluster and Flink compute pool. If you do not have such environment Confluent cli has a quickstart plugin:
confluent flink quickstart --name dbt-migration --max-cfu 50 --region us-west-2 --cloud aws
- Define environment variables in the .env file
FLINK_PROJECT=.
CCLOUD_ENV_NAME=
CLOUD_PROVIDER=
CLOUD_REGION=
CCLOUD_CONTEXT=
CCLOUD_KAFKA_CLUSTER=
CCLOUD_COMPUTE_POOL_ID=
SRC_FOLDER=../../src-dbt-project/models
STAGING=$FLINK_PROJECT/staging
PIPELINES=$FLINK_PROJECT/pipelines
- Define a config.yaml file to keep some important parameters of the CLI.
cp src/shift_left/src/shift_left/core/templates/config_tmpl.yaml ./config.yaml
- Modify the
config.yaml
in the root of the Flink project, with the corresponding values. The Kafka section is to access the Kafka Cluster and topics:
kafka:
bootstrap.servers: pkc-<uid>.us-west-2.aws.confluent.cloud:9092
security.protocol: SASL_SSL
sasl.mechanisms: PLAIN
sasl.username: <key name>
sasl.password: <key seceret>
session.timeout.ms: 5000
The registry section is for the schema registy.
registry:
url: https://psrc-<uid>.us-west-2.aws.confluent.cloud
registry_key_name: <registry-key-name>
registry_key_secret: <registry-key-secrets>
Those declarations are loaded by the Kafka Producer and Consumer and with tools accessing the model definitions from the Schema Registy.
Security access
The config.yaml file is ignored in Git. So having the keys in this file is not a major concern as it used by the developer only. But it can be possible, in the future, to access secrets using a Key manager API. This could be a future enhancement.
You are ready to use the different tools, as a next step read an example of migration approach in this note or use the recipes to get how to do some common activities.
Working with the migration AI agent¶
- Install Ollama: using one of the downloads.
- Start Ollama using
ollama serve
then download the one of the Qwen model used by the AI Agent:qwen2.5-coder:32b
orqwen2.5-coder:14b
depending of your memory and GPU resources.