Skip to content

Environment Setup

Version

Created January 2025 - Updated 03/21/25: setup guide separation between user of the CLI and developer of this project.

This chapter discusses the CLI tool to manage Flink project, in a shift_left exercise or in close future in a data as a product project. To install the CLI, which is based on Python, use a virtual environment.

Additionally, when using the tool to do migration of existing code to Apache Flink SQL, a LLM running within Ollama is used. The Qwen model with 32 billion parameters requires 64 GB of memory and a GPU with 32 GB of VRAM. Therefore, it may be more practical to create an EC2 instance with the appropriate resources to handle the migration of each fact and dimension tables, generate the stageg migrated SQLs, and then terminate the instance. Currently, tuning the pipeline and finalizing each automatically migrated SQL is done manually as some migrations are not perfect. There is work to be done in prompting and chaining AI agents.

To create this EC2 machine, Terraform configurations are defined in the IaC folder. See the readme. With Terraform and setup.sh script the next sections are automated. The EC2 does not need to run docker.

Important Ollama and LLM is used for migration, and to classify SQL statements.

Pre-requisites

git clone  https://github.com/jbcodeforce/shift_left_utils.git
cd shift_left_utils
  • Create a Python virtual environment:
python -m venv .venv
  • Activate the environment:
source .venv/bin/activate
  • Install the shift_left CLI using the command (this is temporary once the CLI will be loaded to pypi): To get the list of version of the wheel, thery are under the src/shift_left/dist folder, the last version as of 4/11 is 0.1.8 but it will change, so the documentation may not reflect the last version.
pip install src/shift_left/dist/shift_left-0.1.12-py3-none-any.whl
# for the developers using uv, the installation is
uv tool list
uv tool uninstall shift_left
uv tool install shift_left@src/shift_left/dist/shift_left-0.1.12-py3-none-any.whl

Set up configuration yaml file

The configuration file has to be named config.yaml and will be referenced by the environment variables: CONFIG_FILE.

  • Copy the config.yaml template file to keep some important parameters for the CLI.
cp src/shift_left/src/shift_left/core/templates/config_tmpl.yaml ./config.yaml
  • Modify the config.yaml, with the corresponding values. Some sections are mandatory

The Kafka section is to access the Kafka Cluster and topics :

kafka:
  bootstrap.servers: pkc-<uid>.us-west-2.aws.confluent.cloud:9092
  security.protocol: SASL_SSL
  sasl.mechanisms: PLAIN
  sasl.username: <key name>
  sasl.password: <key seceret> 
  session.timeout.ms: 5000

The registry section is for the schema registy.

registry:
  url: https://psrc-<uid>.us-west-2.aws.confluent.cloud
  registry_key_name: <registry-key-name>
  registry_key_secret: <registry-key-secrets>

Those declarations are loaded by the Kafka Producer and Consumer and with tools accessing the model definitions from the Schema Registy.

Security access

The config.yaml file is ignored in Git. So having the keys in this file is not a major concern as it used by the developer only. But it can be possible, in the future, to access secrets using a Key manager API. This could be a future enhancement.

Environment variables

Ensure the following environment variables are set: in a set_env file. For example, in a project where the source repository is cloned to your-src-dbt-folder and the target Flink project is flink-project, use these setting:

export FLINK_PROJECT=$HOME/Code/flink-project
export STAGING=$FLINK_PROJECT/staging
export PIPELINES=$FLINK_PROJECT/pipelines
export SRC_FOLDER=$HOME/Code/datawarehouse/models


export CONFIG_FILE=./config.yaml
# The following variables are used when deploying with the Makefile.

export CCLOUD_ENV_ID=env-xxxx
export CCLOUD_ENV_NAME=j9r-env
export CCLOUD_KAFKA_CLUSTER=jxxxxx
export CLOUD_REGION=us-west-2
export CLOUD_PROVIDER=aws
export CCLOUD_CONTEXT=login-<user_id>@<domain>.com-https://confluent.cloud
export CCLOUD_COMPUTE_POOL_ID=lfcp-xxxx

To get the environment variables configured for your Terminal session do:

source set_env

Validate the CLI

shift_left --help

Next>> Recipes section