Skip to content

Environment Setup

This chapter discusses the essential tools required and the project setup. There are two options for running the tools: one involves using a Python virtual environment, while the other utilizes Docker along with a predefined Python environment image that includes all the necessary modules and code.

Additionally, running the Ollama Qwen model with 32 billion parameters requires 64 GB of memory and a GPU with 32 GB of VRAM. Therefore, it may be more practical to create an EC2 instance with the appropriate resources to handle the migration of each fact and dimension table, generate the staged migrated SQL, and then terminate the instance. Currently, tuning the pipeline and finalizing each SQL version is done manually.

To create this EC2 machine, Terraform configurations are defined in the IaC folder. See the readme. With Terraform and setup.sh script the next sections are automated. The EC2 does not need to run docker.

Common Pre-requisites

  • On Windows - enable WSL2
  • All Platforms - install git
  • All Platforms - install make for windows,

    • Mac OS: brew install make
    • Linux: sudo apt-get install build-essential
  • All Platforms - install confluent cli

  • Go to the folder parent to the dbt source project. For example if your dbt project is in /home/user/code then be in this code folder.
  • Clone this repository:
git clone  https://github.com/jbcodeforce/shift_left_utils.git
  • Use the setup.sh script to create the project structure, and copy some important files, for the new flink project:
cd shift_left_utils
./setup.sh <a_flink_project_name | default is flink_project>

At this stage, you should have three folders for the project: flink_project, the dbt_source, the shift_left_utils. For the flink_project a pipelines folder with the same structure as defined by the Kimball guidelines:

├── flink-project
│   ├── docker-compose.yaml
│   ├── pipelines
│      ├── common.mk
│      ├── dimensions
│      ├── facts
│      ├── intermediates
│      ├── sources
│      └── stage
│   └── start-ollama.sh
├── shift_left_utils
└── src_dbt_project
  • Connect to Confluent Cloud with CLI, then get environment and compute pool identifiers:
confluent login --save
  • Get the credentials for the Confluent Cloud Kafka cluster and Flink compute pool. If you do not have such environment Confluent cli has a quickstart plugin:
confluent flink quickstart --name dbt-migration --max-cfu 50 --region us-west-2 --cloud aws
  • Modify the config.yaml with the corresponding values. The Kafka section is to access the topics
kafka:
  bootstrap.servers: pkc-<uid>.us-west-2.aws.confluent.cloud:9092
  security.protocol: SASL_SSL
  sasl.mechanisms: PLAIN
  sasl.username: <key name>
  sasl.password: <key seceret> 
  session.timeout.ms: 5000

The registry section is for the schema registy.

registry:
  url: https://psrc-<uid>.us-west-2.aws.confluent.cloud
  registry_key_name: <registry-key-name>
  registry_key_secret: <registry-key-secrets>

Those declarations are loaded by the Kafka Producer and Consumer and with tools accessing the model definitons from the Schema Registy. (see utils/kafka/app_config.py code)

  • Modify the value for the cloud provider, the environment name, and the confluent context in the file pipelines/common.mk. The first lines of this common makefile file have some variables set up:
ENV_NAME=aws-west    # name of the Confluent Cloud environment
CLOUD=aws            # cloud provider
REGION=us-west-2     # cloud provider region where compute pool and kakfa cluster run
MY_CONTEXT= login-jboyer@confluent.io-https://confluent.cloud 
DB_NAME=cluster_0.   # Name of the Kafka Cluster, mapped to Database name in Flink
Security access

The config.yaml file is ignored in Git. So having the keys in this file is not a major concern as it used by the developer only. But it can be possible, in the future, to access secrets using a Key manager API. This could be a future enhancement.

Using Python Virtual Environment

Pre-requisites

python -m venv .venv
  • Activate the environment:
source .venv/bin/activate
  • Work from the shift_left_utils folder
  • Install the needed Python modules
# under the shift_left_utils folder
pip install -r utils/requirements.txt
  • Define environment variables, change the flink_project folder to reflect your own settings:
export CONFIG_FILE=../../flink-project/config.yaml
export SRC_FOLDER=../../src-dbt-project/models
export STAGING=../../flink-project/staging
export REPORT_FOLDER=../../flink-project/reports
  • Install Ollama: using one of the downloads.
  • Start Ollama using ollama serve then download the one of the Qwen model used by the AI Agent: qwen2.5-coder:32b or qwen2.5-coder:14b depending of your memory and GPU resources.

You are ready to use the different tools, as a next step, read the migration approach in this note

Using Docker Pythonenv image

Use this set up if you do not want to use a virtual environment.

Pre-requisites

docker version 
version 27.3.1, build ce12230

If for some reason, you could not use Docker Desktop, you can try Colima, Ranger Desktop, or podman.

For colima the following configuration was used to run those tools on Mac M3 64Gb ram:

colima start --cpu 4 -m 48 -d 100 -r docker

Project setup

  • Modify the docker-compose.yaml file to reference the source code folder path mapped to dbt-src folder. For example, using a source project, named src-dbt-project, defined at the same level as the root folder of this repository, the statement looks like:
  pythonenv:
    image: jbcodeforce/shift-left-utils
    # ... more config here
    volumes:
    # change the path below
    - ../src-dbt-project/:/app/dbt-src
    # do not touch the other mounted volumes

You are ready to use the different tools, as a next step read an example of migration approach in this note