Skip to content

Database playground and CDC to kafka studies

info

Updated 8/20/2022 - postgres content

This repository supports simple studies for data base play (SQL, JPA) like db2, postgresql and mysql and change data capture to move data to kafka for example.

SQL basic

Examples of some basic commands applied to customers table.

select * from customers;
select distinct(name) from customers;
select count(distinct(rate)) from films;

-- how many customer has the name bob and older that 18
select count(*) from customer where name = 'bob' and age >= 18;
-- order on a column name
select name, salary from customers order by salary DESC;

-- limit the number of records returned
SELECT name, salary FROM customers ORDER BY salary DESC LIMIT 10;

-- BETWEEN
select count(*) from payment where amount between 8 and 9;
select * from payment where payment_date between '2007-02-01' and '2007-2-15';

-- IN to test value in a list of options
select count(*) from payment where amount in(0.99, 1.98, 1.99);

-- LIKE and ILIKE (case incensitive) to do pattern matching on string
select * from customer where first_name ilike 'J%';

-- Modify the table
alter table customers add column email varchar(100);

update customers  set email='max@email.com' where id = 4;

-- Aggregate min, nax, abg, cont,...
select round(avg(replacement_code),3) from film;

-- GROUP BY combined with aggregate. Who is the customer spending the most
select customer_id, sum(amount) from  payment group by customer_id order by sum(amount) DESC;

-- get the day with the most transactions
select DATE(payment_date), sum(amount) from  payment group by  DATE(payment_date) order by sum(amount) DESC;

-- HAVING to allow us to use the aggregate result to filter the result along with group by
select customer_id, sum(amount) from  payment  
where staff_id = 2 
group by  customer_id having sum(amount) >= 110;

See also postgres study for information to run SQL on a local postgresql started with docker compose.

Joins and more complex SQL

Join to group records from two tables. Global Inner join SQL structure looks like: build a projection of all elements in table A also present in table B. Get all columns of both table

SELECT * FROM table_a INNER JOIN table_b on table_a.column = table_b.column_b
  • Get the top 10 customer name who do the most renting
select first_name, last_name, count(*) from payment 
inner join customer 
on payment.customer_id = customer.customer_id 
group by first_name, last_name 
order by count(*) desc limit 10;
  • OUTER JOIN: to deal with column only in one table

    -- full outer join - customer never buy anything
    select * from customer 
    full outer join payment 
    on payment.customer_id = customer.customer_id
    where customer.customer_id IS null;
    -- set of records that are in the left table, present or not in the right
    select title, inventory_id, store_id from film 
    left join inventory 
    on film.film_id = inventory.film_id
    

    -- film in catalog not in the inventory
    select title, inventory_id, store_id from film 
    left join inventory 
    on film.film_id = inventory.film_id
    where inventory_ID IS null
    
    -- get email address of customer leaving in california
    select district, email from customer
    join address
    on address.address_id = customer.address_id
    where address.district = 'California'
    

Exercises on the dvdrental database

See postgres to restore the database schema and data from the tar file.

  • How many payment transactions were greater than $5.00?

    select count(*) from payment where amount > 5
    
  • How many actors have a first name that starts with the letter P?

    select count(first_name) from actor where first_name like 'P%';
    
    * How many unique districts are our customers from?

    select count(distinct(district)) from address;
    -- Retrieve the list of names for those distinct districts 
    select distinct(district) from address where district is not null and district != '' order by district asc;
    
  • How many films have a rating of R and a replacement cost between $5 and $15?

    select count(*) from film where  rating = 'R' and replacement_cost between 5 and 15;
    
    * How many films have the word Truman somewhere in the title?

    select count(*) from film where title ilike '%truman%';
    
    * which staff member processes the biggest number of transaction

    select staff_id, count(amount) from  payment group by  staff_id;
    
    * what is the avg replacement cost per film rating

    select rating, round(avg(replacement_cost),2) from  film group by  rating;
    
    * Customer elligible for platinium status having more than 40 transactions

    select customer_id, count(amount) from  payment group by  customer_id having count(amount) >= 40 ORDER by count(amount);
    
  • Customer who spent more than 100$ with a given staff

    select customer_id, sum(amount) from  payment  where staff_id = 2 group by  customer_id having sum(amount) >= 100;
    
  • what are the film with a given actor

    select  title,first_name, last_name from film_actor as fa
    inner join actor
    on actor.actor_id = fa.actor_id
    inner join film
    on fa.film_id = film.film_id
    where last_name = 'Wahlberg' and first_name = 'Nick'
    
  • Film returned on a specific date: uses subquery, and joins

    select film_id,title from film
    where film_id in
    (select inventory.inventory_id from rental 
    inner join inventory
    on inventory.inventory_id = rental.inventory_id
    where rental.return_date between '2005-05-29' and ' 2005-05-30')
    ORDER BY title
    

Intermediate SQL

  • Protect against null for value and substitute using COALESCE
select product, (price - coalesce(discount,0)) from products;

Play with timestamp

-- get the year of a date
select extract(year from payment_date) as myyear from payment;
-- can use month, quarter, month, day
-- get how old is a record
select AGE(payment_date) from payment;
--
select TO_CHAR(payment_date, "MM-DD-YYYY") from payment;

Mathematical functions

Compute things from columns. Examples from postgresql

  • Get list of students who scored better than average grade
-- avg grade
select AVG(grade) from test_scores
-- solution with subquery
select student, grade from test_scrores
where grade > ( select AVG(grade) from test_scores);

Self-join

A query in which table is joined to itself: used to compare values in a column of rows within the same table. Need to use aliases. Get the employe's name and the name of his manager:

select emp.name, report.name as manager from employees as emp
join employees as report 
on emp.emp_id = report.report_id 
  • find the films with the same length presented as pair
from f1.title, f2.title, f1.length from film as f1
inner join film as f2
on f1.film_id != f2.film_id and f1.length = f2.length
  • Compare the various amounts of films per movie rating
select
sum(CASE rating
    WHEN 'R' THEN 1
    ELSE 0
END) as R,
sum(CASE rating
    WHEN 'PG' THEN 1
    ELSE 0
END) as PG,
sum(CASE rating
    WHEN 'PG-13' THEN 1
    ELSE 0
END) as PG13
from film

Add views

It is a stored query to be executed often.

create view customer_info as 
SELECT ... -- the sql to repeat

Table creation

For primary key try to using numerical type and Postgresl sequence like:

CREATE TABLE players (player_id SERIAL PRIMARY_KEY, age smallint not null)

Use CHECK to put constraint on column and between column

hire_date DATE check (hire_date > birthday)

Alter table

-- drop a constraint
alter table people alter column name drop not null

Alter table in postgresql - doc

Common queries to clean data

Remove duplicates

-- search for same transaction
SELECT transaction_id, COUNT(transaction_id)
FROM Transaction__Table
GROUP BY transaction_id
HAVING COUNT(transaction_id) > 1;
-- 

Quarkus with panache and DB2

The project is inventory app and the interesting part is the property file

quarkus.datasource.username=cbm39816
quarkus.datasource.password=
quarkus.datasource.url=jdbc:db2://dashdb-txn-sbox-yp-dal09-03.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;
quarkus.datasource.max-size=8
quarkus.datasource.min-size=2
quarkus.datasource.driver=com.ibm.db2.jcc.DB2Driver
quarkus.hibernate-orm.dialect=org.hibernate.dialect.DB2Dialect
quarkus.hibernate-orm.database.generation=drop-and-create
quarkus.hibernate-orm.database.default-schema=CBM39816
quarkus.hibernate-orm.log.sql=true
quarkus.hibernate-orm.sql-load-script=import.sql
quarkus.smallrye-openapi.path=/inventory/swagger
quarkus.swagger-ui.always-include=true

Postgresql

See dedicated note.

Setting CDC on those table

The Kafka connector is deployed as part of Strimzi.

TBC