CDC with Debezium studies¶
Tutorial summary¶
The Debezium tutorial is here and summarized in this note.
Things to remember:¶
- Debezium is a Kafka connect connector, one dedicated per database: Postgresql, MySQL.
Tutorial steps¶
The commands are provided in the Makefile under code/debezium-tutorial.
- The labs setup:
Potential issues to address
When trying to connect to mysql server with the client, the root user is denied access. One way is to ensure the 'root' user (or the specific user you're using for Debezium) has privileges granted for the correct host.
-
Registrer debezium connector to start monitoring the database server's binlog.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d@config.json
When you register a connector, it generates a large amount of log output in the Kafka Connect container.
-
The connector performs a snapshot, looking at the table schema, reading all the rows, then it may create kafka topics, and then it transitions to continuously reading the binlog.
- The topics created match the tables in the database:
topic name | role |
---|---|
dbserver1 | The schema change topic to which all of the DDL statements are written. |
dbserver1.inventory.products | Captures change events for the products table in the inventory database. |
dbserver1.inventory.products_on_hand | Captures change events for the products_on_hand table in the inventory database. |
dbserver1.inventory.customers | Captures change events for the customers table in the inventory database. |
dbserver1.inventory.orders | Captures change events for the orders table in the inventory database. |
-
View events created
docker run -it --rm --name watcher --link zookeeper:zookeeper --link kafka:kafka quay.io/debezium/kafka:3.2 watch-topic -a -k dbserver1.inventory.customers # or make view_customers
We should see two records, one for the key and one for the payload.
The event has two parts: a schema and a payload. The schema contains a Kafka Connect schema describing what is in the payload
-
The envelop provides metadata about what the event represents:
- op: A required field that contains a string value describing the type of operation
- before: An optional field that, if present, contains the state of the row before the event occurred.
- after: An optional field that, if present, contains the state of the row after the event occurred.
- source: A required field that contains a structure describing the source metadata for the event, which in the case of MySQL, contains several fields: the connector name, the name of the binlog file where the event was recorded, the position in that binlog file where the event appeared, the row within the event (if there is more than one), the names of the affected database and table, the MySQL thread ID that made the change, whether this event was part of a snapshot, and, if available, the MySQL server ID, and the timestamp in seconds.
- ts_ms: An optional field that, if present, contains the time (using the system clock in the JVM running the Kafka Connect task) at which the connector processed the event.
-
The JSON converter includes the key and value schemas in every message, so it does produce very verbose events. While with Avro, it will persist the schema in a remote schema registry, and send the binary unique identifier of the schema.
-
Update table records in the mySQL client
-
Deleting a record:
The key records is the same, while the payload has an after that is null.
-
Stop the kafka connector,
add new records
-
Clean up
to keep in mind¶
If Kafka is set up to be log compacted, it will remove older messages from the topic if there is at least one message later in the topic with same key. This last event is called a tombstone event, because it has a key and an empty value. This means that Kafka will remove all prior messages with the same key. Even though the prior messages will be removed, the tombstone event means that consumers can still read the topic from the beginning and not miss any events.
The Kafka Connect service automatically manages tasks for its registered connectors. Therefore, if it goes offline, when it restarts, it will start any non-running tasks. This means that even if Debezium is not running, it can still report changes in a database.