Skip to main content

Sync data from other DB to ClickHouse, current support postgres and mysql, and support full and increment ETL.

Project description

Synch

pypi docker license workflows workflows

中文文档

Introduction

Sync data from other DB to ClickHouse, current support postgres and mysql, and support full and increment ETL.

synch

Features

  • Full data etl and real time increment etl.
  • Support DDL and DML sync, current support add column and drop column and change column of DDL, and full support of DML also.
  • Custom configurable items.
  • Support kafka and redis as broker.

Requirements

  • redis, cache mysql binlog file and position and as broker, support redis cluster also.
  • kafka, need if you use kafka as broker.
  • clickhouse-jdbc-bridge, need if you use postgres and set auto_full_etl = True, or exec synch etl command.
  • sentry, error reporting, worked if set dsn in config.

Install

> pip install synch

Usage

synch.ini

synch will read default config from ./synch.ini, or you can use synch -c specify config file.

Don't delete any section in synch.ini although you don't need it, just keep default as it.

[core]
# when set True, will display sql information.
debug = True
# current support redis and kafka
broker_type = redis
# source database, current support mysql and postgres
source_db = mysql
# these tables skip delete, multiple separated with comma, format with schema.table
skip_delete_tables =
# these tables skip update, multiple separated with comma, format with schema.table
skip_update_tables =
# skip delete or update dmls, multiple separated with comma, example: delete,update
skip_dmls =
# how many num to submit,recommend set 20000 when production
insert_num = 1
# how many seconds to submit,recommend set 60 when production
insert_interval = 1
# auto do full etl at first when table not exists
auto_full_etl = True

[sentry]
# sentry environment
environment = development
# sentry dsn
dsn =

[redis]
host = 127.0.0.1
port = 6379
password =
db = 0
prefix = synch
# enable redis sentinel
sentinel = False
# redis sentinel hosts,multiple separated with comma
sentinel_hosts = 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002
sentinel_master = master
# stream max len, will delete redundant ones with FIFO
queue_max_len = 200000

[mysql]
server_id = 1
# optional, read from `show master status` result if empty
init_binlog_file =
# optional, read from `show master status` result if empty
init_binlog_pos =
host = mysql
port = 3306
user = root
password = 123456

# sync schema, format with mysql.schema, each schema for one section.
[mysql.test]
# multiple separated with comma
tables = test
# kafka partition, need when broker_type=kafka
kafka_partition = 0

# when source_db = postgres
[postgres]
host = postgres
port = 5432
user = postgres
password =

[postgres.postgres]
tables = test
kafka_partition = 0

[clickhouse]
host = 127.0.0.1
port = 9000
user = default
password =

# need when broker_type=kafka
[kafka]
# kafka servers,multiple separated with comma
servers = 127.0.0.1:9092
topic = synch

Full data etl

Maybe you need make full data etl before continuous sync data from MySQL to ClickHouse or redo data etl with --renew.

> synch etl -h

usage: synch etl [-h] --schema SCHEMA [--tables TABLES] [--renew] [--partition-by PARTITION_BY] [--settings SETTINGS] [--engine ENGINE]

optional arguments:
  -h, --help            show this help message and exit
  --schema SCHEMA       Schema to full etl.
  --tables TABLES       Tables to full etl, multiple tables split with comma.
  --renew               Etl after try to drop the target tables.
  --partition-by PARTITION_BY
                        Table create partitioning by, like toYYYYMM(created_at).
  --settings SETTINGS   Table create settings, like index_granularity=8192
  --engine ENGINE       Table create engine, default MergeTree.

Full etl from table test.test:

> synch etl --schema test --tables test

Produce

Listen all MySQL binlog and produce to broker.

> synch produce

Consume

Consume message from broker and insert to ClickHouse,and you can skip error rows with --skip-error. And synch will do full etl at first when set auto_full_etl = True in synch.ini.

> synch consume -h

usage: synch consume [-h] --schema SCHEMA [--skip-error] [--last-msg-id LAST_MSG_ID]

optional arguments:
  -h, --help            show this help message and exit
  --schema SCHEMA       Schema to consume.
  --skip-error          Skip error rows.
  --last-msg-id LAST_MSG_ID
                        Redis stream last msg id or kafka msg offset, depend on broker_type in config.

Consume schema test and insert into ClickHouse:

> synch consume --schema test

Use docker-compose(recommended)

Redis Broker, lightweight and for low concurrency
version: "3"
services:
  producer:
    depends_on:
      - redis
    image: long2ice/synch
    command: synch produce
    volumes:
      - ./synch.ini:/synch/synch.ini
  consumer.test:
    depends_on:
      - redis
    image: long2ice/synch
    command: synch consume --schema test
    volumes:
      - ./synch.ini:/synch/synch.ini
  redis:
    hostname: redis
    image: redis:latest
    volumes:
      - redis
volumes:
  redis:
Kafka Broker, for high concurrency
version: "3"
services:
  zookeeper:
    image: bitnami/zookeeper:3
    hostname: zookeeper
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
    volumes:
      - zookeeper:/bitnami
  kafka:
    image: bitnami/kafka:2
    hostname: kafka
    environment:
      - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
      - ALLOW_PLAINTEXT_LISTENER=yes
      - JMX_PORT=23456
      - KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE=true
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
    depends_on:
      - zookeeper
    volumes:
      - kafka:/bitnami
  kafka-manager:
    image: hlebalbau/kafka-manager
    ports:
      - "9000:9000"
    environment:
      ZK_HOSTS: "zookeeper:2181"
      KAFKA_MANAGER_AUTH_ENABLED: "false"
    command: -Dpidfile.path=/dev/null
  producer:
    depends_on:
      - redis
      - kafka
      - zookeeper
    image: long2ice/synch
    command: synch produce
    volumes:
      - ./synch.ini:/synch/synch.ini
  consumer.test:
    depends_on:
      - redis
      - kafka
      - zookeeper
    image: long2ice/synch
    command: synch consume --schema test
    volumes:
      - ./synch.ini:/synch/synch.ini
  redis:
    hostname: redis
    image: redis:latest
    volumes:
      - redis:/data
volumes:
  redis:
  kafka:
  zookeeper:

Important

  • Synch don't support composite primary key, and you need always keep a primary key or unique key.
  • DDL sync not support postgres.
  • Postgres sync is not fully test, be careful use it in production.

QQ Group

Support this project

  • Just give a star!
  • Join QQ group for communication.
  • Donation.

AliPay

WeChat Pay

PayPal

Donate money by paypal to my account long2ice.

ThanksTo

Powerful Python IDE Pycharm from Jetbrains.

jetbrains

License

This project is licensed under the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synch-0.6.1.tar.gz (26.5 kB view hashes)

Uploaded Source

Built Distribution

synch-0.6.1-py3-none-any.whl (36.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page