K Knowledge Base
Breadcrumbs

Clickhouse (via Collector method) - v3.0.0

About Collectors


Pre-requisites

Collector Server Minimum Requirements

ClickHouse Requirements

  • Access to the following tables

    1. system.databases

    2. system.tables

    3. system.columns


Step 1: Enabling logging

TBC


Step 2: Create the Source in K

Create a ClickHouse source in K

  • Go to Settings, Select Sources and click Add Source

  • Select "Load from File system" option

  • Give the source a Name - e.g. ClickHouse Production

  • Add the Host name for the ClickHouse Server

  • Click Finish Setup


Step 3: Getting Access to the Source Landing Directory


Step 4: Install the Collector

You can download the latest Core Library and whl via Platform Settings → SourcesDownload Collectors

Run the following command to install the collector

pip install kada_collectors_extractors_<version>-none-any.whl

You will also need to install the common library kada_collectors_lib for this collector to function properly.

pip install kada_collectors_lib-<version>-none-any.whl

Step 5: Configure the Collector

The ClickHouse collector only extracts metadata and does not extract or process query usage on the database.

FIELD

FIELD TYPE

DESCRIPTION

EXAMPLE

username

string

Username to log into ClickHouse

"myuser"

password

string

Password to log into ClickHouse

"password"

server

string

ClickHouse instance server

"t1x6j03yyo.ap-southeast-2.aws.clickhouse.cloud"

port

integer

The port to connect to the ClickHouse instance, generally this is 9440

9440

host

string

The onboarded host in K for the ClickHouse Source

"t1x6j03yyo.ap-southeast-2.aws.clickhouse.cloud"

database_name

string

The onboarded database name in K for the ClickHouse Source

"myclickhouse"

meta_only

boolean

Currently we only support meta only as true

true

output_path

string

Absolute path to the output location

"/tmp/output"

mask

boolean

To enable masking or not

true

compress

boolean

To enable compression or not to .csv.gz

true

timeout

integer

Timeout setting in seconds

80000

kada_clickhouse_extractor_config.json

{
    "username": "",
    "password": "",
    "server": "",
    "port": 9440,
    "database_name": "",
    "host": "",
    "output_path": "/tmp/output",
    "mask": true,
    "compress": true,
    "meta_only": true,
    "timeout": 80000
}

Step 6: Run the Collector

This is the wrapper script: kada_clickhouse_extractor.py

import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.clickhouse import Extractor

get_generic_logger('root')

_type = 'clickhouse'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA Clickhouse Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(args.name)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(args.name, end_hwm)

Step 7: Check the Collector Outputs

K Extracts

A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.

High Water Mark File

A high water mark file is created called clickhouse_hwm.txt.


Step 8: Push the Extracts to K

Once the files have been validated, you can push the files to the K landing directory.


Example: Using Airflow to orchestrate the Extract and Push to K