K Knowledge Base
Breadcrumbs

Databricks (via Collector method) - v3.0.0

About Collectors


Pre-requisites

Collector server minimum requirements

Databricks Requirements

  1. Unity enabled catalogue. Hive catalogues are not supported currently.

    1. https://community.databricks.com/t5/bangalore/how-do-we-enable-unity-catalog-for-our-workspace/td-p/73258

  2. Enable System Schemas for

    1. access

    2. query

    3. Follow the following documentation to enable

      1. https://docs.databricks.com/en/admin/system-tables/index.html#enable

      2. https://docs.databricks.com/en/dev-tools/auth/pat.html

      3. https://kb.databricks.com/unity-catalog/find-your-metastore-id

        curl -v -X PUT -H "Authorization: Bearer <PAT TOKEN>" "https://<YOUR WORKSPACE>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<METASTORE ID>/systemschemas/access"
        curl -v -X PUT -H "Authorization: Bearer <PAT TOKEN>" "https://<YOUR WORKSPACE>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<METASTORE ID>/systemschemas/query"
        

Step 1: Create the Source in K

Create a source in K

  • Go to Settings, Select Sources and click Add Source

  • Select "Load from File" option

  • Give the source a Name - e.g. Databricks Production

  • Add the Host name for the Databricks Instance

  • Click Next & Finish Setup


Step 2: Getting Access to the Source Landing Directory


Step 3: Install the Collector

You can download the Latest Core Library and Databricks whl via Platform Settings → SourcesDownload Collectors

Run the following command to install the collector

pip install kada_collectors_extractors_databricks-3.0.0-py3-none-any.whl

You will also need to install the corresponding common library kada_collectors_lib-x.x.x for this collector to function properly.

pip install kada_collectors_lib-x.x.x-py3-none-any.whl

Step 4: Configure the Collector

FIELD

FIELD TYPE

DESCRIPTION

EXAMPLE

access_token

string

Databricks personal access token for authentication


server_hostname

string

Server address to the Databricks Service

adb-<workspaceId>.<instance>.azuredatabricks.net

http_path

string

Http path either to a DBSQL endpoint or to a DBR interactive cluster

/sql/1.0/warehouses/<warehouseId>

statement_timeout

integer

Query time limit (in seconds). Default is 600s.

600

host

string

The onboarded host value in K


databases

list<string>

list of databases to extract (catalogs in Databricks)

["dwh", "adw"]

information_catalog

string

The catalog to extract information. Default is 'system'

system

output_path

string

Absolute path to the output location

"/tmp/output"

mask

boolean

To enable masking or not

true

compress

boolean

To gzip the output or not

true

meta_only

boolean

Extract metadata only

false

kada_databricks_extractor_config.json

JSON
{
    "access_token": "",
    "server_hostname": "",
    "http_path": "",
    "statement_timeout": 600,
    "host": "",
    "databases": [],
    "information_catalog": "system",
    "output_path": "/tmp/output",
    "mask": true,
    "compress": true,
    "meta_only": true
}

Step 5: Run the Collector

This code sample uses the kada_databricks_extractor.py for handling the configuration details

Python
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.databricks import Extractor

get_generic_logger('root')

_type = 'databricks'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA Databricks Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(args.name)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(args.name, end_hwm)

Step 6: Check the Collector Outputs

K Extracts

A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.

High Water Mark File

A high water mark file is created called databricks_hwm.txt.


Step 7: Push the Extracts to K

Once the files have been validated, you can push the files to the K landing directory.


Example: Using Airflow to orchestrate the Extract and Push to K