About Collectors
Pre-requisites
Collector server minimum requirements
Databricks Requirements
-
Unity enabled catalogue. Hive catalogues are not supported currently.
-
Enable System Schemas for
-
access
-
query
-
Follow the following documentation to enable
-
https://docs.databricks.com/en/admin/system-tables/index.html#enable
-
https://kb.databricks.com/unity-catalog/find-your-metastore-id
curl -v -X PUT -H "Authorization: Bearer <PAT TOKEN>" "https://<YOUR WORKSPACE>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<METASTORE ID>/systemschemas/access" curl -v -X PUT -H "Authorization: Bearer <PAT TOKEN>" "https://<YOUR WORKSPACE>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<METASTORE ID>/systemschemas/query"
-
-
Step 1: Create the Source in K
Create a source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File" option
-
Give the source a Name - e.g. Databricks Production
-
Add the Host name for the Databricks Instance
-
Click Next & Finish Setup
Step 2: Getting Access to the Source Landing Directory
Step 3: Install the Collector
You can download the Latest Core Library and Databricks whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector
pip install kada_collectors_extractors_databricks-3.0.0-py3-none-any.whl
You will also need to install the corresponding common library kada_collectors_lib-x.x.x for this collector to function properly.
pip install kada_collectors_lib-x.x.x-py3-none-any.whl
Step 4: Configure the Collector
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
access_token |
string |
Databricks personal access token for authentication |
|
|
server_hostname |
string |
Server address to the Databricks Service |
|
|
http_path |
string |
Http path either to a DBSQL endpoint or to a DBR interactive cluster |
|
|
statement_timeout |
integer |
Query time limit (in seconds). Default is 600s. |
600 |
|
host |
string |
The onboarded host value in K |
|
|
databases |
list<string> |
list of databases to extract (catalogs in Databricks) |
["dwh", "adw"] |
|
information_catalog |
string |
The catalog to extract information. Default is 'system' |
system |
|
output_path |
string |
Absolute path to the output location |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
|
meta_only |
boolean |
Extract metadata only |
false |
kada_databricks_extractor_config.json
{
"access_token": "",
"server_hostname": "",
"http_path": "",
"statement_timeout": 600,
"host": "",
"databases": [],
"information_catalog": "system",
"output_path": "/tmp/output",
"mask": true,
"compress": true,
"meta_only": true
}
Step 5: Run the Collector
This code sample uses the kada_databricks_extractor.py for handling the configuration details
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.databricks import Extractor
get_generic_logger('root')
_type = 'databricks'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA Databricks Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 6: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called databricks_hwm.txt.
Step 7: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.