About Collectors
Pre-requisites
Collector Server Minimum Requirements
BigQuery Requirements
-
Access to BigQuery
Step 1: Establish BigQuery Access
This step is performed by the Google Cloud Admin
-
Create a Service Account by going to the Google Cloud Admin or clicking on this link
-
Give the Service Account a name (e.g. KADA BQ Integration)
-
Select the Projects that include the BigQuery instance(s) that you want to catalog
-
Click Save
-
-
Create a Service Token
-
Click on the Service Account
-
Select the Keys tab. Click on Create new key
-
Select the JSON option. After clicking 'CREATE', the JSON file will automatically download to your device.
-
-
Add permission grants on the Service Account by going to IAM page
-
Click on ADD
-
Add the Service Account to the 'New principals' field.
-
Grant the following roles this principal:
-
BigQuery Job User
-
BigQuery Metadata Viewer
-
BigQuery Read Session User
-
BigQuery Resource Viewer
-
-
Click SAVE
-
Step 2: Create the Source in K
Create a BigQuery source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File system" option
-
Give the source a Name - e.g. BigQuery Production
-
Add the Host name for the BigQuery Server
-
Click Finish Setup
Step 3: Getting Access to the Source Landing Directory
Step 4: Install the Collector
You can download the Latest Core Library and whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector
pip install kada_collectors_extractors_<version>-none-any.whl
You will also need to install the common library kada_collectors_lib for this collector to function properly.
pip install kada_collectors_lib-<version>-none-any.whl
Under the covers this uses the BigQuery Client API and may have OS dependencies see https://cloud.google.com/bigquery/docs/reference/libraries
Step 5: Configure the Collector
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
regions |
list<string> |
List of valid regions to inspect |
"us" |
|
projects |
list<string> |
List of project ids to inspect across the regions specified |
"kada-data" |
|
host |
string |
This is the host that was onboarded in K for BigQuery |
"bigquery" |
|
json_credentials |
JSON |
Service account credentials JSON |
{"type": "service_account", "project_id": "...", ...} |
|
output_path |
string |
Absolute path to the output location |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
kada_bigquery_extractor_config.json
{
"regions": [],
"projects": [],
"host": "",
"json_credentials": {},
"output_path": "/tmp/output",
"mask": true,
"compress": true
}
Step 6: Run the Collector
This is the wrapper script: kada_bigquery_extractor.py
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.bigquery import Extractor
get_generic_logger('root')
_type = 'bigquery'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA BigQuery Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 7: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called bigquery_hwm.txt.
Step 8: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.