About Collectors
Pre-requisites
Collector Server Minimum Requirements
Azure Data Factory Requirements
-
Access to Azure Data Factory
Step 1: Create the Source in K
Create a Azure Data Factory source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File" option
-
Give the source a Name - e.g. Azure Data Factory Production
-
Add the Host name for the Azure Data Factory Server
-
Click Finish Setup
Step 2: Getting Access to the Source Landing Directory
Step 3: Install the Collector
You can download the latest Core Library and whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector.
pip install kada_collectors_extractors_<version>-none-any.whl
You will also need to install the common library kada_collectors_lib for this collector to function properly.
pip install kada_collectors_lib-<version>-none-any.whl
Step 4: Configure the Collector
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
client |
string |
Onboarded client in Azure to access ADF |
|
|
secret |
string |
Onboarded client secret in Azure to access ADF |
|
|
tenant |
string |
Tenant ID of where ADF exists |
|
|
subscription_id |
string |
Subscription in Azure which the ADF is associated to |
|
|
resource_group_name |
string |
Resource group in Azure which the ADF is associated to |
|
|
factory_name |
string |
The name of the ADF factory |
|
|
output_path |
string |
Absolute path to the output location |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
timeout |
integer |
Timeout in seconds allowed against the ADF APIs |
20 |
|
mapping |
json |
Mapping file of data source names against the onboarded host and database name in K |
{"myDSN": {"host": "myhost", "database": "mydatabase"}} |
|
compress |
boolean |
To compress the output |
true |
|
active_days |
integer |
The pipeline must have been run within active days from today to be considered active |
60 |
kada_adf_extractor_config.json
{
"client": "",
"secret": "",
"tenant": "",
"subscription_id": "",
"resource_group_name": "",
"factory_name": "",
"output_path": "/tmp/output",
"mask": true,
"timeout": 20,
"mapping": {
"myDSN": {
"host": "myhost",
"database": "mydatabase"
}
},
"compress": true,
"active_days": 60
}
Step 5: Run the Collector
This is the wrapper script: kada_adf_extractor.py
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.adf import Extractor
get_generic_logger('root')
_type = 'adf'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA ADF Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 6: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called adf_hwm.txt.
Step 7: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.