K Knowledge Base
Breadcrumbs

Azure Data Factory (via Collector method) - v3.0

About Collectors


Pre-requisites

  • Python 3.6 - 3.10

  • Access to K landing directory

  • Access to Azure Data Factory (see section below)

Azure Data Factory Permissions

  • Refer to Step 1 - 3 in the Azure Data Factory direct connect documentation


Step 1: Create the Source in K

Create a Azure Data Factory source in K

  • Go to Settings, Select Sources and click Add Source

  • Select "Load from File" option

  • Give the source a Name - e.g. Azure Data Factory Production

  • Add the Host name for the Azure Data Factory Server

  • Click Finish Setup


Step 2: Getting Access to the Source Landing Directory


Step 3: Install the Collector

You can download the latest Core Library and whl via Platform Settings → SourcesDownload Collectors

Run the following command to install the collector.

pip install kada_collectors_extractors_<version>-none-any.whl

You will also need to install the common library kada_collectors_lib for this collector to function properly.

pip install kada_collectors_lib-<version>-none-any.whl

Step 4: Configure the Collector

FIELD

FIELD TYPE

DESCRIPTION

EXAMPLE

client

string

Onboarded client in Azure to access ADF


secret

string

Onboarded client secret in Azure to access ADF


tenant

string

Tenant ID of where ADF exists


subscription_id

string

Subscription in Azure which the ADF is associated to


resource_group_name

string

Resource group in Azure which the ADF is associated to


factory_name

string

The name of the ADF factory


output_path

string

Absolute path to the output location

"/tmp/output"

mask

boolean

To enable masking or not

true

timeout

integer

Timeout in seconds allowed against the ADF APIs

20

mapping

json

Mapping file of data source names against the onboarded host and database name in K

{"myDSN": {"host": "myhost", "database": "mydatabase"}}

compress

boolean

To compress the output

true

kada_adf_extractor_config.json

JSON
{
    "client": "",
    "secret": "",
    "tenant": "",
    "subscription_id": "",
    "resource_group_name": "",
    "factory_name": "",
    "output_path": "/tmp/output",
    "mask": true,
    "timeout": 20,
    "mapping": {
        "myDSN": {
            "host": "myhost",
            "database": "mydatabase"
        }
    },
    "compress": true
}

Step 5: Run the Collector

This is the wrapper script: kada_adf_extractor.py

Python
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.adf import Extractor

get_generic_logger('root')

_type = 'adf'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA ADF Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(_type)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(args.name, end_hwm)

Step 6: Check the Collector Outputs

K Extracts

A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.

High Water Mark File

A high water mark file is created called adf_hwm.txt.


Step 7: Push the Extracts to K

Once the files have been validated, you can push the files to the K landing directory.


Example: Using Airflow to orchestrate the Extract and Push to K