About Collectors
Collectors are extractors that are developed and managed by you (a customer of K).
KADA provides python libraries that customers can use to quickly deploy a Collector.
Why you should use a Collector
There are several reasons why you may use a collector vs the direct connect extractor:
-
You are using the KADA SaaS offering and it cannot connect to your sources due to firewall restrictions
-
You want to push metadata to KADA rather than allow it to pull data for security reasons
-
You want to inspect the metadata before pushing it to K
Using a collector requires you to manage:
-
Deploying and orchestrating the extract code
-
Managing a high water mark so the extract only pulls the latest metadata
-
Storing and pushing the extracts to your K instance
Pre-Requisites
Collector Server Minimum Requirements
For the collector to operate effectively, it will need to be deployed on a server with the below minimum specifications:
-
CPU: 2 vCPU
-
Memory: 8GB
-
Storage: 30GB (depends on historical data extracted)
-
OS: unix distro e.g. RHEL preferred but can also work with Windows Server
-
Python 3.10.x or later
-
Access to K landing directory
Athena Requirements
-
Access to Athena
Step 1: Establish Athena Access
It is advised you create a new Role and a separate s3 bucket for the service user provided to KADA and have a policy that allows the below, see Identity and access management in Athena - Amazon Athena
The service user/account/role will require permissions to the following
-
Execute queries against Athena with access to the INFORMATION_SCHEMA in particular the following tables:
-
information_schema.views
-
information_schema.tables
-
information_schema.columns
-
-
Executing queries in Athena requires an s3 bucket to temporarily store results. We will also require the policy to allow Read Write Listing access to objects within that bucket, conversely, the bucket must also have policy to allow to do the same.
-
Call the following Athena APIs (Note that access to Athena metadata through the below APIs will also require access to the Glue catalog).
-
The service user/account/role will need permissions to access all workgroups to be able to extract all data, if you omit workgroups, that information will not be extracted and you may not see the complete picture in K.
-
See IAM policies for accessing workgroups - Amazon Athena on how to add policy entries to have fine grain control at the workgroup level. Note that the extractor runs queries on Athena, If you do choose to restrict workgroup access, ensure that Query based actions (e.g. StartQueryExecution) are allowed for the workgroup the service user/account/role is associated to.
Note that user usage will be associated to the workgroup level rather than individual users, these workgroups are published as users in K in the form "athena_workgroup_<name>"
Example Role Policy to allow Athena Access with least privileges for actions, this example allows the ACCOUNT ARN to assume the role. Note the variables ATHENA RESULTS BUCKET NAME. You may also choose to just assign the policy directly to a new user and use that user without assuming roles. In the scenario you do wish to assume a role, please note down the role ARN to be used when onboarding/extracting.
AWSTemplateFormatVersion: "2010-09-09"
Description: 'AWS IAM Role - Athena Access to KADA'
Resources:
KadaAthenaRole:
Type: "AWS::IAM::Role"
Properties:
RoleName: "KadaAthenaRole"
MaxSessionDuration: 43200
Path: "/"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
AWS: "[ACCOUNT ARN]"
Action: "sts:AssumeRole"
KadaAthenaPolicy:
Type: 'AWS::IAM::Policy'
Properties:
PolicyName: root
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- athena:BatchGetQueryExecution
- athena:GetQueryExecution
- athena:GetQueryResults
- athena:GetQueryResultsStream
- athena:ListQueryExecutions
- athena:StartQueryExecution
- athena:ListWorkGroups
- athena:ListDataCatalogs
- athena:ListDatabases
- athena:ListTableMetadata
Resource: '*'
- Effect: Allow
Action:
- glue:GetDatabase
- glue:GetDatabases
- glue:GetTable
- glue:GetTables
- glue:GetPartition
- glue:GetPartitions
Resource: '*'
- Effect: Allow
Action:
- s3:GetBucketLocation
- s3:GetObject
- s3:ListBucket
- s3:ListBucketMultipartUploads
- s3:ListMultipartUploadParts
- s3:AbortMultipartUpload
- s3:PutObject
- s3:PutBucketPublicAccessBlock
- s3:DeleteObject
Resource:
- arn:aws:s3:::[ATHENA RESULTS BUCKET NAME]
Roles:
- !Ref KadaAthenaRole
Alternatively, the following managed policy will also provide the necessary permissions for the collector - https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonAthenaFullAccess.html
aws iam attach-role-policy \
--role-name YOUR_ROLE_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess
After this step you should have the following information
-
Athena User
-
Role
-
Key
-
Secret
-
Athena S3 bucket location
Step 2: Create the Source in K
Create an Athena source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File system" option
-
Give the source a Name - e.g. Athena Production
-
Add the Host name for the Athena Server
-
Click Finish Setup
Step 3: Getting Access to the Source Landing Directory
When using a Collector you will push metadata to a K landing directory.
To find your landing directory you will need to:
-
Go to Platform Settings - Settings. Note down the value of this setting:
-
If using Azure: storage_azure_storage_account
-
If using AWS:
-
storage_root_folder - the AWS s3 bucket
-
storage_aws_region - the region where the AWS s3 bucket is hosted
-
-
-
Go to Sources - Edit the Source you have configured. Note down the landing directory in the About this Source section.
To connect to the landing directory you will need:
-
If using Azure: a SAS token to push data to the landing directory. Request this from KADA Support (support@kada.ai)
-
If using AWS:
-
An Access key and Secret. Request this from KADA Support (support@kada.ai)
-
OR provide your IAM role to KADA Support to provision access.
-
Step 4: Install the Collector
It is recommended to use a python environment such as pyenv or pipenv if you are not intending to install this package at the system level.
Some python packages also have dependencies on the OS level packages, so you may be required to install additional OS packages if the below fails to install.
You can download the Latest Core Library and Athena whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector
pip install kada_collectors_extractors_<version>-none-any.whl
You will also need to install the common library kada_collectors_lib for this collector to function properly.
pip install kada_collectors_lib-<version>-none-any.whl
Under the covers this uses boto3 and may have OS dependencies see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
Step 5: Configure the Collector
The collector requires a set of parameters to connect to and extract metadata from Athena
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
key |
string |
Key for the AWS user |
"xcvsdsdfsdf" |
|
secret |
string |
Secret for the AWS user |
"sgsdfdsfg" |
|
server |
string |
This is the host that was onboarded in K for Athena |
"athena.cloud" |
|
bucket |
string |
Bucket location to temporarily store Athena query results |
"s3://mybucket/myathenaresults" |
|
catalogs |
list<string> |
List of catalogs to extract from Athena |
["AwsDataCatalog"] |
|
region |
string |
Set the region for AWS for where Athena exists |
ap-southeast-2 |
|
role |
string |
If your access requires role assumption, place the full arn value here, otherwise leave it blank |
"" |
|
output_path |
string |
Absolute path to the output location where files are to be written |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
kada_athena_extractor_config.json
{
"key": "",
"secret": "",
"server": "athena",
"bucket": "s3://examplebucket/examplefolder",
"catalogs": ["AwsDataCatalog"],
"region": "ap-southeast-2",
"role": "",
"output_path": "/tmp/output",
"mask": true,
"compress": true
}
Step 6: Run the Collector
This is the wrapper script: kada_athena_extractor.py
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.athena import Extractor
get_generic_logger('root')
_type = 'athena'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA Athena Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 7: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called athena_hwm.txt.
Refer to Collector Integration General Notes for more information
Step 8: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.
Example: Using Airflow to orchestrate the Extract and Push to K
The following example is how you can orchestrate the Tableau collector using Airflow and push the files to K hosted on Azure. The code is not expected to be used as-is but as a template for your own DAG.
# built-in
import os
# Installed
from airflow.operators.python_operator import PythonOperator
from airflow.models.dag import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago
from airflow.utils.task_group import TaskGroup
from plugins.utils.azure_blob_storage import AzureBlobStorage
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.tableau import Extractor
# To be configured by the customer.
# Note variables may change if using a different object store.
KADA_SAS_TOKEN = os.getenv("KADA_SAS_TOKEN")
KADA_CONTAINER = ""
KADA_STORAGE_ACCOUNT = ""
KADA_LANDING_PATH = "lz/tableau/landing"
KADA_EXTRACTOR_CONFIG = {
"server_address": "http://tabserver",
"username": "user",
"password": "password",
"sites": [],
"db_host": "tabserver",
"db_username": "repo_user",
"db_password": "repo_password",
"db_port": 8060,
"db_name": "workgroup",
"meta_only": False,
"retries": 5,
"dry_run": False,
"output_path": "/set/to/output/path",
"mask": True,
"mapping": {}
}
# To be implemented by the customer.
# Upload to your landing zone storage.
# Change '.csv' to '.csv.gz' if you set compress = true in the config
def upload():
output = KADA_EXTRACTOR_CONFIG['output_path']
for filename in os.listdir(output):
if filename.endswith('.csv'):
file_to_upload_path = os.path.join(output, filename)
AzureBlobStorage.upload_file_sas_token(
client=KADA_SAS_TOKEN,
storage_account=KADA_STORAGE_ACCOUNT,
container=KADA_CONTAINER,
blob=f'{KADA_LANDING_PATH}/{filename}',
local_path=file_to_upload_path
)
with DAG(dag_id="taskgroup_example", start_date=days_ago(1)) as dag:
# To be implemented by the customer.
# Retrieve the timestamp from the prior run
start_hwm = 'YYYY-MM-DD HH:mm:SS'
end_hwm = 'YYYY-MM-DD HH:mm:SS' # timestamp now
ext = Extractor(**KADA_EXTRACTOR_CONFIG)
start = DummyOperator(task_id="start")
with TaskGroup("taskgroup_1", tooltip="extract tableau and upload") as extract_upload:
task_1 = PythonOperator(
task_id="extract_tableau",
python_callable=ext.run,
op_kwargs={"start_hwm": start_hwm, "end_hwm": end_hwm},
provide_context=True,
)
task_2 = PythonOperator(
task_id="upload_extracts",
python_callable=upload,
op_kwargs={},
provide_context=True,
)
# To be implemented by the customer.
# Timestamp needs to be saved for next run
task_3 = DummyOperator(task_id='save_hwm')
end = DummyOperator(task_id='end')
start >> extract_upload >> end