About Collectors
Pre-Requisites
Collector Server Minimum Requirements
Athena Requirements
-
Access to Athena
Step 1: Establish Athena Access
It is advised you create a new Role and a separate s3 bucket for the service user provided to KADA and have a policy that allows the below, see Identity and access management in Athena - Amazon Athena
The service user/account/role will require permissions to the following
-
Execute queries against Athena with access to the INFORMATION_SCHEMA in particular the following tables:
-
information_schema.views
-
information_schema.tables
-
information_schema.columns
-
-
Executing queries in Athena requires an s3 bucket to temporarily store results. We will also require the policy to allow Read Write Listing access to objects within that bucket, conversely, the bucket must also have policy to allow to do the same.
-
Call the following Athena APIs (Note that access to Athena metadata through the below APIs will also require access to the Glue catalog).
-
The service user/account/role will need permissions to access all workgroups to be able to extract all data, if you omit workgroups, that information will not be extracted and you may not see the complete picture in K.
-
See IAM policies for accessing workgroups - Amazon Athena on how to add policy entries to have fine grain control at the workgroup level. Note that the extractor runs queries on Athena, If you do choose to restrict workgroup access, ensure that Query based actions (e.g. StartQueryExecution) are allowed for the workgroup the service user/account/role is associated to.
Note that user usage will be associated to the workgroup level rather than individual users, these workgroups are published as users in K in the form "athena_workgroup_<name>"
Example Role Policy to allow Athena Access with least privileges for actions, this example allows the ACCOUNT ARN to assume the role. Note the variables ATHENA RESULTS BUCKET NAME. You may also choose to just assign the policy directly to a new user and use that user without assuming roles. In the scenario you do wish to assume a role, please note down the role ARN to be used when onboarding/extracting.
AWSTemplateFormatVersion: "2010-09-09"
Description: 'AWS IAM Role - Athena Access to KADA'
Resources:
KadaAthenaRole:
Type: "AWS::IAM::Role"
Properties:
RoleName: "KadaAthenaRole"
MaxSessionDuration: 43200
Path: "/"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
AWS: "[ACCOUNT ARN]"
Action: "sts:AssumeRole"
KadaAthenaPolicy:
Type: 'AWS::IAM::Policy'
Properties:
PolicyName: root
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- athena:BatchGetQueryExecution
- athena:GetQueryExecution
- athena:GetQueryResults
- athena:GetQueryResultsStream
- athena:ListQueryExecutions
- athena:StartQueryExecution
- athena:ListWorkGroups
- athena:ListDataCatalogs
- athena:ListDatabases
- athena:ListTableMetadata
Resource: '*'
- Effect: Allow
Action:
- glue:GetDatabase
- glue:GetDatabases
- glue:GetTable
- glue:GetTables
- glue:GetPartition
- glue:GetPartitions
Resource: '*'
- Effect: Allow
Action:
- s3:GetBucketLocation
- s3:GetObject
- s3:ListBucket
- s3:ListBucketMultipartUploads
- s3:ListMultipartUploadParts
- s3:AbortMultipartUpload
- s3:PutObject
- s3:PutBucketPublicAccessBlock
- s3:DeleteObject
Resource:
- arn:aws:s3:::[ATHENA RESULTS BUCKET NAME]
Roles:
- !Ref KadaAthenaRole
Alternatively, the following managed policy will also provide the necessary permissions for the collector - https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonAthenaFullAccess.html
aws iam attach-role-policy \
--role-name YOUR_ROLE_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess
After this step you should have the following information
-
Athena User
-
Role
-
Key
-
Secret
-
Athena S3 bucket location
Step 2: Create the Source in K
Create an Athena source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File system" option
-
Give the source a Name - e.g. Athena Production
-
Add the Host name for the Athena Server
-
Click Finish Setup
Step 3: Getting Access to the Source Landing Directory
Step 4: Install the Collector
It is recommended to use a python environment such as pyenv or pipenv if you are not intending to install this package at the system level.
Some python packages also have dependencies on the OS level packages, so you may be required to install additional OS packages if the below fails to install.
You can download the Latest Core Library and Athena whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector
pip install kada_collectors_extractors_<version>-none-any.whl
You will also need to install the common library kada_collectors_lib for this collector to function properly.
pip install kada_collectors_lib-<version>-none-any.whl
Under the covers this uses boto3 and may have OS dependencies see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
Step 5: Configure the Collector
The collector requires a set of parameters to connect to and extract metadata from Athena
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
key |
string |
Key for the AWS user |
"xcvsdsdfsdf" |
|
secret |
string |
Secret for the AWS user |
"sgsdfdsfg" |
|
server |
string |
This is the host that was onboarded in K for Athena |
"athena.cloud" |
|
bucket |
string |
Bucket location to temporarily store Athena query results |
"s3://mybucket/myathenaresults" |
|
catalogs |
list<string> |
List of catalogs to extract from Athena |
["AwsDataCatalog"] |
|
region |
string |
Set the region for AWS for where Athena exists |
ap-southeast-2 |
|
role |
string |
If your access requires role assumption, place the full arn value here, otherwise leave it blank |
"" |
|
output_path |
string |
Absolute path to the output location where files are to be written |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
kada_athena_extractor_config.json
{
"key": "",
"secret": "",
"server": "athena",
"bucket": "s3://examplebucket/examplefolder",
"catalogs": ["AwsDataCatalog"],
"region": "ap-southeast-2",
"role": "",
"output_path": "/tmp/output",
"mask": true,
"compress": true
}
Step 6: Run the Collector
This is the wrapper script: kada_athena_extractor.py
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.athena import Extractor
get_generic_logger('root')
_type = 'athena'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA Athena Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 7: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called athena_hwm.txt.
Step 8: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.