K Knowledge Base
Breadcrumbs

Athena (via Collector method) - v3.0.0

About Collectors


Pre-Requisites

Collector Server Minimum Requirements

Athena Requirements

  • Access to Athena


Step 1: Establish Athena Access

It is advised you create a new Role and a separate s3 bucket for the service user provided to KADA and have a policy that allows the below, see Identity and access management in Athena - Amazon Athena

The service user/account/role will require permissions to the following

  1. Execute queries against Athena with access to the INFORMATION_SCHEMA in particular the following tables:

    1. information_schema.views

    2. information_schema.tables

    3. information_schema.columns

  2. Executing queries in Athena requires an s3 bucket to temporarily store results. We will also require the policy to allow Read Write Listing access to objects within that bucket, conversely, the bucket must also have policy to allow to do the same.

  3. Call the following Athena APIs (Note that access to Athena metadata through the below APIs will also require access to the Glue catalog).

    1. BatchGetQueryExecutions

    2. GetQueryExecution

    3. GetQueryResults

    4. ListQueryExecutions

    5. StartQueryExecution

    6. ListWorkGroups

    7. ListDataCatalogs

    8. ListDatabases

    9. ListTableMetadata

  4. The service user/account/role will need permissions to access all workgroups to be able to extract all data, if you omit workgroups, that information will not be extracted and you may not see the complete picture in K.

  5. See IAM policies for accessing workgroups - Amazon Athena on how to add policy entries to have fine grain control at the workgroup level. Note that the extractor runs queries on Athena, If you do choose to restrict workgroup access, ensure that Query based actions (e.g. StartQueryExecution) are allowed for the workgroup the service user/account/role is associated to.

Note that user usage will be associated to the workgroup level rather than individual users, these workgroups are published as users in K in the form "athena_workgroup_<name>"

Example Role Policy to allow Athena Access with least privileges for actions, this example allows the ACCOUNT ARN to assume the role. Note the variables ATHENA RESULTS BUCKET NAME. You may also choose to just assign the policy directly to a new user and use that user without assuming roles. In the scenario you do wish to assume a role, please note down the role ARN to be used when onboarding/extracting.

AWSTemplateFormatVersion: "2010-09-09"
Description: 'AWS IAM Role - Athena Access to KADA'
Resources: 
  KadaAthenaRole: 
    Type: "AWS::IAM::Role"
    Properties: 
      RoleName: "KadaAthenaRole"
      MaxSessionDuration: 43200
      Path: "/"
      AssumeRolePolicyDocument: 
        Version: "2012-10-17"
        Statement: 
        - Effect: "Allow"
          Principal:
            AWS: "[ACCOUNT ARN]"
          Action: "sts:AssumeRole"

  KadaAthenaPolicy: 
    Type: 'AWS::IAM::Policy'
    Properties:
      PolicyName: root
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action: 
              - athena:BatchGetQueryExecution
              - athena:GetQueryExecution
              - athena:GetQueryResults
              - athena:GetQueryResultsStream
              - athena:ListQueryExecutions
              - athena:StartQueryExecution
              - athena:ListWorkGroups
              - athena:ListDataCatalogs
              - athena:ListDatabases
              - athena:ListTableMetadata
            Resource: '*'
          - Effect: Allow
            Action: 
              - glue:GetDatabase
              - glue:GetDatabases 
              - glue:GetTable
              - glue:GetTables
              - glue:GetPartition
              - glue:GetPartitions
            Resource: '*'
          - Effect: Allow
            Action: 
              - s3:GetBucketLocation
              - s3:GetObject
              - s3:ListBucket
              - s3:ListBucketMultipartUploads
              - s3:ListMultipartUploadParts
              - s3:AbortMultipartUpload
              - s3:PutObject
              - s3:PutBucketPublicAccessBlock
              - s3:DeleteObject
            Resource:
              - arn:aws:s3:::[ATHENA RESULTS BUCKET NAME]
      Roles:
        - !Ref KadaAthenaRole

Alternatively, the following managed policy will also provide the necessary permissions for the collector - https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonAthenaFullAccess.html

aws iam attach-role-policy \
    --role-name YOUR_ROLE_NAME \
    --policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess

After this step you should have the following information

  • Athena User

  • Role

  • Key

  • Secret

  • Athena S3 bucket location


Step 2: Create the Source in K

Create an Athena source in K

  • Go to Settings, Select Sources and click Add Source

  • Select "Load from File system" option

  • Give the source a Name - e.g. Athena Production

  • Add the Host name for the Athena Server

  • Click Finish Setup


Step 3: Getting Access to the Source Landing Directory


Step 4: Install the Collector

It is recommended to use a python environment such as pyenv or pipenv if you are not intending to install this package at the system level.

Some python packages also have dependencies on the OS level packages, so you may be required to install additional OS packages if the below fails to install.

You can download the Latest Core Library and Athena whl via Platform Settings → SourcesDownload Collectors

Run the following command to install the collector

pip install kada_collectors_extractors_<version>-none-any.whl

You will also need to install the common library kada_collectors_lib for this collector to function properly.

pip install kada_collectors_lib-<version>-none-any.whl

Under the covers this uses boto3 and may have OS dependencies see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html


Step 5: Configure the Collector

The collector requires a set of parameters to connect to and extract metadata from Athena

FIELD

FIELD TYPE

DESCRIPTION

EXAMPLE

key

string

Key for the AWS user

"xcvsdsdfsdf"

secret

string

Secret for the AWS user

"sgsdfdsfg"

server

string

This is the host that was onboarded in K for Athena

"athena.cloud"

bucket

string

Bucket location to temporarily store Athena query results

"s3://mybucket/myathenaresults"

catalogs

list<string>

List of catalogs to extract from Athena

["AwsDataCatalog"]

region

string

Set the region for AWS for where Athena exists

ap-southeast-2

role

string

If your access requires role assumption, place the full arn value here, otherwise leave it blank

""

output_path

string

Absolute path to the output location where files are to be written

"/tmp/output"

mask

boolean

To enable masking or not

true

compress

boolean

To gzip the output or not

true

kada_athena_extractor_config.json

{
    "key": "",
    "secret": "",
    "server": "athena",
    "bucket": "s3://examplebucket/examplefolder",
    "catalogs": ["AwsDataCatalog"],
    "region": "ap-southeast-2",
    "role": "",
    "output_path": "/tmp/output",
    "mask": true,
    "compress": true
}

Step 6: Run the Collector

This is the wrapper script: kada_athena_extractor.py

import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.athena import Extractor

get_generic_logger('root')

_type = 'athena'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA Athena Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(args.name)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(args.name, end_hwm)

Step 7: Check the Collector Outputs

K Extracts

A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.

High Water Mark File

A high water mark file is created called athena_hwm.txt.


Step 8: Push the Extracts to K

Once the files have been validated, you can push the files to the K landing directory.


Example: Using Airflow to orchestrate the Extract and Push to K