About Collectors
Collectors are extractors that are developed and managed by you (A customer of K).
KADA provides python libraries that customers can use to quickly deploy a Collector.
Why you should use a Collector
There are several reasons why you may use a collector vs the direct connect extractor:
-
You are using the KADA SaaS offering and it cannot connect to your sources due to firewall restrictions
-
You want to push metadata to KADA rather than allow it pull data for Security reasons
-
You want to inspect the metadata before pushing it to K
Using a collector requires you to manage
-
Deploying and orchestrating the extract code
-
Managing a high water mark so the extract only pull the latest metadata
-
Storing and pushing the extracts to your K instance.
Pre-requisites
Collector Server Minimum Requirements
Greenplum Requirements
User access to Greenplum database(s). For each database, the user will need access to a set of PG Catalog tables and GP Metrics tables outlined below.
-
Connection to each DB
The user to be configured must be able to connect to each database
CREATE USER kadauser WITH PASSWORD 'complexpassword';
GRANT CONNECT ON DATABASE testdatabase TO kadauser;
-
Access to PG Catalog
Generally all users should have access to the pg_catalog tables on database creation. In the event the user doesn't have access, explicit grants will need to be done per new database in Greenplum.
GRANT USAGE ON SCHEMA pg_catalog TO <kada user>;
GRANT SELECT ON ALL TABLES IN SCHEMA pg_catalog TO <kada user>;
Alternatively you may choose to be specific with the SELECT grant based on the tables in the list below:
-
pg_attribute
-
pg_class
-
pg_namespace
-
pg_proc
-
pg_database
-
pg_language
-
pg_type
-
pg_collation
-
pg_depend
-
pg_constraint
-
pg_roles
-
pg_auth_members
-
Access to Query history
This step assumes you have configured gpperfmon for query history logs.
The user must have read access to the gpcc_queries_history table in the gpperform database.
-
gpcc_queries_history
Step 1: Create the Source in K
Create a Greenplum source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File" option
-
Give the source a Name - e.g. Greenplum Production
-
Add the Host name for the Greenplum Server
-
Click Finish Setup
Step 2: Getting Access to the Source Landing Directory
Step 3: Install the Collector
You can download the latest Core Library via Platform Settings → Sources → Download Collectors
Run the following command to install the collector.
pip install kada_collectors_extractors_<version>-none-any.whl
You will also need to install the common library kada_collectors_lib for this collector to function properly.
pip install kada_collectors_lib-<version>-none-any.whl
Step 4: Configure the Collector
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
host |
string |
Greenplum host as per what was onboarded in the K platform |
"example.greenplum.localhost" |
|
server |
string |
Greenplum host to establish a connection |
"example.greenplum.localhost" |
|
username |
string |
Username to log into Greenplum |
"greenplum_user" |
|
password |
string |
Password to log into the Greenplum |
|
|
databases |
list<string> |
A list of databases to extract from Greenplum |
["dwh", "adw"] |
|
port |
integer |
Greenplum port, general default is 5432 |
5432 |
|
output_path |
string |
Absolute path to the output location |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
|
meta_only |
boolean |
To extract metadata only or not |
false |
|
audit_database |
string |
The database where gpmetrics has been set up |
gpperfmon |
kada_greenplum_extractor_config.json
{
"host": "",
"server": "",
"username": "",
"password": "",
"databases": [],
"port": 5432,
"output_path": "/tmp/output",
"mask": true,
"compress": true,
"meta_only": true,
"audit_database": "gpperfmon"
}
Step 5: Run the Collector
This is the wrapper script: kada_greenplum_extractor.py
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.greenplum import Extractor
get_generic_logger('root')
_type = 'greenplum'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA Greenplum Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
Step 6: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called greenplum_hwm.txt.
Step 7: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.