About Collectors
Pre-requisites
Collector server minimum requirements
Azure Synapse Requirements
Setting up Azure Synapse for metadata extraction using a service principal (Application).
Step 1: Registering an Entra App for KADA Azure Synapse Collector Application
Create an Entra Application for the kada synapse collector or reusing an existing kada application in Entra.
Generate a secret for the Entra Application and note down the application id, tenant id.
Step 2: Establish Azure Synapse Access
Apply in MASTER using an Azure Synapse Admin user
CREATE USER [<ENTRA APPLICATION NAME>] FROM EXTERNAL PROVIDER;
Apply per database in scope for metadata collection.
CREATE USER [<ENTRA APPLICATION NAME>] FROM EXTERNAL PROVIDER;
GRANT VIEW DEFINITION TO [<ENTRA APPLICATION NAME>];
GRANT VIEW DATABASE STATE TO [<ENTRA APPLICATION NAME>];
The following table should also be available to SELECT by the user created in each database
-
INFORMATION_SCHEMA.ROUTINES
-
INFORMATION_SCHEMA.VIEWS
-
INFORMATION_SCHEMA.TABLE_CONSTRAINTS
-
INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE
-
INFORMATION_SCHEMA.TABLES
-
INFORMATION_SCHEMA.COLUMNS
-
sys.foreign_key_columns
-
sys.objects
-
sys.tables
-
sys.schemas
-
sys.columns
-
sys.databases
Synapse has the concept of a serverless and dedicated sql pool per workspace.
If you use both serverless and dedicated sql pools in a workspace you will need to onboard each as a separate source in K.
For serverless sql pools the master database can't be selected for metadata extraction.
To onboard multiple Synapse workspace each workspace will need to be onboarded as a new source in K.
Step 1: Create the Source in K
Create a source in K
-
Go to Settings, Select Sources and click Add Source
-
Select "Load from File" option
-
Give the source a Name - e.g. SQLServer Azure Production
-
Add the Host name for the SQLServer Azure Instance
-
Click Next & Finish Setup
Step 2: Getting Access to the Source Landing Directory
Step 3: Install the Collector
You can download the Latest Core Library and Azure Synapse whl via Platform Settings → Sources → Download Collectors
Run the following command to install the collector
pip install kada_collectors_extractors_azure_synapse-3.0.0-py3-none-any.whl
You will also need to install the corresponding common library kada_collectors_lib-x.x.x for this collector to function properly.
pip install kada_collectors_lib-x.x.x-py3-none-any.whl
Note that you will also need an ODBC package installed at the OS level for pyodbc to use as well as a SQLServer ODBC driver, refer to https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver15
Step 4: Configure the Collector
|
FIELD |
FIELD TYPE |
DESCRIPTION |
EXAMPLE |
|---|---|---|---|
|
client |
string |
Onboarded client in Azure to access Azure Synapse |
|
|
secret |
string |
Onboarded client secret in Azure to access Azure Synapse |
|
|
tenant |
string |
Tenant ID of where Azure Synapse exists |
|
|
server |
string |
Azure Synapse server |
"<workspace-name>.sql.azuresynapse.net,1433" |
|
host |
string |
The onboarded host value in K |
"<workspace-name>.sql.azuresynapse.net,1433" |
|
database_name |
string |
The name of the database that will be used to test the connection |
master |
|
databases |
list<string> |
A list of databases to extract from SQLServer Azure |
["dwh", "adw"] |
|
driver |
string |
This is the ODBC driver |
"ODBC Driver 17 for SQL Server" |
|
meta_only |
boolean |
Extract metadata only |
true |
|
output_path |
string |
Absolute path to the output location |
"/tmp/output" |
|
mask |
boolean |
To enable masking or not |
true |
|
compress |
boolean |
To gzip the output or not |
true |
|
connection_timeout |
integer |
Timeout in seconds for Synapse Sql Pool connection |
30 |
kada_azure_synapse_extractor_config.json
{
"client": "",
"secret": "",
"tenant": "",
"server": "",
"host": "",
"driver": "ODBC Driver 17 for SQL Server",
"database_name": "master",
"databases": [""],
"output_path": "/tmp/output",
"mask": true,
"compress": true,
"meta_only": true,
"connection_timeout": 30
}
Step 5: Run the Collector
This code sample uses the kada_azure_synapse_extractor.py for handling the configuration details
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.azure_synapse import Extractor
get_generic_logger('root')
_type = 'azure_synapse'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))
parser = argparse.ArgumentParser(description='KADA Azure Synapse Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename)
parser.add_argument('--name', '-n', dest='name', default=_type)
args = parser.parse_args()
start_hwm, end_hwm = get_hwm(args.name)
ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})
publish_hwm(args.name, end_hwm)
In some scenarios, you may receive an error message about the SSL settings.
This error can be resolved via the Open SSL settings. Refer to: https://github.com/mkleehammer/pyodbc/issues/610#issuecomment-534920201
Edited /etc/ssl/openssl.cnf
# Change or add
MinProtocol = TLSv1.0
CipherString = DEFAULT@SECLEVEL=1
Step 6: Check the Collector Outputs
K Extracts
A set of files (eg metadata, databaselog, linkages, events etc) will be generated in the output_path directory.
High Water Mark File
A high water mark file is created called azure_synapse_hwm.txt.
Step 7: Push the Extracts to K
Once the files have been validated, you can push the files to the K landing directory.