Configuring the KADA Great Expectations Plugin

Introducing the KADA Great Expectations (GX) Plugin

The KADA GX plugin is used by Great Expectations to push validation results to your K instance.

The Plugin is available on pypi - https://pypi.org/project/kada-gx-plugin/ or can be provided by request (please reach out to support@kada.ai).

The plugin will handle uploading the validation results to the correct landing directory in K and also handle formatting the file name and adding some additional metadata to the validation result to get best experience in K.

1. Installing the KADA GX Plugin

Install the python wheel into your GX environment

CODE

pip install kada-gx-plugin

The Kada plugin has been tested with GX versions 0.15.41 - 0.18.19 and Python 3.8 - 3.11

Once installed, you will need to complete the following:

Add the storage action to your checkpoint yamls
Add batch_metadata to configured datasource and predefined assets inside your great_expectations.yaml
Add kada_targets to query based batches inside checkpoint yamls

2. Add the Plugin to Checkpoint Action List

Ensure your checkpoint is no using the SimpleCheckpoint class as this has pre-defined actions and the action_list you specify in the checkpoint wont apply.

Add the plugin to your checkpoint.yaml files as part of the action_list
AZURE_BLOB_SAS_URL is the Azure Container SAS url (not just the token) which can be generated by navigating to the Storage Account > Container > Shared Access tokens. It should have permissions to Read/Write/List/Add/Create/Delete permissions.

prefix is the path relative to the container itself, the validation result files will be written to this path

For best practises to store credential variables see https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials/#using-the-config_variablesyml-file

YAML

action_list:
  - name: store_kada_validation_result
    action:
      class_name: KadaStoreValidationResultsAction
      module_name: kada_ge_store_plugin.kada_store_validation
      prefix: lz/ge_landing/landing
      azure_blob_sas_url: ${AZURE_BLOB_SAS_URL}

If you simply want to test the action locally and target a local file directory first you can populate provide a test_directory to the action.

For example the below configuration will push formatted validation results to /tmp/ge_validations/lz/ge_landing/landing on your local file directory

YAML

action_list:
  - name: store_kada_validation_result
    action:
      class_name: KadaStoreValidationResultsAction
      module_name: kada_ge_store_plugin.kada_store_validation
      prefix: lz/ge_landing/landing
      test_directory: /tmp/ge_validations

Remove the test_directory parameter once you are ready to push to our Landing Area

If you have another action that stores the results already and add this action, GX will simply just push the validations in both locations, so it won’t impact any existing process you may have that requires the validation results

3. Coding Standards

To get the best experience of viewing your Data Quality objects in K, you should add the following to your existing setup or keep these conventions in mind when coding for GX.

As a general rule, upfront defined assets within the configuration.yaml should include batch_metadata, assets not defined upfront and query assets should include kada_targets under evaluation_parameters in the checkpoint.yaml files.

3.1. Great Expectations Configuration

You will need to add batch_metadata / batch_spec_passthrough with the following values to the different listed connection types

kada_database_name
kada_host_name

kada_database_name will hold the name of the targeted database

kada_host_name will hold the service name or host of the targeted database

3.1.1. ConfiguredDatasourceConnectors

For datasources where the assets associated to the datasource are defined upfront in the great_expectations.yaml add the batch_metadata section to each defined asset, note for non-fluent style (v15.x.x or older) datasources please use batch_spec_passthrough instead of batch_metadata

Where MY_DB and MY_HOST can be either hard coded or environment driven

YAML

fluent_datasources:
  postgres:
    type: sql
    assets:
      test_table:
        type: table
        order_by: []
        batch_metadata:
          kada_database_name: ${MY_DB}
          kada_host_name: ${MY_HOST}
        table_name: node
        schema_name: public

For query type assets you have the option to do the same, but this is not required as you will be adding a value called kada_targets in your checkpoint file which is explained in 3.2. Checkpoints

YAML

fluent_datasources:
  postgres:
    type: sql
    assets:
      query_asset_node_ref:
        type: query
        order_by: []
        batch_metadata:
          kada_database_name: ${MY_DB}
          kada_host_name: ${MY_HOST}
        query: SELECT id as new_id, name FROM node_ref

If using GX v15.x.x or older when using non fluent style, use batch_spec_passthrough instead

YAML

      conf_asset_data_connector:
        name: conf_asset_data_connector
        class_name: ConfiguredAssetSqlDataConnector
        module_name: great_expectations.datasource.data_connector
        assets:
          table_asset_edge:
            class_name: Asset
            module_name: great_expectations.datasource.data_connector.asset
            schema_name: public
            batch_spec_passthrough:
              kada_database_name: ${MY_DB}
              kada_host_name: ${MY_HOST}
            table_name: edge
            type: table

3.1.2. InferredDatasourceConnectors

No additions are required, these additions will be required in the checkpoint yaml file level instead, applicable for non File based

3.1.3. RuntimeDatasourceConnectors

No additions are required, these additions will be required in the checkpoint yaml file level instead

3.2. Checkpoints

For query based assets or run time assets that are query based add evaluation_parameters if it does not already exist to each applicable batch request. Under this element add

kada_targets

kada_targets provides metadata to the K platform to help determine what your query is testing, as it may not be obvious from the query alone.

This will define what the intended target table or column is for the query asset or run time query asset

It should be in the form

YAML

HOST_NAME.DATABASE_NAME.SCHEMA_NAME.TABLE_NAME.COLUMN_NAME

Not the period delimitation is important as it tells K which part of the naming is Database/Schema/Table/Column etc., so if your names contain a period, please replace it with an underscore (_)

Note if you associate kada_targets to one or many columns, do not associate the corresponding table, as this will result in a double count of the test result.

For run time query assets such as runtime_defined_test_node below

YAML

  - batch_request:
      datasource_name: postgres_datasource
      data_connector_name: my_runtime_data_connector
      data_asset_name: runtime_defined_test_node
      runtime_parameters:
        query: "select id, name, node_ref_id from node"
      batch_identifiers:
        default_identifier_name: "some name"
    expectation_suite_name: test_suite
    evaluation_parameters:
      kada_targets:
        - ${MY_HOST}.${MY_DB}.${MY_SCHEMA}.node

For inferred asset types (with example of associating to multiple columns)

YAML

  - batch_request:
      datasource_name: postgres_datasource_inferred
      data_connector_name: whole_table
      data_asset_name: public.node
    expectation_suite_name: test_suite
    evaluation_parameters:
      kada_targets:
        - ${MY_HOST}.${MY_DB}.public.node.created_at
        - ${MY_HOST}.${MY_DB}.public.node.id

Similar for configured query assets such as this predefined query asset query_asset_node_ref

YAML

  - batch_request:
      datasource_name: postgres
      data_asset_name: query_asset_node_ref
    expectation_suite_name: query_test_suite
    evaluation_parameters:
      kada_targets:
        - ${MY_HOST}.${MY_DB}.public.node

If you define your batch requests in python, simply add kada_targets to the batch_spec_passthrough in the kwargs for the BatchRequest / RuntimeBatchRequest object

PY

batch_request = BatchRequest(
    datasource_name="postgres",
    data_asset_name="query_asset_node_ref",
    expectation_suite_name="query_test_suite",
    batch_spec_passthrough={"kada_targets": ["${MY_HOST}.${MY_DB}.public.node"]}
)

Or add evaluation_parameters to where you define your validation, this could either be via the get_validator method

PY

validator = context.get_validator(
    batch_request=my_batch_request_object,
    expectation_suite_name=my_expectation_suite_name,
    evaluation_parameters={"kada_targets": ["${MY_HOST}.${MY_DB}.public.node"]}
)

Or when you run a checkpoint via the run_checkpoint method and declare validations there

PY

checkpoint_result = context.run_checkpoint(
    checkpoint_name="my_checkpoint",
    validations=[
        {
            "batch_request": my_batch_request_object,
            "expectation_suite_name": my_expectation_suite_name,
            "evaluation_parameters": {"kada_targets": ["${MY_HOST}.${MY_DB}.public.node"]}
        }
    ]
)