K Knowledge Base
Breadcrumbs

Sensitive Data (e.g. PII) Scanner

Pre-requisites

  • Python 3.6 - 3.10 (excluding 3.9.0). Note: 3.9.0 is unsupported. 3.9.1 and subsequent versions are supported.

  • Access to K landing directory

  • Read access to the source that you are going to run the PII Scanner against.

  • Install the corresponding collector package for the source you are scanning.

    • For example if you are scanning Snowflake, you need to also install the Snowflake collector package.

    • If you would like to run the PII scanner on multiple sources, then you will need to install the collector package for all sources.

    • Refer to the Source collector page for instructions on how to install collector packages.

Limitations

The Scanner has a number of known limitations. The following scenarios will result in a FAILED scan status:

  1. Unable to scan tables with case sensitive names that are usually controlled by quoting in the SQL

  2. Unable to scan tables with special characters that break SQL format without quoting

  3. Unable to scan tables that are named after keywords which require quoting

  4. Unable to scan tables (inclusive of the schema and database names) that contain a period (.) in the name

  5. If the table contains a column that causes a data retrieval error

  6. If the view has issues executing the underlying Stored Proc or SQL


Step 1: Generate a scanner configuration

The scanner configuration is generated for all tables in a database or schema within a database source.

Log into K and go to Data Applications. Select Ask K, then select the scanner tab. Go to the Scanner config and click on Run.

Select the source and tables you want to run the scanner on. You can select all the tables by database or schema. Click Create to generate the scanner config.


Step 2: Install the PII Scanner Collector

The PII Scanner Collector is hosted in KADA's Azure Blob Storage. Reach out to KADA Support (support@kada.ai) to obtain the collector package and receive a SAS token to access the repository.

Make sure that you've already set up the collector packages for the Sources (e.g. Snowflake) that you'd like to run the PII Scanner Collector on.


Step 3: Configure the Collector

Check to ensure that the following steps have been completed:

  • Installed the relevant Source Collector .whl

  • Installed any external dependencies described on the Source Collector page

  • Installed the common library package kada_collectors_lib-<version>-py3-none-any.whl or higher

  • Installed the PII Scanner .whl (as per Step 2)

  • Created the Source Collector config json file as described on the Source Collector page

Wrapper script: kada_pii_scanner.py

Python
import csv
import argparse
from kada_collectors.extractors.utils import load_config, get_generic_logger
from kada_collectors.extractors.pii_scanner import PIIScanner, VALID_DEFAULT_DETECTORS

get_generic_logger('root')

parser = argparse.ArgumentParser(description='KADA PII Scanner.')
parser.add_argument('--extractor-config', '-e', dest='extractor_config', type=str, required=True)
parser.add_argument('--objects-file-path', '-f', dest='objects_file_path', type=str, required=True)
parser.add_argument('--source-type', '-t', dest='source_type', type=str, required=True)
parser.add_argument('--sample-size', '-s', dest='sample_size', type=int, required=True)
parser.add_argument('--parrallel', '-p', dest='concurrency', type=int, default=1)
parser.add_argument('--default-detectors', '-d', dest='default_detectors', type=str)
parser.add_argument('--delta', '-a', dest='delta', action='store_true')
parser.add_argument('--pii-output-path', '-o', dest='pii_output_path', type=str, required=True)
args = parser.parse_args()

def read_validate_object_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.reader(csv_file, delimiter=',')
        header = next(reader)
        if [x.upper() for x in header] != ['OBJECT_TYPE','OBJECT_ID']:
            raise Exception('Invalid object file')
        return [x for x in reader]

if __name__ == '__main__':
    extractor_config = load_config(args.extractor_config)
    object_list = read_validate_object_file(args.objects_file_path)
    default_detectors = [x.strip() for x in args.default_detectors.split(',')] if args.default_detectors else []
    pii_scanner = PIIScanner(args.source_type, args.sample_size, args.concurrency, object_list, args.pii_output_path, default_detectors=default_detectors, delta=args.delta, **extractor_config)
    pii_scanner.scan()

Example execution:

Bash
python kada_pii_scanner.py -e ./kada_snowflake_extractor_config.json -f ./pii_test_scan.csv -t snowflake -s 10 -p 8 -o /tmp/output -d Email,AUPhoneNumber,CreditCard,AUTaxFileNumber,AUZipCode,AUDriversLicense,AUMedicare,AUPassport,AUABN,IPAddress

Arguments:

ARGUMENT

SHORT

TYPE

OPTIONAL

DESCRIPTION

--extractor-config

-e

STRING

N

Location of the extractor configuration json

--objects-file-path

-f

STRING

N

Location of the .txt file with list of objects to scan

--source-type

-t

STRING

N

Source type e.g. snowflake, oracle

--sample-size

-s

INTEGER

N

Number of rows to sample (0 = all rows)

--parallel

-p

INTEGER

Y

Parallelism level (default 1)

--default-detectors

-d

STRING

Mandatory unless custom detectors defined

Comma-separated list: Email, CreditCard, AUPhoneNumber, AUTaxFileNumber, AUAddress, AUDriversLicense, AUMedicare, AUPassport, AUABN, IPAddress

--delta

-a

FLAG

Y

Produces a DELTA extract file for partial scans

--pii-output-path

-o

STRING

N

Output folder path for PII extract


Step 4 (Optional): Defining your own Detectors

Out of the box detectors include: AUAddress, Email, CreditCard, AUTaxFileNumber, AUPhoneNumber, AUMedicare, AUPassport, AUABN, IPAddress, AUDriversLicense (with state variants).

To define your own detector:

Python
from kada_collectors.extractors.pii_scanner import PIIScanner, register_detector, DatabaseDatumDetectors

@register_detector
class MyEmailDetector(DatabaseDatumDetectors):
    def detect(self, datum, column_name):
        matches = []
        if not isinstance(datum, bool):
            if '@' in str(datum):
                matches.append(pii_cls())
        return matches

if __name__ == '__main__':
    pii_scanner = PIIScanner(source_type, sample_size, concurrency, object_list, output_path, **extractor_config)
    pii_scanner.scan()

References

Supported Data Sources

  1. Snowflake

  2. Oracle

  3. Redshift

Object List File

The object list file can be generated via Ask K - Scanner → Generate Scanner Configuration. It must be a comma-separated flat file (UTF-8 encoded) with headers OBJECT_TYPE,OBJECT_ID.

HEADER

TYPE

DESCRIPTION

OBJECT_TYPE

STRING

Object type being scanned (currently only TABLE)

OBJECT_ID

STRING

4-part ID: <host>.<database>.<schema>.<table> (replace any periods in names with underscores)