Data Validation / Schema Enforcement

`class` SchemaValidation

The SchemaValidation class provides functions for enforcing a schema for a Pandas DataFrame, which can also be used to validate a Pandas DataFrame against a Pandera schema.

To use the SchemaValidation class, import it, create an instance, and use that instance to validate a DataFrame against a Pandera schema.

from ganymede_sdk import SchemaValidation
from ganymede_sdk.flow_runtime import GanymedeException

import pandas as pd
import pandera as pa
from pandera import DataFrameSchema

# typically, this DataFrame would be read in using retrieve_sql or retrieve_tables, or retrieved from a input file widget
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c']
})

# obtain a validated DataFrame
validation_dict = SchemaValidation(df)
schema_df = DataFrameSchema(validation_dict)

try:
    # specifying lazy=True ensures that casting errors for all columns are displayed at once
    df = schema_df.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    raise GanymedeException(
        message=f"File failed schema type validation: {e}", exception_type="Validation",
    )

Constructor Parameters

prefix_cols: list[str] | None: List of column prefixes to treat as a group. For example, if the DataFrame has columns ['a_1', 'a_2', 'b_1', 'b_2'], and ['a_', 'b_'] was specified as prefix_cols, then all columns that start with 'a_' or 'b_' would receive the same schema as the first element in the group.

nullable_cols: list[str] | None: List of columns that are nullable in the schema. If not specified, all columns are considered nullable.

required_cols: list[str] | None: List of columns that are required in the schema. If not specified, all columns are considered optional.

Attributes

df: pd.DataFrame: Pandas DataFrame to validate

schema_dict: dict[str, Column]: Dictionary of column names and Pandera Column objects

schema: DataFrameSchema: Pandera DataFrameSchema object

validated_df: pd.DataFrame: Validated Pandas DataFrame

Enforcing Schema Types

Sometimes, it can be useful to tweak the Pandera schema types generated from a file. An example of this is a column that contains fluorescence readings which could take on values specifying that a reading is above the limits of quantification or below the limits of quantification.

In this case, modifying the types for a schema is a two-step process.

Step 1: Generate the schema using the SchemaValidation class.

from ganymede_sdk import SchemaValidation

import pandas as pd

df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c']
})

validate_df = SchemaValidation(df)
validate_df

The output would be something a dictionary like the following:

{
    "column1": Column('int64', nullable=True, required=False, coerce=True),
    "column2": Column('str', nullable=True, required=False, coerce=True)
}

Step 2: Copy and modify the schema definition as needed, and use for validation

from pandera import Column, DataFrameSchema
from ganymede_sdk.flow_runtime import GanymedeException

new_schema = DataFrameSchema(
    {
    "column1": Column('int64', nullable=True, required=False, coerce=True),
    "column2": Column('str', nullable=True, required=False, coerce=True)
    }
)

try:
    # specifying lazy=True ensures that casting errors for all columns are displayed at once
    df_new = new_schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    raise GanymedeException(
        message=f"File failed schema type validation: {e}", exception_type="Validation",
    )

More granular control over schema generation can be achieved by calling the generate_pandera_schema method on the SchemaValidation object. This method returns a Pandera DataFrameSchema object that can be modified before validation.

Jinja templating for SQL Queries

Variables from the Ganymede object can be used in SQL queries by using Jinja templating. The full set of variables available for Jinja templates can be found in the params attribute of either the GanymedeContext or MockGanymedeContext objects.

The following example demonstrates how to retrieve data from the most recent run of a Flow.

from ganymede_sdk import Ganymede

g = Ganymede()

# Retrieve dataframe with flow_run_id from most recent run
query_sql = 'SELECT * FROM tbl WHERE flow_run_id = "{{flow_run_id}}";'
result = g.retrieve_sql(query_sql)

`class` GanymedeException

The GanymedeException class is a custom exception class that can be used to raise exceptions to attribute error type.

from ganymede_sdk.flow_runtime import GanymedeException

# available exception types are: Validation, Connection, Function
raise GanymedeException(message="Only 1 replicate detected in sample; expected 3.", exception_type="Validation")

Error Types

Validation: Schema validation errors. An example use case would be a Pandas DataFrame failing to validate against a Pandera schema.
Connection: Errors related to connecting to external services. Check for errors related to authentication, service availability, and access methods.
Function: Errors related to function logic. For example, a failed function call to a Python package would fall under this category.

`function` compare_df_to_table

compare_df_to_table allows users to compare the columns and schema on a Pandas DataFrame to a corresponding table in the Ganymede database. The function returns differences in column names and data types between the DataFrame and the table, and can be used to validate that the DataFrame matches the table schema before writing the DataFrame to the table.

Parameters

ganymede_context : GanymedeContext
- Ganymede context to get run attributes
df : pd.DataFrame
- Pandas DataFrame to validate
table : str
- Name of table to compare DataFrame to
diffs_only: bool, optional, default=True
- If True, only return columns and schemas differences between DataFrame and table (rather than all columns and schemas in DataFrame and table)

from ganymede_sdk import Ganymede
from ganymede_sdk.editor import compare_df_to_table

g = Ganymede()

# returns a Pandas DataFrame with schema and column
res = compare_df_to_table(g.ganymede_context, df, 'ganymede_table_name')

The following examples are outputs from the compare_df_to_table function:

Example:

Pandas DataFrame named df_plate with the following values:

well	run1	run2
A1	2.5	3.7
...

Table in Ganymede data lake named plate_reader_run with the following values:

well	run1	run2	run_diff
A1	2.5	4	1.5
...

Would result in the following DataFrames:

# One column difference and one schema difference between DataFrame and table
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results')

well	table_df	table_bq	diff
run2	FLOAT	INT	diff
run_diff	NaN	FLOAT	diff

# Full set of schema and column differences between DataFrame and table returned
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results', diffs_only=False)

well	table_df	table_bq	diff
well	STRING	STRING
run1	FLOAT	FLOAT
run2	FLOAT	INT	diff
run_diff	NaN	FLOAT	diff

Allotrope Schema Validation

The Ganymede SDK provides a set of functions to validate that a Pandas DataFrame matches the schema of an Allotrope Data Models.

class SchemaValidation​

Constructor Parameters​

Attributes​

Enforcing Schema Types​

Jinja templating for SQL Queries​

class GanymedeException​

Error Types​

function compare_df_to_table​

Parameters​

Allotrope Schema Validation​