Skip to main content

Data Validation

class SchemaValidation

The SchemaValidation class provides functions to validate a Pandas DataFrame against a Pandera schema.

To use the SchemaValidation class, import the class and create an instance of the class. The instance can then be used to validate a DataFrame against a Pandera schema.

from ganymede_sdk import SchemaValidation
import pandas as pd

# typically, this DataFrame would be read in using retrieve_sql or retrieve_tables, or retrieved from a input file widget
df = pd.DataFrame({
'column1': [1, 2, 3],
'column2': ['a', 'b', 'c']
})

# obtain a validated DataFrame
validated_df = SchemaValidation(df)

Constructor Parameters

prefix_cols: Optional[list[str]]: List of column prefixes to treat as a group. For example, if the DataFrame has columns ['a_1', 'a_2', 'b_1', 'b_2'], and ['a_', 'b_'] was specified as prefix_cols, then all columns that start with 'a_' or 'b_' would receive the same schema as the first element in the group.

nullable_cols: Optional[list[str]]: List of columns that are nullable in the schema. If not specified, all columns are considered nullable.

required_cols: Optional[list[str]]: List of columns that are required in the schema. If not specified, all columns are considered optional.

Attributes

property df: pd.DataFrame: Pandas DataFrame to validate property schema_dict: dict[str, Column]: Dictionary of column names and Pandera Column objects property schema: DataFrameSchema: Pandera DataFrameSchema object property validated_df: pd.DataFrame: Validated Pandas DataFrame

customizing types

Sometimes, it can be useful to tweak the Pandera schema types generated from a file. An example of this is a column that contains fluorescence readings which could take on values specifying that a reading is above the limits of quantification or below the limits of quantification.

In this case, modifying the types for a schema is a two-step process.

Step 1: Generate the schema using the SchemaValidation class.

from ganymede_sdk import SchemaValidation
import pandas as pd

df = pd.DataFrame({
'column1': [1, 2, 3],
'column2': ['a', 'b', 'c']
})

validate_df = SchemaValidation(df)
validate_df

The output would be something a dictionary like the following:

{
"column1": Column('int64', nullable=True, required=False, coerce=True),
"column2": Column('str', nullable=True, required=False, coerce=True)
}

Step 2: Copy and modify the schema definition as needed, and use for validation

from pandera import Column, DataFrameSchema
new_schema = DataFrameSchema(
{
"column1": Column('int64', nullable=True, required=False, coerce=True),
"column2": Column('str', nullable=True, required=False, coerce=True)
}
)
df_new = new_schema.validate(df)

More granular control over schema generation can be achieved by calling the generate_pandera_schema method on the SchemaValidation object. This method returns a Pandera DataFrameSchema object that can be modified before validation.



### Jinja templating for SQL Queries

Variables from the **Ganymede** object can be used in SQL queries by using [Jinja templating](https://jinja.palletsprojects.com/en/3.0.x/templates/). The full set of variables available for Jinja templates can be found in the params attribute of either the **GanymedeContext** or **MockGanymedeContext** objects.

The following example demonstrates how to retrieve data from the most recent run of a _flow_.

```python
from ganymede_sdk import Ganymede

g = Ganymede()

# Retrieve dataframe with flow_run_id from most recent run
query_sql = 'SELECT * FROM tbl WHERE flow_run_id = "{{flow_run_id}}";'
result = g.retrieve_sql(query_sql)

function compare_df_to_table

compare_df_to_table allows users to compare the columns and schema on a Pandas DataFrame to a corresponding table in the Ganymede database. The function returns differences in column names and data types between the DataFrame and the table, and can be used to validate that the DataFrame matches the table schema before writing the DataFrame to the table.

Parameters

  • ganymede_context : GanymedeContext
    • Ganymede context to get run attributes
  • df : pd.DataFrame
    • Pandas DataFrame to validate
  • table : str
    • Name of table to compare DataFrame to
  • diffs_only: bool, optional, default=True
    • If True, only return columns and schemas differences between DataFrame and table (rather than all columns and schemas in DataFrame and table)
from ganymede_sdk import Ganymede
from ganymede_sdk.editor import compare_df_to_table

g = Ganymede()

# returns a Pandas DataFrame with schema and column
res = compare_df_to_table(g.ganymede_context, df, 'ganymede_table_name')

The following examples are outputs from the compare_df_to_table function:

Example:

Pandas DataFrame named df_plate with the following values:

wellrun1run2
A12.53.7
...

Table in Ganymede data lake named plate_reader_run with the following values:

wellrun1run2run_diff
A12.541.5
...

Would result in the following DataFrames:

# One column difference and one schema difference between DataFrame and table
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results')
welltable_dftable_bqdiff
run2FLOATINTdiff
run_diffNaNFLOATdiff
# Full set of schema and column differences between DataFrame and table returned
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results', diffs_only=False)
welltable_dftable_bqdiff
wellSTRINGSTRING
run1FLOATFLOAT
run2FLOATINTdiff
run_diffNaNFLOATdiff

Allotrope Schema Validation

The Ganymede SDK provides a set of functions to validate that a Pandas DataFrame matches the schema of an Allotrope Data Models.