Data Validation
class
SchemaValidation
The SchemaValidation class provides functions for validating a Pandas DataFrame against a Pandera schema.
To use the SchemaValidation class, import it, create an instance, and use that instance to validate a DataFrame against a Pandera schema.
from ganymede_sdk import SchemaValidation
import pandas as pd
from pandera import DataFrameSchema
# typically, this DataFrame would be read in using retrieve_sql or retrieve_tables, or retrieved from a input file widget
df = pd.DataFrame({
'column1': [1, 2, 3],
'column2': ['a', 'b', 'c']
})
# obtain a validated DataFrame
validation_dict = SchemaValidation(df)
schema_df = DataFrameSchema(validation_dict)
df = schema_df.validate(df)
Constructor Parameters
prefix_cols: list[str] | None
: List of column prefixes to treat as a group. For example, if the DataFrame has columns ['a_1', 'a_2', 'b_1', 'b_2'], and ['a_', 'b_'] was specified as prefix_cols, then all columns that start with 'a_' or 'b_' would receive the same schema as the first element in the group.
nullable_cols: list[str] | None
: List of columns that are nullable in the schema. If not specified, all columns are considered nullable.
required_cols: list[str] | None
: List of columns that are required in the schema. If not specified, all columns are considered optional.
Attributes
property df: pd.DataFrame
: Pandas DataFrame to validate
property schema_dict: dict[str, Column]
: Dictionary of column names and Pandera Column objects
property schema: DataFrameSchema
: Pandera DataFrameSchema object
property validated_df: pd.DataFrame
: Validated Pandas DataFrame
Sometimes, it can be useful to tweak the Pandera schema types generated from a file. An example of this is a column that contains fluorescence readings which could take on values specifying that a reading is above the limits of quantification or below the limits of quantification.
In this case, modifying the types for a schema is a two-step process.
Step 1: Generate the schema using the SchemaValidation class.
from ganymede_sdk import SchemaValidation
import pandas as pd
df = pd.DataFrame({
'column1': [1, 2, 3],
'column2': ['a', 'b', 'c']
})
validate_df = SchemaValidation(df)
validate_df
The output would be something a dictionary like the following:
{
"column1": Column('int64', nullable=True, required=False, coerce=True),
"column2": Column('str', nullable=True, required=False, coerce=True)
}
Step 2: Copy and modify the schema definition as needed, and use for validation
from pandera import Column, DataFrameSchema
new_schema = DataFrameSchema(
{
"column1": Column('int64', nullable=True, required=False, coerce=True),
"column2": Column('str', nullable=True, required=False, coerce=True)
}
)
df_new = new_schema.validate(df)
More granular control over schema generation can be achieved by calling the generate_pandera_schema
method on the SchemaValidation object. This method returns a Pandera DataFrameSchema object that can be modified before validation.
Jinja templating for SQL Queries
Variables from the Ganymede object can be used in SQL queries by using Jinja templating. The full set of variables available for Jinja templates can be found in the params attribute of either the GanymedeContext or MockGanymedeContext objects.
The following example demonstrates how to retrieve data from the most recent run of a Flow.
from ganymede_sdk import Ganymede
g = Ganymede()
# Retrieve dataframe with flow_run_id from most recent run
query_sql = 'SELECT * FROM tbl WHERE flow_run_id = "{{flow_run_id}}";'
result = g.retrieve_sql(query_sql)
function
compare_df_to_table
compare_df_to_table allows users to compare the columns and schema on a Pandas DataFrame to a corresponding table in the Ganymede database. The function returns differences in column names and data types between the DataFrame and the table, and can be used to validate that the DataFrame matches the table schema before writing the DataFrame to the table.
Parameters
- ganymede_context :
GanymedeContext
- Ganymede context to get run attributes
- df :
pd.DataFrame
- Pandas DataFrame to validate
- table :
str
- Name of table to compare DataFrame to
- diffs_only:
bool, optional, default=True
- If True, only return columns and schemas differences between DataFrame and table (rather than all columns and schemas in DataFrame and table)
from ganymede_sdk import Ganymede
from ganymede_sdk.editor import compare_df_to_table
g = Ganymede()
# returns a Pandas DataFrame with schema and column
res = compare_df_to_table(g.ganymede_context, df, 'ganymede_table_name')
The following examples are outputs from the compare_df_to_table function:
Example:
Pandas DataFrame named df_plate with the following values:
well | run1 | run2 |
---|---|---|
A1 | 2.5 | 3.7 |
... |
Table in Ganymede data lake named plate_reader_run with the following values:
well | run1 | run2 | run_diff |
---|---|---|---|
A1 | 2.5 | 4 | 1.5 |
... |
Would result in the following DataFrames:
# One column difference and one schema difference between DataFrame and table
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results')
well | table_df | table_bq | diff |
---|---|---|---|
run2 | FLOAT | INT | diff |
run_diff | NaN | FLOAT | diff |
# Full set of schema and column differences between DataFrame and table returned
compare_df_to_table(g.ganymede_context, result_dfs, 'Example_Quickstart_Absorbance_Change_Python_results', diffs_only=False)
well | table_df | table_bq | diff |
---|---|---|---|
well | STRING | STRING | |
run1 | FLOAT | FLOAT | |
run2 | FLOAT | INT | diff |
run_diff | NaN | FLOAT | diff |
Allotrope Schema Validation
The Ganymede SDK provides a set of functions to validate that a Pandas DataFrame matches the schema of an Allotrope Data Models.