DataFrame?¶dataframe.head()
| hours_worked | wage_per_hour | |
|---|---|---|
| person_id | ||
| aafee0b | 38.5 | 15.1 |
| 0a917a3 | 41.25 | 15.0 |
| 2d1786b | 35.0 | 21.3 |
| 263cf89 | 27.75 | 17.5 |
| 89a09dc | 22.25 | 19.5 |
Data validation is the act of falsifying data against explicit assumptions for some downstream purpose, like analysis, modeling, and visualization.
"All swans are white"
/usr/local/miniconda3/envs/pandera-presentations/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
445 if mask.any():
446 with np.errstate(all="ignore"):
--> 447 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
448
449 else:
TypeError: can't multiply sequence by non-int of type 'float'
def process_data(df):
...
def process_data(df):
return (
df.assign(
weekly_income=lambda x: x.hours_worked * x.wage_per_hour
)
)
def process_data(df):
import pdb; pdb.set_trace() # <- insert breakpoint
return (
df.assign(
weekly_income=lambda x: x.hours_worked * x.wage_per_hour
)
)
>>> print(df)
hours_worked wage_per_hour person_id aafee0b 38.5 15.1 0a917a3 41.25 15.0 2d1786b 35.0 21.3 263cf89 27.75 17.5 89a09dc 22.25 19.5 e256747 -20.5 25.5
>>> df.dtypes
hours_worked object wage_per_hour float64 dtype: object
>>> df.hours_worked.map(type)
person_id aafee0b <class 'float'> 0a917a3 <class 'float'> 2d1786b <class 'str'> 263cf89 <class 'float'> 89a09dc <class 'float'> e256747 <class 'float'> Name: hours_worked, dtype: object
def process_data(df):
return (
df
.astype({"hours_worked": float, "wage_per_hour": float}) # <- make sure columns are floats
.assign(
hours_worked=lambda x: x.hours_worked.where( # <- replace negative values with nans
x.hours_worked >= 0, np.nan
)
)
.assign(
weekly_income=lambda x: x.hours_worked * x.wage_per_hour
)
)
>>> process_data(df)
| hours_worked | wage_per_hour | weekly_income | |
|---|---|---|---|
| person_id | |||
| aafee0b | 38.50 | 15.1 | 581.350 |
| 0a917a3 | 41.25 | 15.0 | 618.750 |
| 2d1786b | 35.00 | 21.3 | 745.500 |
| 263cf89 | 27.75 | 17.5 | 485.625 |
| 89a09dc | 22.25 | 19.5 | 433.875 |
| e256747 | NaN | 25.5 | NaN |
@pa.check_input(in_schema)
@pa.check_output(out_schema)
def process_data(df):
return (
df.assign(
hours_worked=lambda x: x.hours_worked.where( # <- replace negative values with nans
x.hours_worked >= 0, np.nan
)
)
.assign(
weekly_income=lambda x: x.hours_worked * x.wage_per_hour
)
)
in_schema and out_schema are, finding a NOTE that a fellow traveler has left for you.¶import pandera as pa
from pandera import DataFrameSchema, Column, Check
# NOTE: this is what's supposed to be in `df` going into `process_data`
in_schema = DataFrameSchema({
"hours_worked": Column(pa.Float, coerce=True, nullable=True),
"wage_per_hour": Column(pa.Float, coerce=True, nullable=True),
})
# ... and this is what `process_data` is supposed to return.
out_schema = (
in_schema
.update_column("hours_worked", checks=Check.greater_than_or_equal_to(0))
.add_columns({"weekly_income": Column(pa.Float, nullable=True)})
)
@pa.check_input(in_schema)
@pa.check_output(out_schema)
def process_data(df):
...
The better you can reason about the contents of a dataframe, the faster you can debug.
The faster you can debug, the sooner you can focus on downstream tasks that you care about.
panderaAccording to the European Statistical System:
Data validation is an activity in which it is verified whether or not a combination of values is a member of a set of acceptable value combinations. [Di Zio et al. 2015]
More formally, we can relate this definition to one of the core principles of the scientific method: falsifiability
$v(x) \twoheadrightarrow \{ {True}, {False} \}$
Where $x$ is a set of data values and $v$ is a surjective validation function, meaning that there exists at least one $x$ that maps onto each of the elements in the set $\{True, False\}$ [van der Loo et al. 2019].
$v(x) \rightarrow True$
Example: "my dataframe can have any number of columns of any type"
lambda df: True
$v(x) \rightarrow False$
Example: "my dataframe has an infinite number of rows and columns"
lambda df: False
Have to do with the properties of the data structure:
income is a numeric variable, occupation is a categorical variable.email_address should be unique.occupation field.Have to do with properties specific to the topic under study:
age variable should be between the range 0 and 120income for records where age is below the legal
working age should be nanoccupation tend to have higher income than othersChecks that express hard-coded logical rules
the mean
ageshould be between30and40years old.lambda age: 30 <= age.mean() <= 40
Checks that explicitly incorporate randomness and distributional variability
the 95% confidence interval or mean
ageshould be between30and40years old.def prob_check_age(age): mu = age.mean() ci = 1.96 * (age.std() / np.sqrt(len(age)) return 30 <= mu - ci and mu + ci <= 40
Verifying the assumptions about the distributional properties of a dataset to ensure that statistical operations on those data are valid.
x1, x2, and x3 are not correlated"x is greater than some threshold t"
As a machine learning engineer who uses
pandasevery day, I want a data validation tool that's intuitive, flexible, customizable, and easy to integrate into my ETL pipelines so that I can spend less time worrying about the correctness of a dataframe's contents and more time training models.

pandera¶A design-by-contract data validation library that exposes an intuitive API for expressing dataframe schemas.
$v(x) \twoheadrightarrow \{ {True}, {False} \}$
$s(v, x) \rightarrow \begin{cases} \mbox{x,} & \mbox{if } v(x) = true \\ \mbox{error,} & \mbox{otherwise} \end{cases}$
Where $s$ is a schema function that takes two arguments: the validation function $v$ and some data $x$.
Compositionality
Consider a data processing function $f(x) \rightarrow x'$ that cleans the raw dataset $x$.
We can use the schema to define any number of composite functions:
"A step toward creating an impartial, comprehensive, and searchable national database of people killed during interactions with law enforcement."
Example 1:
Undocumented immigrant Roberto Chavez-Recendiz, of Hidalgo, Mexico, was fatally shot while Rivera was arresting him along with his brother and brother-in-law for allegedly being in the United States without proper documentation. The shooting was "possibly the result of what is called a 'sympathetic grip,' where one hand reacts to the force being used by the other," Chief Criminal Deputy County Attorney Rick Unklesbay wrote in a letter outlining his review of the shooting. The Border Patrol officer had his pistol in his hand as he took the suspects into custody. He claimed the gun fired accidentally.
Example 2:
Andrew Lamar Washington died after officers tasered him 17 times within three minutes.
Example 3:
Biddle and his brother, Drake Biddle, were fleeing from a Nashville Police Department officer at a high rate of speed when a second Nashville Police Department officer, James L. Steely, crashed into him head-on.
pandera.
import janitor
import requests
from pandas_profiling import ProfileReport
dataset_url = (
"https://docs.google.com/spreadsheets/d/"
"1dKmaV_JiWcG8XBoRgP8b4e9Eopkpgt7FL7nyspvzAsE/export?format=csv"
)
fatal_encounters = pd.read_csv(dataset_url, skipfooter=1, engine="python")
| Unique ID | Subject's name | Subject's age | Subject's gender | Subject's race | Subject's race with imputations | Imputation probability | URL of image of deceased | Date of injury resulting in death (month/day/year) | Location of injury (address) | Location of death (city) | Location of death (state) | Location of death (zip code) | Location of death (county) | Full Address | Latitude | Longitude | Agency responsible for death | Cause of death | A brief description of the circumstances surrounding the death | Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS | Intentional Use of Force (Developing) | Link to news article or photo of official document | Symptoms of mental illness? INTERNAL USE, NOT FOR ANALYSIS | Video | Date&Description | Unique ID formula | Unique identifier (redundant) | Date (Year) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25746 | Samuel H. Knapp | 17 | Male | European-American/White | European-American/White | not imputed | NaN | 01/01/2000 | 27898-27804 US-101 | Willits | CA | 95490.0 | Mendocino | 27898-27804 US-101 Willits CA 95490 Mendocino | 39.470883 | -123.361751 | Mendocino County Sheriff's Office | Vehicle | Samuel Knapp was allegedly driving a stolen ve... | Unreported | Vehicle/Pursuit | https://drive.google.com/file/d/10DisrV8K5ReP1... | No | NaN | 1/1/2000: Samuel Knapp was allegedly driving a... | NaN | 25746 | 2000.0 |
| 1 | 25747 | Mark A. Horton | 21 | Male | African-American/Black | African-American/Black | not imputed | NaN | 01/01/2000 | Davison Freeway | Detroit | MI | 48203.0 | Wayne | Davison Freeway Detroit MI 48203 Wayne | 42.404526 | -83.092274 | NaN | Vehicle | Two Detroit men killed when their car crashed ... | Unreported | Vehicle/Pursuit | https://drive.google.com/file/d/1-nK-RohgiM-tZ... | No | NaN | 1/1/2000: Two Detroit men killed when their ca... | NaN | 25747 | 2000.0 |
| 2 | 25748 | Phillip A. Blurbridge | 19 | Male | African-American/Black | African-American/Black | not imputed | NaN | 01/01/2000 | Davison Freeway | Detroit | MI | 48203.0 | Wayne | Davison Freeway Detroit MI 48203 Wayne | 42.404526 | -83.092274 | NaN | Vehicle | Two Detroit men killed when their car crashed ... | Unreported | Vehicle/Pursuit | https://drive.google.com/file/d/1-nK-RohgiM-tZ... | No | NaN | 1/1/2000: Two Detroit men killed when their ca... | NaN | 25748 | 2000.0 |
To make the analysis more readable
def clean_columns(df):
return (
df.clean_names()
.rename(
columns=lambda x: (
x.strip("_")
.replace("&", "_and_")
.replace("subjects_", "")
.replace("location_of_death_", "")
.replace("_resulting_in_death_month_day_year", "")
.replace("_internal_use_not_for_analysis", "")
)
)
)
fatal_encounters_clean_columns = clean_columns(fatal_encounters)
Just define columns that we need for creating the training set,
and specify which are nullable.
clean_column_schema = pa.DataFrameSchema(
{
"age": Column(nullable=True),
"gender": Column(nullable=True),
"race": Column(nullable=True),
"cause_of_death": Column(nullable=True),
"symptoms_of_mental_illness": Column(nullable=True),
"dispositions_exclusions": Column(nullable=True),
}
)
Validate the output of clean_columns using pipe at the end of the method chain
def clean_columns(df):
return (
df.clean_names()
.rename(
columns=lambda x: (
x.strip("_")
.replace("&", "_and_")
.replace("subjects_", "")
.replace("location_of_death_", "")
.replace("_resulting_in_death_month_day_year", "")
.replace("_internal_use_not_for_analysis", "")
)
)
.pipe(clean_column_schema) # <- validate output with schema
)
If a column is not present as specified by the schema, a SchemaError is raised.
corrupted_data = fatal_encounters.drop("Subject's age", axis="columns")
try:
clean_columns(corrupted_data)
except pa.errors.SchemaError as exc:
print(exc)
column 'age' not in dataframe
unique_id name gender race \
0 25746 Samuel H. Knapp Male European-American/White
1 25747 Mark A. Horton Male African-American/Black
2 25748 Phillip A. Blurbridge Male African-American/Black
3 25749 Mark Ortiz Male Hispanic/Latino
4 2 Lester Miller Male Race unspecified
race_with_imputations imputation_probability url_of_image_of_deceased \
0 European-American/White not imputed NaN
1 African-American/Black not imputed NaN
2 African-American/Black not imputed NaN
3 Hispanic/Latino not imputed NaN
4 African-American/Black 0.947676492 NaN
date_of_injury location_of_injury_address city ... \
0 01/01/2000 27898-27804 US-101 Willits ...
1 01/01/2000 Davison Freeway Detroit ...
2 01/01/2000 Davison Freeway Detroit ...
3 01/01/2000 600 W Cherry Ln Carlsbad ...
4 01/02/2000 4850 Flakes Mill Road Ellenwood ...
a_brief_description_of_the_circumstances_surrounding_the_death \
0 Samuel Knapp was allegedly driving a stolen ve...
1 Two Detroit men killed when their car crashed ...
2 Two Detroit men killed when their car crashed ...
3 A motorcycle was allegedly being driven errati...
4 Darren Mayfield, a DeKalb County sheriff's dep...
dispositions_exclusions intentional_use_of_force_developing \
0 Unreported Vehicle/Pursuit
1 Unreported Vehicle/Pursuit
2 Unreported Vehicle/Pursuit
3 Unreported Vehicle/Pursuit
4 Criminal Intentional Use of Force, Deadly
link_to_news_article_or_photo_of_official_document \
0 https://drive.google.com/file/d/10DisrV8K5ReP1...
1 https://drive.google.com/file/d/1-nK-RohgiM-tZ...
2 https://drive.google.com/file/d/1-nK-RohgiM-tZ...
3 https://drive.google.com/file/d/1qAEefRjX_aTtC...
4 https://docs.google.com/document/d/1-YuShSarW_...
symptoms_of_mental_illness video \
0 No NaN
1 No NaN
2 No NaN
3 No NaN
4 No NaN
date_and_description unique_id_formula \
0 1/1/2000: Samuel Knapp was allegedly driving a... NaN
1 1/1/2000: Two Detroit men killed when their ca... NaN
2 1/1/2000: Two Detroit men killed when their ca... NaN
3 1/1/2000: A motorcycle was allegedly being dri... NaN
4 1/2/2000: Darren Mayfield, a DeKalb County she... NaN
unique_identifier_redundant date_year
0 25746 2000.0
1 25747 2000.0
2 25748 2000.0
3 25749 2000.0
4 2 2000.0
[5 rows x 28 columns]
pandas-profiling¶training_data_schema = pa.DataFrameSchema(
{
# feature columns
"age": Column(pa.Float, Check.in_range(0, 120), nullable=True),
"gender": Column(pa.String, Check.isin(genders), nullable=True),
"race": Column(pa.String, Check.isin(races), nullable=True),
"cause_of_death": Column(pa.String, Check.isin(causes_of_death), nullable=True),
"symptoms_of_mental_illness": Column(pa.Bool, nullable=True),
# target column
"disposition_accidental": Column(pa.Bool, nullable=False),
},
coerce=True # <- coerce columns to the specified type
)
print(training_data_schema.to_yaml())
schema_type: dataframe
version: 0.4.2
columns:
age:
pandas_dtype: float
nullable: true
checks:
in_range:
min_value: 0
max_value: 120
gender:
pandas_dtype: string
nullable: true
checks:
isin:
- female
- male
- transgender
- transexual
race:
pandas_dtype: string
nullable: true
checks:
isin:
- african_american_black
- asian_pacific_islander
- european_american_white
- hispanic_latino
- middle_eastern
- native_american_alaskan
- race_unspecified
cause_of_death:
pandas_dtype: string
nullable: true
checks:
isin:
- asphyxiated_restrained
- beaten_bludgeoned_with_instrument
- burned_smoke_inhalation
- chemical_agent_pepper_spray
- drowned
- drug_overdose
- fell_from_a_height
- gunshot
- medical_emergency
- other
- stabbed
- tasered
- undetermined
- unknown
- vehicle
symptoms_of_mental_illness:
pandas_dtype: bool
nullable: true
checks: null
disposition_accidental:
pandas_dtype: bool
nullable: false
checks: null
index: null
coerce: true
The cleaning function should normalize string values as specified
by training_data_schema.
def clean_data(df):
return (
df.dropna(subset=["dispositions_exclusions"])
.transform_columns(
[
"gender", "race", "cause_of_death",
"symptoms_of_mental_illness", "dispositions_exclusions"
],
lambda x: x.str.lower().str.replace('-|/| ', '_'), # clean string values
elementwise=False
)
.transform_column(
"symptoms_of_mental_illness",
lambda x: x.mask(x.dropna().str.contains("unknown")) != "no", # binarize mental illness
elementwise=False
)
.transform_column(
"dispositions_exclusions",
lambda x: x.str.contains("accident", case=False), # derive target column
"disposition_accidental",
elementwise=False
)
.query("gender != 'white'") # probably a data entry error
.filter_string(
"dispositions_exclusions",
"unreported|unknown|pending|suicide", # filter out unknown, unreported, or suicide cases
complement=True
)
)
@pa.check_input(clean_column_schema)
@pa.check_output(training_data_schema)
def clean_data(df):
return (
df.dropna(subset=["dispositions_exclusions"])
.transform_columns(
[
"gender", "race", "cause_of_death",
"symptoms_of_mental_illness", "dispositions_exclusions"
],
lambda x: x.str.lower().str.replace('-|/| ', '_'), # clean string values
elementwise=False
)
.transform_column(
"symptoms_of_mental_illness",
lambda x: x.mask(x.dropna().str.contains("unknown")) != "no", # binarize mental illness
elementwise=False
)
.transform_column(
"dispositions_exclusions",
lambda x: x.str.contains("accident", case=False), # derive target column
"disposition_accidental",
elementwise=False
)
.query("gender != 'white'") # probably a data entry error
.filter_string(
"dispositions_exclusions",
"unreported|unknown|pending|suicide", # filter out unknown, unreported, or suicide cases
complement=True
)
)
ValueError: Unable to coerce column to schema data type¶try:
clean_data(fatal_encounters_clean_columns)
except ValueError as exc:
print(exc)
could not convert string to float: '18 months'
So that string values can be converted into float
def normalize_age(age):
return (
age.str.replace("s|`", "")
.pipe(normalize_age_range)
.pipe(normalize_age_to_year, "month|mon", 12)
.pipe(normalize_age_to_year, "day", 365)
)
normalize_age inside the clean_data function.¶# data validation 🥪
@pa.check_input(clean_column_schema)
@pa.check_output(training_data_schema)
def clean_data(df):
return (
df.dropna(subset=["dispositions_exclusions"])
.transform_columns(
[
"gender", "race", "cause_of_death",
"symptoms_of_mental_illness", "dispositions_exclusions"
],
lambda x: x.str.lower().str.replace('-|/| ', '_'), # clean string values
elementwise=False
)
.transform_column(
"symptoms_of_mental_illness",
lambda x: x.mask(x.dropna().str.contains("unknown")) != "no", # binarize mental illness
elementwise=False
)
.transform_column("age", normalize_age, elementwise=False) # <- clean up age column
.transform_column(
"dispositions_exclusions",
lambda x: x.str.contains("accident", case=False), # derive target column
"disposition_accidental",
elementwise=False
)
.query("gender != 'white'") # probably a data entry error
.filter_string(
"dispositions_exclusions",
"unreported|unknown|pending|suicide", # filter out unknown, unreported, or suicide cases
complement=True
)
)
fatal_encounters_clean = clean_data(fatal_encounters_clean_columns)
with pd.option_context("display.max_rows", 5):
display(fatal_encounters_clean.filter(list(training_data_schema.columns)))
| age | gender | race | cause_of_death | symptoms_of_mental_illness | disposition_accidental | |
|---|---|---|---|---|---|---|
| 4 | 53.0 | male | race_unspecified | gunshot | False | False |
| 8 | 42.0 | female | race_unspecified | vehicle | False | False |
| ... | ... | ... | ... | ... | ... | ... |
| 28236 | 35.0 | male | european_american_white | drowned | False | False |
| 28276 | 26.0 | male | hispanic_latino | gunshot | False | False |
8068 rows × 6 columns
corrupt_data = fatal_encounters_clean.copy()
corrupt_data["gender"].iloc[:50] = "foo"
corrupt_data["gender"].iloc[50:100] = "bar"
try:
training_data_schema(corrupt_data)
except pa.errors.SchemaError as exc:
print(exc)
<Schema Column: 'gender' type=string> failed element-wise validator 0:
<Check _isin: isin({'transgender', 'female', 'male', 'transexual'})>
failure cases:
index count
failure_case
bar [146, 147, 148, 151, 153, 162, 165, 168, 171, ... 50
foo [4, 8, 9, 11, 12, 13, 17, 22, 31, 32, 40, 41, ... 50
The SchemaError exception object contains the invalid dataframe and
the failure cases, which is also a dataframe.
with pd.option_context("display.max_rows", 5):
try:
training_data_schema(corrupt_data)
except pa.errors.SchemaError as exc:
print("Invalid Data:\n-------------")
print(exc.data.iloc[:, :5])
print("\nFailure Cases:\n--------------")
print(exc.failure_cases)
Invalid Data:
-------------
unique_id name age gender \
4 2 Lester Miller 53.0 foo
8 25752 Doris Murphy 42.0 foo
... ... ... ... ...
28236 28209 Joe Deewayne Cothrum 35.0 male
28276 28358 Jose Santos Parra Juarez 26.0 male
race
4 race_unspecified
8 race_unspecified
... ...
28236 european_american_white
28276 hispanic_latino
[8068 rows x 5 columns]
Failure Cases:
--------------
index failure_case
0 4 foo
1 8 foo
.. ... ...
98 280 bar
99 284 bar
[100 rows x 2 columns]
What percent of cases in the training data are "accidental"?
percent_accidental = fatal_encounters_clean.disposition_accidental.mean()
display(Markdown(f"{percent_accidental * 100:0.02f}%"))
2.74%
Hypothesis: "the disposition_accidental target has a
class balance of ~2.75%"
from pandera import Hypothesis
# use the Column object as a stand-alone schema object
target_schema = Column(
pa.Bool,
name="disposition_accidental",
checks=Hypothesis.one_sample_ttest(
popmean=0.0275, relationship="equal", alpha=0.01
)
)
target_schema(fatal_encounters_clean);
For functions that have tuple/list-like output, specify an integer
index pa.check_output(schema, <int>) to apply the schema to a
specific element in the output.
from sklearn.model_selection import train_test_split
target_schema = pa.SeriesSchema(
pa.Bool,
name="disposition_accidental",
checks=Hypothesis.one_sample_ttest(
popmean=0.0275, relationship="equal", alpha=0.01
)
)
feature_schema = training_data_schema.remove_columns([target_schema.name])
@pa.check_input(training_data_schema)
@pa.check_output(feature_schema, 0)
@pa.check_output(feature_schema, 1)
@pa.check_output(target_schema, 2)
@pa.check_output(target_schema, 3)
def split_training_data(fatal_encounters_clean):
return train_test_split(
fatal_encounters_clean[list(feature_schema.columns)],
fatal_encounters_clean[target_schema.name],
test_size=0.2,
random_state=45,
)
X_train, X_test, y_train, y_test = split_training_data(fatal_encounters_clean)
Import the tools
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
DataFrameSchema -> ColumnTransformer 🤯¶Create a transformer to numericalize the features using a schema object.
Checks have a statistics attribute that enables access to the properties
defined in the schema.
def column_transformer_from_schema(feature_schema):
def transformer_from_column(column):
column_schema = feature_schema.columns[column]
if column_schema.pandas_dtype is pa.String:
return make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(categories=[get_categories(column_schema)])
)
if column_schema.pandas_dtype is pa.Bool:
return SimpleImputer(strategy="median")
# otherwise assume numeric variable
return make_pipeline(
SimpleImputer(strategy="median"),
StandardScaler()
)
return ColumnTransformer([
(column, transformer_from_column(column), [column])
for column in feature_schema.columns
])
def get_categories(column_schema):
for check in column_schema.checks:
if check.name == "isin":
return check.statistics["allowed_values"]
raise ValueError("could not find Check.isin")
transformer = column_transformer_from_schema(feature_schema)
ColumnTransformer(transformers=[('age',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['age']),
('gender',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(categories=[['female',
'male',
'transgender',
'transexual']]))]),
['gender']),
('race',...
'beaten_bludgeoned_with_instrument',
'burned_smoke_inhalation',
'chemical_agent_pepper_spray',
'drowned',
'drug_overdose',
'fell_from_a_height',
'gunshot',
'medical_emergency',
'other',
'stabbed',
'tasered',
'undetermined',
'unknown',
'vehicle']]))]),
['cause_of_death']),
('symptoms_of_mental_illness',
SimpleImputer(strategy='median'),
['symptoms_of_mental_illness'])])['age']
SimpleImputer(strategy='median')
StandardScaler()
['gender']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['female', 'male', 'transgender', 'transexual']])
['race']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['african_american_black', 'asian_pacific_islander',
'european_american_white', 'hispanic_latino',
'middle_eastern', 'native_american_alaskan',
'race_unspecified']])['cause_of_death']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['asphyxiated_restrained',
'beaten_bludgeoned_with_instrument',
'burned_smoke_inhalation',
'chemical_agent_pepper_spray', 'drowned',
'drug_overdose', 'fell_from_a_height', 'gunshot',
'medical_emergency', 'other', 'stabbed', 'tasered',
'undetermined', 'unknown', 'vehicle']])['symptoms_of_mental_illness']
SimpleImputer(strategy='median')
You can even decorate object methods, specifying the argument name that you want to apply a schema to.
pipeline = Pipeline([
("transformer", transformer),
(
"estimator",
RandomForestClassifier(
class_weight="balanced_subsample",
n_estimators=500,
min_samples_leaf=20,
min_samples_split=10,
max_depth=10,
random_state=100,
)
)
])
fit_fn = pa.check_input(feature_schema, "X")(pipeline.fit)
fit_fn = pa.check_input(target_schema, "y")(fit_fn)
fit_fn(X_train, y_train)
Pipeline(steps=[('transformer',
ColumnTransformer(transformers=[('age',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['age']),
('gender',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(categories=[['female',
'male',
'transgender',
'transex...
'medical_emergency',
'other',
'stabbed',
'tasered',
'undetermined',
'unknown',
'vehicle']]))]),
['cause_of_death']),
('symptoms_of_mental_illness',
SimpleImputer(strategy='median'),
['symptoms_of_mental_illness'])])),
('estimator',
RandomForestClassifier(class_weight='balanced_subsample',
max_depth=10, min_samples_leaf=20,
min_samples_split=10, n_estimators=500,
random_state=100))])ColumnTransformer(transformers=[('age',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['age']),
('gender',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(categories=[['female',
'male',
'transgender',
'transexual']]))]),
['gender']),
('race',...
'beaten_bludgeoned_with_instrument',
'burned_smoke_inhalation',
'chemical_agent_pepper_spray',
'drowned',
'drug_overdose',
'fell_from_a_height',
'gunshot',
'medical_emergency',
'other',
'stabbed',
'tasered',
'undetermined',
'unknown',
'vehicle']]))]),
['cause_of_death']),
('symptoms_of_mental_illness',
SimpleImputer(strategy='median'),
['symptoms_of_mental_illness'])])['age']
SimpleImputer(strategy='median')
StandardScaler()
['gender']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['female', 'male', 'transgender', 'transexual']])
['race']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['african_american_black', 'asian_pacific_islander',
'european_american_white', 'hispanic_latino',
'middle_eastern', 'native_american_alaskan',
'race_unspecified']])['cause_of_death']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(categories=[['asphyxiated_restrained',
'beaten_bludgeoned_with_instrument',
'burned_smoke_inhalation',
'chemical_agent_pepper_spray', 'drowned',
'drug_overdose', 'fell_from_a_height', 'gunshot',
'medical_emergency', 'other', 'stabbed', 'tasered',
'undetermined', 'unknown', 'vehicle']])['symptoms_of_mental_illness']
SimpleImputer(strategy='median')
RandomForestClassifier(class_weight='balanced_subsample', max_depth=10,
min_samples_leaf=20, min_samples_split=10,
n_estimators=500, random_state=100)Use the check_input and check_output decorators to validate
the estimator.predict_proba method.
from sklearn.metrics import roc_auc_score, roc_curve
pred_schema = pa.SeriesSchema(pa.Float, Check.in_range(0, 1))
# check that the feature array input to predict_proba method adheres to the feature_schema
predict_fn = pa.check_input(feature_schema)(pipeline.predict_proba)
# check that the prediction array output is a probability.
predict_fn = pa.check_output(pred_schema, lambda x: pd.Series(x))(predict_fn)
yhat_train = pipeline.predict_proba(X_train)[:, 1]
print(f"train ROC AUC: {roc_auc_score(y_train, yhat_train):0.04f}")
yhat_test = pipeline.predict_proba(X_test)[:, 1]
print(f"test ROC AUC: {roc_auc_score(y_test, yhat_test):0.04f}")
train ROC AUC: 0.9207 test ROC AUC: 0.8479
Plot the ROC curves using an in-line schema.
def plot_roc_auc(y_true, y_pred, label, ax=None):
fpr, tpr, _ = roc_curve(y_true, y_pred)
roc_curve_df = pd.DataFrame({"fpr": fpr, "tpr": tpr}).pipe(
pa.DataFrameSchema({
"fpr": Column(pa.Float, Check.in_range(0, 1)),
"tpr": Column(pa.Float, Check.in_range(0, 1)),
})
)
return roc_curve_df.plot.line(x="fpr", y="tpr", label=label, ax=ax)
with sns.axes_style("whitegrid"):
_, ax = plt.subplots(figsize=(5, 4))
plot_roc_auc(y_train, yhat_train, "test AUC", ax)
plot_roc_auc(y_test, yhat_test, "train AUC", ax)
ax.set_ylabel("true_positive_rate")
ax.set_xlabel("false_positive_rate")
ax.plot([0, 1], [0, 1], color="k", linestyle=":")
shap package: https://github.com/slundberg/shap
Create an explainer object. Here we want to check the inputs to the transformer.transform method.
import shap
explainer = shap.TreeExplainer(
pipeline.named_steps["estimator"],
feature_perturbation="tree_path_dependent",
)
transform_fn = pa.check_input(feature_schema)(
pipeline.named_steps["transformer"].transform
)
X_test_array = transform_fn(X_test).toarray()
shap_values = explainer.shap_values(X_test_array, check_additivity=False)
The probability of the case being ruled as accidental ⬆️ if the cause_of_death is vehicle, tasered, asphyxiated_restrained, medical_emergency, or drug_overdose, or race is race_unspecified or native_american_alaskan.
The probability of the case being ruled as accidental ⬇️ if the cause_of_death is gunshot or race is european_american_white, or asian_pacific_islander.
Create a dataframe with {variable} and {variable}_shap as columns
audit_dataframe = (
pd.concat(
[
pd.DataFrame(X_test_array, columns=feature_names),
pd.DataFrame(shap_values[1], columns=[f"{x}_shap" for x in feature_names])
],
axis="columns"
).sort_index(axis="columns")
)
audit_dataframe.head(3)
| age | age_shap | cause_of_death_asphyxiated_restrained | cause_of_death_asphyxiated_restrained_shap | cause_of_death_beaten_bludgeoned_with_instrument | cause_of_death_beaten_bludgeoned_with_instrument_shap | cause_of_death_burned_smoke_inhalation | cause_of_death_burned_smoke_inhalation_shap | cause_of_death_chemical_agent_pepper_spray | cause_of_death_chemical_agent_pepper_spray_shap | ... | race_hispanic_latino | race_hispanic_latino_shap | race_middle_eastern | race_middle_eastern_shap | race_native_american_alaskan | race_native_american_alaskan_shap | race_race_unspecified | race_race_unspecified_shap | symptoms_of_mental_illness | symptoms_of_mental_illness_shap | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.534156 | -0.039886 | 0.0 | -0.007479 | 0.0 | -0.001080 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.003194 | 0.0 | 0.0 | 0.0 | -0.001401 | 1.0 | 0.030105 | 0.0 | -0.007136 |
| 1 | -0.820163 | -0.012115 | 0.0 | -0.008413 | 0.0 | -0.001327 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | -0.002723 | 0.0 | 0.0 | 0.0 | -0.001538 | 0.0 | -0.012061 | 1.0 | -0.011863 |
| 2 | -0.319651 | -0.037459 | 0.0 | -0.008016 | 0.0 | -0.001250 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.002469 | 0.0 | 0.0 | 0.0 | -0.001629 | 0.0 | -0.014678 | 0.0 | -0.009156 |
3 rows × 56 columns
Define two sample t-test that tests the relative impact of a variable on the output probability of the model
def hypothesis_accident_probability(feature, increases=True):
relationship = "greater_than" if increases else "less_than"
return {
feature: Column(checks=Check.isin([1, 0])),
f"{feature}_shap": Column(
pa.Float,
checks=Hypothesis.two_sample_ttest(
sample1=1,
sample2=0,
groupby=feature,
relationship=relationship,
alpha=0.01,
)
),
}
Programmatically construct the schema and validate feature_shap_df.
columns = {}
# increases probability of disposition "accidental"
for column in [
"cause_of_death_vehicle",
"cause_of_death_tasered",
"cause_of_death_asphyxiated_restrained",
"cause_of_death_medical_emergency",
"cause_of_death_drug_overdose",
"race_race_unspecified",
"race_native_american_alaskan",
]:
columns.update(hypothesis_accident_probability(column, increases=True))
# decreases probability of disposition "accidental"
for column in [
"cause_of_death_gunshot",
"race_european_american_white",
"race_asian_pacific_islander",
]:
columns.update(hypothesis_accident_probability(column, increases=False))
model_audit_schema = pa.DataFrameSchema(columns)
try:
model_audit_schema(audit_dataframe)
print("Model audit results pass! ✅")
except pa.errors.SchemaError as exc:
print("Model audit results fail ❌")
print(exc)
Model audit results pass! ✅
race_unspecified be associated with a higher probability of accidental rulings?disposition of unreported, unknown, or pending cases? What would that get us?race_african_american_black and other variables?pandera schemas are executable contracts that enforce the statistical properties of a dataframe
at runtime and can be flexibly interleaved with data processing and analysis logic.pandera doesn't automate data exploration or the data validation process. The user is responsible for identifying which parts of the pipeline are critical to test and defining the contracts under which data are considered valid.Schema Inference
schema = pa.infer_schema(dataframe)
Schema Serialization
pa.io.to_yaml(schema, "schema.yml")
pa.io.to_script(schema, "schema.py")
Define domain-specific schemas, types, and checks, e.g. for machine learning
# validate a regression model dataset
schema = pa.machine_learning.supervised.TabularSchema(
targets={"regression": pa.TargetColumn(type=pa.ml_dtypes.Continuous)},
features={
"continuous": pa.FeatureColumn(type=pa.ml_dtypes.Continuous),
"categorical": pa.FeatureColumn(type=pa.ml_dtypes.Categorical),
"ordinal": pa.FeatureColumn(type=pa.ml_dtypes.Ordinal),
}
)
Generate synthetic data based on schema definition as constraints
dataset = schema.generate_samples(100)
X, y = dataset[schema.features], dataset[schema.targets]
estimator.fit(X, y)
Repo: https://github.com/pandera-dev/pandera
Check and Hypothesis methods)