Pandera: A Data Testing Toolkit for Data Science & Machine Learning¶

Niels Bantilan¶

Pycon Blackrock, November 22nd 2021

WARNING:root:Found pyspark version "3.2.0" installed. The pyspark version 3.2 and above has a built-in "pandas APIs on Spark" module ported from Koalas. Try `import pyspark.pandas as ps` instead. 

Background 🏞¶

  • 📜 B.A. in Biology and Dance
  • 📜 M.P.H. in Sociomedical Science and Public Health Informatics
  • 🤖 Machine Learning Engineer @ Union.ai
  • 🛩 Flytekit OSS Maintainer
  • ✅ Auther and Maintainer of Pandera
  • 🛠 Make DS/ML practitioners more productive

Outline 📝¶

  • 🤔 What's Data Testing?
  • ✅ Pandera Quickstart
  • 🚦 Guiding Principles
  • 🏔 Scaling Pandera
  • ⌨️ Conclusion: Statistical Typing
  • 🛣 Future Roadmap

🤔 What's Data Testing?¶

The act of asking the question "are my data as I expect them to be?"

It's the act of writing programs that assert properties about not only the data themselves, but also the functions that produce them.

A Simple Example: Life Before Pandera¶

data_cleaner.py

In [2]:
import pandas as pd

raw_data = pd.DataFrame({
    "continuous": ["-1.1", "4.0", "10.25", "-0.1", "5.2"],
    "categorical": ["A", "B", "C", "Z", "X"],
})

def clean(raw_data):
    # do some cleaning 🧹✨
    clean_data = ...
    return clean_data

test_data_cleaner.py

In [3]:
import pytest

def test_clean():
    # assumptions about valid data
    mock_raw_data = pd.DataFrame({"continuous": ["1.0", "-5.1"], "categorical": ["X", "A"]})
    result = clean(mock_raw_data)
    
    # check that the result contains nulls
    assert result.isna().any(axis="columns").all()

    # check data types of each column
    assert result["continuous"].dtype == float
    assert result["categorical"].dtype == object
    
    # check that non-null values have expected properties
    assert result["continuous"].dropna().ge(0).all()
    assert result["categorical"].dropna().isin(["A", "B", "C"]).all()
    
    # assumptions about invalid data
    with pytest.raises(KeyError):
        invalid_mock_raw_data = pd.DataFrame({"categorical": ["A"]})
        clean(invalid_mock_raw_data)
    print("tests pass! ✅")

Let's implement the clean function:

In [4]:
def clean(raw_data):
    raw_data = pd.DataFrame(raw_data)
    # do some cleaning 🧹✨
    clean_data = (
        raw_data
        .astype({"continuous": float, "categorical": str})
        .assign(
            continuous=lambda _: _.continuous.mask(_.continuous < 0),
            categorical=lambda _: _.categorical.mask(~_.categorical.isin(["A", "B", "C"]))
        )
    )
    return clean_data

clean(raw_data)
Out[4]:
continuous categorical
0 NaN A
1 4.00 B
2 10.25 C
3 NaN NaN
4 5.20 NaN
In [5]:
test_clean()
tests pass! ✅

Pandera Quickstart

An expressive and light-weight statistical validation tool for dataframes

  • Check the types and properties of dataframes
  • Easily integrate with existing data pipelines via function decorators
  • Synthesize data from schema objects for property-based testing

Object-based API¶

Defining a schema looks and feels like defining a pandas dataframe

In [6]:
import pandera as pa

clean_data_schema = pa.DataFrameSchema(
    columns={
        "continuous": pa.Column(float, pa.Check.ge(0)),
        "categorical": pa.Column(str, pa.Check.isin(["A", "B", "C"])),
    },
    coerce=True,
)

Class-based API¶

Complex Types with Modern Python, Inspired by pydantic and dataclasses

In [7]:
from pandera.typing import Series

class CleanData(pa.SchemaModel):
    continuous: Series[float] = pa.Field(ge=0)
    categorical: Series[str] = pa.Field(isin=["A", "B", "C"])

    class Config:
        coerce = True

Pandera comes in two flavors

Pandera Raises Informative Errors¶

Know Exactly What Went Wrong with Your Data

In [8]:
raw_data = pd.DataFrame({
    "continuous": ["-1.1", "4.0", "10.25", "-0.1", "5.2"],
    "categorical": ["A", "B", "C", "Z", "X"],
})

try:
    CleanData.validate(raw_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)
schema_context column check check_number failure_case index
0 Column continuous greater_than_or_equal_to(0) 0 -1.1 0
1 Column continuous greater_than_or_equal_to(0) 0 -0.1 3
2 Column categorical isin({'A', 'B', 'C'}) 0 Z 3
3 Column categorical isin({'A', 'B', 'C'}) 0 X 4

Meta: This presentation notebook is validated by pandera!¶

mindblown

🚦 Guiding Principles¶

A Simple Example: Life After Pandera¶

Let's define the types of the dataframes that we expect to see.

Here's data_cleaner.py again:

In [9]:
import pandera as pa
from pandera.typing import DataFrame, Series

class RawData(pa.SchemaModel):
    continuous: Series[float]
    categorical: Series[str]

    class Config:
        coerce = True


class CleanData(RawData):
    continuous = pa.Field(ge=0, nullable=True)
    categorical = pa.Field(isin=[*"ABC"], nullable=True)

Parse, then Validate¶

Pandera guarantees that input and output dataframes fulfill the types and constraints as defined by type annotations

In [10]:
@pa.check_types
def clean(raw_data: DataFrame[RawData]) -> DataFrame[CleanData]:
    return raw_data.assign(
        continuous=lambda _: _.continuous.mask(_.continuous < 0),
        categorical=lambda _: _.categorical.mask(~_.categorical.isin(["A", "B", "C"]))
    )
In [11]:
clean(raw_data)
Out[11]:
continuous categorical
0 NaN A
1 4.00 B
2 10.25 C
3 NaN NaN
4 5.20 NaN

test_data_cleaner.py

In [12]:
def test_clean():
    # assumptions about valid data
    mock_raw_data = pd.DataFrame({"continuous": ["1.0", "-5.1"], "categorical": ["X", "A"]})
    
    # the assertions about the resulting data reduces to an execution test!
    clean(mock_raw_data)
    
    # assumptions about invalid data
    with pytest.raises(pa.errors.SchemaError):
        invalid_mock_raw_data = pd.DataFrame({"categorical": ["A"]})
        clean(invalid_mock_raw_data)
    print("tests pass! ✅")
In [13]:
test_clean()
tests pass! ✅

Maximize Reusability and Adaptability¶

Once you've defined a schema, you can import it in other parts of your code base, like your test suite!

In [14]:
# data_cleaner.py
def clean(raw_data: DataFrame[RawData]) -> DataFrame[CleanData]:
    return raw_data.assign(
        continuous=lambda _: _.continuous.mask(_.continuous < 0),
        categorical=lambda _: _.categorical.mask(~_.categorical.isin(["A", "B", "C"]))
    )

# test_data_cleaner.py
def test_clean():
    # assumptions about valid data
    mock_raw_data = RawData(pd.DataFrame({"continuous": ["1.0", "-5.1"], "categorical": ["X", "A"]}))
    
    # the assertions about the resulting data reduces to an execution test!
    CleanData(clean(mock_raw_data))
    
    # assumptions about invalid data
    with pytest.raises(pa.errors.SchemaError):
        invalid_mock_raw_data = RawData(pd.DataFrame({"categorical": ["A"]}))
        clean(invalid_mock_raw_data)
    print("tests pass! ✅")
    
test_clean()
tests pass! ✅

You can even represent dataframe joins!

In [15]:
class CleanData(RawData):
    continuous = pa.Field(ge=0, nullable=True)
    categorical = pa.Field(isin=[*"ABC"], nullable=True)
    
class SupplementaryData(pa.SchemaModel):
    discrete: Series[int] = pa.Field(ge=0, nullable=True)
        
class JoinedData(CleanData, SupplementaryData): pass


clean_data = pd.DataFrame({"continuous": ["1.0"], "categorical": ["A"]})
supplementary_data = pd.DataFrame({"discrete": [1]})
JoinedData(clean_data.join(supplementary_data))
Out[15]:
continuous categorical discrete
0 1.0 A 1

Bootstrap and Interoperate¶

Infer a schema definition from reference data¶
In [16]:
clean_data = pd.DataFrame({
    "continuous": range(100),
    "categorical": [*"ABCAB" * 20]
})

schema = pa.infer_schema(clean_data)
print(schema)
<Schema DataFrameSchema(
    columns={
        'continuous': <Schema Column(name=continuous, type=DataType(int64))>
        'categorical': <Schema Column(name=categorical, type=DataType(object))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema Index(name=None, type=DataType(int64))>,
    strict=False
    name=None,
    ordered=False
)>
Write it to/from a yaml file¶
In [17]:
yaml_schema = schema.to_yaml()
print(yaml_schema)
schema_type: dataframe
version: 0.8.0
columns:
  continuous:
    dtype: int64
    nullable: false
    checks:
      greater_than_or_equal_to: 0.0
      less_than_or_equal_to: 99.0
    unique: false
    coerce: false
    required: true
    regex: false
  categorical:
    dtype: object
    nullable: false
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
checks: null
index:
- dtype: int64
  nullable: false
  checks:
    greater_than_or_equal_to: 0.0
    less_than_or_equal_to: 99.0
  name: null
  coerce: false
coerce: true
strict: false
unique: null

In [18]:
print(schema.from_yaml(yaml_schema))
<Schema DataFrameSchema(
    columns={
        'continuous': <Schema Column(name=continuous, type=DataType(int64))>
        'categorical': <Schema Column(name=categorical, type=DataType(object))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema Index(name=None, type=DataType(int64))>,
    strict=False
    name=None,
    ordered=False
)>
Write it to a python script for further refinement using schema.to_script()¶
from pandera import DataFrameSchema, Column, Check, Index, MultiIndex

schema = DataFrameSchema(
    columns={
        "continuous": Column(
            dtype=pandera.engines.numpy_engine.Int64,
            checks=[
                Check.greater_than_or_equal_to(min_value=0.0),
                Check.less_than_or_equal_to(max_value=99.0),
            ],
            nullable=False,
            unique=False,
            coerce=False,
            required=True,
            regex=False,
        ),
        "categorical": Column(
            dtype=pandera.engines.numpy_engine.Object,
            checks=None,
            nullable=False,
            unique=False,
            coerce=False,
            required=True,
            regex=False,
        ),
    },
    index=Index(
        dtype=pandera.engines.numpy_engine.Int64,
        checks=[
            Check.greater_than_or_equal_to(min_value=0.0),
            Check.less_than_or_equal_to(max_value=99.0),
        ],
        nullable=False,
        coerce=False,
        name=None,
    ),
    coerce=True,
    strict=False,
    name=None,
)
Port schema from a frictionless table schema¶
In [20]:
from pandera.io import from_frictionless_schema

frictionless_schema = {
    "fields": [
        {
            "name": "continuous",
            "type": "number",
            "constraints": {"minimum": 0}
        },
        {
            "name": "categorical",
            "type": "string",
            "constraints": {"isin": ["A", "B", "C"]}
        },
    ],
}
schema = from_frictionless_schema(frictionless_schema)
print(schema)
<Schema DataFrameSchema(
    columns={
        'continuous': <Schema Column(name=continuous, type=DataType(float64))>
        'categorical': <Schema Column(name=categorical, type=DataType(string[python]))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=None,
    strict=True
    name=None,
    ordered=False
)>

Facilitate Property-based Testing with Generative Schemas¶

Generate valid examples under the schema's constraints

In [21]:
RawData.example(size=3)
Out[21]:
continuous categorical
0 0.0
1 0.0
2 0.0
In [22]:
CleanData.example(size=3)
Out[22]:
continuous categorical
0 0.0 A
1 0.0 A
2 0.0 A
In [23]:
# Transform your unit test suite!

from hypothesis import given


@pa.check_types
def clean(raw_data: DataFrame[RawData]) -> DataFrame[CleanData]:
    return raw_data.assign(
        continuous=lambda _: _.continuous.mask(_.continuous < 0),
        categorical=lambda _: _.categorical.mask(~_.categorical.isin(["A", "B", "C"]))
    )

@given(RawData.strategy(size=5))
def test_clean(mock_raw_data):
    clean(mock_raw_data)
    
    
class InvalidData(pa.SchemaModel):
    foo: Series[int]
    

@given(InvalidData.strategy(size=5))
def test_clean_errors(mock_invalid_data):
    with pytest.raises(pa.errors.SchemaError):
        clean(mock_invalid_data)
    

def run_test_suite():
    test_clean()
    test_clean_errors()
    print("tests pass! ✅")
    
    
run_test_suite()
tests pass! ✅

🏔 Scaling Pandera¶

🚧 beta 🏗¶

In 0.8.0, pandera supports dask, modin, and koalas dataframes to scale data validation to big data.

In [24]:
display(raw_data)
continuous categorical
0 -1.1 A
1 4.0 B
2 10.25 C
3 -0.1 Z
4 5.2 X

Dask¶

In [25]:
import dask.dataframe as dd

dask_dataframe = dd.from_pandas(raw_data, npartitions=1)

try:
    CleanData(dask_dataframe, lazy=True).compute()
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)
  schema_context       column                        check  check_number  \
0         Column   continuous  greater_than_or_equal_to(0)             0   
1         Column   continuous  greater_than_or_equal_to(0)             0   
2         Column  categorical        isin({'A', 'B', 'C'})             0   
3         Column  categorical        isin({'A', 'B', 'C'})             0   

  failure_case  index  
0         -1.1      0  
1         -0.1      3  
2            Z      3  
3            X      4  

Koalas¶

In [26]:
import databricks.koalas as ks

koalas_dataframe = ks.DataFrame(raw_data)

try:
    CleanData(koalas_dataframe, lazy=True).compute()
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)
  schema_context       column                        check  check_number  \
0         Column   continuous  greater_than_or_equal_to(0)             0   
1         Column  categorical        isin({'A', 'B', 'C'})             0   

                                        failure_case index  
0  AnalysisException("Resolved attribute(s) conti...  None  
1  AnalysisException("Resolved attribute(s) categ...  None  

Modin¶

In [27]:
import modin.pandas as mpd

modin_dataframe = mpd.DataFrame(raw_data)

try:
    CleanData(modin_dataframe, lazy=True).compute()
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)
  schema_context       column                        check  check_number  \
0         Column   continuous  greater_than_or_equal_to(0)             0   
1         Column   continuous  greater_than_or_equal_to(0)             0   
2         Column  categorical        isin({'A', 'B', 'C'})             0   
3         Column  categorical        isin({'A', 'B', 'C'})             0   

  failure_case  index  
0         -1.1      0  
1         -0.1      3  
2            Z      3  
3            X      4  

⌨️ Statistical Typing¶

Type systems help programmers reason about and write more robust code¶

In [28]:
from typing import Union

Number = Union[int, float]

def add_and_double(x: Number, y: Number) -> Number:
    ...

Can you predict the outcome of these function calls?¶

In [29]:
add_and_double(5, 2)
add_and_double(5, "hello")
add_and_double(11.5, -1.5)

Similarly...¶

In [30]:
import pandera as pa
from pandera.typing import DataFrame, Series

class Inputs(pa.SchemaModel):
    x: Series[int]
    y: Series[int]

    class Config:
        coerce = True


class Outputs(Inputs):
    z: Series[int]
        
    @pa.dataframe_check
    def custom_check(cls, df: DataFrame) -> Series:
        return df["z"] == (df["x"] + df["y"]) * 2
    
    
@pa.check_types
def add_and_double(raw_data: DataFrame[Inputs]) -> DataFrame[Outputs]:
    ...

Pandera is a Type System Geared Towards Data Science and Machine Learning¶

It provides a flexible and expressive API for defining types for dataframes.

This enables a more intuitive way of validating not only data, but also the functions that produce those data.

🛣 Future Roadmap¶

  • Add support for other dataframe-like libraries, e.g: pyarrow, xarray, polars, vaex, cudf, etc.
  • Improve interoperability and integrations with data ecosystem, e.g. pandas-profiling, fastapi, json-schema
  • Implement CLI and pytest-pandera plugin for data pipeline profiling and reporting

Where to Learn More¶

  • Pycon [2021] - Statistical Typing: A Runtime TypingSystem for Data Science and Machine Learning
    • video: https://youtu.be/PI5UmKi14cM
  • Scipy [2020] - Statistical Data Validation of Pandas Dataframes
    • video: https://youtu.be/PxTLD-ueNd4
    • talk: https://conference.scipy.org/proceedings/scipy2020/pdfs/niels_bantilan.pdf
  • Pandera Blog [2020]: https://blog.pandera.ci/statistical%20typing/unit%20testing/2020/12/26/statistical-typing.html
  • PyOpenSci Blog [2019]: https://www.pyopensci.org/blog/pandera-python-pandas-dataframe-validation
  • Personal Blog [2018]: https://cosmicbboy.github.io/2018/12/28/validating-pandas-dataframes.html

Join the Community!¶

badge badge badge badge badge badge badge

  • Twitter: @cosmicbboy
  • Discord: https://discord.gg/vyanhWuaKB
  • Email: niels.bantilan@gmail.com
  • Repo: https://github.com/pandera-dev/pandera
  • Docs: https://pandera.readthedocs.io
  • Contributing Guide: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html