23/07/12 13:11:07 WARN Utils: Your hostname, Admins-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.19.6.37 instead (on interface en0)
23/07/12 13:11:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/12 13:11:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


dataframe.head()

/usr/local/miniconda3/envs/pandera-presentations/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
    445         if mask.any():
    446             with np.errstate(all="ignore"):
--> 447                 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
    448 
    449     else:

TypeError: can't multiply sequence by non-int of type 'float'


def process_data(df):
    ...


def process_data(df):
    return df.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)


def process_data(df):
    import pdb; pdb.set_trace()  # <- insert breakpoint
    return df.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)


print(df)

          hours_worked  wage_per_hour
person_id                            
1014c40           38.5           15.1
9bdac94          41.25           15.0
4331eb8           35.0           21.3
ee7af68          27.75           17.5
d188ff6          22.25           19.5
a0c4b6e          -20.5           25.5


df.dtypes

hours_worked      object
wage_per_hour    float64
dtype: object


df.hours_worked.map(type)

person_id
1014c40    <class 'float'>
9bdac94    <class 'float'>
4331eb8      <class 'str'>
ee7af68    <class 'float'>
d188ff6    <class 'float'>
a0c4b6e    <class 'float'>
Name: hours_worked, dtype: object


def process_data(df):
    return (
        df
        # make sure columns are floats
        .astype({"hours_worked": float, "wage_per_hour": float})
        # replace negative values with nans
        .assign(hours_worked=lambda x: x.hours_worked.where(x.hours_worked >= 0, np.nan))
        # compute weekly income
        .assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
    )


process_data(df)


@pa.check_types
def process_data(df: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    return (
        # replace negative values with nans
        df.assign(hours_worked=lambda x: x.hours_worked.where(x.hours_worked >= 0, np.nan))
        # compute weekly income
        .assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
    )


import pandera as pa

# NOTE: this is what's supposed to be in `df` going into `process_data`
class RawData(pa.SchemaModel):
    hours_worked: float = pa.Field(coerce=True, nullable=True)
    wage_per_hour: float = pa.Field(coerce=True, nullable=True)


# ... and this is what `process_data` is supposed to return.
class ProcessedData(RawData):
    hours_worked: float = pa.Field(ge=0, coerce=True, nullable=True)
    weekly_income: float = pa.Field(nullable=True)


@pa.check_types
def process_data(df: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    ...


import pandera as pa

clean_data_schema = pa.DataFrameSchema(
    columns={
        "continuous": pa.Column(float, pa.Check.ge(0), nullable=True),
        "categorical": pa.Column(str, pa.Check.isin(["A", "B", "C"]), nullable=True),
    },
    coerce=True,
)


from pandera.typing import DataFrame, Series

class CleanData(pa.SchemaModel):
    continuous: Series[float] = pa.Field(ge=0, nullable=True)
    categorical: Series[str] = pa.Field(isin=["A", "B", "C"], nullable=True)

    class Config:
        coerce = True


raw_data = pd.DataFrame({
    "continuous": ["-1.1", "4.0", "10.25", "-0.1", "5.2"],
    "categorical": ["A", "B", "C", "Z", "X"],
})

try:
    CleanData.validate(raw_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)


raw_data_schema = pa.DataFrameSchema(
    columns={
        "continuous": pa.Column(float),
        "categorical": pa.Column(str),
    },
    coerce=True,
)

clean_data_schema.update_columns({
    "continuous": {"nullable": True},
    "categorical": {"checks": pa.Check.isin(["A", "B", "C"]), "nullable": True},
});


class RawData(pa.SchemaModel):
    continuous: Series[float]
    categorical: Series[str]

    class Config:
        coerce = True

class CleanData(RawData):
    continuous = pa.Field(ge=0, nullable=True)
    categorical = pa.Field(isin=["A", "B", "C"], nullable=True);


@pa.check_types
def fn(raw_data: DataFrame[RawData]) -> DataFrame[CleanData]:
    return raw_data.assign(
        continuous=lambda df: df["continuous"].where(lambda x: x > 0, np.nan),
        categorical=lambda df: df["categorical"].where(lambda x: x.isin(["A", "B", "C"]), np.nan),
    )


fn(raw_data)


CleanData.example(size=5)


from hypothesis import given


@given(RawData.strategy(size=5))
def test_fn(raw_data):
    fn(raw_data)


def run_test_suite():
    test_fn()
    print("tests passed ✅")


run_test_suite()

tests passed ✅


display(raw_data)


import dask.dataframe as dd

dask_dataframe = dd.from_pandas(raw_data, npartitions=1)

try:
    CleanData(dask_dataframe, lazy=True).compute()
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases.sort_index())


import modin.pandas as mpd

modin_dataframe = mpd.DataFrame(raw_data)

try:
    CleanData(modin_dataframe, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases.sort_index())

2023-07-12 13:11:15,188	INFO worker.py:1636 -- Started a local Ray instance.
(deploy_ray_func pid=47371) FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.


import pyspark.pandas as ps

pyspark_pd_dataframe = ps.DataFrame(raw_data)

try:
    CleanData(pyspark_pd_dataframe, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.

import sloth as sl
from pandera.api.base.schema import BaseSchema
from pandera.backends.base import BaseSchemaBackend

class DataFrameSchema(BaseSchema):
    def __init__(self, **kwargs):
        # add properties that this dataframe would contain

class DataFrameSchemaBackend(BaseSchemaBackend):
    def validate(
        self,
        check_obj: sl.DataFrame,
        schema: DataFrameSchema,
        *,
        **kwargs,
    ):
        # implement custom validation logic
        
# register the backend
DataFrameSchema.register_backend(
    sloth.DataFrame,
    DataFrameSchemaBackend,
)

import pandera.pyspark as pa
import pyspark.sql.types as T

from decimal import Decimal
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel


class PanderaSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product_name: T.StringType() = pa.Field(str_startswith="B")
    price: T.DecimalType(20, 5) = pa.Field()
    description: T.ArrayType(T.StringType()) = pa.Field()
    meta: T.MapType(T.StringType(), T.StringType()) = pa.Field()

	continuous	categorical
0	NaN	A
1	4.00	B
2	10.25	C
3	NaN	NaN
4	5.20	NaN

	continuous	categorical
0	NaN	A
1	NaN	A
2	NaN	C
3	4.501643e+15	NaN
4	NaN	C

	hours_worked	wage_per_hour
person_id
1014c40	38.5	15.1
9bdac94	41.25	15.0
4331eb8	35.0	21.3
ee7af68	27.75	17.5
d188ff6	22.25	19.5

	hours_worked	wage_per_hour	weekly_income
person_id
1014c40	38.50	15.1	581.350
9bdac94	41.25	15.0	618.750
4331eb8	35.00	21.3	745.500
ee7af68	27.75	17.5	485.625
d188ff6	22.25	19.5	433.875
a0c4b6e	NaN	25.5	NaN

	schema_context	column	check	failure_case	index
0	Column	continuous	greater_than_or_equal_to(0)	-1.1	0
1	Column	continuous	greater_than_or_equal_to(0)	-0.1	3
2	Column	categorical	isin(['A', 'B', 'C'])	Z	3
3	Column	categorical	isin(['A', 'B', 'C'])	X	4

Pandera: Beyond Pandas Data Validation 🐼✅¶

Background¶

This is a talk about open source development 🧑🏾‍💻¶

Outline 📝¶

Where's the Code?¶

🐣 Origins¶

🤷‍♂️ Why Should I Validate Data?¶

What's a DataFrame?¶

What's Data Validation?¶

Why Do I Need it?¶

🐞 It can be difficult to reason about and debug data processing pipelines.¶

⚠️ It's critical to ensuring data quality in many contexts especially when the end product informs business decisions, supports scientific findings, or generates predictions in a production setting.¶

Everyone has a personal relationship with their data¶

Story Time 📖¶

Imagine that you're a data scientist maintaining an existing data processing pipeline 👩‍💻👨‍💻...¶

One day, you encounter an error log trail and decide to follow it...¶

And you find yourself at the top of a function...¶

You look around, and see some hints of what had happened...¶

You sort of know what's going on, but you want to take a closer look!¶

And you find some funny business going on...¶

You squash the bug and add documentation for the next weary traveler who happens upon this code.¶

⏱ A few months later...¶

You find yourself at a familiar function, but it looks a little different from when you left it...¶

You look above and see what RawData and ProcessedData are, finding a NOTE that a fellow traveler has left for you.¶

Moral of the Story¶

The better you can reason about the contents of a dataframe, the faster you can debug.¶

The faster you can debug, the sooner you can focus on downstream tasks that you care about.¶

By validating data through explicit contracts, you also create data documentation and a simple, stateless data shift detector.¶

Pandera Design Principles¶

Pandera Programming Model¶

Meta Comment¶

This presentation notebook is validated by pandera 🤯¶

🤔 What's Data Testing¶

And How Can I Put it Into Practice?¶

In the Real World 🌍¶

Validate real data in production¶

In the Test Suite 🧪¶

Validate functions that produce data, given some test cases¶

Pandera

An expressive and light-weight statistical typing tool for dataframe-like containers¶

Object-based API¶

Class-based API¶

Pandera Raises Informative Errors¶

Pandera Supports Schema Transformations/Inheritence¶

Object-based API¶

Class-based API¶

Integrate Seamlessly with your Pipeline¶

Generative Schemas¶

🐓 Evolution¶

Major Events¶

📖 Documentation Improvements¶

🔤 Class-based API¶

📊 Data Synthesis Strategies¶

⌨️ Pandera Type System¶

Expanding scope¶

dask¶

modin¶

pyspark.pandas¶

Problem: What about non-pandas-compliant dataframes?¶

😩 Design weaknesses¶

💪 Design strengths¶

🦩 Revolution¶

Re-writing pandera internals¶

Writing your own schema¶

📢 Pandera now supports pyspark.sql.DataFrame in 0.16.0b!¶

Organizational and Development Challenges¶

Retrospective¶

Things in place that reduced the risk of regressions¶

Additional approaches to put into practice in the future:¶

Updated Principles¶

Announcement¶

🎉 Pandera has joined Union.ai 🎉¶

🛣️ Roadmap¶

Join the Community!¶

Join me at the sprints!¶

What's a `DataFrame`?¶

You look above and see what `RawData` and `ProcessedData` are, finding a `NOTE` that a fellow traveler has left for you.¶

`dask`¶

`modin`¶

`pyspark.pandas`¶

📢 Pandera now supports `pyspark.sql.DataFrame` in `0.16.0b`!¶