Niels Bantilan, Chief ML Engineer @ Union.ai
SciPy 2023, July 12th 2023
23/07/12 13:11:07 WARN Utils: Your hostname, Admins-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.19.6.37 instead (on interface en0) 23/07/12 13:11:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/07/12 13:11:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
DataFrame
?¶dataframe.head()
hours_worked | wage_per_hour | |
---|---|---|
person_id | ||
1014c40 | 38.5 | 15.1 |
9bdac94 | 41.25 | 15.0 |
4331eb8 | 35.0 | 21.3 |
ee7af68 | 27.75 | 17.5 |
d188ff6 | 22.25 | 19.5 |
Data validation is the act of falsifying data against explicit assumptions for some downstream purpose, like analysis, modeling, and visualization.
"All swans are white"
/usr/local/miniconda3/envs/pandera-presentations/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
445 if mask.any():
446 with np.errstate(all="ignore"):
--> 447 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
448
449 else:
TypeError: can't multiply sequence by non-int of type 'float'
def process_data(df):
...
def process_data(df):
return df.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
def process_data(df):
import pdb; pdb.set_trace() # <- insert breakpoint
return df.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
print(df)
hours_worked wage_per_hour person_id 1014c40 38.5 15.1 9bdac94 41.25 15.0 4331eb8 35.0 21.3 ee7af68 27.75 17.5 d188ff6 22.25 19.5 a0c4b6e -20.5 25.5
df.dtypes
hours_worked object wage_per_hour float64 dtype: object
df.hours_worked.map(type)
person_id 1014c40 <class 'float'> 9bdac94 <class 'float'> 4331eb8 <class 'str'> ee7af68 <class 'float'> d188ff6 <class 'float'> a0c4b6e <class 'float'> Name: hours_worked, dtype: object
def process_data(df):
return (
df
# make sure columns are floats
.astype({"hours_worked": float, "wage_per_hour": float})
# replace negative values with nans
.assign(hours_worked=lambda x: x.hours_worked.where(x.hours_worked >= 0, np.nan))
# compute weekly income
.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
)
process_data(df)
hours_worked | wage_per_hour | weekly_income | |
---|---|---|---|
person_id | |||
1014c40 | 38.50 | 15.1 | 581.350 |
9bdac94 | 41.25 | 15.0 | 618.750 |
4331eb8 | 35.00 | 21.3 | 745.500 |
ee7af68 | 27.75 | 17.5 | 485.625 |
d188ff6 | 22.25 | 19.5 | 433.875 |
a0c4b6e | NaN | 25.5 | NaN |
@pa.check_types
def process_data(df: DataFrame[RawData]) -> DataFrame[ProcessedData]:
return (
# replace negative values with nans
df.assign(hours_worked=lambda x: x.hours_worked.where(x.hours_worked >= 0, np.nan))
# compute weekly income
.assign(weekly_income=lambda x: x.hours_worked * x.wage_per_hour)
)
RawData
and ProcessedData
are, finding a NOTE
that a fellow traveler has left for you.¶import pandera as pa
# NOTE: this is what's supposed to be in `df` going into `process_data`
class RawData(pa.SchemaModel):
hours_worked: float = pa.Field(coerce=True, nullable=True)
wage_per_hour: float = pa.Field(coerce=True, nullable=True)
# ... and this is what `process_data` is supposed to return.
class ProcessedData(RawData):
hours_worked: float = pa.Field(ge=0, coerce=True, nullable=True)
weekly_income: float = pa.Field(nullable=True)
@pa.check_types
def process_data(df: DataFrame[RawData]) -> DataFrame[ProcessedData]:
...
The better you can reason about the contents of a dataframe, the faster you can debug.¶
The faster you can debug, the sooner you can focus on downstream tasks that you care about.¶
By validating data through explicit contracts, you also create data documentation and a simple, stateless data shift detector.¶
From scipy 2020 - pandera: Statistical Data Validation of Pandas Dataframes
The pandera programming model is an iterative loop of building statistical domain knowledge, implementing data transforms and schemas, and verifying data.
Data validation: The act of falsifying data against explicit assumptions for some downstream purpose, like analysis, modeling, and visualization.
Data Testing: Validating not only real data, but also the functions that produce them.
Defining a schema looks and feels like defining a pandas dataframe
import pandera as pa
clean_data_schema = pa.DataFrameSchema(
columns={
"continuous": pa.Column(float, pa.Check.ge(0), nullable=True),
"categorical": pa.Column(str, pa.Check.isin(["A", "B", "C"]), nullable=True),
},
coerce=True,
)
from pandera.typing import DataFrame, Series
class CleanData(pa.SchemaModel):
continuous: Series[float] = pa.Field(ge=0, nullable=True)
categorical: Series[str] = pa.Field(isin=["A", "B", "C"], nullable=True)
class Config:
coerce = True
Know Exactly What Went Wrong with Your Data
raw_data = pd.DataFrame({
"continuous": ["-1.1", "4.0", "10.25", "-0.1", "5.2"],
"categorical": ["A", "B", "C", "Z", "X"],
})
try:
CleanData.validate(raw_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
schema_context | column | check | check_number | failure_case | index | |
---|---|---|---|---|---|---|
0 | Column | continuous | greater_than_or_equal_to(0) | 0 | -1.1 | 0 |
1 | Column | continuous | greater_than_or_equal_to(0) | 0 | -0.1 | 3 |
2 | Column | categorical | isin(['A', 'B', 'C']) | 0 | Z | 3 |
3 | Column | categorical | isin(['A', 'B', 'C']) | 0 | X | 4 |
raw_data_schema = pa.DataFrameSchema(
columns={
"continuous": pa.Column(float),
"categorical": pa.Column(str),
},
coerce=True,
)
clean_data_schema.update_columns({
"continuous": {"nullable": True},
"categorical": {"checks": pa.Check.isin(["A", "B", "C"]), "nullable": True},
});
Inherit from pandera.SchemaModel
to Define Type Hierarchies
class RawData(pa.SchemaModel):
continuous: Series[float]
categorical: Series[str]
class Config:
coerce = True
class CleanData(RawData):
continuous = pa.Field(ge=0, nullable=True)
categorical = pa.Field(isin=["A", "B", "C"], nullable=True);
Use decorators to add IO checkpoints to the critical functions in your pipeline
@pa.check_types
def fn(raw_data: DataFrame[RawData]) -> DataFrame[CleanData]:
return raw_data.assign(
continuous=lambda df: df["continuous"].where(lambda x: x > 0, np.nan),
categorical=lambda df: df["categorical"].where(lambda x: x.isin(["A", "B", "C"]), np.nan),
)
fn(raw_data)
continuous | categorical | |
---|---|---|
0 | NaN | A |
1 | 4.00 | B |
2 | 10.25 | C |
3 | NaN | NaN |
4 | 5.20 | NaN |
Schemas that synthesize valid data under its constraints
CleanData.example(size=5)
continuous | categorical | |
---|---|---|
0 | NaN | A |
1 | NaN | A |
2 | NaN | C |
3 | 4.501643e+15 | NaN |
4 | NaN | C |
Data Testing: Test the functions that produce clean data
from hypothesis import given
@given(RawData.strategy(size=5))
def test_fn(raw_data):
fn(raw_data)
def run_test_suite():
test_fn()
print("tests passed ✅")
run_test_suite()
tests passed ✅
📖 Documentation Improvements¶
🔤 Class-based API¶
📊 Data Synthesis Strategies¶
⌨️ Pandera Type System¶
Adding geopandas
, dask
, modin
, and pyspark.pandas
was relatively
straight forward.
display(raw_data)
continuous | categorical | |
---|---|---|
0 | -1.1 | A |
1 | 4.0 | B |
2 | 10.25 | C |
3 | -0.1 | Z |
4 | 5.2 | X |
dask
¶import dask.dataframe as dd
dask_dataframe = dd.from_pandas(raw_data, npartitions=1)
try:
CleanData(dask_dataframe, lazy=True).compute()
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases.sort_index())
schema_context | column | check | check_number | failure_case | index | |
---|---|---|---|---|---|---|
0 | Column | continuous | greater_than_or_equal_to(0) | 0 | -1.1 | 0 |
1 | Column | continuous | greater_than_or_equal_to(0) | 0 | -0.1 | 3 |
2 | Column | categorical | isin(['A', 'B', 'C']) | 0 | Z | 3 |
3 | Column | categorical | isin(['A', 'B', 'C']) | 0 | X | 4 |
modin
¶import modin.pandas as mpd
modin_dataframe = mpd.DataFrame(raw_data)
try:
CleanData(modin_dataframe, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases.sort_index())
2023-07-12 13:11:15,188 INFO worker.py:1636 -- Started a local Ray instance.
(deploy_ray_func pid=47371) FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
schema_context | column | check | check_number | failure_case | index | |
---|---|---|---|---|---|---|
0 | Column | continuous | greater_than_or_equal_to(0) | 0 | -1.1 | 0 |
1 | Column | continuous | greater_than_or_equal_to(0) | 0 | -0.1 | 3 |
2 | Column | categorical | isin(['A', 'B', 'C']) | 0 | Z | 3 |
3 | Column | categorical | isin(['A', 'B', 'C']) | 0 | X | 4 |
pyspark.pandas
¶import pyspark.pandas as ps
pyspark_pd_dataframe = ps.DataFrame(raw_data)
try:
CleanData(pyspark_pd_dataframe, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
/Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /Users/nielsbantilan/miniconda3/envs/pandera-presentations/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
schema_context | column | check | check_number | failure_case | index | |
---|---|---|---|---|---|---|
0 | Column | continuous | greater_than_or_equal_to(0) | 0 | -1.1 | 0 |
1 | Column | continuous | greater_than_or_equal_to(0) | 0 | -0.1 | 3 |
2 | Column | categorical | isin(['A', 'B', 'C']) | 0 | Z | 3 |
3 | Column | categorical | isin(['A', 'B', 'C']) | 0 | X | 4 |
High-level approach: decoupling schema specification from backend
pandera.api
subpackage, which contains the schema specification that
defines the properties of an underlying data structure.pandera.backends
subpackage, which leverages the schema specification and
implements the actual validation logic.Check
namespace and registry, which registers
type-specific implementations of built-in checks and allows contributors to
easily add new built-in checks.import sloth as sl
from pandera.api.base.schema import BaseSchema
from pandera.backends.base import BaseSchemaBackend
class DataFrameSchema(BaseSchema):
def __init__(self, **kwargs):
# add properties that this dataframe would contain
class DataFrameSchemaBackend(BaseSchemaBackend):
def validate(
self,
check_obj: sl.DataFrame,
schema: DataFrameSchema,
*,
**kwargs,
):
# implement custom validation logic
# register the backend
DataFrameSchema.register_backend(
sloth.DataFrame,
DataFrameSchemaBackend,
)
pyspark.sql.DataFrame
in 0.16.0b
!¶https://pandera.readthedocs.io/en/latest/
import pandera.pyspark as pa
import pyspark.sql.types as T
from decimal import Decimal
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel
class PanderaSchema(DataFrameModel):
id: T.IntegerType() = pa.Field(gt=5)
product_name: T.StringType() = pa.Field(str_startswith="B")
price: T.DecimalType(20, 5) = pa.Field()
description: T.ArrayType(T.StringType()) = pa.Field()
meta: T.MapType(T.StringType(), T.StringType()) = pa.Field()
Additional approaches to put into practice in the future:¶
- Thoughtful design work.
- Library-independent error reporting.
- Decoupling metadata from data.
- Investing in governance and community.
What does this mean?
pydantic v2
pytest
: collect data coverage statisticshypothesis
: faster data synthesis