-
-
Notifications
You must be signed in to change notification settings - Fork 375
Description
Describe the bug
When I validate a custom pandas data frame with a custom parser to set strings to lower case, use a built in validator such as isin list and I set the Config to drop_invalid_rows = True to filter the rows that don't meet my criteria, I get a column with missing values.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"channel": ["Google", "bing", "invalid_channel", "google"],
"value": [10.0, 20.0, 30.0, 40.0]
})
allowed_channels= ["google", "bing"]
class ChannelSchema(pa.DataFrameModel):
id: Series[int]
channel: Series[str] = pa.Field(coerce=True, isin=allowed_channels)
value: Series[float]
@pa.parser("channel")
@classmethod
def parse_channel(cls, series: Series[str]) -> Series[str]:
return series.astype(str).str.lower() # type: ignore
class Config:
strict = "filter"
drop_invalid_rows = True
ChannelSchema.validate(df, lazy=True)Output
id channel value
0 1 None 10.0
1 2 None 20.0
3 4 None 40.0
Expected behavior
In this case I would expect the output to only contain the channels that are allowed in lower case and their value. Such as:
id channel value
0 1 google 10.0
1 2 bing 20.0
3 4 google 40.0
This also happens in pa.DataFrameSchema.
Here are some edge scenarios that may help debug:
Values in channel column with capital letters but all of them appear in the allowed columns, then the behavior is as expected.
import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"channel": ["google", "bing", "Bing", "Google"],
"value": [10.0, 20.0, 30.0, 40.0]
})
class ChannelSchema(pa.DataFrameModel):
id: Series[int]
channel: Series[str] = pa.Field(coerce=True, isin=["google", "bing"])
value: Series[float]
@pa.parser("channel")
@classmethod
def parse_channel(cls, series: Series[str]) -> Series[str]:
return series.astype(str).str.lower() # type: ignore
class Config:
strict = "filter"
drop_invalid_rows = True
clean_df = ChannelSchema.validate(df, lazy=True)Output
id channel value
0 1 google 10.0
1 2 bing 20.0
2 3 bing 30.0
3 4 google 40.0
When a string is not part of the allowed channels, and we do not use a custom parser, the behavior is as expected.
import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"channel": ["google", "bing", "other", "google"],
"value": [10.0, 20.0, 30.0, 40.0]
})
class ChannelSchema(pa.DataFrameModel):
id: Series[int]
channel: Series[str] = pa.Field(coerce=True, isin=["google", "bing"])
value: Series[float]
class Config:
strict = "filter"
drop_invalid_rows = True
clean_df = ChannelSchema.validate(df, lazy=True)Output
id channel value
0 1 google 10.0
1 2 bing 20.0
3 4 google 40.0
Desktop (please complete the following information):
- OS: IOS
- pandera: 0.29.0
- pandas: 2.2.3