Skip to content

custom parser + built in validator + drop_invalid_rows = True results on missing data #2216

@AlexAbades

Description

@AlexAbades

Describe the bug
When I validate a custom pandas data frame with a custom parser to set strings to lower case, use a built in validator such as isin list and I set the Config to drop_invalid_rows = True to filter the rows that don't meet my criteria, I get a column with missing values.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series

df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "channel": ["Google", "bing", "invalid_channel", "google"],
    "value": [10.0, 20.0, 30.0, 40.0]
})
allowed_channels= ["google", "bing"]
class ChannelSchema(pa.DataFrameModel):
    id: Series[int]
    channel: Series[str] = pa.Field(coerce=True, isin=allowed_channels)
    value: Series[float]

    @pa.parser("channel")
    @classmethod
    def parse_channel(cls, series: Series[str]) -> Series[str]:
        return series.astype(str).str.lower() # type: ignore

    class Config:
        strict = "filter"
        drop_invalid_rows = True

ChannelSchema.validate(df, lazy=True)

Output

    id channel  value
0   1    None   10.0
1   2    None   20.0
3   4    None   40.0

Expected behavior

In this case I would expect the output to only contain the channels that are allowed in lower case and their value. Such as:

    id channel  value
0   1  google   10.0
1   2    bing   20.0
3   4  google   40.0

This also happens in pa.DataFrameSchema.

Here are some edge scenarios that may help debug:

Values in channel column with capital letters but all of them appear in the allowed columns, then the behavior is as expected.

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series

df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "channel": ["google", "bing", "Bing", "Google"],
    "value": [10.0, 20.0, 30.0, 40.0]
})

class ChannelSchema(pa.DataFrameModel):
    id: Series[int]
    channel: Series[str] = pa.Field(coerce=True, isin=["google", "bing"])
    value: Series[float]

    @pa.parser("channel")
    @classmethod
    def parse_channel(cls, series: Series[str]) -> Series[str]:
        return series.astype(str).str.lower() # type: ignore

    class Config:
        strict = "filter"
        drop_invalid_rows = True

clean_df = ChannelSchema.validate(df, lazy=True)

Output

    id channel  value
0   1  google   10.0
1   2    bing   20.0
2   3    bing   30.0
3   4  google   40.0

When a string is not part of the allowed channels, and we do not use a custom parser, the behavior is as expected.

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series

df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "channel": ["google", "bing", "other", "google"],
    "value": [10.0, 20.0, 30.0, 40.0]
})

class ChannelSchema(pa.DataFrameModel):
    id: Series[int]
    channel: Series[str] = pa.Field(coerce=True, isin=["google", "bing"])
    value: Series[float]

    class Config:
        strict = "filter"
        drop_invalid_rows = True

clean_df = ChannelSchema.validate(df, lazy=True)

Output

    id channel  value
0   1  google   10.0
1   2    bing   20.0
3   4  google   40.0

Desktop (please complete the following information):

  • OS: IOS
  • pandera: 0.29.0
  • pandas: 2.2.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions