Description
Add a new MultiLabelBinarizerTransformer class that wraps sklearn's MultiLabelBinarizer to make it fully compatible with sklearn pipelines. This transformer should be added to a new transformers module in the package.
Motivation
Currently, sklearn's MultiLabelBinarizer has limitations when used in modern sklearn pipelines:
-
Missing get_feature_names_out method: Unlike most sklearn transformers, MultiLabelBinarizer doesn't implement get_feature_names_out, which was standardized in sklearn 1.0+ (SLEP007). This breaks feature name propagation through pipelines and prevents integration with tools that rely on feature names.
-
Input handling inconsistency: The transformer doesn't gracefully handle both list and array-like inputs without preprocessing.
-
Type compatibility: Outputs may need conversion to float64 for downstream pipeline components that expect numeric dtypes.
This wrapper class solves these problems by:
- Implementing the complete transformer interface including
get_feature_names_out
- Handling both list and non-list inputs automatically
- Converting output to float64 for compatibility with downstream components
- Providing meaningful feature names based on the label classes
Proposed Implementation
New Module Structure
Create a new module: ds_utils.transformers
This module will house sklearn-compatible transformer wrappers and extensions.
Class: MultiLabelBinarizerTransformer
Inherits from: BaseEstimator, TransformerMixin
Suggested Implementation:
from sklearn.preprocessing import MultiLabelBinarizer
class MultiLabelBinarizerTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self.mlb = MultiLabelBinarizer()
def _sanitize_column_name(self, name):
"""Sanitize column name to remove invalid characters for Delta tables.
Invalid characters: space, comma, semicolon, braces, parentheses, newline, tab, equals
"""
import re
# Convert to string if not already
name_str = str(name)
# Replace invalid characters ( ,;{}()\n\t=) with underscore
sanitized = re.sub(r'[ ,;{}()\n\t=]', '_', name_str)
# Replace multiple consecutive underscores with a single underscore
sanitized = re.sub(r'_+', '_', sanitized)
# Remove leading/trailing underscores
sanitized = sanitized.strip('_')
return sanitized
def _handle_none_values(self, X):
"""Convert None/NaN values to empty lists for MultiLabelBinarizer"""
if hasattr(X, 'tolist'):
X_list = X.tolist()
else:
X_list = list(X)
# Handle None/NaN values - convert to empty list
processed = []
for item in X_list:
if item is None or (isinstance(item, float) and pd.isna(item)):
processed.append([])
elif isinstance(item, np.ndarray):
# Convert numpy array to list and ensure all items are hashable
item_list = item.tolist()
if isinstance(item_list, list):
# Filter and ensure hashable
cleaned = []
for x in item_list:
if isinstance(x, np.ndarray):
x = x.item() if x.size == 1 else x.tolist()
if isinstance(x, (str, int, float, bool)) and x is not None and not (isinstance(x, float) and pd.isna(x)):
cleaned.append(x)
processed.append(cleaned)
else:
# Single value from array
if isinstance(item_list, (str, int, float, bool)) and item_list is not None:
processed.append([item_list])
else:
processed.append([])
elif isinstance(item, list):
# Filter out None values from lists, and convert any numpy arrays
cleaned = []
for x in item:
if isinstance(x, np.ndarray):
x = x.item() if x.size == 1 else x.tolist()
if isinstance(x, list):
cleaned.extend([y for y in x if isinstance(y, (str, int, float, bool)) and y is not None])
elif isinstance(x, (str, int, float, bool)) and x is not None:
cleaned.append(x)
elif isinstance(x, (str, int, float, bool)) and x is not None and not (isinstance(x, float) and pd.isna(x)):
cleaned.append(x)
processed.append(cleaned)
else:
# If it's a single hashable value, wrap it in a list
if isinstance(item, (str, int, float, bool)) and item is not None and not (isinstance(item, float) and pd.isna(item)):
processed.append([item])
else:
processed.append([])
return processed
def fit(self, X, y=None):
processed_X = self._handle_none_values(X)
self.mlb.fit(processed_X)
return self
def transform(self, X):
processed_X = self._handle_none_values(X)
result = self.mlb.transform(processed_X)
return result.astype('float64')
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
def get_feature_names_out(self, input_features=None):
prefix = input_features[0] if input_features and len(input_features) > 0 else "label"
# Sanitize label names to remove invalid characters for Delta tables
sanitized_labels = [self._sanitize_column_name(label) for label in self.mlb.classes_]
return [f"{prefix}_{label}" for label in sanitized_labels]
Key Features:
- Automatic conversion of non-list inputs to lists using
hasattr(X, 'tolist')
- Returns float64 arrays for better pipeline compatibility
- Proper implementation of
get_feature_names_out: Returns feature names based on self.mlb.classes_, following sklearn conventions
- Handles
input_features parameter to customize the prefix for feature names
- Feature names follow the pattern
{prefix}_{label} for each label class
Note on get_feature_names_out Implementation:
The implementation uses the input_features parameter to determine the prefix:
- If
input_features is None, uses "label" as default prefix
- If
input_features is provided, uses the first feature name as prefix
- Returns a list of strings in the format
f"{prefix}_{label}" for each label in self.mlb.classes_
Implementation Checklist
Example Usage
from ds_utils.transformers import MultiLabelBinarizerTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
# Basic usage
mlb_transformer = MultiLabelBinarizerTransformer()
X = [['sci-fi', 'action'], ['romance'], ['action', 'comedy']]
X_transformed = mlb_transformer.fit_transform(X)
# Get feature names
feature_names = mlb_transformer.get_feature_names_out()
print(feature_names) # ['label_action', 'label_comedy', 'label_romance', 'label_sci-fi']
# In a pipeline with pandas output
pipeline = Pipeline([
('mlb', MultiLabelBinarizerTransformer()),
# other transformers...
])
pipeline.set_output(transform="pandas")
df_transformed = pipeline.fit_transform(X)
print(df_transformed.columns) # Will show the feature names
# In a full ML pipeline
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline([
('mlb', MultiLabelBinarizerTransformer()),
('classifier', RandomForestClassifier())
])
full_pipeline.fit(X, y)
Benefits
This transformer enables:
- Full pipeline compatibility: Works seamlessly with sklearn's modern pipeline infrastructure
- Feature name tracking: Maintains feature names through complex pipelines
- Pandas integration: Compatible with
set_output(transform="pandas") for DataFrame outputs
- Multi-label classification: Essential for multi-label ML problems where samples have multiple labels
- Feature engineering: Useful for binarizing categorical list data in preprocessing pipelines
References
Technical Notes
Why get_feature_names_out is critical:
Modern sklearn pipelines (v1.0+) rely on this method for feature name propagation. Without it, the transformer:
- Cannot be used with
ColumnTransformer verbose output
- Breaks
set_output(transform="pandas") functionality
- Prevents downstream feature importance analysis
- Is incompatible with model inspection tools
The implementation should follow sklearn's conventions: return a numpy array of strings, handle the optional input_features parameter, and generate meaningful names based on the binarized classes.
Description
Add a new
MultiLabelBinarizerTransformerclass that wraps sklearn'sMultiLabelBinarizerto make it fully compatible with sklearn pipelines. This transformer should be added to a newtransformersmodule in the package.Motivation
Currently, sklearn's
MultiLabelBinarizerhas limitations when used in modern sklearn pipelines:Missing
get_feature_names_outmethod: Unlike most sklearn transformers,MultiLabelBinarizerdoesn't implementget_feature_names_out, which was standardized in sklearn 1.0+ (SLEP007). This breaks feature name propagation through pipelines and prevents integration with tools that rely on feature names.Input handling inconsistency: The transformer doesn't gracefully handle both list and array-like inputs without preprocessing.
Type compatibility: Outputs may need conversion to float64 for downstream pipeline components that expect numeric dtypes.
This wrapper class solves these problems by:
get_feature_names_outProposed Implementation
New Module Structure
Create a new module:
ds_utils.transformersThis module will house sklearn-compatible transformer wrappers and extensions.
Class: MultiLabelBinarizerTransformer
Inherits from:
BaseEstimator,TransformerMixinSuggested Implementation:
Key Features:
hasattr(X, 'tolist')get_feature_names_out: Returns feature names based onself.mlb.classes_, following sklearn conventionsinput_featuresparameter to customize the prefix for feature names{prefix}_{label}for each label classNote on
get_feature_names_outImplementation:The implementation uses the
input_featuresparameter to determine the prefix:input_featuresisNone, uses"label"as default prefixinput_featuresis provided, uses the first feature name as prefixf"{prefix}_{label}"for each label inself.mlb.classes_Implementation Checklist
ds_utils/transformers.pymodule with appropriate docstringMultiLabelBinarizerTransformerclass with complete docstringsget_feature_names_outmethod following sklearn API conventionsget_feature_names_outset_output(transform="pandas")__init__.pyto expose the new moduleExample Usage
Benefits
This transformer enables:
set_output(transform="pandas")for DataFrame outputsReferences
Technical Notes
Why
get_feature_names_outis critical:Modern sklearn pipelines (v1.0+) rely on this method for feature name propagation. Without it, the transformer:
ColumnTransformerverbose outputset_output(transform="pandas")functionalityThe implementation should follow sklearn's conventions: return a numpy array of strings, handle the optional
input_featuresparameter, and generate meaningful names based on the binarized classes.