Skip to content

Add MultiLabelBinarizerTransformer to New Transformers Module #89

@idanmoradarthas

Description

@idanmoradarthas

Description

Add a new MultiLabelBinarizerTransformer class that wraps sklearn's MultiLabelBinarizer to make it fully compatible with sklearn pipelines. This transformer should be added to a new transformers module in the package.

Motivation

Currently, sklearn's MultiLabelBinarizer has limitations when used in modern sklearn pipelines:

  1. Missing get_feature_names_out method: Unlike most sklearn transformers, MultiLabelBinarizer doesn't implement get_feature_names_out, which was standardized in sklearn 1.0+ (SLEP007). This breaks feature name propagation through pipelines and prevents integration with tools that rely on feature names.

  2. Input handling inconsistency: The transformer doesn't gracefully handle both list and array-like inputs without preprocessing.

  3. Type compatibility: Outputs may need conversion to float64 for downstream pipeline components that expect numeric dtypes.

This wrapper class solves these problems by:

  • Implementing the complete transformer interface including get_feature_names_out
  • Handling both list and non-list inputs automatically
  • Converting output to float64 for compatibility with downstream components
  • Providing meaningful feature names based on the label classes

Proposed Implementation

New Module Structure

Create a new module: ds_utils.transformers

This module will house sklearn-compatible transformer wrappers and extensions.

Class: MultiLabelBinarizerTransformer

Inherits from: BaseEstimator, TransformerMixin

Suggested Implementation:

from sklearn.preprocessing import MultiLabelBinarizer

 class MultiLabelBinarizerTransformer(BaseEstimator, TransformerMixin):
     def __init__(self):
         self.mlb = MultiLabelBinarizer()
     
     def _sanitize_column_name(self, name):
         """Sanitize column name to remove invalid characters for Delta tables.
         
         Invalid characters: space, comma, semicolon, braces, parentheses, newline, tab, equals
         """
         import re
         # Convert to string if not already
         name_str = str(name)
         # Replace invalid characters ( ,;{}()\n\t=) with underscore
         sanitized = re.sub(r'[ ,;{}()\n\t=]', '_', name_str)
         # Replace multiple consecutive underscores with a single underscore
         sanitized = re.sub(r'_+', '_', sanitized)
         # Remove leading/trailing underscores
         sanitized = sanitized.strip('_')
         return sanitized
     
     def _handle_none_values(self, X):
         """Convert None/NaN values to empty lists for MultiLabelBinarizer"""
         
         if hasattr(X, 'tolist'):
             X_list = X.tolist()
         else:
             X_list = list(X)
         
         # Handle None/NaN values - convert to empty list
         processed = []
         for item in X_list:
             if item is None or (isinstance(item, float) and pd.isna(item)):
                 processed.append([])
             elif isinstance(item, np.ndarray):
                 # Convert numpy array to list and ensure all items are hashable
                 item_list = item.tolist()
                 if isinstance(item_list, list):
                     # Filter and ensure hashable
                     cleaned = []
                     for x in item_list:
                         if isinstance(x, np.ndarray):
                             x = x.item() if x.size == 1 else x.tolist()
                         if isinstance(x, (str, int, float, bool)) and x is not None and not (isinstance(x, float) and pd.isna(x)):
                             cleaned.append(x)
                     processed.append(cleaned)
                 else:
                     # Single value from array
                     if isinstance(item_list, (str, int, float, bool)) and item_list is not None:
                         processed.append([item_list])
                     else:
                         processed.append([])
             elif isinstance(item, list):
                 # Filter out None values from lists, and convert any numpy arrays
                 cleaned = []
                 for x in item:
                     if isinstance(x, np.ndarray):
                         x = x.item() if x.size == 1 else x.tolist()
                         if isinstance(x, list):
                             cleaned.extend([y for y in x if isinstance(y, (str, int, float, bool)) and y is not None])
                         elif isinstance(x, (str, int, float, bool)) and x is not None:
                             cleaned.append(x)
                     elif isinstance(x, (str, int, float, bool)) and x is not None and not (isinstance(x, float) and pd.isna(x)):
                         cleaned.append(x)
                 processed.append(cleaned)
             else:
                 # If it's a single hashable value, wrap it in a list
                 if isinstance(item, (str, int, float, bool)) and item is not None and not (isinstance(item, float) and pd.isna(item)):
                     processed.append([item])
                 else:
                     processed.append([])
         
         return processed
     
     def fit(self, X, y=None):
         processed_X = self._handle_none_values(X)
         self.mlb.fit(processed_X)
         return self
     
     def transform(self, X):
         processed_X = self._handle_none_values(X)
         result = self.mlb.transform(processed_X)
         return result.astype('float64')
     
     def fit_transform(self, X, y=None):
         return self.fit(X, y).transform(X)
     
     def get_feature_names_out(self, input_features=None):
         prefix = input_features[0] if input_features and len(input_features) > 0 else "label"
         # Sanitize label names to remove invalid characters for Delta tables
         sanitized_labels = [self._sanitize_column_name(label) for label in self.mlb.classes_]
         return [f"{prefix}_{label}" for label in sanitized_labels]

Key Features:

  • Automatic conversion of non-list inputs to lists using hasattr(X, 'tolist')
  • Returns float64 arrays for better pipeline compatibility
  • Proper implementation of get_feature_names_out: Returns feature names based on self.mlb.classes_, following sklearn conventions
  • Handles input_features parameter to customize the prefix for feature names
  • Feature names follow the pattern {prefix}_{label} for each label class

Note on get_feature_names_out Implementation:
The implementation uses the input_features parameter to determine the prefix:

  • If input_features is None, uses "label" as default prefix
  • If input_features is provided, uses the first feature name as prefix
  • Returns a list of strings in the format f"{prefix}_{label}" for each label in self.mlb.classes_

Implementation Checklist

  • Create new ds_utils/transformers.py module with appropriate docstring
  • Implement MultiLabelBinarizerTransformer class with complete docstrings
  • Implement get_feature_names_out method following sklearn API conventions
  • Add comprehensive unit tests covering:
    • Basic fit/transform functionality
    • Pipeline integration
    • List and array inputs
    • Feature name generation via get_feature_names_out
    • Integration with set_output(transform="pandas")
    • Edge cases (empty labels, single labels, etc.)
  • Add documentation:
    • Docstrings for all methods (following numpy/sklearn docstring format)
    • Usage examples in module documentation
    • Update README if appropriate
  • Update package __init__.py to expose the new module

Example Usage

from ds_utils.transformers import MultiLabelBinarizerTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Basic usage
mlb_transformer = MultiLabelBinarizerTransformer()
X = [['sci-fi', 'action'], ['romance'], ['action', 'comedy']]
X_transformed = mlb_transformer.fit_transform(X)

# Get feature names
feature_names = mlb_transformer.get_feature_names_out()
print(feature_names)  # ['label_action', 'label_comedy', 'label_romance', 'label_sci-fi']

# In a pipeline with pandas output
pipeline = Pipeline([
    ('mlb', MultiLabelBinarizerTransformer()),
    # other transformers...
])
pipeline.set_output(transform="pandas")
df_transformed = pipeline.fit_transform(X)
print(df_transformed.columns)  # Will show the feature names

# In a full ML pipeline
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline([
    ('mlb', MultiLabelBinarizerTransformer()),
    ('classifier', RandomForestClassifier())
])
full_pipeline.fit(X, y)

Benefits

This transformer enables:

  • Full pipeline compatibility: Works seamlessly with sklearn's modern pipeline infrastructure
  • Feature name tracking: Maintains feature names through complex pipelines
  • Pandas integration: Compatible with set_output(transform="pandas") for DataFrame outputs
  • Multi-label classification: Essential for multi-label ML problems where samples have multiple labels
  • Feature engineering: Useful for binarizing categorical list data in preprocessing pipelines

References

Technical Notes

Why get_feature_names_out is critical:
Modern sklearn pipelines (v1.0+) rely on this method for feature name propagation. Without it, the transformer:

  • Cannot be used with ColumnTransformer verbose output
  • Breaks set_output(transform="pandas") functionality
  • Prevents downstream feature importance analysis
  • Is incompatible with model inspection tools

The implementation should follow sklearn's conventions: return a numpy array of strings, handle the optional input_features parameter, and generate meaningful names based on the binarized classes.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions