Skip to content

Latest commit

 

History

History
835 lines (637 loc) · 24.4 KB

File metadata and controls

835 lines (637 loc) · 24.4 KB

SOLID Principles Review

Hub Prioritization Framework

Review Date: 2025-12-17 Reviewer: Claude Code Codebase Version: Current branch claude/update-solid-docs-XLO46 Last Updated: 2025-12-17


Executive Summary

This document provides a comprehensive review of the Hub Prioritization Framework codebase against SOLID design principles. The codebase demonstrates strong adherence to most SOLID principles, particularly Single Responsibility and Interface Segregation. However, there are opportunities for improvement in Open/Closed Principle and Dependency Inversion Principle implementation.

Overall Assessment

Principle Status Grade Summary
Single Responsibility ✅ Excellent A Modules and functions are well-scoped
Open/Closed ⚠️ Partial B- Extensible but requires code changes
Liskov Substitution ✓ N/A - Minimal inheritance usage
Interface Segregation ✅ Good A- Focused, minimal interfaces
Dependency Inversion ⚠️ Needs Work C+ Direct dependencies, no abstractions

1. Single Responsibility Principle (SRP)

"A class should have one, and only one, reason to change."

Assessment: ✅ EXCELLENT (Grade: A)

The codebase demonstrates strong adherence to SRP across all modules.

Strengths

  1. Clear Module Boundaries

    • src/config.py: Configuration management only
    • src/scoring/activity.py: Passenger activity scoring only
    • src/scoring/service.py: Service and mode scoring only
    • src/scoring/monte_carlo.py: Monte Carlo aggregation only
    • src/scoring/mc_distribution.py: Monte Carlo distribution analysis only
    • src/scoring/ahp.py: AHP expert-driven scoring only
    • src/data/loaders.py: Data loading operations only
    • src/classification/eligibility.py: Eligibility filtering only
    • src/spatial/h3_operations.py: H3 spatial operations only
  2. Focused Functions

    # Good example from activity.py
    def calculate_activity_score(gdf, demand_column, tier_column, use_log):
        """Calculate passenger activity score."""
        # Single, well-defined responsibility
  3. Separation of Concerns

    • Data loading separated from processing
    • Scoring separated from aggregation
    • Validation separated from business logic

Examples of Good SRP Implementation

File: src/scoring/normalization.py

# Each function has ONE clear responsibility:
def normalize_minmax(values, min_val, max_val):
    """Min-max normalization only"""

def normalize_by_tier(df, value_column, tier_column):
    """Per-tier normalization only"""

def normalize_log10(values):
    """Log transformation + normalization only"""

File: src/data/loaders.py

# Each loader handles ONE data source:
def load_transit_nodes(filepath): ...
def load_lines_and_modes(filepath): ...
def load_demand_data(filepath): ...
def load_spatial_layer(filepath): ...

Recommendations

  • No immediate changes needed
  • Continue maintaining clear module boundaries
  • Resist temptation to add unrelated functionality to existing modules

2. Open/Closed Principle (OCP)

"Software entities should be open for extension, but closed for modification."

Assessment: ⚠️ PARTIAL COMPLIANCE (Grade: B-)

The codebase is partially compliant with OCP. While some components are extensible, adding new functionality often requires modifying existing code.

Strengths

  1. Configuration-Based Design

    # config.py allows extension without code changes
    MODE_WEIGHTS = {
        'Funicular': 1.0,
        'BRT': 3.0,
        'LRT': 4.0,
        'Metro': 5.0,
        # Easy to add new modes here
    }
  2. Functional Decomposition

    • Individual scoring functions can be called independently
    • Pipeline can be customized by calling functions in different orders

Weaknesses

  1. Hard-Coded Scoring Criteria

    # monte_carlo.py - Adding a new criterion requires code changes
    def calculate_all_scores(gdf, tier_column):
        from .activity import calculate_activity_score
        from .service import calculate_service_score
        from .location import calculate_location_score
        from .demographics import calculate_pop_jobs_score
        from .terminals import calculate_terminal_score
    
        gdf_scored['activity_score'] = calculate_activity_score(...)
        gdf_scored['service_score'] = calculate_service_score(...)
        # Must modify this function to add new criterion
  2. No Abstract Base Classes

    • No common interface for scoring criteria
    • Each scorer has its own function signature
    • Difficult to add new scorers without modifying aggregation code
  3. Hard-Coded Column Names

    # In multiple places - brittle to changes
    score_columns = [
        'activity_score',
        'service_score',
        'location_score',
        'pop_jobs_score',
        'terminal_score'
    ]

Impact

Current: Adding a new scoring criterion requires:

  1. Creating new scoring module ✅
  2. Importing it in monte_carlo.py
  3. Calling it in calculate_all_scores()
  4. Adding column name to score_columns list ❌

Ideal: Adding a new scoring criterion should only require:

  1. Creating new scoring module that implements common interface ✅
  2. Registering it (via config or decorator) ✅

Recommendations

HIGH PRIORITY: Implement Strategy Pattern for Scoring

Create Abstract Base Class for Scorers

# src/scoring/base.py
from abc import ABC, abstractmethod
import geopandas as gpd
import pandas as pd

class BaseScorer(ABC):
    """Abstract base class for all scoring criteria."""

    def __init__(self, name: str, weight_range: tuple = (0.0, 0.5)):
        self.name = name
        self.weight_range = weight_range

    @abstractmethod
    def calculate(self, gdf: gpd.GeoDataFrame, **kwargs) -> pd.Series:
        """
        Calculate scores for all hubs.

        Args:
            gdf: GeoDataFrame with hub data
            **kwargs: Additional parameters

        Returns:
            Series of scores normalized to 1-10 range
        """
        pass

    @abstractmethod
    def get_required_columns(self) -> list:
        """Return list of required column names."""
        pass

    def validate_inputs(self, gdf: gpd.GeoDataFrame) -> None:
        """Validate that required columns are present."""
        missing = [col for col in self.get_required_columns()
                   if col not in gdf.columns]
        if missing:
            raise ValueError(f"{self.name} missing columns: {missing}")

Refactor Existing Scorers

# src/scoring/activity.py
from .base import BaseScorer

class ActivityScorer(BaseScorer):
    """Passenger activity scoring criterion."""

    def __init__(self):
        super().__init__(name='activity')

    def get_required_columns(self) -> list:
        return ['TotalDemand', 'tier']

    def calculate(self, gdf, demand_column='TotalDemand',
                 tier_column='tier', use_log=True) -> pd.Series:
        self.validate_inputs(gdf)
        # ... existing calculation logic ...
        return scores

Create Scorer Registry

# src/scoring/registry.py
from typing import Dict, Type
from .base import BaseScorer

class ScorerRegistry:
    """Registry for all scoring criteria."""

    _scorers: Dict[str, Type[BaseScorer]] = {}

    @classmethod
    def register(cls, name: str):
        """Decorator to register a scorer."""
        def decorator(scorer_class: Type[BaseScorer]):
            cls._scorers[name] = scorer_class
            return scorer_class
        return decorator

    @classmethod
    def get_scorer(cls, name: str) -> BaseScorer:
        """Get scorer instance by name."""
        if name not in cls._scorers:
            raise ValueError(f"Unknown scorer: {name}")
        return cls._scorers[name]()

    @classmethod
    def get_all_scorers(cls) -> Dict[str, BaseScorer]:
        """Get all registered scorers."""
        return {name: scorer() for name, scorer in cls._scorers.items()}

# Usage
@ScorerRegistry.register('activity')
class ActivityScorer(BaseScorer):
    ...

Refactor Monte Carlo to Use Registry

# src/scoring/monte_carlo.py
from .registry import ScorerRegistry

def calculate_all_scores(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Calculate all registered scoring criteria."""
    logger.info("Calculating all scoring criteria...")

    gdf_scored = gdf.copy()
    scorers = ScorerRegistry.get_all_scorers()

    for i, (name, scorer) in enumerate(scorers.items(), 1):
        logger.info(f"\n{i}/{len(scorers)}: {scorer.name.title()} Score")
        score_column = f"{name}_score"
        gdf_scored[score_column] = scorer.calculate(gdf_scored)

    return gdf_scored

Benefits of This Approach:

  • ✅ Add new scorers without modifying existing code
  • ✅ Scorers are self-documenting (required columns, parameters)
  • ✅ Consistent interface across all scoring criteria
  • ✅ Easy to test individual scorers
  • ✅ Automatic discovery of available scorers
MEDIUM PRIORITY: Configuration-Driven Pipeline
# config.py
SCORING_CRITERIA = [
    {
        'name': 'activity',
        'class': 'ActivityScorer',
        'enabled': True,
        'params': {'use_log': True}
    },
    {
        'name': 'service',
        'class': 'ServiceScorer',
        'enabled': True,
        'params': {}
    },
    # Easy to add/remove/configure criteria
]

3. Liskov Substitution Principle (LSP)

"Derived classes must be substitutable for their base classes."

Assessment: ✓ NOT APPLICABLE (No Grade)

The codebase uses functional programming rather than class hierarchies, making LSP largely not applicable.

Current State

  • Minimal use of classes
  • No inheritance hierarchies
  • No polymorphic behavior
  • Functions instead of objects

Observations

This is actually a strength of the current design:

  • Simpler to understand
  • Easier to test
  • No inheritance-related bugs
  • LSP violations are impossible

Recommendations

  • No changes needed
  • If classes are introduced (per OCP recommendations), ensure LSP compliance
  • Use composition over inheritance where possible

4. Interface Segregation Principle (ISP)

"Clients should not be forced to depend on interfaces they do not use."

Assessment: ✅ GOOD (Grade: A-)

The codebase demonstrates good adherence to ISP through focused function signatures.

Strengths

  1. Minimal Function Parameters

    # Good: Only required parameters
    def calculate_activity_score(gdf, demand_column='TotalDemand',
                                 tier_column='tier', use_log=True):
        """Focused interface with defaults"""
  2. No "God Functions"

    • Functions don't accept dozens of parameters
    • Each function has a clear, specific purpose
    • Optional parameters have sensible defaults
  3. Separation of Data and Configuration

    # Good: Config imported, not passed as parameters
    from ..config import MODE_WEIGHTS, MODE_DIVERSITY_BONUS_PCT
    
    def calculate_service_score(gdf, modes_column='modes'):
        # Doesn't force caller to know about all config

Minor Issues

  1. Some Functions Have Many Optional Parameters

    # normalization.py
    def normalize_minmax(values, min_val=SCORE_MIN, max_val=SCORE_MAX,
                        input_min=None, input_max=None):
        # 5 parameters - could use config object
  2. Inconsistent Parameter Naming

    # Different functions use different names for similar concepts
    calculate_activity_score(..., tier_column='tier')
    assign_hub_tiers(..., tier_column='HubType')  # Different default

Recommendations

LOW PRIORITY: Parameter Objects for Complex Functions
# Instead of many parameters:
def normalize_minmax(values, min_val=1, max_val=10, input_min=None, input_max=None):
    ...

# Use configuration object:
@dataclass
class NormalizationConfig:
    min_val: float = 1.0
    max_val: float = 10.0
    input_min: Optional[float] = None
    input_max: Optional[float] = None

def normalize_minmax(values: pd.Series, config: NormalizationConfig = None):
    config = config or NormalizationConfig()
    ...
LOW PRIORITY: Standardize Column Name Parameters
# Create standard parameter names
STANDARD_COLUMN_NAMES = {
    'tier': 'tier',  # Always use 'tier'
    'demand': 'TotalDemand',  # Always use 'TotalDemand'
    'modes': 'modes',  # Always use 'modes'
}

5. Dependency Inversion Principle (DIP)

"Depend on abstractions, not concretions."

Assessment: ⚠️ NEEDS WORK (Grade: C+)

The codebase has limited adherence to DIP, with most modules depending directly on concrete implementations.

Weaknesses

  1. Direct Imports of Concrete Implementations

    # monte_carlo.py
    from .activity import calculate_activity_score  # Concrete function
    from .service import calculate_service_score    # Concrete function
    from .location import calculate_location_score  # Concrete function
    
    # Should depend on abstraction (BaseScorer interface)
  2. Tight Coupling to GeoDataFrame

    # All functions directly depend on GeoDataFrame
    def calculate_activity_score(gdf: gpd.GeoDataFrame, ...):
        # Tightly coupled to GeoDataFrame structure
  3. Hard-Coded Dependencies on Config

    # Every module imports config directly
    from ..config import (
        MODE_WEIGHTS,
        MODE_LINE_DIMINISHING_RETURNS,
        MODE_DIVERSITY_BONUS_PCT,
    )
    # No way to inject alternative configuration
  4. No Dependency Injection

    # Functions create their own dependencies
    def calculate_all_scores(gdf):
        from .activity import calculate_activity_score
        # Dependencies hard-coded inside function
  5. Direct File System Access

    # loaders.py
    def load_transit_nodes(filepath):
        df = pd.read_csv(filepath, encoding=encoding)
        # Tightly coupled to file system

Impact

Current Limitations:

  • ❌ Difficult to unit test (requires real GeoDataFrames)
  • ❌ Cannot swap implementations (e.g., use different scoring algorithms)
  • ❌ Hard to mock dependencies for testing
  • ❌ Configuration changes require code changes
  • ❌ Cannot inject alternative data sources

Recommendations

HIGH PRIORITY: Implement Abstract Interfaces

Create Data Protocol

# src/core/protocols.py
from typing import Protocol, Any
import pandas as pd

class HubDataProtocol(Protocol):
    """Protocol for hub data structures."""

    def __getitem__(self, key: str) -> pd.Series:
        """Get column by name."""
        ...

    def __len__(self) -> int:
        """Get number of hubs."""
        ...

    @property
    def columns(self) -> list:
        """Get column names."""
        ...

    @property
    def index(self) -> pd.Index:
        """Get index."""
        ...

# Now functions can depend on protocol, not GeoDataFrame
def calculate_activity_score(
    hub_data: HubDataProtocol,  # Abstract protocol
    demand_column: str = 'TotalDemand'
) -> pd.Series:
    """Works with any object matching HubDataProtocol."""
    ...

Create Configuration Interface

# src/core/config_protocol.py
from typing import Protocol

class ScoringConfigProtocol(Protocol):
    """Protocol for scoring configuration."""

    @property
    def mode_weights(self) -> dict:
        ...

    @property
    def score_range(self) -> tuple:
        ...

    @property
    def monte_carlo_iterations(self) -> int:
        ...

# Inject configuration instead of importing
def calculate_service_score(
    gdf: gpd.GeoDataFrame,
    config: ScoringConfigProtocol  # Injected dependency
) -> pd.Series:
    mode_weights = config.mode_weights  # From injected config
    ...

Create Data Loader Interface

# src/data/protocols.py
from abc import ABC, abstractmethod
from typing import Union
from pathlib import Path
import geopandas as gpd

class DataLoader(ABC):
    """Abstract interface for data loading."""

    @abstractmethod
    def load_transit_nodes(self, source: Union[str, Path]) -> gpd.GeoDataFrame:
        """Load transit nodes from source."""
        pass

    @abstractmethod
    def load_demand_data(self, source: Union[str, Path]) -> dict:
        """Load demand data from source."""
        pass

# Concrete implementation
class FileSystemDataLoader(DataLoader):
    """Load data from file system."""

    def load_transit_nodes(self, source):
        return pd.read_csv(source, ...)

    def load_demand_data(self, source):
        return pd.read_excel(source, ...)

# Alternative implementation for testing
class MockDataLoader(DataLoader):
    """Load data from in-memory mocks."""

    def __init__(self, mock_data: dict):
        self.mock_data = mock_data

    def load_transit_nodes(self, source):
        return self.mock_data['transit_nodes']

    def load_demand_data(self, source):
        return self.mock_data['demand']

# Usage with dependency injection
def run_pipeline(data_loader: DataLoader):
    """Pipeline depends on abstract DataLoader."""
    nodes = data_loader.load_transit_nodes('data/nodes.csv')
    demand = data_loader.load_demand_data('data/demand.xlsx')
    ...
MEDIUM PRIORITY: Dependency Injection Container
# src/core/container.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class DependencyContainer:
    """Container for dependency injection."""

    data_loader: 'DataLoader'
    config: 'ScoringConfigProtocol'
    logger: 'Logger'

    @classmethod
    def create_default(cls):
        """Create container with default implementations."""
        from ..data.loaders import FileSystemDataLoader
        from ..config import DefaultConfig
        from ..utils.logging import get_logger

        return cls(
            data_loader=FileSystemDataLoader(),
            config=DefaultConfig(),
            logger=get_logger(__name__)
        )

    @classmethod
    def create_for_testing(cls, mock_data: dict):
        """Create container for testing."""
        from ..data.loaders import MockDataLoader
        from ..config import TestConfig
        from ..utils.logging import get_logger

        return cls(
            data_loader=MockDataLoader(mock_data),
            config=TestConfig(),
            logger=get_logger(__name__)
        )

# Usage in main pipeline
def run_pipeline(container: DependencyContainer = None):
    container = container or DependencyContainer.create_default()

    nodes = container.data_loader.load_transit_nodes(...)
    scores = calculate_scores(nodes, config=container.config)
    ...

# Usage in tests
def test_pipeline():
    mock_data = {'transit_nodes': create_mock_nodes(), ...}
    container = DependencyContainer.create_for_testing(mock_data)

    result = run_pipeline(container)  # Fully testable!
    assert ...

Benefits:

  • ✅ Easy to test (inject mocks)
  • ✅ Flexible configuration (inject different configs)
  • ✅ Swappable implementations (different data sources)
  • ✅ Loose coupling (depend on abstractions)
  • ✅ Single Responsibility (each module has one dependency source)

Summary of Recommendations

Recent Improvements (Since Initial Review)

The codebase has evolved with several notable additions:

AHP Scoring Implementation (Partially addresses OCP)

  • Full AHP module with expert pairwise comparison support
  • Alternative scoring methodology alongside Monte Carlo
  • Separation of concerns maintained

Monte Carlo Distribution Analysis (Enhances SRP)

  • Dedicated mc_distribution.py module for robustness analysis
  • Distribution statistics and rank probability analysis
  • Clear separation from core Monte Carlo aggregation

⚠️ Still Pending: Strategy Pattern for adding new scoring criteria without code modification

Updated Priority Matrix

Priority Principle Recommendation Effort Impact Status
HIGH OCP Implement Strategy Pattern for Scoring Medium High ⚠️ Pending
HIGH DIP Create Abstract Interfaces (Protocols) Medium High ⚠️ Pending
MEDIUM OCP Configuration-Driven Pipeline Low Medium 🔄 Partial
MEDIUM DIP Dependency Injection Container Medium Medium ⚠️ Pending
LOW ISP Parameter Objects for Complex Functions Low Low ⚠️ Pending
LOW ISP Standardize Column Name Parameters Low Medium ⚠️ Pending
ONGOING SRP Maintain Clear Module Boundaries - High ✅ Excellent

Implementation Roadmap

Phase 1: Foundation (1-2 days)

  1. Create src/scoring/base.py with BaseScorer ABC
  2. Create src/core/protocols.py with data protocols
  3. Create src/scoring/registry.py with ScorerRegistry

Phase 2: Refactoring (2-3 days)

  1. Refactor existing scorers to extend BaseScorer
  2. Update monte_carlo.py to use registry
  3. Add protocol type hints to function signatures

Phase 3: Dependency Injection (1-2 days)

  1. Create DataLoader interface
  2. Implement FileSystemDataLoader and MockDataLoader
  3. Create DependencyContainer

Phase 4: Testing & Documentation (1 day)

  1. Update unit tests to use new interfaces
  2. Update documentation
  3. Add examples of extending the framework

Testing Impact

Current Testing Challenges

  1. Tight Coupling to Real Data

    # Current: Hard to test without real GeoDataFrame
    def test_activity_score():
        gdf = gpd.read_file('real_data.geojson')  # Requires real file
        scores = calculate_activity_score(gdf)
        assert ...
  2. Hard-Coded Dependencies

    # Current: Cannot mock config
    from ..config import MODE_WEIGHTS  # Hard-coded import

After Implementing Recommendations

# After: Easy to test with mocks
def test_activity_score():
    # Create minimal mock data
    mock_data = pd.DataFrame({
        'TotalDemand': [1000, 5000, 10000],
        'tier': ['עירוני', 'מטרופוליני', 'ארצי']
    })

    scorer = ActivityScorer()
    scores = scorer.calculate(mock_data)

    assert scores.min() >= 1
    assert scores.max() <= 10

# After: Can inject test configuration
def test_with_custom_config():
    test_config = TestConfig(
        mode_weights={'BRT': 5.0, 'Metro': 10.0},
        score_range=(0, 100)
    )

    scorer = ServiceScorer(config=test_config)
    scores = scorer.calculate(mock_data)
    ...

Conclusion

The Hub Prioritization Framework demonstrates strong software engineering practices, particularly in:

  • ✅ Clear separation of concerns (SRP)
  • ✅ Focused, minimal interfaces (ISP)
  • ✅ Readable, maintainable code
  • ✅ Active development with architectural improvements (AHP, MC distribution analysis)

Recent Progress: The codebase has grown to include sophisticated distribution analysis and alternative scoring methods while maintaining strong module boundaries. The addition of mc_distribution.py and ahp.py modules shows continued commitment to separation of concerns.

Remaining Opportunities:

  • ⚠️ Extensibility without modification (OCP) - Strategy Pattern for scorers
  • ⚠️ Abstraction over concretion (DIP) - Protocol interfaces and dependency injection

Implementing the remaining recommended changes would:

  1. Make the system more testable (mock dependencies easily)
  2. Enable extension without modification (add new scorers without changing existing code)
  3. Increase flexibility (swap implementations, different data sources)
  4. Improve maintainability (loose coupling, clear interfaces)
  5. Support future evolution (new scoring criteria, alternative algorithms)

The recommended changes are backwards-compatible and can be implemented incrementally without disrupting the existing functionality.

Current Code Quality Status: GOOD → VERY GOOD The system continues to improve with each iteration while maintaining production stability.


References


Document Status: Complete Next Review: After implementation of recommendations Maintainer: Development Team