Skip to content

feat: add-sklearn#2

Open
edwinjosechittilappilly wants to merge 7 commits intomainfrom
update-sklearn
Open

feat: add-sklearn#2
edwinjosechittilappilly wants to merge 7 commits intomainfrom
update-sklearn

Conversation

@edwinjosechittilappilly
Copy link
Copy Markdown

@edwinjosechittilappilly edwinjosechittilappilly commented Mar 10, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Scikit-Learn integration with five new components: dataset loader, model trainer, data scaler, train-test splitter, and classification report generator for machine learning workflows.
    • Added Scikit-Learn icon and UI support.
  • Chores

    • Added scikit-learn dependency.

EntelligenceAI PR Summary

This PR adds comprehensive scikit-learn integration to Langflow with five new ML components and frontend branding support.

  • Added scikit-learn (>=1.6.1) dependency in pyproject.toml and uv.lock
  • Implemented SklearnDatasetComponent for loading 5 built-in datasets (iris, digits, wine, breast_cancer, diabetes)
  • Created SklearnModelComponent supporting 8 model types (4 classifiers, 4 regressors) with training and prediction capabilities
  • Added DataScalerComponent with 4 scaler types (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler)
  • Implemented TrainTestSplitComponent for dataset splitting with configurable parameters
  • Created ClassificationReportComponent for model evaluation with dictionary and DataFrame outputs
  • Added Scikit-Learn logo assets (SVG, JSX, React component) and registered icon in styleUtils.ts
  • Note: DataScalerComponent contains duplicate fit_transform call on line 74 that should be addressed

Confidence Score: 2/5 - Changes Needed

  • The package naming issue (scikit-learn with hyphen) is a critical Python import problem that will cause runtime errors when attempting to import the module
  • New components with non-trivial ML functionality (model training, prediction, data processing) were added without any tests, creating significant risk for production bugs
  • These are not cosmetic issues - the import problem is a blocking technical issue and lack of tests for complex ML operations represents poor code quality and maintainability risk
  • While no new issues were found in current review, the two existing unresolved comments represent fundamental problems that should be addressed before merge
  • 8 previous unresolved comment(s) likely resolved in latest diff (score-only signal; thread status unchanged)
Files requiring special attention
  • src/backend/base/langflow/components/scikit-learn/__init__.py
  • src/backend/base/langflow/components/scikit-learn/model.py

Copilot AI review requested due to automatic review settings March 10, 2026 01:46
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 10, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d297031-d128-44e3-89e8-f4690bb3def5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This pull request integrates scikit-learn support into Langflow by adding five ML pipeline components (dataset loading, train-test splitting, data scaling, model training, and classification evaluation), along with frontend icon registration and dependency configuration. The changes enable data manipulation and model management workflows within the Langflow framework.

Changes

Cohort / File(s) Summary
Dependency Configuration
pyproject.toml
Added scikit-learn>=1.6.1 as a project dependency.
Scikit-Learn Components
src/backend/base/langflow/components/scikit-learn/__init__.py, src/backend/base/langflow/components/scikit-learn/dataset.py, src/backend/base/langflow/components/scikit-learn/split.py, src/backend/base/langflow/components/scikit-learn/scaler.py, src/backend/base/langflow/components/scikit-learn/model.py, src/backend/base/langflow/components/scikit-learn/evaluator.py
Created five new ML workflow components: SklearnDatasetComponent (loads iris/digits/wine/breast_cancer/diabetes datasets), TrainTestSplitComponent (splits data into train/test subsets), DataScalerComponent (scales features using StandardScaler/MinMaxScaler/RobustScaler/MaxAbsScaler), SklearnModelComponent (trains classification/regression models), and ClassificationReportComponent (generates classification metrics). Each includes input validation, error handling, and DataFrame/Data output formatting.
Frontend Icons
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsx, src/frontend/src/icons/ScikitLearn/index.tsx
Added SVG-based Scikit-Learn logo component with forwardRef-enabled TypeScript wrapper for ref forwarding and prop spreading.
UI Integration
src/frontend/src/utils/styleUtils.ts
Registered Scikit-Learn icon in nodeIconsLucide mapping and added Scikit-Learn entry to SIDEBAR_BUNDLES for UI navigation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error PR adds 7 new component implementations (5 backend scikit-learn components, 2 frontend icon components) with zero test files, while project has established testing conventions with 142 existing tests. Create comprehensive test files: backend unit tests for each component (dataset, evaluator, model, scaler, split), integration tests for component workflows, and frontend tests for icon components, covering identified bugs.
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Quality And Coverage ⚠️ Warning No test files were added for the five new scikit-learn components despite implementation of complex ML pipeline functionality. Add comprehensive pytest test files covering data loading, model training, scaling, train-test splits, and classification reports with error handling validation.
Test File Naming And Structure ⚠️ Warning Pull request adds 444 lines of scikit-learn component code across 6 Python files with zero test files, violating established repository testing patterns and leaving critical review issues unvalidated. Create comprehensive test files in src/backend/tests/integration/components/scikit_learn/ using pytest and src/frontend/tests/core/features/ using Playwright, covering positive cases, error handling, edge cases, and issues flagged in review comments.
Excessive Mock Usage Warning ❓ Inconclusive The PR does not include any test files, so excessive mock usage cannot be assessed. Review the PR to determine if test coverage should be added for the new scikit-learn components.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add-sklearn' directly describes the main change: adding scikit-learn integration with multiple components to the codebase.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update-sklearn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the enhancement New feature or request label Mar 10, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Scikit-Learn as a first-class integration, including backend components for basic ML workflows and frontend icon/bundle wiring.

Changes:

  • Add scikit-learn dependency (and update uv.lock).
  • Introduce new backend components under langflow/components/scikit-learn/ (dataset loader, scaler, train/test split, model trainer/predictor, classification report).
  • Add Scikit-Learn icon assets and register the bundle/icon in the frontend sidebar/icon map.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pyproject.toml Adds scikit-learn>=1.6.1 to project dependencies.
uv.lock Updates lock metadata and adds scikit-learn; also alters wheel lists for some packages.
src/backend/base/langflow/components/scikit-learn/dataset.py New component to load built-in sklearn datasets as Data/DataFrame.
src/backend/base/langflow/components/scikit-learn/split.py New component to split a DataFrame into train/test sets.
src/backend/base/langflow/components/scikit-learn/scaler.py New component to scale features and return scaled DataFrame + scaler object.
src/backend/base/langflow/components/scikit-learn/model.py New component to train a sklearn model and generate predictions.
src/backend/base/langflow/components/scikit-learn/evaluator.py New component to generate a sklearn classification report as Data/DataFrame.
src/backend/base/langflow/components/scikit-learn/__init__.py Exports the new sklearn components.
src/frontend/src/utils/styleUtils.ts Registers Scikit-Learn in sidebar bundles and icon mapping.
src/frontend/src/icons/ScikitLearn/index.tsx Adds the Scikit-Learn icon React wrapper.
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsx Adds the Scikit-Learn SVG React component.
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svg Adds the raw Scikit-Learn SVG asset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +24 to +29
SliderInput(
name="test_size",
display_name="Test Size",
value=0.25,
range_spec=RangeSpec(min=0, max=0.9, step=0.01),
),
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_size slider allows a value of 0, but sklearn.model_selection.train_test_split rejects test_size=0.0 (must be > 0). Consider setting the slider min to a small positive value (e.g. 0.01) to prevent a runtime ValueError from valid UI input.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +76
raise ValueError(msg)
self.scaler_instance = scaler_class()
scaled_data = self.scaler_instance.fit_transform(self.dataset_features)
scaled_data = self.scaler_instance.fit_transform(self.dataset_features)
# Add the target column back to the scaled data
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_transform is called twice in a row on the same data, which will double the work and can change results for stateful scalers. Remove the duplicate call and only compute scaled_data once.

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +96
def predict(self) -> DataFrame:
self.train_model()
if not hasattr(self, "TrainedModel"):
msg = "No trained model provided. Please connect a train-test split component."
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict() always calls self.train_model() first, which means generating predictions will retrain the model every time and overwrite any previously trained model. This is surprising (and expensive); consider only training when self.TrainedModel is not set, and otherwise reuse the existing model.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
# Get the selected model class and create an instance
model_class = self.AVAILABLE_MODELS[self.model_type]

model = model_class()

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state is exposed as an input but isn't used when creating the model instance. Either remove the input or pass it through for models that support it so the UI setting has an effect.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +68
# Generate the classification report
self.report_data = classification_report(
y_true,
y_pred,
output_dict=True
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output_dict input is ignored: classification_report(..., output_dict=True) is hard-coded. Use the self.output_dict value so the component matches its UI/description, and note that the downstream str handling in get_report_dataframe() is currently unreachable because of the hard-coded True.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +6
from .dataset import SklearnDatasetComponent
from .evaluator import ClassificationReportComponent
from .model import SklearnModelComponent
from .scaler import DataScalerComponent
from .split import TrainTestSplitComponent

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package lives under components/scikit-learn/, but scikit-learn is not a valid identifier for import ... statements (you can't write import langflow.components.scikit-learn). Even if components are discovered by file scanning, this makes direct imports and tooling awkward; consider renaming the directory to scikit_learn (and adjusting any references) to match Python module naming rules.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +72
def split_dataframe_train(self) -> DataFrame:
self.split_dataframe_base()
return DataFrame(self.TrainSplit)

def split_dataframe_test(self) -> DataFrame:
self.split_dataframe_base()
return DataFrame(self.TestSplit)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both split_dataframe_train and split_dataframe_test call split_dataframe_base(), so requesting both outputs will recompute the split twice. Caching the result (e.g., only split if TrainSplit/TestSplit are unset or inputs changed) would avoid extra work on large datasets and guarantee both outputs come from the same split operation.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +98
if not hasattr(self, "TrainedModel"):
msg = "No trained model provided. Please connect a train-test split component."
raise ValueError(msg)
if not isinstance(self.test_data, DataFrame):
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not hasattr(self, "TrainedModel") will never be true because TrainedModel is defined as a class attribute (initialized to None). If you want to validate training, check self.TrainedModel is None (and ideally avoid hasattr for this case).

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +20
class SklearnModelComponent(Component):
display_name = "Sklearn Model"
description = "Train and use scikit-learn models"
documentation = "https://scikit-learn.org/stable/supervised_learning.html"
icon = "ScikitLearn"
TrainedModel = None
Predictions = None
ModelPerformance = None
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New scikit-learn components introduce non-trivial behavior (dataset loading, scaling, splitting, model training/prediction, report generation) but no tests were added. The repo has extensive component unit/integration tests under src/backend/tests/; adding a small unit test suite for these components would help prevent regressions (e.g., target-column handling, deterministic splits, and that predict() doesn't retrain).

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +106
# Make predictions if test data is provided
# if self.TrainedModel is None:
# self.TrainedModel = self.train_model()
self.test_x = self.test_data.drop(self.target_column, axis=1)
self.test_y = self.test_data[self.target_column]
predictions = self.TrainedModel.predict(self.test_x)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict() assumes self.target_column exists in test_data (it calls drop(self.target_column, ...)), but it isn't validated like it is for train_data. If the test set is missing the target column, this will raise a pandas KeyError; consider adding an explicit check and a clearer error message.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@entelligence-ai-pr-reviews entelligence-ai-pr-reviews bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walkthrough

This PR integrates scikit-learn machine learning capabilities into the Langflow framework. It adds scikit-learn as a project dependency and introduces five new components: dataset loading (supporting iris, digits, wine, breast_cancer, diabetes datasets), model training/prediction (supporting 8 classifier and regressor types), data preprocessing (train/test splitting and feature scaling with 4 scaler types), and model evaluation (classification reports). The frontend is updated with Scikit-Learn branding assets and icon integration. All components follow Langflow's architecture patterns with proper input validation, metadata, and output methods for seamless pipeline integration.

Changes

File(s) Summary
pyproject.toml
uv.lock
Added scikit-learn (>=1.6.1) as a new project dependency and updated lockfile with package metadata. Removed platform-specific wheels for NVIDIA packages (aarch64) and legacy PyPy27 distributions.
src/backend/base/langflow/components/scikit-learn/__init__.py Created initialization file that imports and exports five scikit-learn component classes: SklearnDatasetComponent, ClassificationReportComponent, SklearnModelComponent, DataScalerComponent, and TrainTestSplitComponent.
src/backend/base/langflow/components/scikit-learn/dataset.py Implemented SklearnDatasetComponent for loading five built-in datasets (iris, digits, wine, breast_cancer, diabetes) with outputs as DataFrame and Data object with metadata.
src/backend/base/langflow/components/scikit-learn/model.py Added SklearnModelComponent supporting 8 model types (LogisticRegression, RandomForest, SVM, DecisionTree classifiers and regressors) with train_model() and predict() methods.
src/backend/base/langflow/components/scikit-learn/scaler.py Implemented DataScalerComponent with 4 scaler types (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler) that scales features while preserving target column. Contains duplicate fit_transform call on line 74.
src/backend/base/langflow/components/scikit-learn/split.py Created TrainTestSplitComponent wrapping train_test_split with configurable test_size, random_state, and target_column parameters, returning split train/test datasets.
src/backend/base/langflow/components/scikit-learn/evaluator.py Added ClassificationReportComponent that generates classification reports from true and predicted labels, outputting both dictionary and DataFrame formats.
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsx
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svg
src/frontend/src/icons/ScikitLearn/index.tsx
Added Scikit-Learn logo assets as React component and SVG file, with forwardRef wrapper for icon component integration.
src/frontend/src/utils/styleUtils.ts Registered ScikitLearnIcon in SIDEBAR_BUNDLES and nodeIconsLucide mapping to enable Scikit-Learn components display in UI sidebar and node visualizations.

Sequence Diagram

This diagram shows the interactions between components:

sequenceDiagram
    participant Client
    participant Dataset as SklearnDatasetComponent
    participant Split as TrainTestSplitComponent
    participant Scaler as DataScalerComponent
    participant Model as SklearnModelComponent
    participant Evaluator as ClassificationReportComponent

    Note over Client,Evaluator: ML Pipeline Flow

    Client->>Dataset: load_dataset()
    activate Dataset
    Dataset-->>Client: raw_data
    deactivate Dataset

    Client->>Split: split(raw_data)
    activate Split
    Split-->>Client: X_train, X_test, y_train, y_test
    deactivate Split

    Client->>Scaler: fit(X_train)
    activate Scaler
    Scaler-->>Client: fitted_scaler
    deactivate Scaler

    Client->>Scaler: transform(X_train)
    Scaler-->>Client: X_train_scaled

    Client->>Scaler: transform(X_test)
    Scaler-->>Client: X_test_scaled

    Client->>Model: train(X_train_scaled, y_train)
    activate Model
    Model-->>Client: trained_model
    deactivate Model

    Client->>Model: predict(X_test_scaled)
    Model-->>Client: predictions

    Client->>Evaluator: generate_report(y_test, predictions)
    activate Evaluator
    Evaluator-->>Client: classification_metrics
    deactivate Evaluator

    Note over Client: Pipeline Complete
Loading

🔗 Cross-Repository Impact Analysis

Enable automatic detection of breaking changes across your dependent repositories. → Set up now

Learn more about Cross-Repository Analysis

What It Does

  • Automatically identifies repositories that depend on this code
  • Analyzes potential breaking changes across your entire codebase
  • Provides risk assessment before merging to prevent cross-repo issues

How to Enable

  1. Visit Settings → Code Management
  2. Configure repository dependencies
  3. Future PRs will automatically include cross-repo impact analysis!

Benefits

  • 🛡️ Prevent breaking changes across repositories
  • 🔍 Catch integration issues before they reach production
  • 📊 Better visibility into your multi-repo architecture

Comment on lines +49 to +72
def split_dataframe_base(self):
if not hasattr(self, "dataset"):
msg = "No dataset provided. Please connect a dataset component."
raise ValueError(msg)
# split the dataset into train and test
if not isinstance(self.dataset, DataFrame):
msg = "The dataset is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)

if self.target_column not in self.dataset.columns:
msg = f"Error: The target column '{self.target_column}' does not exist in the dataset."
raise ValueError(msg)

train_df, test_df = train_test_split(self.dataset, test_size=self.test_size, random_state=self.random_state)
self.TrainSplit = train_df
self.TestSplit = test_df

def split_dataframe_train(self) -> DataFrame:
self.split_dataframe_base()
return DataFrame(self.TrainSplit)

def split_dataframe_test(self) -> DataFrame:
self.split_dataframe_base()
return DataFrame(self.TestSplit)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: The component introduces several logic issues:

  1. Metadata Loss: It fails to propagate 'text_key' and 'default_value' from the input dataset to the output DataFrames, which breaks downstream components (like embeddings) that rely on knowing which column contains the text.
  2. Unused Validation: It validates the existence of 'target_column' but never uses it in the split logic, causing unnecessary failures on datasets without that specific column name.
  3. Redundant Computation: train_test_split is executed twice if both outputs are connected, which is inefficient.
🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

In `src/backend/base/langflow/components/scikit-learn/split.py`, refactor the split methods to: 1. Cache the split result in `self.TrainSplit` to avoid redundant computation when multiple outputs are used. 2. Remove the unused `target_column` validation block (lines 58-60). 3. Preserve `text_key` and `default_value` metadata when creating the output `DataFrame` objects by passing them from `self.dataset` to the `DataFrame` constructor.

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Mar 10, 2026
edwinjosechittilappilly added a commit that referenced this pull request Mar 10, 2026
…e compatibility (langflow-ai#10008)

* fix: Restore Agent memory functionality by fixing inheritance and type compatibility

This commit fixes two critical bugs that completely broke Agent memory in the main branch:

## Bug #1: Inheritance Method Call Error
- Fixed incorrect method calls in Agent component inheritance
- Changed `get_base_inputs()` to `_base_inputs` in:
  - src/lfx/src/lfx/components/agents/agent.py:157
  - src/lfx/src/lfx/base/agents/agent.py:229

## Bug #2: Message Type Incompatibility
- Fixed type checking in Agent base class to handle both Message types
- Memory returns `langflow.schema.message.Message` but Agent expected `lfx.schema.message.Message`
- Updated type check to use duck typing instead of strict isinstance check
- Changed in src/lfx/src/lfx/base/agents/agent.py:148-150

## Impact
- Agents can now remember conversation context across messages
- Memory functionality restored to same level as release-1.6.0
- Fixes issue where agents would forget user information immediately

## Test Results
- Before: Agent says "I don't have access to your name or occupation"
- After: Agent says "Your name is VICTORY TEST, and you work as a memory bug hunter"

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* fix: Update data_to_messages function to accept both Data and Message types

This commit modifies the `data_to_messages` function to accept a list of both `Data` and `Message` types, enhancing type compatibility. The function's docstring has been updated to reflect the new input type and return type, ensuring clarity in its usage.

* fix: improve message validation in Agent

This commit updates the chat history processing in the LCAgentComponent to ensure that only messages with valid 'text' data are included. The method now checks for the presence of 'text' in the message data before converting it to the appropriate format. Additionally, the base input retrieval method has been changed from `_base_inputs` to `get_base_inputs()` for consistency and clarity.

* fix: enhance chat history validation to support Data type

* fix: improve input handling to support dynamic message conversion

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Gabriel Luiz Freitas Almeida <gabriel@langflow.org>
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Mar 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
src/backend/base/langflow/components/scikit-learn/model.py (1)

93-97: Reuse the trained estimator instead of fitting again in predict().

Line 94 retrains on every prediction request. That doubles fit cost when both outputs are resolved and can overwrite previously trained state. Train lazily only when self.TrainedModel is still None.

Proposed fix
     def predict(self) -> DataFrame:
-        self.train_model()
-        if not hasattr(self, "TrainedModel"):
+        if self.TrainedModel is None:
+            self.train_model()
+        if self.TrainedModel is None:
             msg = "No trained model provided. Please connect a train-test split component."
             raise ValueError(msg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 93 -
97, The predict() method currently calls train_model() unconditionally and
retrains on every prediction; change predict() to only call self.train_model()
when self.TrainedModel is None (i.e., train lazily), and afterwards proceed to
use self.TrainedModel for prediction or raise the existing ValueError if it's
still not set. Update references in predict() to rely on the trained estimator
attribute TrainedModel rather than refitting each call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 118: The langflow-base package is missing the scikit-learn runtime
dependency, causing imports under langflow/components/scikit-learn to fail for
standalone installs; add "scikit-learn>=1.6.1" to the langflow-base
pyproject.toml dependencies (the base package's pyproject.toml) so the
dependency is declared in the package manifest and will be installed with
langflow-base.

In `@src/backend/base/langflow/components/scikit-learn/evaluator.py`:
- Around line 15-16: The code currently caches report_data and only regenerates
it when report_data is None, which causes stale outputs if y_true, y_pred, or
output_dict change; fix it by removing the persistent cache or resetting it
whenever inputs change: either compute and return the classification report
fresh each call instead of using the module-level/instance variable report_data,
or ensure any setter or update path for y_true, y_pred, or output_dict clears
report_data (sets it to None) before the getters run; locate the report_data
variable and the getters that read it as well as code paths that assign y_true,
y_pred, and output_dict and apply one of these fixes to lines referenced around
the report_data usage (including the block at ~70-80).
- Around line 30-35: The UI BoolInput named "output_dict" is never used because
generate_report() hardcodes output_dict=True when calling
sklearn.metrics.classification_report, making the string branch unreachable;
update generate_report() to read the node input value (output_dict) and pass
that boolean into classification_report(...) instead of the hardcoded True, and
ensure downstream logic still handles both dict and string outputs (the existing
string-handling branch in generate_report remains applicable when output_dict is
False).

In `@src/backend/base/langflow/components/scikit-learn/model.py`:
- Around line 64-74: The train_model method currently requires self.test_data
and raises if it's missing which breaks training-only flows; remove or defer the
test_data isinstance check from train_model (keep the train_data existence/type
checks and the ValueError/TypeError for it) and move the test_data validation
into the predict method (or wherever model prediction is performed) so that the
Trained Model output can be produced without a wired test frame; update
predict() to raise a clear TypeError/ValueError if self.test_data is missing or
not a DataFrame before using it.
- Around line 98-105: The code currently assumes self.target_column exists in
self.test_data and always sets self.test_y, which breaks inference on unlabeled
feature frames; update the block in the method that handles self.test_data
(checking isinstance(self.test_data, DataFrame)) so that you set self.test_x by
dropping self.target_column only if self.target_column is present in
self.test_data.columns, otherwise set self.test_x to the entire DataFrame, and
only assign self.test_y if the target column exists; keep the TypeError for
non-DataFrame inputs and remove the unconditional read of self.test_y.
- Around line 45-50: The component exposes IntInput name="random_state" but the
estimator constructors are created without using it; update the estimator
instantiation code (where each estimator is constructed — e.g., the lines that
create LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier,
RandomForestRegressor, DecisionTreeRegressor) to pass
random_state=int(random_state) (or random_state=None when input is None/empty)
into their constructors so the user-specified reproducibility value is wired
into the created estimator; ensure you convert the IntInput value to an int and
only add the argument for estimators that accept random_state.

In `@src/backend/base/langflow/components/scikit-learn/scaler.py`:
- Around line 15-16: The component currently memoizes scaled_data and
scaler_instance on the instance (variables scaled_data and scaler_instance) so
subsequent executions can reuse stale state; modify the component to clear or
reinitialize these cached attributes at the start of each execution (e.g., in
the main execute/run method) or change the getters to recompute when inputs
change (scaler_type, target_column, or input dataset) — ensure you also clear
any other cached attributes referenced around lines 84-93 the same way so each
run recreates scaler_instance and recomputes scaled_data from the current
dataset/parameters.
- Around line 52-82: scale_data currently always fits a new scaler (using
SCALER_MAPPING and scaler_instance) which causes data leakage; change scale_data
to (1) accept an optional pre-fitted scaler_instance or a mode flag (e.g.,
fit=True/False) on the component, (2) if a fitted scaler_instance is provided or
fit=False call scaler_instance.transform(...) instead of fit_transform, and (3)
only call fit_transform when no fitted scaler is supplied and fit=True; ensure
you remove the duplicate fit_transform call and keep handling of target_column,
dataset_features, scaled_data, concat_data and scaled_data assignment intact so
transforms preserve original column order.

In `@src/backend/base/langflow/components/scikit-learn/split.py`:
- Around line 24-29: The SliderInput for the "test_size" parameter currently
allows min=0 which can pass 0 to train_test_split and cause a ValueError; update
the RangeSpec in the SliderInput (name="test_size") to use min=0.01 instead of
min=0 so the slider only permits values in (0.0,1.0), leaving the default value
(value=0.25) unchanged.

---

Nitpick comments:
In `@src/backend/base/langflow/components/scikit-learn/model.py`:
- Around line 93-97: The predict() method currently calls train_model()
unconditionally and retrains on every prediction; change predict() to only call
self.train_model() when self.TrainedModel is None (i.e., train lazily), and
afterwards proceed to use self.TrainedModel for prediction or raise the existing
ValueError if it's still not set. Update references in predict() to rely on the
trained estimator attribute TrainedModel rather than refitting each call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 178aa6b9-de4c-4adf-8362-ac0e3f5e0806

📥 Commits

Reviewing files that changed from the base of the PR and between 5220c22 and 9e194d1.

⛔ Files ignored due to path filters (2)
  • src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svg is excluded by !**/*.svg
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • pyproject.toml
  • src/backend/base/langflow/components/scikit-learn/__init__.py
  • src/backend/base/langflow/components/scikit-learn/dataset.py
  • src/backend/base/langflow/components/scikit-learn/evaluator.py
  • src/backend/base/langflow/components/scikit-learn/model.py
  • src/backend/base/langflow/components/scikit-learn/scaler.py
  • src/backend/base/langflow/components/scikit-learn/split.py
  • src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsx
  • src/frontend/src/icons/ScikitLearn/index.tsx
  • src/frontend/src/utils/styleUtils.ts

pyproject.toml Outdated
"apify-client>=1.8.1",
"pylint>=3.3.4",
"ruff>=0.9.7",
"scikit-learn>=1.6.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e

echo "== Workspace/package manifests =="
fd '^pyproject\.toml$'

echo
echo "== sklearn imports under src/backend/base =="
rg -n --glob '*.py' '\bfrom sklearn\b|\bimport sklearn\b' src/backend/base

echo
echo "== scikit-learn declarations across manifests =="
rg -n --glob 'pyproject.toml' 'name\s*=\s*"langflow-base"|name\s*=\s*"langflow"|scikit-learn' .

Repository: Qentora/langflow

Length of output: 1548


🏁 Script executed:

cat src/backend/base/pyproject.toml

Repository: Qentora/langflow

Length of output: 7294


Add scikit-learn>=1.6.1 to src/backend/base/pyproject.toml dependencies.

The sklearn imports in src/backend/base/langflow/components/scikit-learn/ are missing from that package's manifest. Since langflow-base is independently published, standalone installs will fail to import sklearn at runtime. The dependency must be declared in src/backend/base/pyproject.toml, not only in the root package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 118, The langflow-base package is missing the
scikit-learn runtime dependency, causing imports under
langflow/components/scikit-learn to fail for standalone installs; add
"scikit-learn>=1.6.1" to the langflow-base pyproject.toml dependencies (the base
package's pyproject.toml) so the dependency is declared in the package manifest
and will be installed with langflow-base.

Comment on lines +15 to +16
report_data = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't cache report_data across executions.

The getters only regenerate when report_data is None. If this component instance is reused, changing y_true, y_pred, or output_dict can still return the previous report.

Also applies to: 70-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/evaluator.py` around lines
15 - 16, The code currently caches report_data and only regenerates it when
report_data is None, which causes stale outputs if y_true, y_pred, or
output_dict change; fix it by removing the persistent cache or resetting it
whenever inputs change: either compute and return the classification report
fresh each call instead of using the module-level/instance variable report_data,
or ensure any setter or update path for y_true, y_pred, or output_dict clears
report_data (sets it to None) before the getters run; locate the report_data
variable and the getters that read it as well as code paths that assign y_true,
y_pred, and output_dict and apply one of these fixes to lines referenced around
the report_data usage (including the block at ~70-80).

Comment on lines +30 to +35
BoolInput(
name="output_dict",
display_name="Output as Dictionary",
info="If True, return output as dict instead of string",
value=True,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/backend/base/langflow/components/scikit-learn/evaluator.py

Repository: Qentora/langflow

Length of output: 4112


🌐 Web query:

sklearn classification_report output_dict parameter

💡 Result:

In sklearn.metrics.classification_report, the parameter output_dict controls the return type:

  • output_dict=False (default): returns a formatted text report (string).
  • output_dict=True: returns a Python dict (nested) with per-class and aggregate metrics. Each class label maps to {'precision', 'recall', 'f1-score', 'support'}; aggregate keys include things like 'macro avg', 'weighted avg' (and sometimes 'micro avg' / 'samples avg' depending on the problem). Also, when output_dict=True, digits is ignored (values are not rounded). [1]

Example:

from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred, output_dict=True)
f1_class0 = report["0"]["f1-score"]
acc = report["accuracy"]

Caveat: the dict keys include fixed names like 'accuracy', so if you have a class label named "accuracy", it can collide and make the output “brittle.” [2]

Sources: [1] scikit-learn docs for classification_report (sklearn.org), [2] scikit-learn GitHub issue discussing output_dict=True brittleness (github.com)

Citations:


Wire the output_dict input into classification_report().

The UI exposes output_dict as a BoolInput, but generate_report() always passes output_dict=True. This makes the input dead—toggling it has no effect, and the string-handling branch (lines 81–83) is unreachable.

Proposed fix
         self.report_data = classification_report(
             y_true,
             y_pred,
-            output_dict=True
+            output_dict=self.output_dict,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/evaluator.py` around lines
30 - 35, The UI BoolInput named "output_dict" is never used because
generate_report() hardcodes output_dict=True when calling
sklearn.metrics.classification_report, making the string branch unreachable;
update generate_report() to read the node input value (output_dict) and pass
that boolean into classification_report(...) instead of the hardcoded True, and
ensure downstream logic still handles both dict and string outputs (the existing
string-handling branch in generate_report remains applicable when output_dict is
False).

Comment on lines +45 to +50
IntInput(
name="random_state",
display_name="Random State",
value=42,
info="Random state for reproducibility",
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In scikit-learn 1.6.1, which of these estimators accept a random_state constructor argument: LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier, LinearRegression, RandomForestRegressor, SVR, DecisionTreeRegressor?

💡 Result:

In scikit-learn 1.6.1, these accept random_state in the constructor:

  • Yes: LogisticRegression [1], RandomForestClassifier [2], SVC [3], DecisionTreeClassifier [4], RandomForestRegressor [6], DecisionTreeRegressor [8]
  • No: LinearRegression [5], SVR [7]

Sources: [1] https://scikit-learn.org/1.6/modules/generated/sklearn.linear_model.LogisticRegression.html, [2] https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestClassifier.html, [3] https://scikit-learn.org/1.6/modules/generated/sklearn.svm.SVC.html, [4] https://scikit-learn.org/1.6/modules/generated/sklearn.tree.DecisionTreeClassifier.html, [5] https://scikit-learn.org/1.6/modules/generated/sklearn.linear_model.LinearRegression.html, [6] https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestRegressor.html, [7] https://scikit-learn.org/1.6/modules/generated/sklearn.svm.SVR.html, [8] https://scikit-learn.org/1.6/modules/generated/sklearn.tree.DecisionTreeRegressor.html


🏁 Script executed:

find . -type f -name "*.py" -path "*/scikit-learn/*" | head -20

Repository: Qentora/langflow

Length of output: 435


🏁 Script executed:

cat -n src/backend/base/langflow/components/scikit-learn/model.py | sed -n '40,95p'

Repository: Qentora/langflow

Length of output: 2552


🏁 Script executed:

rg "AVAILABLE_MODELS" src/backend/base/langflow/components/scikit-learn/ -A 20

Repository: Qentora/langflow

Length of output: 6042


Wire random_state into estimator construction.

The component exposes a reproducibility input on lines 45-50, but line 85 instantiates the model without it. Six of the eight available estimators (LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier, RandomForestRegressor, DecisionTreeRegressor) accept random_state, so users setting this input still receive non-reproducible models.

Proposed fix
-        model = model_class()
+        model_kwargs = {}
+        if "random_state" in model_class().get_params():
+            model_kwargs["random_state"] = self.random_state
+        model = model_class(**model_kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 45 -
50, The component exposes IntInput name="random_state" but the estimator
constructors are created without using it; update the estimator instantiation
code (where each estimator is constructed — e.g., the lines that create
LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier,
RandomForestRegressor, DecisionTreeRegressor) to pass
random_state=int(random_state) (or random_state=None when input is None/empty)
into their constructors so the user-specified reproducibility value is wired
into the created estimator; ensure you convert the IntInput value to an int and
only add the argument for estimators that accept random_state.

Comment on lines +64 to +74
def train_model(self) -> BaseEstimator:
if not hasattr(self, "train_data"):
msg = "No training data provided. Please connect a train-test split component."
raise ValueError(msg)

if not isinstance(self.train_data, DataFrame):
msg = "The training data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
if not isinstance(self.test_data, DataFrame):
msg = "The test data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t require test_data to build the model output.

Line 72 makes train_model() fail when no test frame is wired, even though the Trained Model output only needs train_data. In the current form, training-only flows will raise before fitting. Move test_data validation into predict().

Proposed fix
     def train_model(self) -> BaseEstimator:
         if not hasattr(self, "train_data"):
             msg = "No training data provided. Please connect a train-test split component."
             raise ValueError(msg)

         if not isinstance(self.train_data, DataFrame):
             msg = "The training data is not a DataFrame. Please connect a DataFrame component."
             raise TypeError(msg)
-        if not isinstance(self.test_data, DataFrame):
-            msg = "The test data is not a DataFrame. Please connect a DataFrame component."
-            raise TypeError(msg)
         if self.target_column not in self.train_data.columns:
             msg = f"Error: The target column '{self.target_column}' does not exist in the training data."
             raise ValueError(msg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 64 -
74, The train_model method currently requires self.test_data and raises if it's
missing which breaks training-only flows; remove or defer the test_data
isinstance check from train_model (keep the train_data existence/type checks and
the ValueError/TypeError for it) and move the test_data validation into the
predict method (or wherever model prediction is performed) so that the Trained
Model output can be produced without a wired test frame; update predict() to
raise a clear TypeError/ValueError if self.test_data is missing or not a
DataFrame before using it.

Comment on lines +98 to +105
if not isinstance(self.test_data, DataFrame):
msg = "The test data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
# Make predictions if test data is provided
# if self.TrainedModel is None:
# self.TrainedModel = self.train_model()
self.test_x = self.test_data.drop(self.target_column, axis=1)
self.test_y = self.test_data[self.target_column]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allow prediction on unlabeled feature frames.

Lines 104-105 require target_column to be present in test_data, but this method never uses self.test_y. That breaks the common inference case where the input only contains features. Drop the target column only if it exists, and remove the unused label read.

Proposed fix
         if not isinstance(self.test_data, DataFrame):
             msg = "The test data is not a DataFrame. Please connect a DataFrame component."
             raise TypeError(msg)
         # Make predictions if test data is provided
         # if self.TrainedModel is None:
         # self.TrainedModel = self.train_model()
-        self.test_x = self.test_data.drop(self.target_column, axis=1)
-        self.test_y = self.test_data[self.target_column]
+        self.test_x = self.test_data.drop(columns=[self.target_column], errors="ignore")
         predictions = self.TrainedModel.predict(self.test_x)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not isinstance(self.test_data, DataFrame):
msg = "The test data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
# Make predictions if test data is provided
# if self.TrainedModel is None:
# self.TrainedModel = self.train_model()
self.test_x = self.test_data.drop(self.target_column, axis=1)
self.test_y = self.test_data[self.target_column]
if not isinstance(self.test_data, DataFrame):
msg = "The test data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
# Make predictions if test data is provided
# if self.TrainedModel is None:
# self.TrainedModel = self.train_model()
self.test_x = self.test_data.drop(columns=[self.target_column], errors="ignore")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 98 -
105, The code currently assumes self.target_column exists in self.test_data and
always sets self.test_y, which breaks inference on unlabeled feature frames;
update the block in the method that handles self.test_data (checking
isinstance(self.test_data, DataFrame)) so that you set self.test_x by dropping
self.target_column only if self.target_column is present in
self.test_data.columns, otherwise set self.test_x to the entire DataFrame, and
only assign self.test_y if the target column exists; keep the TypeError for
non-DataFrame inputs and remove the unconditional read of self.test_y.

Comment on lines +15 to +16
scaled_data = None
scaler_instance = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset cached scaler state between executions.

scaled_data and scaler_instance are memoized on the component instance, and the getters only recompute once. If this component is reused with a different dataset, scaler_type, or target_column, it can return the previous run's frame/scaler.

Also applies to: 84-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/scaler.py` around lines 15
- 16, The component currently memoizes scaled_data and scaler_instance on the
instance (variables scaled_data and scaler_instance) so subsequent executions
can reuse stale state; modify the component to clear or reinitialize these
cached attributes at the start of each execution (e.g., in the main execute/run
method) or change the getters to recompute when inputs change (scaler_type,
target_column, or input dataset) — ensure you also clear any other cached
attributes referenced around lines 84-93 the same way so each run recreates
scaler_instance and recomputes scaled_data from the current dataset/parameters.

Comment on lines +52 to +82
def scale_data(self):
if not hasattr(self, "dataset"):
msg = "No dataset provided. Please connect a dataset component."
raise ValueError(msg)

if not isinstance(self.dataset, DataFrame):
msg = "The dataset is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)

# Create the appropriate scaler instance
scaler_class = self.SCALER_MAPPING.get(self.scaler_type)
if scaler_class is None:
msg = f"Invalid scaler type: {self.scaler_type}"
raise ValueError(msg)

# Initialize and fit the scaler
if self.target_column in self.dataset.columns:
self.dataset_features = self.dataset.drop(self.target_column, axis=1)
else:
msg = "Target column not found in dataset"
raise ValueError(msg)
self.scaler_instance = scaler_class()
scaled_data = self.scaler_instance.fit_transform(self.dataset_features)
scaled_data = self.scaler_instance.fit_transform(self.dataset_features)
# Add the target column back to the scaled data
scaled_data = pd.DataFrame(scaled_data, columns=self.dataset_features.columns)
concat_data = pd.concat([scaled_data, self.dataset[self.target_column]], axis=1)

# Convert to DataFrame with original column names
scale_df = pd.DataFrame(concat_data, columns=self.dataset.columns)
self.scaled_data = DataFrame(scale_df)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This API bakes data leakage into train/test workflows.

scale_data() always calls fit_transform() on the provided dataset, but the component has no way to accept an already-fitted scaler and run transform() on a held-out/test dataset. With TrainTestSplitComponent in the same feature set, the only supported flows are either scaling before the split (leakage) or fitting separate scalers per split (invalid metrics).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/scaler.py` around lines 52
- 82, scale_data currently always fits a new scaler (using SCALER_MAPPING and
scaler_instance) which causes data leakage; change scale_data to (1) accept an
optional pre-fitted scaler_instance or a mode flag (e.g., fit=True/False) on the
component, (2) if a fitted scaler_instance is provided or fit=False call
scaler_instance.transform(...) instead of fit_transform, and (3) only call
fit_transform when no fitted scaler is supplied and fit=True; ensure you remove
the duplicate fit_transform call and keep handling of target_column,
dataset_features, scaled_data, concat_data and scaled_data assignment intact so
transforms preserve original column order.

Comment on lines +24 to +29
SliderInput(
name="test_size",
display_name="Test Size",
value=0.25,
range_spec=RangeSpec(min=0, max=0.9, step=0.01),
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/backend/base/langflow/components/scikit-learn/split.py | head -40

Repository: Qentora/langflow

Length of output: 1737


🌐 Web query:

sklearn train_test_split test_size=0 parameter validation constraints

💡 Result:

In sklearn.model_selection.train_test_split, test_size is validated so the split is non-empty and the value is in-range:

  • If test_size is a float: it must be in the (0.0, 1.0) proportion range (docs say “between 0.0 and 1.0” and the implementation enforces it cannot be 0). Passing test_size=0.0 triggers a ValueError. [1]
  • If test_size is an int: it must be >= 1 and < n_samples (otherwise you’d get an empty train or test set). The common error for test_size=0 is: “test_size=0 should be either positive and smaller than the number of samples … or a float in the (0,1) range”. [2]
  • If both train_size and test_size are given: they must be compatible (for floats, their sum must be <= 1.0; for ints, <= n_samples), and the resulting train/test sizes must each be at least 1 sample (else you’ll get the “resulting train set will be empty” style error). [3]
  • Defaults: if both are None, test_size defaults to 0.25. [1]

Sources: scikit-learn train_test_split docs [1], Stack Overflow error example for test_size=0 [2], scikit-learn error behavior example referencing _validate_shuffle_split (“resulting train set will be empty”) [3].


Change min=0 to min=0.01 in the slider's range specification.

train_test_split() requires test_size to be strictly in the range (0.0, 1.0). The current slider allows test_size=0, which will fail at runtime with a ValueError.

Proposed fix
         SliderInput(
             name="test_size",
             display_name="Test Size",
             value=0.25,
-            range_spec=RangeSpec(min=0, max=0.9, step=0.01),
+            range_spec=RangeSpec(min=0.01, max=0.9, step=0.01),
         ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/base/langflow/components/scikit-learn/split.py` around lines 24 -
29, The SliderInput for the "test_size" parameter currently allows min=0 which
can pass 0 to train_test_split and cause a ValueError; update the RangeSpec in
the SliderInput (name="test_size") to use min=0.01 instead of min=0 so the
slider only permits values in (0.0,1.0), leaving the default value (value=0.25)
unchanged.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 10, 2026

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 24%
23.75% (8457/35606) 16.57% (4643/28013) 16.54% (1236/7472)

Unit Test Results

Tests Skipped Failures Errors Time
2679 0 💤 0 ❌ 0 🔥 45.73s ⏱️

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Mar 10, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants