Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis pull request integrates scikit-learn support into Langflow by adding five ML pipeline components (dataset loading, train-test splitting, data scaling, model training, and classification evaluation), along with frontend icon registration and dependency configuration. The changes enable data manipulation and model management workflows within the Langflow framework. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds Scikit-Learn as a first-class integration, including backend components for basic ML workflows and frontend icon/bundle wiring.
Changes:
- Add
scikit-learndependency (and updateuv.lock). - Introduce new backend components under
langflow/components/scikit-learn/(dataset loader, scaler, train/test split, model trainer/predictor, classification report). - Add Scikit-Learn icon assets and register the bundle/icon in the frontend sidebar/icon map.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Adds scikit-learn>=1.6.1 to project dependencies. |
uv.lock |
Updates lock metadata and adds scikit-learn; also alters wheel lists for some packages. |
src/backend/base/langflow/components/scikit-learn/dataset.py |
New component to load built-in sklearn datasets as Data/DataFrame. |
src/backend/base/langflow/components/scikit-learn/split.py |
New component to split a DataFrame into train/test sets. |
src/backend/base/langflow/components/scikit-learn/scaler.py |
New component to scale features and return scaled DataFrame + scaler object. |
src/backend/base/langflow/components/scikit-learn/model.py |
New component to train a sklearn model and generate predictions. |
src/backend/base/langflow/components/scikit-learn/evaluator.py |
New component to generate a sklearn classification report as Data/DataFrame. |
src/backend/base/langflow/components/scikit-learn/__init__.py |
Exports the new sklearn components. |
src/frontend/src/utils/styleUtils.ts |
Registers Scikit-Learn in sidebar bundles and icon mapping. |
src/frontend/src/icons/ScikitLearn/index.tsx |
Adds the Scikit-Learn icon React wrapper. |
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsx |
Adds the Scikit-Learn SVG React component. |
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svg |
Adds the raw Scikit-Learn SVG asset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| SliderInput( | ||
| name="test_size", | ||
| display_name="Test Size", | ||
| value=0.25, | ||
| range_spec=RangeSpec(min=0, max=0.9, step=0.01), | ||
| ), |
There was a problem hiding this comment.
test_size slider allows a value of 0, but sklearn.model_selection.train_test_split rejects test_size=0.0 (must be > 0). Consider setting the slider min to a small positive value (e.g. 0.01) to prevent a runtime ValueError from valid UI input.
| raise ValueError(msg) | ||
| self.scaler_instance = scaler_class() | ||
| scaled_data = self.scaler_instance.fit_transform(self.dataset_features) | ||
| scaled_data = self.scaler_instance.fit_transform(self.dataset_features) | ||
| # Add the target column back to the scaled data |
There was a problem hiding this comment.
fit_transform is called twice in a row on the same data, which will double the work and can change results for stateful scalers. Remove the duplicate call and only compute scaled_data once.
| def predict(self) -> DataFrame: | ||
| self.train_model() | ||
| if not hasattr(self, "TrainedModel"): | ||
| msg = "No trained model provided. Please connect a train-test split component." |
There was a problem hiding this comment.
predict() always calls self.train_model() first, which means generating predictions will retrain the model every time and overwrite any previously trained model. This is surprising (and expensive); consider only training when self.TrainedModel is not set, and otherwise reuse the existing model.
| # Get the selected model class and create an instance | ||
| model_class = self.AVAILABLE_MODELS[self.model_type] | ||
|
|
||
| model = model_class() | ||
|
|
There was a problem hiding this comment.
random_state is exposed as an input but isn't used when creating the model instance. Either remove the input or pass it through for models that support it so the UI setting has an effect.
| # Generate the classification report | ||
| self.report_data = classification_report( | ||
| y_true, | ||
| y_pred, | ||
| output_dict=True | ||
| ) |
There was a problem hiding this comment.
The output_dict input is ignored: classification_report(..., output_dict=True) is hard-coded. Use the self.output_dict value so the component matches its UI/description, and note that the downstream str handling in get_report_dataframe() is currently unreachable because of the hard-coded True.
| from .dataset import SklearnDatasetComponent | ||
| from .evaluator import ClassificationReportComponent | ||
| from .model import SklearnModelComponent | ||
| from .scaler import DataScalerComponent | ||
| from .split import TrainTestSplitComponent | ||
|
|
There was a problem hiding this comment.
This package lives under components/scikit-learn/, but scikit-learn is not a valid identifier for import ... statements (you can't write import langflow.components.scikit-learn). Even if components are discovered by file scanning, this makes direct imports and tooling awkward; consider renaming the directory to scikit_learn (and adjusting any references) to match Python module naming rules.
| def split_dataframe_train(self) -> DataFrame: | ||
| self.split_dataframe_base() | ||
| return DataFrame(self.TrainSplit) | ||
|
|
||
| def split_dataframe_test(self) -> DataFrame: | ||
| self.split_dataframe_base() | ||
| return DataFrame(self.TestSplit) |
There was a problem hiding this comment.
Both split_dataframe_train and split_dataframe_test call split_dataframe_base(), so requesting both outputs will recompute the split twice. Caching the result (e.g., only split if TrainSplit/TestSplit are unset or inputs changed) would avoid extra work on large datasets and guarantee both outputs come from the same split operation.
| if not hasattr(self, "TrainedModel"): | ||
| msg = "No trained model provided. Please connect a train-test split component." | ||
| raise ValueError(msg) | ||
| if not isinstance(self.test_data, DataFrame): |
There was a problem hiding this comment.
if not hasattr(self, "TrainedModel") will never be true because TrainedModel is defined as a class attribute (initialized to None). If you want to validate training, check self.TrainedModel is None (and ideally avoid hasattr for this case).
| class SklearnModelComponent(Component): | ||
| display_name = "Sklearn Model" | ||
| description = "Train and use scikit-learn models" | ||
| documentation = "https://scikit-learn.org/stable/supervised_learning.html" | ||
| icon = "ScikitLearn" | ||
| TrainedModel = None | ||
| Predictions = None | ||
| ModelPerformance = None |
There was a problem hiding this comment.
New scikit-learn components introduce non-trivial behavior (dataset loading, scaling, splitting, model training/prediction, report generation) but no tests were added. The repo has extensive component unit/integration tests under src/backend/tests/; adding a small unit test suite for these components would help prevent regressions (e.g., target-column handling, deterministic splits, and that predict() doesn't retrain).
| # Make predictions if test data is provided | ||
| # if self.TrainedModel is None: | ||
| # self.TrainedModel = self.train_model() | ||
| self.test_x = self.test_data.drop(self.target_column, axis=1) | ||
| self.test_y = self.test_data[self.target_column] | ||
| predictions = self.TrainedModel.predict(self.test_x) |
There was a problem hiding this comment.
predict() assumes self.target_column exists in test_data (it calls drop(self.target_column, ...)), but it isn't validated like it is for train_data. If the test set is missing the target column, this will raise a pandas KeyError; consider adding an explicit check and a clearer error message.
There was a problem hiding this comment.
Walkthrough
This PR integrates scikit-learn machine learning capabilities into the Langflow framework. It adds scikit-learn as a project dependency and introduces five new components: dataset loading (supporting iris, digits, wine, breast_cancer, diabetes datasets), model training/prediction (supporting 8 classifier and regressor types), data preprocessing (train/test splitting and feature scaling with 4 scaler types), and model evaluation (classification reports). The frontend is updated with Scikit-Learn branding assets and icon integration. All components follow Langflow's architecture patterns with proper input validation, metadata, and output methods for seamless pipeline integration.
Changes
| File(s) | Summary |
|---|---|
pyproject.tomluv.lock |
Added scikit-learn (>=1.6.1) as a new project dependency and updated lockfile with package metadata. Removed platform-specific wheels for NVIDIA packages (aarch64) and legacy PyPy27 distributions. |
src/backend/base/langflow/components/scikit-learn/__init__.py |
Created initialization file that imports and exports five scikit-learn component classes: SklearnDatasetComponent, ClassificationReportComponent, SklearnModelComponent, DataScalerComponent, and TrainTestSplitComponent. |
src/backend/base/langflow/components/scikit-learn/dataset.py |
Implemented SklearnDatasetComponent for loading five built-in datasets (iris, digits, wine, breast_cancer, diabetes) with outputs as DataFrame and Data object with metadata. |
src/backend/base/langflow/components/scikit-learn/model.py |
Added SklearnModelComponent supporting 8 model types (LogisticRegression, RandomForest, SVM, DecisionTree classifiers and regressors) with train_model() and predict() methods. |
src/backend/base/langflow/components/scikit-learn/scaler.py |
Implemented DataScalerComponent with 4 scaler types (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler) that scales features while preserving target column. Contains duplicate fit_transform call on line 74. |
src/backend/base/langflow/components/scikit-learn/split.py |
Created TrainTestSplitComponent wrapping train_test_split with configurable test_size, random_state, and target_column parameters, returning split train/test datasets. |
src/backend/base/langflow/components/scikit-learn/evaluator.py |
Added ClassificationReportComponent that generates classification reports from true and predicted labels, outputting both dictionary and DataFrame formats. |
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsxsrc/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svgsrc/frontend/src/icons/ScikitLearn/index.tsx |
Added Scikit-Learn logo assets as React component and SVG file, with forwardRef wrapper for icon component integration. |
src/frontend/src/utils/styleUtils.ts |
Registered ScikitLearnIcon in SIDEBAR_BUNDLES and nodeIconsLucide mapping to enable Scikit-Learn components display in UI sidebar and node visualizations. |
Sequence Diagram
This diagram shows the interactions between components:
sequenceDiagram
participant Client
participant Dataset as SklearnDatasetComponent
participant Split as TrainTestSplitComponent
participant Scaler as DataScalerComponent
participant Model as SklearnModelComponent
participant Evaluator as ClassificationReportComponent
Note over Client,Evaluator: ML Pipeline Flow
Client->>Dataset: load_dataset()
activate Dataset
Dataset-->>Client: raw_data
deactivate Dataset
Client->>Split: split(raw_data)
activate Split
Split-->>Client: X_train, X_test, y_train, y_test
deactivate Split
Client->>Scaler: fit(X_train)
activate Scaler
Scaler-->>Client: fitted_scaler
deactivate Scaler
Client->>Scaler: transform(X_train)
Scaler-->>Client: X_train_scaled
Client->>Scaler: transform(X_test)
Scaler-->>Client: X_test_scaled
Client->>Model: train(X_train_scaled, y_train)
activate Model
Model-->>Client: trained_model
deactivate Model
Client->>Model: predict(X_test_scaled)
Model-->>Client: predictions
Client->>Evaluator: generate_report(y_test, predictions)
activate Evaluator
Evaluator-->>Client: classification_metrics
deactivate Evaluator
Note over Client: Pipeline Complete
🔗 Cross-Repository Impact Analysis
Enable automatic detection of breaking changes across your dependent repositories. → Set up now
Learn more about Cross-Repository Analysis
What It Does
- Automatically identifies repositories that depend on this code
- Analyzes potential breaking changes across your entire codebase
- Provides risk assessment before merging to prevent cross-repo issues
How to Enable
- Visit Settings → Code Management
- Configure repository dependencies
- Future PRs will automatically include cross-repo impact analysis!
Benefits
- 🛡️ Prevent breaking changes across repositories
- 🔍 Catch integration issues before they reach production
- 📊 Better visibility into your multi-repo architecture
| def split_dataframe_base(self): | ||
| if not hasattr(self, "dataset"): | ||
| msg = "No dataset provided. Please connect a dataset component." | ||
| raise ValueError(msg) | ||
| # split the dataset into train and test | ||
| if not isinstance(self.dataset, DataFrame): | ||
| msg = "The dataset is not a DataFrame. Please connect a DataFrame component." | ||
| raise TypeError(msg) | ||
|
|
||
| if self.target_column not in self.dataset.columns: | ||
| msg = f"Error: The target column '{self.target_column}' does not exist in the dataset." | ||
| raise ValueError(msg) | ||
|
|
||
| train_df, test_df = train_test_split(self.dataset, test_size=self.test_size, random_state=self.random_state) | ||
| self.TrainSplit = train_df | ||
| self.TestSplit = test_df | ||
|
|
||
| def split_dataframe_train(self) -> DataFrame: | ||
| self.split_dataframe_base() | ||
| return DataFrame(self.TrainSplit) | ||
|
|
||
| def split_dataframe_test(self) -> DataFrame: | ||
| self.split_dataframe_base() | ||
| return DataFrame(self.TestSplit) |
There was a problem hiding this comment.
Correctness: The component introduces several logic issues:
- Metadata Loss: It fails to propagate 'text_key' and 'default_value' from the input dataset to the output DataFrames, which breaks downstream components (like embeddings) that rely on knowing which column contains the text.
- Unused Validation: It validates the existence of 'target_column' but never uses it in the split logic, causing unnecessary failures on datasets without that specific column name.
- Redundant Computation:
train_test_splitis executed twice if both outputs are connected, which is inefficient.
🤖 AI Agent Prompt for Cursor/Windsurf
📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue
In `src/backend/base/langflow/components/scikit-learn/split.py`, refactor the split methods to: 1. Cache the split result in `self.TrainSplit` to avoid redundant computation when multiple outputs are used. 2. Remove the unused `target_column` validation block (lines 58-60). 3. Preserve `text_key` and `default_value` metadata when creating the output `DataFrame` objects by passing them from `self.dataset` to the `DataFrame` constructor.
…e compatibility (langflow-ai#10008) * fix: Restore Agent memory functionality by fixing inheritance and type compatibility This commit fixes two critical bugs that completely broke Agent memory in the main branch: ## Bug #1: Inheritance Method Call Error - Fixed incorrect method calls in Agent component inheritance - Changed `get_base_inputs()` to `_base_inputs` in: - src/lfx/src/lfx/components/agents/agent.py:157 - src/lfx/src/lfx/base/agents/agent.py:229 ## Bug #2: Message Type Incompatibility - Fixed type checking in Agent base class to handle both Message types - Memory returns `langflow.schema.message.Message` but Agent expected `lfx.schema.message.Message` - Updated type check to use duck typing instead of strict isinstance check - Changed in src/lfx/src/lfx/base/agents/agent.py:148-150 ## Impact - Agents can now remember conversation context across messages - Memory functionality restored to same level as release-1.6.0 - Fixes issue where agents would forget user information immediately ## Test Results - Before: Agent says "I don't have access to your name or occupation" - After: Agent says "Your name is VICTORY TEST, and you work as a memory bug hunter" 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * fix: Update data_to_messages function to accept both Data and Message types This commit modifies the `data_to_messages` function to accept a list of both `Data` and `Message` types, enhancing type compatibility. The function's docstring has been updated to reflect the new input type and return type, ensuring clarity in its usage. * fix: improve message validation in Agent This commit updates the chat history processing in the LCAgentComponent to ensure that only messages with valid 'text' data are included. The method now checks for the presence of 'text' in the message data before converting it to the appropriate format. Additionally, the base input retrieval method has been changed from `_base_inputs` to `get_base_inputs()` for consistency and clarity. * fix: enhance chat history validation to support Data type * fix: improve input handling to support dynamic message conversion --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Gabriel Luiz Freitas Almeida <gabriel@langflow.org> Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
src/backend/base/langflow/components/scikit-learn/model.py (1)
93-97: Reuse the trained estimator instead of fitting again inpredict().Line 94 retrains on every prediction request. That doubles fit cost when both outputs are resolved and can overwrite previously trained state. Train lazily only when
self.TrainedModelis stillNone.Proposed fix
def predict(self) -> DataFrame: - self.train_model() - if not hasattr(self, "TrainedModel"): + if self.TrainedModel is None: + self.train_model() + if self.TrainedModel is None: msg = "No trained model provided. Please connect a train-test split component." raise ValueError(msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 93 - 97, The predict() method currently calls train_model() unconditionally and retrains on every prediction; change predict() to only call self.train_model() when self.TrainedModel is None (i.e., train lazily), and afterwards proceed to use self.TrainedModel for prediction or raise the existing ValueError if it's still not set. Update references in predict() to rely on the trained estimator attribute TrainedModel rather than refitting each call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 118: The langflow-base package is missing the scikit-learn runtime
dependency, causing imports under langflow/components/scikit-learn to fail for
standalone installs; add "scikit-learn>=1.6.1" to the langflow-base
pyproject.toml dependencies (the base package's pyproject.toml) so the
dependency is declared in the package manifest and will be installed with
langflow-base.
In `@src/backend/base/langflow/components/scikit-learn/evaluator.py`:
- Around line 15-16: The code currently caches report_data and only regenerates
it when report_data is None, which causes stale outputs if y_true, y_pred, or
output_dict change; fix it by removing the persistent cache or resetting it
whenever inputs change: either compute and return the classification report
fresh each call instead of using the module-level/instance variable report_data,
or ensure any setter or update path for y_true, y_pred, or output_dict clears
report_data (sets it to None) before the getters run; locate the report_data
variable and the getters that read it as well as code paths that assign y_true,
y_pred, and output_dict and apply one of these fixes to lines referenced around
the report_data usage (including the block at ~70-80).
- Around line 30-35: The UI BoolInput named "output_dict" is never used because
generate_report() hardcodes output_dict=True when calling
sklearn.metrics.classification_report, making the string branch unreachable;
update generate_report() to read the node input value (output_dict) and pass
that boolean into classification_report(...) instead of the hardcoded True, and
ensure downstream logic still handles both dict and string outputs (the existing
string-handling branch in generate_report remains applicable when output_dict is
False).
In `@src/backend/base/langflow/components/scikit-learn/model.py`:
- Around line 64-74: The train_model method currently requires self.test_data
and raises if it's missing which breaks training-only flows; remove or defer the
test_data isinstance check from train_model (keep the train_data existence/type
checks and the ValueError/TypeError for it) and move the test_data validation
into the predict method (or wherever model prediction is performed) so that the
Trained Model output can be produced without a wired test frame; update
predict() to raise a clear TypeError/ValueError if self.test_data is missing or
not a DataFrame before using it.
- Around line 98-105: The code currently assumes self.target_column exists in
self.test_data and always sets self.test_y, which breaks inference on unlabeled
feature frames; update the block in the method that handles self.test_data
(checking isinstance(self.test_data, DataFrame)) so that you set self.test_x by
dropping self.target_column only if self.target_column is present in
self.test_data.columns, otherwise set self.test_x to the entire DataFrame, and
only assign self.test_y if the target column exists; keep the TypeError for
non-DataFrame inputs and remove the unconditional read of self.test_y.
- Around line 45-50: The component exposes IntInput name="random_state" but the
estimator constructors are created without using it; update the estimator
instantiation code (where each estimator is constructed — e.g., the lines that
create LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier,
RandomForestRegressor, DecisionTreeRegressor) to pass
random_state=int(random_state) (or random_state=None when input is None/empty)
into their constructors so the user-specified reproducibility value is wired
into the created estimator; ensure you convert the IntInput value to an int and
only add the argument for estimators that accept random_state.
In `@src/backend/base/langflow/components/scikit-learn/scaler.py`:
- Around line 15-16: The component currently memoizes scaled_data and
scaler_instance on the instance (variables scaled_data and scaler_instance) so
subsequent executions can reuse stale state; modify the component to clear or
reinitialize these cached attributes at the start of each execution (e.g., in
the main execute/run method) or change the getters to recompute when inputs
change (scaler_type, target_column, or input dataset) — ensure you also clear
any other cached attributes referenced around lines 84-93 the same way so each
run recreates scaler_instance and recomputes scaled_data from the current
dataset/parameters.
- Around line 52-82: scale_data currently always fits a new scaler (using
SCALER_MAPPING and scaler_instance) which causes data leakage; change scale_data
to (1) accept an optional pre-fitted scaler_instance or a mode flag (e.g.,
fit=True/False) on the component, (2) if a fitted scaler_instance is provided or
fit=False call scaler_instance.transform(...) instead of fit_transform, and (3)
only call fit_transform when no fitted scaler is supplied and fit=True; ensure
you remove the duplicate fit_transform call and keep handling of target_column,
dataset_features, scaled_data, concat_data and scaled_data assignment intact so
transforms preserve original column order.
In `@src/backend/base/langflow/components/scikit-learn/split.py`:
- Around line 24-29: The SliderInput for the "test_size" parameter currently
allows min=0 which can pass 0 to train_test_split and cause a ValueError; update
the RangeSpec in the SliderInput (name="test_size") to use min=0.01 instead of
min=0 so the slider only permits values in (0.0,1.0), leaving the default value
(value=0.25) unchanged.
---
Nitpick comments:
In `@src/backend/base/langflow/components/scikit-learn/model.py`:
- Around line 93-97: The predict() method currently calls train_model()
unconditionally and retrains on every prediction; change predict() to only call
self.train_model() when self.TrainedModel is None (i.e., train lazily), and
afterwards proceed to use self.TrainedModel for prediction or raise the existing
ValueError if it's still not set. Update references in predict() to rely on the
trained estimator attribute TrainedModel rather than refitting each call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 178aa6b9-de4c-4adf-8362-ac0e3f5e0806
⛔ Files ignored due to path filters (2)
src/frontend/src/icons/ScikitLearn/ScikitLearnLogo.svgis excluded by!**/*.svguv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
pyproject.tomlsrc/backend/base/langflow/components/scikit-learn/__init__.pysrc/backend/base/langflow/components/scikit-learn/dataset.pysrc/backend/base/langflow/components/scikit-learn/evaluator.pysrc/backend/base/langflow/components/scikit-learn/model.pysrc/backend/base/langflow/components/scikit-learn/scaler.pysrc/backend/base/langflow/components/scikit-learn/split.pysrc/frontend/src/icons/ScikitLearn/ScikitLearnLogo.jsxsrc/frontend/src/icons/ScikitLearn/index.tsxsrc/frontend/src/utils/styleUtils.ts
pyproject.toml
Outdated
| "apify-client>=1.8.1", | ||
| "pylint>=3.3.4", | ||
| "ruff>=0.9.7", | ||
| "scikit-learn>=1.6.1", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -e
echo "== Workspace/package manifests =="
fd '^pyproject\.toml$'
echo
echo "== sklearn imports under src/backend/base =="
rg -n --glob '*.py' '\bfrom sklearn\b|\bimport sklearn\b' src/backend/base
echo
echo "== scikit-learn declarations across manifests =="
rg -n --glob 'pyproject.toml' 'name\s*=\s*"langflow-base"|name\s*=\s*"langflow"|scikit-learn' .Repository: Qentora/langflow
Length of output: 1548
🏁 Script executed:
cat src/backend/base/pyproject.tomlRepository: Qentora/langflow
Length of output: 7294
Add scikit-learn>=1.6.1 to src/backend/base/pyproject.toml dependencies.
The sklearn imports in src/backend/base/langflow/components/scikit-learn/ are missing from that package's manifest. Since langflow-base is independently published, standalone installs will fail to import sklearn at runtime. The dependency must be declared in src/backend/base/pyproject.toml, not only in the root package.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 118, The langflow-base package is missing the
scikit-learn runtime dependency, causing imports under
langflow/components/scikit-learn to fail for standalone installs; add
"scikit-learn>=1.6.1" to the langflow-base pyproject.toml dependencies (the base
package's pyproject.toml) so the dependency is declared in the package manifest
and will be installed with langflow-base.
| report_data = None | ||
|
|
There was a problem hiding this comment.
Don't cache report_data across executions.
The getters only regenerate when report_data is None. If this component instance is reused, changing y_true, y_pred, or output_dict can still return the previous report.
Also applies to: 70-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/evaluator.py` around lines
15 - 16, The code currently caches report_data and only regenerates it when
report_data is None, which causes stale outputs if y_true, y_pred, or
output_dict change; fix it by removing the persistent cache or resetting it
whenever inputs change: either compute and return the classification report
fresh each call instead of using the module-level/instance variable report_data,
or ensure any setter or update path for y_true, y_pred, or output_dict clears
report_data (sets it to None) before the getters run; locate the report_data
variable and the getters that read it as well as code paths that assign y_true,
y_pred, and output_dict and apply one of these fixes to lines referenced around
the report_data usage (including the block at ~70-80).
| BoolInput( | ||
| name="output_dict", | ||
| display_name="Output as Dictionary", | ||
| info="If True, return output as dict instead of string", | ||
| value=True, | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/backend/base/langflow/components/scikit-learn/evaluator.pyRepository: Qentora/langflow
Length of output: 4112
🌐 Web query:
sklearn classification_report output_dict parameter
💡 Result:
In sklearn.metrics.classification_report, the parameter output_dict controls the return type:
output_dict=False(default): returns a formatted text report (string).output_dict=True: returns a Python dict (nested) with per-class and aggregate metrics. Each class label maps to{'precision', 'recall', 'f1-score', 'support'}; aggregate keys include things like'macro avg','weighted avg'(and sometimes'micro avg'/'samples avg'depending on the problem). Also, whenoutput_dict=True,digitsis ignored (values are not rounded). [1]
Example:
from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred, output_dict=True)
f1_class0 = report["0"]["f1-score"]
acc = report["accuracy"]Caveat: the dict keys include fixed names like 'accuracy', so if you have a class label named "accuracy", it can collide and make the output “brittle.” [2]
Sources: [1] scikit-learn docs for classification_report (sklearn.org), [2] scikit-learn GitHub issue discussing output_dict=True brittleness (github.com)
Citations:
- 1: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- 2: classification_report with output_dict=True leads to brittle output scikit-learn/scikit-learn#29205
Wire the output_dict input into classification_report().
The UI exposes output_dict as a BoolInput, but generate_report() always passes output_dict=True. This makes the input dead—toggling it has no effect, and the string-handling branch (lines 81–83) is unreachable.
Proposed fix
self.report_data = classification_report(
y_true,
y_pred,
- output_dict=True
+ output_dict=self.output_dict,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/evaluator.py` around lines
30 - 35, The UI BoolInput named "output_dict" is never used because
generate_report() hardcodes output_dict=True when calling
sklearn.metrics.classification_report, making the string branch unreachable;
update generate_report() to read the node input value (output_dict) and pass
that boolean into classification_report(...) instead of the hardcoded True, and
ensure downstream logic still handles both dict and string outputs (the existing
string-handling branch in generate_report remains applicable when output_dict is
False).
| IntInput( | ||
| name="random_state", | ||
| display_name="Random State", | ||
| value=42, | ||
| info="Random state for reproducibility", | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In scikit-learn 1.6.1, which of these estimators accept a random_state constructor argument: LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier, LinearRegression, RandomForestRegressor, SVR, DecisionTreeRegressor?
💡 Result:
In scikit-learn 1.6.1, these accept random_state in the constructor:
- Yes:
LogisticRegression[1],RandomForestClassifier[2],SVC[3],DecisionTreeClassifier[4],RandomForestRegressor[6],DecisionTreeRegressor[8] - No:
LinearRegression[5],SVR[7]
Sources: [1] https://scikit-learn.org/1.6/modules/generated/sklearn.linear_model.LogisticRegression.html, [2] https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestClassifier.html, [3] https://scikit-learn.org/1.6/modules/generated/sklearn.svm.SVC.html, [4] https://scikit-learn.org/1.6/modules/generated/sklearn.tree.DecisionTreeClassifier.html, [5] https://scikit-learn.org/1.6/modules/generated/sklearn.linear_model.LinearRegression.html, [6] https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestRegressor.html, [7] https://scikit-learn.org/1.6/modules/generated/sklearn.svm.SVR.html, [8] https://scikit-learn.org/1.6/modules/generated/sklearn.tree.DecisionTreeRegressor.html
🏁 Script executed:
find . -type f -name "*.py" -path "*/scikit-learn/*" | head -20Repository: Qentora/langflow
Length of output: 435
🏁 Script executed:
cat -n src/backend/base/langflow/components/scikit-learn/model.py | sed -n '40,95p'Repository: Qentora/langflow
Length of output: 2552
🏁 Script executed:
rg "AVAILABLE_MODELS" src/backend/base/langflow/components/scikit-learn/ -A 20Repository: Qentora/langflow
Length of output: 6042
Wire random_state into estimator construction.
The component exposes a reproducibility input on lines 45-50, but line 85 instantiates the model without it. Six of the eight available estimators (LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier, RandomForestRegressor, DecisionTreeRegressor) accept random_state, so users setting this input still receive non-reproducible models.
Proposed fix
- model = model_class()
+ model_kwargs = {}
+ if "random_state" in model_class().get_params():
+ model_kwargs["random_state"] = self.random_state
+ model = model_class(**model_kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 45 -
50, The component exposes IntInput name="random_state" but the estimator
constructors are created without using it; update the estimator instantiation
code (where each estimator is constructed — e.g., the lines that create
LogisticRegression, RandomForestClassifier, SVC, DecisionTreeClassifier,
RandomForestRegressor, DecisionTreeRegressor) to pass
random_state=int(random_state) (or random_state=None when input is None/empty)
into their constructors so the user-specified reproducibility value is wired
into the created estimator; ensure you convert the IntInput value to an int and
only add the argument for estimators that accept random_state.
| def train_model(self) -> BaseEstimator: | ||
| if not hasattr(self, "train_data"): | ||
| msg = "No training data provided. Please connect a train-test split component." | ||
| raise ValueError(msg) | ||
|
|
||
| if not isinstance(self.train_data, DataFrame): | ||
| msg = "The training data is not a DataFrame. Please connect a DataFrame component." | ||
| raise TypeError(msg) | ||
| if not isinstance(self.test_data, DataFrame): | ||
| msg = "The test data is not a DataFrame. Please connect a DataFrame component." | ||
| raise TypeError(msg) |
There was a problem hiding this comment.
Don’t require test_data to build the model output.
Line 72 makes train_model() fail when no test frame is wired, even though the Trained Model output only needs train_data. In the current form, training-only flows will raise before fitting. Move test_data validation into predict().
Proposed fix
def train_model(self) -> BaseEstimator:
if not hasattr(self, "train_data"):
msg = "No training data provided. Please connect a train-test split component."
raise ValueError(msg)
if not isinstance(self.train_data, DataFrame):
msg = "The training data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
- if not isinstance(self.test_data, DataFrame):
- msg = "The test data is not a DataFrame. Please connect a DataFrame component."
- raise TypeError(msg)
if self.target_column not in self.train_data.columns:
msg = f"Error: The target column '{self.target_column}' does not exist in the training data."
raise ValueError(msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 64 -
74, The train_model method currently requires self.test_data and raises if it's
missing which breaks training-only flows; remove or defer the test_data
isinstance check from train_model (keep the train_data existence/type checks and
the ValueError/TypeError for it) and move the test_data validation into the
predict method (or wherever model prediction is performed) so that the Trained
Model output can be produced without a wired test frame; update predict() to
raise a clear TypeError/ValueError if self.test_data is missing or not a
DataFrame before using it.
| if not isinstance(self.test_data, DataFrame): | ||
| msg = "The test data is not a DataFrame. Please connect a DataFrame component." | ||
| raise TypeError(msg) | ||
| # Make predictions if test data is provided | ||
| # if self.TrainedModel is None: | ||
| # self.TrainedModel = self.train_model() | ||
| self.test_x = self.test_data.drop(self.target_column, axis=1) | ||
| self.test_y = self.test_data[self.target_column] |
There was a problem hiding this comment.
Allow prediction on unlabeled feature frames.
Lines 104-105 require target_column to be present in test_data, but this method never uses self.test_y. That breaks the common inference case where the input only contains features. Drop the target column only if it exists, and remove the unused label read.
Proposed fix
if not isinstance(self.test_data, DataFrame):
msg = "The test data is not a DataFrame. Please connect a DataFrame component."
raise TypeError(msg)
# Make predictions if test data is provided
# if self.TrainedModel is None:
# self.TrainedModel = self.train_model()
- self.test_x = self.test_data.drop(self.target_column, axis=1)
- self.test_y = self.test_data[self.target_column]
+ self.test_x = self.test_data.drop(columns=[self.target_column], errors="ignore")
predictions = self.TrainedModel.predict(self.test_x)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not isinstance(self.test_data, DataFrame): | |
| msg = "The test data is not a DataFrame. Please connect a DataFrame component." | |
| raise TypeError(msg) | |
| # Make predictions if test data is provided | |
| # if self.TrainedModel is None: | |
| # self.TrainedModel = self.train_model() | |
| self.test_x = self.test_data.drop(self.target_column, axis=1) | |
| self.test_y = self.test_data[self.target_column] | |
| if not isinstance(self.test_data, DataFrame): | |
| msg = "The test data is not a DataFrame. Please connect a DataFrame component." | |
| raise TypeError(msg) | |
| # Make predictions if test data is provided | |
| # if self.TrainedModel is None: | |
| # self.TrainedModel = self.train_model() | |
| self.test_x = self.test_data.drop(columns=[self.target_column], errors="ignore") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/model.py` around lines 98 -
105, The code currently assumes self.target_column exists in self.test_data and
always sets self.test_y, which breaks inference on unlabeled feature frames;
update the block in the method that handles self.test_data (checking
isinstance(self.test_data, DataFrame)) so that you set self.test_x by dropping
self.target_column only if self.target_column is present in
self.test_data.columns, otherwise set self.test_x to the entire DataFrame, and
only assign self.test_y if the target column exists; keep the TypeError for
non-DataFrame inputs and remove the unconditional read of self.test_y.
| scaled_data = None | ||
| scaler_instance = None |
There was a problem hiding this comment.
Reset cached scaler state between executions.
scaled_data and scaler_instance are memoized on the component instance, and the getters only recompute once. If this component is reused with a different dataset, scaler_type, or target_column, it can return the previous run's frame/scaler.
Also applies to: 84-93
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/scaler.py` around lines 15
- 16, The component currently memoizes scaled_data and scaler_instance on the
instance (variables scaled_data and scaler_instance) so subsequent executions
can reuse stale state; modify the component to clear or reinitialize these
cached attributes at the start of each execution (e.g., in the main execute/run
method) or change the getters to recompute when inputs change (scaler_type,
target_column, or input dataset) — ensure you also clear any other cached
attributes referenced around lines 84-93 the same way so each run recreates
scaler_instance and recomputes scaled_data from the current dataset/parameters.
| def scale_data(self): | ||
| if not hasattr(self, "dataset"): | ||
| msg = "No dataset provided. Please connect a dataset component." | ||
| raise ValueError(msg) | ||
|
|
||
| if not isinstance(self.dataset, DataFrame): | ||
| msg = "The dataset is not a DataFrame. Please connect a DataFrame component." | ||
| raise TypeError(msg) | ||
|
|
||
| # Create the appropriate scaler instance | ||
| scaler_class = self.SCALER_MAPPING.get(self.scaler_type) | ||
| if scaler_class is None: | ||
| msg = f"Invalid scaler type: {self.scaler_type}" | ||
| raise ValueError(msg) | ||
|
|
||
| # Initialize and fit the scaler | ||
| if self.target_column in self.dataset.columns: | ||
| self.dataset_features = self.dataset.drop(self.target_column, axis=1) | ||
| else: | ||
| msg = "Target column not found in dataset" | ||
| raise ValueError(msg) | ||
| self.scaler_instance = scaler_class() | ||
| scaled_data = self.scaler_instance.fit_transform(self.dataset_features) | ||
| scaled_data = self.scaler_instance.fit_transform(self.dataset_features) | ||
| # Add the target column back to the scaled data | ||
| scaled_data = pd.DataFrame(scaled_data, columns=self.dataset_features.columns) | ||
| concat_data = pd.concat([scaled_data, self.dataset[self.target_column]], axis=1) | ||
|
|
||
| # Convert to DataFrame with original column names | ||
| scale_df = pd.DataFrame(concat_data, columns=self.dataset.columns) | ||
| self.scaled_data = DataFrame(scale_df) |
There was a problem hiding this comment.
This API bakes data leakage into train/test workflows.
scale_data() always calls fit_transform() on the provided dataset, but the component has no way to accept an already-fitted scaler and run transform() on a held-out/test dataset. With TrainTestSplitComponent in the same feature set, the only supported flows are either scaling before the split (leakage) or fitting separate scalers per split (invalid metrics).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/scaler.py` around lines 52
- 82, scale_data currently always fits a new scaler (using SCALER_MAPPING and
scaler_instance) which causes data leakage; change scale_data to (1) accept an
optional pre-fitted scaler_instance or a mode flag (e.g., fit=True/False) on the
component, (2) if a fitted scaler_instance is provided or fit=False call
scaler_instance.transform(...) instead of fit_transform, and (3) only call
fit_transform when no fitted scaler is supplied and fit=True; ensure you remove
the duplicate fit_transform call and keep handling of target_column,
dataset_features, scaled_data, concat_data and scaled_data assignment intact so
transforms preserve original column order.
| SliderInput( | ||
| name="test_size", | ||
| display_name="Test Size", | ||
| value=0.25, | ||
| range_spec=RangeSpec(min=0, max=0.9, step=0.01), | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/backend/base/langflow/components/scikit-learn/split.py | head -40Repository: Qentora/langflow
Length of output: 1737
🌐 Web query:
sklearn train_test_split test_size=0 parameter validation constraints
💡 Result:
In sklearn.model_selection.train_test_split, test_size is validated so the split is non-empty and the value is in-range:
- If
test_sizeis a float: it must be in the (0.0, 1.0) proportion range (docs say “between 0.0 and 1.0” and the implementation enforces it cannot be 0). Passingtest_size=0.0triggers aValueError. [1] - If
test_sizeis an int: it must be >= 1 and < n_samples (otherwise you’d get an empty train or test set). The common error fortest_size=0is: “test_size=0should be either positive and smaller than the number of samples … or a float in the (0,1) range”. [2] - If both
train_sizeandtest_sizeare given: they must be compatible (for floats, their sum must be <= 1.0; for ints, <= n_samples), and the resulting train/test sizes must each be at least 1 sample (else you’ll get the “resulting train set will be empty” style error). [3] - Defaults: if both are
None,test_sizedefaults to 0.25. [1]
Sources: scikit-learn train_test_split docs [1], Stack Overflow error example for test_size=0 [2], scikit-learn error behavior example referencing _validate_shuffle_split (“resulting train set will be empty”) [3].
Change min=0 to min=0.01 in the slider's range specification.
train_test_split() requires test_size to be strictly in the range (0.0, 1.0). The current slider allows test_size=0, which will fail at runtime with a ValueError.
Proposed fix
SliderInput(
name="test_size",
display_name="Test Size",
value=0.25,
- range_spec=RangeSpec(min=0, max=0.9, step=0.01),
+ range_spec=RangeSpec(min=0.01, max=0.9, step=0.01),
),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/base/langflow/components/scikit-learn/split.py` around lines 24 -
29, The SliderInput for the "test_size" parameter currently allows min=0 which
can pass 0 to train_test_split and cause a ValueError; update the RangeSpec in
the SliderInput (name="test_size") to use min=0.01 instead of min=0 so the
slider only permits values in (0.0,1.0), leaving the default value (value=0.25)
unchanged.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
Summary by CodeRabbit
Release Notes
New Features
Chores
EntelligenceAI PR Summary
This PR adds comprehensive scikit-learn integration to Langflow with five new ML components and frontend branding support.
Confidence Score: 2/5 - Changes Needed
Files requiring special attention
src/backend/base/langflow/components/scikit-learn/__init__.pysrc/backend/base/langflow/components/scikit-learn/model.py