Update computation logic for pod count#1862
Merged
mbvreddy merged 7 commits intokruize:pod-count-fixfrom Apr 6, 2026
Merged
Conversation
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Contributor
Reviewer's GuideAdds per-term pod_count metrics_info aggregation to container recommendations, refactors recommendation processing to use pre-filtered metric windows and metric-based minimum data checks, and removes explicit pod count handling from the core engine and namespace recommendations while introducing a custom serializer to emit integer metric aggregations. Sequence diagram for container term recommendation with pod_count metrics_infosequenceDiagram
actor User
participant API as UpdateRecommendations
participant Engine as RecommendationEngine
participant ContProc as ContainerRecommendationProcessor
participant Terms
participant Model as RecommendationModel
participant TermRec as TermRecommendations
participant Gson as Gson
participant Serializer as MetricAggregationInfoResultsIntSerializer
User->>API: trigger recommendation update
API->>Engine: generateRecommendations(kruizeObject)
loop for each container and term
Engine->>ContProc: generateRecommendationsBasedOnTerms(containerData, kruizeObject, termsMap)
ContProc->>ContProc: compute monitoringStartTime, monitoringEndTime
ContProc->>ContProc: build filteredResultsMap from containerData.results
ContProc->>Terms: checkIfMinDataAvailableForTerm(filteredResultsMap, term, measurementDuration)
Terms-->>ContProc: boolean hasMinData
alt not enough data
ContProc->>TermRec: new TermRecommendations
ContProc->>TermRec: addNotification(INFO_NOT_ENOUGH_DATA)
else enough data
ContProc->>TermRec: new TermRecommendations
ContProc->>ContProc: getPodCountAggrInfo(filteredResultsMap)
ContProc-->>ContProc: MetricAggregationInfoResults podCountAggrInfo
ContProc->>TermRec: addMetricsInfo(POD_COUNT, podCountAggrInfo)
loop for each RecommendationModel
Engine->>Engine: getModels()
Engine-->>ContProc: model list
ContProc->>Model: getCPURequestRecommendation(filteredResultsMap, notifications)
ContProc->>Model: getMemoryRequestRecommendation(filteredResultsMap, notifications)
ContProc->>Model: getAcceleratorRequestRecommendation(filteredResultsMap, notifications)
ContProc->>Engine: populateRecommendation(termEntry, mappedRecommendationForModel, notifications, internalMapToPopulate, cpuThreshold, memoryThreshold, acceleratorMap, runtimeRecommList)
Engine-->>ContProc: boolean success
end
end
end
API->>Gson: toJson(recommendationList)
Gson->>Serializer: serialize(MetricAggregationInfoResults)
Serializer-->>Gson: Json with avg, min, max as int
Gson-->>API: JSON payload with metrics_info.pod_count
API-->>User: HTTP response with updated recommendations
Class diagram for updated recommendation processing and pod_count metrics_infoclassDiagram
class ContainerRecommendationProcessor {
- RecommendationEngineService engineService
+ boolean generateRecommendationsBasedOnTerms(ContainerData containerData, KruizeObject kruizeObject, Map~String, Terms~ termsMap)
- MappedRecommendationForModel generateRecommendationBasedOnModel(RecommendationModel model, ContainerData containerData, Map~Timestamp, IntervalResults~ filteredResultsMap, KruizeObject kruizeObject, HashMap~AnalyzerConstants_ResourceSetting, HashMap~AnalyzerConstants_RecommendationItem, RecommendationConfigItem~~ currentConfigMap, Map_Entry~String, Terms~ termEntry)
- static MetricAggregationInfoResults getPodCountAggrInfo(Map~Timestamp, IntervalResults~ filteredResultsMap)
}
class NamespaceRecommendationProcessor {
- RecommendationEngineService engineService
+ boolean generateNamespaceRecommendationsBasedOnTerms(NamespaceData namespaceData, KruizeObject kruizeObject, Map~String, Terms~ termsMap)
- MappedRecommendationForModel generateNamespaceRecommendationBasedOnModel(RecommendationModel model, Map~Timestamp, IntervalResults~ filteredResultsMap, RecommendationSettings recommendationSettings, HashMap~AnalyzerConstants_ResourceSetting, HashMap~AnalyzerConstants_RecommendationItem, RecommendationConfigItem~~ currentNamespaceConfigMap, Map_Entry~String, Terms~ termEntry)
}
class Terms {
+ int days
+ double threshold_in_days
+ static int getMaxDays(Map~String, Terms~ terms)
+ static boolean checkIfMinDataAvailableForTerm(Map~Timestamp, IntervalResults~ metricsData, Terms term, double measurementDuration)
+ static boolean checkIfMinDataAvailableForTerm(ContainerData containerData, Terms term, Timestamp monitoringEndTime, double measurementDuration)
+ static boolean checkIfMinDataAvailableForTermForNamespace(Map~Timestamp, IntervalResults~ metricsData, Terms term, double measurementDuration)
+ static boolean checkIfMinDataAvailableForTermForNamespace(NamespaceData namespaceData, Terms term, Timestamp monitoringEndTime, double measurementDuration)
}
class TermRecommendations {
- HashMap~Integer, RecommendationNotification~ termLevelNotificationMap
- HashMap~String, MetricAggregationInfoResults~ metricsInfo
- Timestamp monitoringStartTime
- Timestamp monitoringEndTime
+ void addNotification(RecommendationNotification notification)
+ void addMetricsInfo(String metricName, MetricAggregationInfoResults metricAggregationInfoResults)
+ void setMonitoringStartTime(Timestamp monitoringStartTime)
+ void setMonitoringEndTime(Timestamp monitoringEndTime)
}
class RecommendationEngineService {
<<interface>>
+ boolean populateRecommendation(Map_Entry~String, Terms~ termEntry, MappedRecommendationForModel recommendationModel, ArrayList~RecommendationNotification~ notifications, HashMap~String, RecommendationConfigItem~ internalMapToPopulate, double cpuThreshold, double memoryThreshold, Map~AnalyzerConstants_RecommendationItem, RecommendationConfigItem~ recommendationAcceleratorRequestMap, List~RecommendationConfigEnv~ runtimeRecommList)
+ List~RecommendationModel~ getModels()
}
class RecommendationEngine {
+ boolean populateRecommendation(Map_Entry~String, Terms~ termEntry, MappedRecommendationForModel recommendationModel, ArrayList~RecommendationNotification~ notifications, HashMap~String, RecommendationConfigItem~ internalMapToPopulate, double cpuThreshold, double memoryThreshold, Map~AnalyzerConstants_RecommendationItem, RecommendationConfigItem~ recommendationAcceleratorRequestMap, List~RecommendationConfigEnv~ runtimeRecommList)
}
class MetricAggregationInfoResults {
+ Double avg
+ Double sum
+ Double min
+ Double max
+ Long count
+ Double median
+ Double mode
+ Double range
+ String format
+ Double getAvg()
+ Double getSum()
+ Double getMin()
+ Double getMax()
+ Long getCount()
+ Double getMedian()
+ Double getMode()
+ Double getRange()
+ String getFormat()
+ void setAvg(Double avg)
+ void setMin(Double min)
+ void setMax(Double max)
}
class MetricAggregationInfoResultsIntSerializer {
<<JsonSerializer~MetricAggregationInfoResults~~>>
+ JsonElement serialize(MetricAggregationInfoResults metricAggregationInfoResults, Type type, JsonSerializationContext jsonSerializationContext)
}
class UpdateRecommendations {
+ boolean shouldSkipClass(Class clazz)
+ void someMethodBuildingGson()
}
class KruizeConstants_JSONKeys {
<<static>>
+ String METRICS_INFO
+ String POD_COUNT
+ String PODS_COUNT
}
class AnalyzerConstants_MetricName {
<<enum>>
cpuUsage
memoryUsage
podCount
namespaceTotalPods
jvmInfo
jvmInfoTotal
}
class KruizeObject {
+ static double getTermThresholdInDays(String term, Double measurement_duration)
}
ContainerRecommendationProcessor --> RecommendationEngineService : uses
NamespaceRecommendationProcessor --> RecommendationEngineService : uses
RecommendationEngineService <|.. RecommendationEngine
ContainerRecommendationProcessor --> Terms : uses
NamespaceRecommendationProcessor --> Terms : uses
ContainerRecommendationProcessor --> TermRecommendations : creates
NamespaceRecommendationProcessor --> TermRecommendations : creates
TermRecommendations --> MetricAggregationInfoResults : contains
ContainerRecommendationProcessor --> MetricAggregationInfoResults : computes
ContainerRecommendationProcessor --> AnalyzerConstants_MetricName : uses
NamespaceRecommendationProcessor --> AnalyzerConstants_MetricName : uses
UpdateRecommendations --> MetricAggregationInfoResultsIntSerializer : registers
UpdateRecommendations --> MetricAggregationInfoResults : serializes
MetricAggregationInfoResultsIntSerializer --> MetricAggregationInfoResults : serializes
KruizeObject --> KruizeConstants_JSONKeys : uses
TermRecommendations --> KruizeConstants_JSONKeys : uses
ContainerRecommendationProcessor --> KruizeConstants_JSONKeys : uses
Terms --> KruizeConstants_JSONKeys : uses
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new filtering for
filteredResultsMapusesafter(monitoringStartTime)andbefore(monitoringEndTime), whereas the previous logic used>=/<=; if you still intend to include samples exactly at the boundaries, you should switch to!before(start)/!after(end)or usecompareToas before. - Removing
numPodsfrompopulateRecommendationand the associated validation while introducingmetrics_info.pod_countchanges where pod count is enforced and surfaced; consider either wiring the new pod count aggregation back intoMappedRecommendationForModel.setPodsCountor clearly consolidating on one representation to avoidpods_countsilently remaining unset.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new filtering for `filteredResultsMap` uses `after(monitoringStartTime)` and `before(monitoringEndTime)`, whereas the previous logic used `>=` / `<=`; if you still intend to include samples exactly at the boundaries, you should switch to `!before(start)` / `!after(end)` or use `compareTo` as before.
- Removing `numPods` from `populateRecommendation` and the associated validation while introducing `metrics_info.pod_count` changes where pod count is enforced and surfaced; consider either wiring the new pod count aggregation back into `MappedRecommendationForModel.setPodsCount` or clearly consolidating on one representation to avoid `pods_count` silently remaining unset.
## Individual Comments
### Comment 1
<location path="src/main/java/com/autotune/analyzer/recommendations/engine/ContainerRecommendationProcessor.java" line_range="336-345" />
<code_context>
+ // 2. Calculate from 'cpuUsage' datapoints using formulae avg of 'sum/avg', min of 'sum/avg', max of 'sum/avg'
</code_context>
<issue_to_address>
**issue (bug_risk):** Division by zero / Infinity risk when computing pod counts from cpuUsage metrics.
In the `cpuUsageMetrics` branch, `aggInfo.getAvg()` is only checked for non-null, so `aggInfo.getSum() / aggInfo.getAvg()` can divide by zero or near-zero values, causing exceptions or `Infinity`/`NaN` that propagate into `avg`, `min`, and `max`. The same risk exists in the memory branch. Please guard against `aggInfo.getAvg() == 0` (or effectively zero) before dividing, e.g., by filtering those entries or routing them to the error path instead of including them in the aggregates.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Author
|
@sourcery-ai review |
khansaad
requested changes
Apr 6, 2026
…ce and containers Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Signed-off-by: Bhaktavatsal Reddy <bhamaram@in.ibm.com>
Author
|
@sourcery-ai review |
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Registering MetricAggregationInfoResultsIntSerializer globally in UpdateRecommendations means all MetricAggregationInfoResults in the response are coerced to ints; if only pod_count needs integer output, consider scoping the serializer to that field or using a dedicated DTO to avoid truncating other metric aggregations.
- getPodCountAggrInfo logs full metric contents at INFO level and returns null on failure; consider downgrading these logs to DEBUG and returning a non-null MetricAggregationInfoResults with zeroed values so callers don’t need to handle null and logs stay less noisy in normal operation.
- The new Terms.checkIfMinDataAvailableForTerm(Map...) and checkIfMinDataAvailableForTermForNamespace(Map...) duplicate logic with the existing container/namespace versions; it would be cleaner to centralize the common datapoint-count calculation in a single helper to ensure future threshold changes stay consistent across all callers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Registering MetricAggregationInfoResultsIntSerializer globally in UpdateRecommendations means *all* MetricAggregationInfoResults in the response are coerced to ints; if only pod_count needs integer output, consider scoping the serializer to that field or using a dedicated DTO to avoid truncating other metric aggregations.
- getPodCountAggrInfo logs full metric contents at INFO level and returns null on failure; consider downgrading these logs to DEBUG and returning a non-null MetricAggregationInfoResults with zeroed values so callers don’t need to handle null and logs stay less noisy in normal operation.
- The new Terms.checkIfMinDataAvailableForTerm(Map...) and checkIfMinDataAvailableForTermForNamespace(Map...) duplicate logic with the existing container/namespace versions; it would be cleaner to centralize the common datapoint-count calculation in a single helper to ensure future threshold changes stay consistent across all callers.
## Individual Comments
### Comment 1
<location path="src/main/java/com/autotune/analyzer/recommendations/engine/ContainerRecommendationProcessor.java" line_range="313-322" />
<code_context>
+ private static MetricAggregationInfoResults getPodCountAggrInfo(Map<Timestamp, IntervalResults> filteredResultsMap) {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Return-null behavior from getPodCountAggrInfo can cause confusing logs and makes callers responsible for dealing with failure.
Currently this returns `null` and logs an error when no pod count can be derived, while callers log `podCountAggrInfo` (possibly `null`) and `addMetricsInfo` silently ignores it. This yields an error log with no clear API-level signal that pod-count data is unavailable and increases the risk of future NPEs if a caller forgets the null-check. Prefer returning a `MetricAggregationInfoResults` instance with explicit sentinel values or a flag to indicate “no data,” and consider downgrading the log level if this is an expected condition.
Suggested implementation:
```java
*
* @param filteredResultsMap
* @return aggregated results like min, max, avg of pods from the results supplied. Never {@code null}.
* Use {@link MetricAggregationInfoResults#hasData()} to detect when pod-count data could not be derived.
*/
private static MetricAggregationInfoResults getPodCountAggrInfo(Map<Timestamp, IntervalResults> filteredResultsMap) {
MetricAggregationInfoResults metricAggregationInfoResults = new MetricAggregationInfoResults();
// Default to "no data"; set to true only when we successfully derive pod-count metrics.
metricAggregationInfoResults.setHasData(false);
Double avg = 0.0, min = 0.0, max = 0.0;
LOGGER.info("filteredResultsMap: size = {}", filteredResultsMap.size());
```
To fully implement the behavior you described, the following additional changes are needed elsewhere in the codebase:
1. **Update `MetricAggregationInfoResults` to carry an explicit “has data” flag:**
- In the `MetricAggregationInfoResults` class, add:
- A private boolean field, e.g. `private boolean hasData;`
- Public getter/setter: `public boolean hasData()` and `public void setHasData(boolean hasData)`
- Initialize `hasData` to `false` by default (either in the field declaration or in constructors).
- Ensure `MetricAggregationInfoResults` instances created in other places set `hasData(true)` when they contain valid aggregation data.
2. **Mark successful pod-count aggregation in `getPodCountAggrInfo`:**
- In the body of `getPodCountAggrInfo`, wherever you currently:
- Successfully compute `avg`, `min`, `max` for pod count and populate `metricAggregationInfoResults`,
- Add `metricAggregationInfoResults.setHasData(true);` right after setting the computed values.
3. **Remove null-return behavior and adjust logging in `getPodCountAggrInfo`:**
- Locate any branches in `getPodCountAggrInfo` that currently:
- Log an error (likely `LOGGER.error(...)`) because no pod count could be derived and then `return null;`.
- Replace those with:
- A downgraded log level (`LOGGER.info` or `LOGGER.debug`) with a message like: `"No pod-count aggregation data could be derived; returning MetricAggregationInfoResults with hasData=false"`.
- `return metricAggregationInfoResults;` (which will have `hasData` already set to `false`).
- After these changes, **ensure the method never returns `null`**.
4. **Update callers of `getPodCountAggrInfo`:**
- Replace any null checks like:
- `if (podCountAggrInfo != null) { ... }`
- With explicit flag checks:
- `if (podCountAggrInfo.hasData()) { ... } else { /* handle "no data" case if needed */ }`
- In `addMetricsInfo` (and any similar methods), instead of silently ignoring a `null` result, they should:
- Always accept a non-null `MetricAggregationInfoResults`.
- Optionally log or branch behavior based on `hasData()` if they need to distinguish “no data” from “zero-valued data.”
5. **Align logging expectations:**
- Since “no pod-count data” is expected in some situations, ensure that:
- There are no remaining `LOGGER.error` calls solely for the absence of pod-count data.
- Such conditions, if logged, use `info` or `debug` and clearly state that the API will return a non-null object with `hasData=false`.
These additional changes will remove the confusing null-return behavior, provide a clear API-level signal when pod-count data is unavailable, and reduce the risk of future NPEs from missing null checks.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR contains changes to include following section for each recommendation term.
At high level, following changes are made part of this PR
podCountto Enum MetricNamecheckIfMinDataAvailableForTermto check min data-points in efficient way. Our current implementation assumes 15 mins measurement duration would never change in ROS. So, new method also is inline with that assumption.com.autotune.analyzer.recommendations.objects.TermRecommendationsto support nee sectionmetrics_info.com.autotune.analyzer.recommendations.engine.NamespaceRecommendationProcessoris updated to remove computation of no.of pods as it is not applicable for namespace recommendations as confirmed in previous comments.com.autotune.analyzer.recommendations.engine.ContainerRecommendationProcessoris updated to compute avg, min and max pod_count.MetricAggregationInfoResultsto hold the pod count. Note that this model define aggregations as Double. So, in response, we see “2.0” instead of “2”. If this has to be sent as “2”, we need to create new model object that holds aggregations as int instead of double. - Fixed by using custom serializer.com.autotune.analyzer.recommendations.engine.ContainerRecommendationProcessor#generateRecommendationBasedOnModelmethod signature is modified to remove unused parametersmonitoringStartTimeandmonitoringEndTimeFixes # (issue)
Type of change
How has this been tested?
Please describe the tests that were run to verify your changes and steps to reproduce. Please specify any test configuration required.
Test Configuration
Checklist 🎯
Additional information
Include any additional information such as links, test results, screenshots here
Summary by Sourcery
Add per-term metrics metadata for pod count and refactor recommendation generation to use pre-filtered metric windows and metric-based data-availability checks.
New Features:
Enhancements:
Summary by Sourcery
Add per-term metrics metadata for pod count and refactor recommendation processing to use pre-filtered metric windows and metric-based data availability checks.
New Features:
Enhancements: