# Pull Request: Deep Finance Judge System Enhancement by TaoShuchang · Pull Request #10 · modelscope/AgentJet

TaoShuchang · 2026-02-24T06:31:53Z

Summary

This PR introduces a comprehensive upgrade to the Deep Finance evaluation system, adding multiple new grading modules and enhancing the existing infrastructure for better citation compliance, traceability, and quality assessment.

Branch Information

Source Branch: dev/shuchang_newjudge
Target Branch: origin/main
Commits Ahead: 65 commits
Files Changed: 33 files (+4,268 lines / -111 lines)

Key Changes

1. New Judge Modules

Module	Description
Audit	Citation audit grader for validating reference compliance
CGCV	Comprehensive Grounding & Citation Verification scoring
EBTU	Evidence-Based Traceability & Understanding grader
Traceability	Source traceability evaluation for generated content

2. Enhanced Existing Modules

Grounding Grader: Improved citation compliance evaluation with updated prompts
Presentation Quality: Upgraded to 1/3/5 scoring system with markdown cleanup support

3. Configuration & Infrastructure

Added LFS patterns in .gitattributes for large file handling
Parameterized training configuration for flexible experiment setup
New YAML templates: deep_finance_template_maxlen.yaml, infer.yaml
Updated environment service URL configuration

4. Documentation

Comprehensive refactoring of the Financial Deep Research Agent Training Tutorial (deep_finance.md)

5. Bug Fixes

Fixed metric helper trajectory save path and tool call metrics
Corrected environment variable settings for multi-machine training
Improved MultiAgent message content parsing logic
Suppressed httpx AsyncClient.aclose() exception warnings

New Files Added

tutorial/example_deep_finance/judge/
├── audit/
│   ├── __init__.py
│   ├── grader.py
│   ├── json_utils.py
│   └── prompt.py
├── cgcv/
│   ├── __init__.py
│   ├── grader.py
│   ├── json_utils.py
│   └── prompt.py
├── ebtu/
│   ├── __init__.py
│   ├── grader.py
│   ├── json_utils.py
│   └── prompt.py
└── traceability/
    ├── __init__.py
    ├── grader.py
    ├── json_utils.py
    └── prompt.py

Testing

All new graders follow the existing _aevaluate interface pattern
Compatible with current reward metric helper system
Tested with Deep Finance training pipeline

Breaking Changes

aevaluate method renamed to _aevaluate for internal consistency (internal API only)

Related Issues

N/A

Checklist

Code follows project style guidelines
New modules include proper __init__.py exports
Configuration templates updated
Documentation updated
No merge conflicts with main branch

Reviewers: @team-lead
Labels: enhancement, deep-finance, judge-system

@Haoran

…uation functionality to the FinWorld task. - Added the ExampleAgentScopeLearnProtocol class to implement the AgentScope execution flow for multi-turn interactions. - Integrated semaphore control to manage the parallelism of environment calls, improving environment stepping performance. - Implemented a mechanism for detecting context overflows and quickly terminating during environment interactions to prevent blocking. - Added a finworld.yaml configuration file to define project training and rollout parameters. - Added the FinWorldJudgeByOpenJudge class, integrating multiple evaluators including RM Gallery and OpenJudge (@Haoran). - Implemented a mechanism for converting task output, asynchronous calls, and retrying to ensure evaluation stability. - Weight normalization manages the contributions of each evaluator, merging them to calculate the final reward and success determination.

* fix end of files * autoflake import fix * add mypy check

…dates - Renamed ExampleAgentScopeLearnProtocol to ExampleDeepResearchProtocol and modified the execute method signature. - Unified the parameter name of the model tuner to `tuner` and its related attribute references. - Optimized the multi-turn interaction step configuration, changing it to use `tuner.config.ajet.rollout.multi_turn.max_steps`. - Modified the context overflow judgment logic to prevent tool call blocking. - Updated the finworld.yaml configuration, replacing astune with ajet-related configurations, and adjusted the workflow protocol and environment parameters. - Modified the default environment variable values and log saving paths in finworld_judge.py. - Added and improved multi-machine and single-machine startup scripts, supporting dynamic generation of MCP configuration and environment variable loading. - Added the finworld_single.yaml template to adapt to single-machine training configurations. - Adjusted the key reference for multi-turn step configuration in ma_deepresearch.py, using the ajet configuration path.

…ipts and templates - Added bash startup scripts for multi-machine, multi-GPU training, supporting dynamic configuration generation and environment variable import. - Implemented training configuration file templates, supporting automatic injection of various weight parameters and model paths. - Adjusted the default request timeout of EnvClient from 30 seconds to 300 seconds to accommodate long training requests. - Added a new finworld example directory and related documentation, improving the example project structure.

…_tool_stats_from_cmts`

…uation configuration and scripts - Replaced model initialization in FinWorldJudgeByOpenJudge with the `_init_openjudge_model` method - Read Judge model parameters from the configuration file first, using environment variables as a fallback - Optimized RM Gallery initialization, using configuration-first logic, and improved exception stack trace printing - Cleaned up and removed the old `_init_model` singleton method and related code - Updated the example startup script `ajet_finworld.sh`, adding OPENJUDGE_LLM and RM_LLM configurations - Modified YAML templates and configuration files to unify the structure and field naming of Judge configuration items - Deleted the outdated `cc_rm4_res2cit2fai2_30b.sh` script - Adjusted the `env_service` startup path to improve environment activation compatibility - Adjusted script log output format and content to enhance the clarity of configuration parameter printing

- Added the jsonl_with_env_service type, which allows loading data from jsonl files while calling tools via env_service. - Extended ResourceKeeper to handle the creation and release logic of environment instances for jsonl_with_env_service. - Maintained the env_service type logic, calling create_instance to register instances and initializing them using init_messages from the jsonl file. - Added an example protocol, ExampleDeepResearchProtocol, to implement multi-turn interaction and environment call coordination. - Provided training scripts and YAML configuration templates for finworld, supporting the jsonl_with_env_service mode training environment. - Optimized scripts to support multi-node multi-GPU training, including environment variables and Ray cluster configuration.

… grading logic

…eering

…ipts

…cripts and configuration templates:

…of the metrics update logic - Modified the `update_metrics` function, adding a `prefix` parameter to distinguish between training and validation metrics. - Adjusted the data source for extracting `reward_stats` and `tool_stats`, migrating from `workflow_metadata` to `log_metrics`. - Added debug printing to output the `log_metrics` content and metric key names at key steps for easier troubleshooting. - Used the appropriate prefix when calling `update_metrics` in `trainer_verl.py`, and added multiple debug prints. - Modified `WorkflowOutput` to place `tool_stats` and `reward_stats` into the `log_metrics` field. - Removed redundant and deprecated code for extracting `reward_stats` and calculation functions. - Added debug information output to the `finworld` and `finworld_judge` modules to track log metrics and scoring data.

- Removed debug print statements before and after the `update_metrics` call in `trainer_verl.py` - Removed debug print statements related to the `log_metrics` key in `finworld.py` - Removed debug print statements before updating `metadata_stats` in `finworld_judge.py` - Added logic in `general_runner.py` to synchronize `reward_stats` from `metadata` to `log_metrics` after the judge calculation - Cleaned up debug print statements within `update_metrics` in `metric_helper`, improving code readability.

feat(tutorial): Added FinWorld multi-machine multi-GPU training startup script

…g configuration and startup processes.

…ations - Switched the example directory from example_finworld to example_deep_finance - Modified startup parameters and logic to support deep_finance, replacing the finworld option - Replaced finworld_reader with deep_finance_reader in the task reader - Adjusted environment client configuration in resource management, using deep_finance instead of finworld-related checks - Updated reward metric tool documentation to support deep_finance - Deleted finworld-related configuration files, scripts, code, and evaluation modules, cleaning up leftover files and scripts - Replaced the keyword "finworld" with "deep_finance" in comments and logs

… references - Replace all "finworld" and "deep_finance" names with the unified "deepfinance" format. - Modify command-line arguments to `--with-deepfinance` for consistency. - Adjust the class name in `task_reader` from `deep_financeReader` to `DeepFinanceReader`. - Update the documentation description and file name of the `metric_helper` module to DeepFinance. - Modify environment variables and configuration paths in the example script `deep_finance.sh` to use the `DEEPFINANCE` prefix. - Update `judge_protocol` to `DeepFinanceJudgeByOpenJudge` in the `deep_finance.yaml` configuration. - Refactor the `FinWorldJudgeByOpenJudge` class in `deep_finance_judge.py` to `DeepFinanceJudgeByOpenJudge`. - Rename the `FinworldReader` class in `deep_finance_reader.py` to `DeepFinanceReader`. - Modify the debug log identifier and corresponding environment variable name to `DEEPFINANCE_DEBUG`. - Update the evaluation protocol in the `deep_finance_template.yaml` template to `DeepFinanceJudgeByOpenJudge`. - Ensure that internal references and comments in all modules are updated to use DeepFinance and deepfinance-related names.

…on file paths

…ariable settings

…urning environment state - Corrected the `env_output` return value structure in `BaseGymEnv` to ensure correct assignment of `reward` and `info` fields. - Removed `RefJudge` and `StructureJudge` related metric calculations and statistics from `reward_metric_helper`. - Cleaned up redundant code in `reward_metric_helper`, removing invalid comments and statistical items. - Modified `save_trajectory_as_json` to always print trajectory saving confirmation information. - Corrected log comments in `example_deep_finance` to avoid meaningless log output. - Added the `save_trajectory_as_json_file` configuration item to `deep_finance_template.yaml` to support trajectory saving functionality.

… files - Added a new ignore rule for config file paths in .gitignore - Deleted the automatically generated mcp_finance_tool_generated.json file in example_deep_finance - Refactored the deep_finance.yaml configuration file, adjusting project and experiment names - Reorganized Judge configuration, clarifying openjudge_llm and rm_llm models - Optimized model paths and training parameter configurations, adding parallel and batch processing settings - Adjusted data reading methods and training/validation set path placeholders - Reduced GPU memory usage ratio for rollout to 0.8 - Updated the default save directory path for the trainer to a placeholder variable - Cleaned up unused and commented-out code to improve configuration file conciseness

- Corrected the data source field for timeline data used during trajectory saving. - Removed redundant fields in tool execution time, cache hit rate, and error rate statistics. - Updated .gitignore to add ignore rules for the example script directory. - Removed unnecessary debugging information from logs to reduce log noise. - Adjusted log printing in the multi-round interaction execution process to simplify output content. - Streamlined log code for environment observation and termination checks to improve code readability.

…tric - Change trajectory save directory from "ctx_trackers" to "trajectory" to organize files better - Add recording of tool call counts alongside error rates in tool metrics - Update experiment suffix in deep finance example script for clearer naming convention

…lper

…yGrader - Remove legacy graders and integrate PresentationQualityGrader and GroundingGrader - Update grader weights and disable unused graders in config and code - Simplify grader configuration creation with new mappers for report content and traj - Refactor DeepFinanceJudgeByOpenJudge to support new grading scheme - Add PresentationQualityGrader implementation with strict JSON output format - Include utilities for JSON parsing and validation in presentation quality grader - Add prompt templates for presentation quality grading criteria and instructions - Provide example script to run PresentationQualityGrader with OpenAIChatModel - Add traj_adapter utilities to normalize and extract user query and final report - Update YAML template to replace old grader weights with presentation quality weight - Create init files to expose PresentationQualityGrader in judge package

…valuation - add GroundingGrader class to evaluate citation coverage and truthfulness based on dialogue traj - provide default OpenAIChatModel creation with deterministic options - implement prompt construction and JSON parsing utilities for model interaction - calculate scores including coverage, grounding, and invalid citation penalties - add detailed json_utils module for strict JSON extraction and validation - introduce prompt templates defining citation auditing rules and user prompts - supply reference.py with related grounding evaluation logic and RefJudgeEvaluator class - create __init__.py to expose GroundingGrader module - add presentation_quality module __init__.py with PresentationQualityGrader export

…rocess

- Add populate_reward_metadata_from_stats to copy reward stats into reward metadata - Populate reward metadata in GeneralRunner if reward_stats present in workflow output - Refine compute_reward_metrics with updated OpenJudge graders: presentation_quality, grounding, planning - Add _save_zero_score_debug method in DeepFinanceJudgeByOpenJudge to save debug info for zero grader scores - Remove deprecated RewardStats usage in deep_finance_judge - Update judge __init__ to export GroundingGrader alongside PresentationQualityGrader - Clean up debug print statements and logging in deep_finance_judge.py - Update .gitignore to exclude prepare_data and judge/analytical_sufficiency folders in example_deep_finance tutorial

…ith markdown cleanup - Add function to strip markdown code block fences in grounding and presentation_quality modules - Change presentation quality grader to score each of 8 criteria on a 1/3/5 scale instead of pass/fail - Normalize total score by dividing sum of item scores by max (40), improving granularity - Update reasoning output to list lowest scoring items with notes for focused feedback - Revise presentation quality prompt to reflect new 1/3/5 scoring rubric with detailed instructions - Adjust JSON output schema accordingly, replacing boolean pass with numeric score fields - Add get_score utility in JSON utils to extract and validate scores from graded items - Clean report input by removing markdown fences before grading to avoid markup noise - Add grounding weight configuration in YAML template for improved modular judge weighting

… deep_finance.sh

…ptions

- Added various file extensions to .gitattributes for Git LFS tracking - Added dataset_gsm8k/ to .gitignore to exclude dataset files - Introduced ENV_SERVICE_URL variable in deep_finance.sh and deep_finance_single.sh - Updated configuration file generation to include ENV_SERVICE_URL substitution - Commented out invalid reference penalty calculation in grounding grader logic

- 新增 AuditGrader 引用逻辑审计模块，实现对引用合规性的严格验证 - 新增 CGCVGrader，支持引用锚定断言的详细验证及评分 - 新增 TraceabilityRewardGrader，实现报告断言的证据锚点可追溯性检查 - 在 DeepFinanceJudgeByOpenJudge 中集成以上三种新评分器，并配置对应权重 - 扩展 reward_metric_helper，增加 'audit', 'traceability', 'cgcv' 等评分项 - 更新依赖和导入，支持通过 judge 包直接访问三种评分器 - 新增相关 JSON 工具和 Prompt 模板以支持评分器的准确评估与结果解析 - .gitignore 增加 example_deep_finance/output_report 路径忽略规则

- 新增 EBTUTraceabilityGrader 并集成入 DeepFinanceJudge 权重配置 - deep_finance.yaml 配置最大模型长度调整为 40960 - 脚本 deep_finance.sh 和 deep_finance_single.sh 中增加 EBTU 及相关权重配置 - 完善 deep_finance_single.sh 单机调试日志及目录结构 - 深度完善 audit、cgcv、traceability json 解析，增加对常见 JSON 格式错误的自动修复 - audit grader 中移除对模型输出 integrity_score 的依赖，采用手动计算方式 - 禁用 ExampleDeepResearchProtocol 中部分工具统计日志输出，增加线程信号量限制 - 调整提示和 yaml 模板，新增 EBTU 权重占位符，完善配置文件生成日志显示

… used `_aevaluate`. - Renames the `aevaluate` async method in multiple gradient classes to `_aevaluate` to identify it as an internal method. - Updates the method call names in `CGCVGrader` to match the renamed `_aevaluate`. - Maintains asynchronous call logic, enhancing code encapsulation. - Affected gradients include the `audit`, `cgcv`, `ebtu`, `grounding`, `presentation_quality`, and `traceability` modules.

…ng Tutorial Documentation - Removed the original detailed tutorial file "How to Train a Finance Deep Research Agent" to avoid redundancy. - Added a new document, deep_finance.md, with a clearer structure and more complete content. - Reorganized the system architecture and pipeline module descriptions to improve readability. - Provided detailed explanations of the two-stage deep research workflow design and citation specifications. - Systematically listed and categorized 19 financial tools and their functions. - Standardized reward design specifications, refining multi-dimensional scoring indicators and weight adjustments. - Supplemented key technical points such as training loop, engineering details, tool calls, and caching design. - Standardized document format, using tables and code blocks to display information, improving the document's professionalism and usability.

gemini-code-assist · 2026-02-24T06:33:18Z

Summary of Changes

Hello @TaoShuchang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the Deep Finance evaluation system by integrating several new and enhanced grading modules. These changes aim to improve the rigor and reliability of financial research reports generated by the AI agent, focusing on citation accuracy, evidence traceability, and overall report quality. The update also includes infrastructure improvements and bug fixes to support the expanded grading capabilities, ensuring a more robust and flexible training and evaluation pipeline.

Highlights

New Judge Modules: Introduced four new grading modules: Audit, CGCV, EBTU, and Traceability, to enhance citation compliance, traceability, and quality assessment within the Deep Finance evaluation system.
Enhanced Existing Modules: Improved the Grounding Grader with updated prompts and upgraded the Presentation Quality grader to a 1/3/5 scoring system with markdown cleanup support.
Configuration and Infrastructure Updates: Added Git LFS patterns for handling large files, parameterized training configurations for flexibility, and introduced new YAML templates (deep_finance_template_maxlen.yaml, infer.yaml) for experiment setup.
Documentation Refactoring: Comprehensively rewrote and expanded the Financial Deep Research Agent Training Tutorial (deep_finance.md) to provide detailed guidance.
Bug Fixes and Optimizations: Addressed issues related to metric helper trajectory save paths, tool call metrics, environment variable settings for multi-machine training, improved MultiAgent message content parsing, and suppressed httpx AsyncClient.aclose() exception warnings.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitattributes
- Added LFS patterns for various large file types, including archives, models, and audio files.
.gitignore
- Updated to ignore new output report and dataset directories (tutorial/example_deep_finance/output_report/*, dataset_gsm8k/*).
ajet/utils/metric_helper/reward_metric_helper.py
- Extended the list of openjudge_graders to include new audit, traceability, and CGCV modules.
tutorial/example_deep_finance/deep_finance.md
- Completely rewritten to provide a comprehensive tutorial for the DeepFinance agent training solution, detailing its pipeline, workflow, tool system, and reward design.
tutorial/example_deep_finance/deep_finance.py
- Increased the semaphore limit for concurrent threads from 30 to 60.
- Commented out tool statistics logging for cleaner output.
tutorial/example_deep_finance/deep_finance.sh
- Added new grader weight variables (CGCV_WEIGHT, AUDIT_WEIGHT, TRACEABILITY_WEIGHT, EBTU_WEIGHT) with default zero values.
- Included ENV_SERVICE_URL configuration.
- Updated the parameter confirmation log to display all new grader weights.
tutorial/example_deep_finance/deep_finance.yaml
- Decreased the max_model_len parameter from 50000 to 40960.
tutorial/example_deep_finance/deep_finance_judge.py
- Imported new grader classes: CGCVGrader, AuditGrader, TraceabilityRewardGrader, and EBTUTraceabilityGrader.
- Updated reward weight configuration to include new graders.
- Added new grader configurations to the _setup_graders method for CGCV, Audit, Traceability, and EBTU.
tutorial/example_deep_finance/deep_finance_single.sh
- Added new grader weight variables and the environment service URL.
- Exported MODEL_PATH for single-machine debugging.
- Updated the parameter confirmation log to display all new grader weights.
- Modified the ajet/launcher.py command to include --with-deepfinance and --with-ray flags.
tutorial/example_deep_finance/judge/init.py
- Imported and exposed new grader classes: CGCVGrader, AuditGrader, TraceabilityRewardGrader, and EBTUTraceabilityGrader.
tutorial/example_deep_finance/judge/audit/init.py
- Added the AuditGrader export.
tutorial/example_deep_finance/judge/audit/grader.py
- Added the AuditGrader class for citation integrity auditing, including JSON parsing, schema validation, and score computation logic.
tutorial/example_deep_finance/judge/audit/json_utils.py
- Added utility functions for JSON parsing, repair, and validation specific to the AuditGrader, along with trajectory helper functions.
tutorial/example_deep_finance/judge/audit/prompt.py
- Added system and user prompts for the AuditGrader, defining its role as an 'Evidence Logic Analyst' and specifying output format.
tutorial/example_deep_finance/judge/cgcv/init.py
- Added the CGCVGrader export.
tutorial/example_deep_finance/judge/cgcv/grader.py
- Added the CGCVGrader class for citation-grounded claim verification, including prompt construction, model interaction, and score computation.
tutorial/example_deep_finance/judge/cgcv/json_utils.py
- Added utility functions for JSON parsing, repair, validation, and trajectory processing specific to the CGCVGrader.
tutorial/example_deep_finance/judge/cgcv/prompt.py
- Added system and user prompts for the CGCVGrader in both Chinese and English, detailing the verification process and output format.
tutorial/example_deep_finance/judge/ebtu/init.py
- Added the ebtu_reward package init.
tutorial/example_deep_finance/judge/ebtu/grader.py
- Added the EBTUTraceabilityGrader class for evidence-backed traceability, including score computation based on various metrics.
tutorial/example_deep_finance/judge/ebtu/json_utils.py
- Added utility functions for JSON parsing, validation, and trajectory processing specific to the EBTUTraceabilityGrader.
tutorial/example_deep_finance/judge/ebtu/prompt.py
- Added system and user prompts for the EBTUTraceabilityGrader, defining its role as an 'Evidence-first Auditor' and specifying output format.
tutorial/example_deep_finance/judge/grounding/grader.py
- Renamed the aevaluate method to _aevaluate for internal consistency.
- Removed the invalid reference penalty from the score computation.
tutorial/example_deep_finance/judge/grounding/prompt.py
- Updated the system prompt with revised citation guidelines, input instructions, and output format for the Grounding Grader.
tutorial/example_deep_finance/judge/presentation_quality/grader.py
- Renamed the aevaluate method to _aevaluate for internal consistency.
tutorial/example_deep_finance/judge/traceability/init.py
- Added the TraceabilityRewardGrader export.
tutorial/example_deep_finance/judge/traceability/grader.py
- Added the TraceabilityRewardGrader class for traceability and verifiability, including score computation logic.
tutorial/example_deep_finance/judge/traceability/json_utils.py
- Added utility functions for JSON parsing, validation, and trajectory processing specific to the TraceabilityRewardGrader.
tutorial/example_deep_finance/judge/traceability/prompt.py
- Added system and user prompts for the TraceabilityRewardGrader, defining its role as a 'Traceability Auditor' and specifying output format.
tutorial/example_deep_finance/prompt/tool_prompt_builder.py
- Removed the crawl_url tool from the prompt template.
tutorial/example_deep_finance/yaml_template/deep_finance_template.yaml
- Added new grader weight placeholders for CGCV, Audit, Traceability, and EBTU.
tutorial/example_deep_finance/yaml_template/deep_finance_template_maxlen.yaml
- Added a new YAML template for deep finance configuration with max length settings, including new grader weights.
tutorial/example_deep_finance/yaml_template/infer.yaml
- Added a new YAML template for inference configuration, including new grader weights and adjusted max_prompt_length and max_response_length.

Activity

The pull request includes 65 commits from user TaoShuchang.
A total of 33 files were changed, with 4,268 lines added and 111 lines deleted.
The author has completed all checklist items, including code style, module exports, configuration updates, and documentation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR is a significant enhancement to the Deep Finance evaluation system, introducing several new, well-structured grading modules (audit, cgcv, ebtu, traceability). The configuration has been updated to support these new modules, and the documentation has been extensively rewritten. The changes are comprehensive and align with the goal of improving citation compliance and quality assessment.

My review includes a few suggestions for improvement:

Documentation: The new deep_finance.md contains images and links pointing to internal resources, which will be inaccessible to external users.
Code Maintainability: There is significant code duplication across the new judge modules, particularly in the json_utils.py files. Consolidating these utilities into a shared module would improve maintainability.
Configuration: A hardcoded model path in the single-machine script could be made more flexible.
Logging: I've suggested a more configurable approach to logging instead of commenting out code.

Overall, this is a great contribution. Addressing these points will further enhance the quality and maintainability of the codebase.

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/deep_finance.md

+![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843906200-9dd35ac4-f71e-40dc-b130-f03e3e6bae6a.png)
+
+![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843940824-4e3637d7-a16e-4994-8878-242effc2c0d7.png)![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843950142-09def779-5521-41f0-a457-a7715a819cc7.png)


The images in the "实验结果" (Experiment Results) section are hosted on an internal intranetproxy.alipay.com URL. These images will be broken for anyone outside of the internal network. To ensure the documentation is accessible to all contributors, please consider either:

Committing the images to the repository (and using Git LFS if they are large).

Uploading them to a public image hosting service.

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/judge/audit/json_utils.py

+"""JSON Utilities for Audit Grader"""
+from __future__ import annotations
+
+import json
+import re
+from typing import Any, Dict, List, Tuple
+
+_JSON_RE = re.compile(r"\{.*\}", re.DOTALL)
+
+def extract_first_json_object(text: str) -> str | None:
+    if not text:
+        return None
+    m = _JSON_RE.search(text.strip())
+    if not m:
+        return None
+    return m.group(0)
+
+
+def _repair_json(js: str) -> str:
+    """
+    尝试修复常见的JSON格式错误
+    1. 修复字符串中未转义的换行符
+    2. 修复trailing comma
+    3. 修复缺少的逗号
+    4. 修复不完整的JSON（截断）
+    """
+    # 1. 替换字符串值中的未转义换行符
+    # 这是最常见的问题：LLM在字符串中直接输出换行而非 \n
+    def escape_newlines_in_strings(s: str) -> str:
+        result = []
+        in_string = False
+        escape_next = False
+        i = 0
+        while i < len(s):
+            c = s[i]
+            if escape_next:
+                result.append(c)
+                escape_next = False
+            elif c == '\\':
+                result.append(c)
+                escape_next = True
+            elif c == '"':
+                result.append(c)
+                in_string = not in_string
+            elif in_string and c == '\n':
+                result.append('\\n')
+            elif in_string and c == '\r':
+                result.append('\\r')
+            elif in_string and c == '\t':
+                result.append('\\t')
+            else:
+                result.append(c)
+            i += 1
+        return ''.join(result)
+
+    js = escape_newlines_in_strings(js)
+
+    # 2. 移除trailing comma: ",}" -> "}" 和 ",]" -> "]"
+    js = re.sub(r',\s*}', '}', js)
+    js = re.sub(r',\s*]', ']', js)
+
+    # 3. 尝试修复截断的JSON - 补全缺失的括号
+    # 统计括号数量
+    open_braces = js.count('{')
+    close_braces = js.count('}')
+    open_brackets = js.count('[')
+    close_brackets = js.count(']')
+
+    # 如果括号不匹配，尝试补全
+    if open_braces > close_braces:
+        # 先关闭可能未闭合的字符串
+        # 检查最后是否在字符串中
+        in_string = False
+        escape_next = False
+        for c in js:
+            if escape_next:
+                escape_next = False
+            elif c == '\\':
+                escape_next = True
+            elif c == '"':
+                in_string = not in_string
+        if in_string:
+            js += '"'
+
+        # 补全缺失的括号
+        js += ']' * (open_brackets - close_brackets)
+        js += '}' * (open_braces - close_braces)
+
+    return js
+
+
+def strict_load_json(text: str) -> Tuple[Dict[str, Any] | None, str | None]:
+    js = extract_first_json_object(text)
+    if js is None:
+        return None, "No JSON object found"
+
+    # 第一次尝试：直接解析
+    try:
+        obj = json.loads(js)
+        if not isinstance(obj, dict):
+            return None, f"Root is not dict: {type(obj)}"
+        return obj, None
+    except json.JSONDecodeError:
+        pass  # 继续尝试修复
+
+    # 第二次尝试：修复后解析
+    try:
+        repaired = _repair_json(js)
+        obj = json.loads(repaired)
+        if not isinstance(obj, dict):
+            return None, f"Root is not dict: {type(obj)}"
+        return obj, None
+    except json.JSONDecodeError as e:
+        return None, f"JSONDecodeError: {str(e)}"
+
+def validate_integrity_shape(obj: Dict[str, Any]) -> Tuple[Dict[str, Any] | None, str | None]:
+    """
+    验证 Evidence Logic Analyst 的输出结构
+    Schema:
+    {
+      "audit_trail": [
+         {"citation_id": int, "verdict": str, ...}, ...
+      ],
+      "qualitative_summary": str,
+      "integrity_score": float
+    }
+    """
+    # 1. Check Top-level fields
+    required_fields = ["audit_trail", "qualitative_summary", "integrity_score"]
+    for f in required_fields:
+        if f not in obj:
+            return None, f"Missing field: {f}"
+
+    # 2. Validate integrity_score
+    try:
+        score = float(obj["integrity_score"])
+        if not (0.0 <= score <= 1.0):
+             # 容错：稍微越界归一化
+             score = max(0.0, min(1.0, score))
+        obj["integrity_score"] = score
+    except ValueError:
+        return None, "integrity_score must be a float"
+
+    # 3. Validate audit_trail
+    if not isinstance(obj["audit_trail"], list):
+        return None, "audit_trail must be a list"
+
+    valid_verdicts = {"Supported", "Overstated", "Contradicted", "Hallucinated", "Irrelevant"}
+
+    for idx, item in enumerate(obj["audit_trail"]):
+        if not isinstance(item, dict):
+            return None, f"audit_trail[{idx}] is not a dict"
+
+        # Check required item fields
+        if "citation_id" not in item:
+            return None, f"audit_trail[{idx}] missing 'citation_id'"
+        if "verdict" not in item:
+            return None, f"audit_trail[{idx}] missing 'verdict'"
+
+        # Normalize verdict
+        v = str(item["verdict"]).strip()
+        # 简单的大小写兼容
+        v_cap = v.capitalize()
+        if v not in valid_verdicts and v_cap in valid_verdicts:
+            item["verdict"] = v_cap
+        elif v not in valid_verdicts:
+            # 如果模型输出了奇奇怪怪的verdict，降级为Irrelevant或报错，这里选择报错以保证严谨
+            return None, f"Invalid verdict '{v}' in item {idx}"
+
+    return obj, None
+
+
+# =============================================================================
+# Trajectory Helpers
+# =============================================================================
+
+def _extract_text_content(content) -> str:
+    if content is None: return ""
+    if isinstance(content, str): return content
+    if isinstance(content, list):
+        # Handle OpenAI multi-part content
+        parts = []
+        for p in content:
+            if isinstance(p, dict) and p.get("type") == "text":
+                parts.append(p.get("text", ""))
+            elif isinstance(p, str):
+                parts.append(p)
+        return "\n".join(parts)
+    return str(content)
+
+def _strip_think(text: str) -> str:
+    return re.sub(r"<think>.*?</think>\s*", "", text, flags=re.S).strip()
+
+def _strip_markdown_fences(text: str) -> str:
+    text = text.strip()
+    text = re.sub(r'^```(?:markdown|md)?\s*\n?', '', text, flags=re.IGNORECASE)
+    text = re.sub(r'\n?```\s*$', '', text)
+    return text.strip()
+
+def _extract_tool_call_json(text: str) -> str:
+    # 尝试提取 ```json ... ```
+    m = re.search(r"```json\s*(\[[\s\S]*?\])\s*```", text)
+    if m: return m.group(1).strip()
+    # 简单的 fallback
+    if text.strip().startswith("[") and text.strip().endswith("]"):
+        return text.strip()
+    return ""
+
+def construct_reward_prompt(trajectory: List[Dict[str, Any]], template: str) -> str:
+    """
+    提取 User Query, Evidence (Tool Outputs), Final Report
+    """
+    user_query = ""
+    evidence_parts = []
+    final_report = ""
+
+    # Helper to clean text
+    def clean(c): return _strip_think(_extract_text_content(c))
+
+    # 1. Identify components
+    # 倒序查找 Final Report (包含 References 或 TASK_COMPLETED 的 Assistant 消息)
+    for i in range(len(trajectory) - 1, -1, -1):
+        msg = trajectory[i]
+        if msg.get("role") == "assistant":
+            txt = clean(msg.get("content"))
+            # 宽松判定：通常最后的长文本是报告
+            if "References" in txt or "[TASK_COMPLETED]" in txt or len(txt) > 600:
+                final_report = _strip_markdown_fences(txt)
+                break
+
+    # 找不到显式报告时，取最后一条 Assistant
+    if not final_report and trajectory:
+        last = trajectory[-1]
+        if last.get("role") == "assistant":
+            final_report = _strip_markdown_fences(clean(last.get("content")))
+
+    for idx, msg in enumerate(trajectory):
+        role = msg.get("role")
+        content_raw = clean(msg.get("content"))
+
+        # User Query: First user message
+        if role == "user" and not user_query:
+            user_query = content_raw
+            continue # 不要把 query 当作 evidence
+
+        # Evidence: Tool calls and Tool outputs
+        if role == "assistant":
+            # Check for tool calls
+            tool_json = _extract_tool_call_json(content_raw)
+            if tool_json:
+                evidence_parts.append(f"--- Step {idx} Tool Call ---\n{tool_json}")
+
+        elif role == "tool":
+            evidence_parts.append(f"--- Step {idx} Tool Result ---\n{content_raw}")
+
+    evidence_text = "\n\n".join(evidence_parts)
+
+    return template.format(
+        user_query=user_query,
+        evidence_text=evidence_text,
+        final_report=final_report
+    )


This file contains several utility functions for JSON parsing and trajectory processing (e.g., _repair_json, strict_load_json, construct_reward_prompt). These functions are nearly identical to those found in the json_utils.py files of the other new judge modules (cgcv, ebtu, traceability). This code duplication makes maintenance difficult, as a bug fix or improvement in one file would need to be manually replicated in all others.

To improve maintainability, I strongly recommend refactoring these common utilities into a shared module, for example, tutorial/example_deep_finance/judge/utils/json_helpers.py. Each grader module could then import these functions from the central location. This would centralize the logic and make the codebase cleaner and easier to manage.

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/deep_finance.md

+
+```bash
+cd /path/to/AgentJet
+bash install.sh # TODO：把这部分缩减到一个install：https://yuque.alibaba-inc.com/bayotg/wxz7sb/qdesuu33621x2yhi


This TODO comment contains a link to an internal Yuque document (https://yuque.alibaba-inc.com/...). This link is not accessible to external contributors. For better collaboration, it's recommended to either provide a public link or summarize the context of the TODO directly in the comment.

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/deep_finance.py

+                    # if latest_tool_stats.get('total_calls', 0) > 0:
+                    #     logger.info(f"步骤 {step + 1} 工具统计: 调用={latest_tool_stats.get('total_calls', 0)}, "
+                    #                 f"成功率={latest_tool_stats.get('success_rate', 0):.1f}%")


This logging statement for tool statistics has been commented out. While this reduces log verbosity, it's generally better to control logging output through configuration or log levels rather than commenting out code. This allows for easily re-enabling detailed logs for debugging without code changes.

For example, you could use a logger with a configurable level, like logger.debug(...), or wrap the log statement in a conditional check based on a configuration flag.

Suggested change

# if latest_tool_stats.get('total_calls', 0) > 0:

# logger.info(f"步骤 {step + 1} 工具统计: 调用={latest_tool_stats.get('total_calls', 0)}, "

# f"成功率={latest_tool_stats.get('success_rate', 0):.1f}%")

if self.config.get('enable_step_logging', False) and latest_tool_stats.get('total_calls', 0) > 0:

logger.info(f"步骤 {step + 1} 工具统计: 调用={latest_tool_stats.get('total_calls', 0)}, "

f"成功率={latest_tool_stats.get('success_rate', 0):.1f}%")

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/deep_finance_single.sh

    echo -e "\033[31m警告: 找不到 .env 文件: $ENV_FILE\033[0m"
 fi

+export MODEL_PATH="/mnt/data_cpfs/taoshuchang.tsc/models/Qwen3-8B"


The MODEL_PATH is hardcoded in this script. This can override the value set in the .env file, which is sourced earlier. This reduces the script's flexibility and might lead to confusion if a developer expects the .env file to be the single source of truth for this path.

It would be more robust to rely on the .env file for this path. Simply removing this line and ensuring MODEL_PATH is defined in .env would be the cleanest approach.

gemini-code-assist · 2026-02-24T06:35:02Z

tutorial/example_deep_finance/judge/ebtu/__init__.py

@@ -0,0 +1 @@
+# ebtu_reward package


This __init__.py file is nearly empty. For consistency with the other new judge modules (audit, cgcv, traceability), it should export the EBTUTraceabilityGrader class. This makes the grader easily importable from the package level.

Suggested change

# ebtu_reward package

# ebtu_reward package

from .grader import EBTUTraceabilityGrader

__all__ = ["EBTUTraceabilityGrader"]

TaoShuchang and others added 30 commits January 16, 2026 14:53

Merge remote-tracking branch 'origin/main' into dev/shuchang

ba41164

Precommit fix (#4)

c7ca8c7

* fix end of files * autoflake import fix * add mypy check

fix test bench import

7f2b017

refactor(utils): Remove unused extract and compute functions `extract…

079e4bd

…_tool_stats_from_cmts`

feat(core): add finworld task reader support to framework

de81c1d

feat(finworld): implement specialized data reader and openjudge-based…

248acc4

… grading logic

refactor(finworld): optimize configuration templates and prompt engin…

9d651fd

…eering

chore(finworld): update launch scripts and add variant experiment scr…

7475ecc

…ipts

Merge remote-tracking branch 'origin/main' into dev/shuchang

b95d491

feat(finworld): Added support for multi-machine, multi-GPU training s…

f20ab91

…cripts and configuration templates:

chore(git): ignore finworld/yaml/*

ea87d4b

chore: "Stop tracking existing yaml files in tutorial directory"

0889483

fix(task_runner): Synchronize reward_stats to log_metrics

db7114c

feat(tutorial): Added FinWorld multi-machine multi-GPU training startup script

refactor(script): Refactored the finworld training script, integratin…

5a25550

…g configuration and startup processes.

refactor(tutorial): Optimize dynamic generation logic for configurati…

04f4959

…on file paths

fix(deep_finance): argparse: with-deepfinance

d0ff68b

Merge remote-tracking branch 'origin/main' into dev/shuchang

1c356d7

fix(tutorial): Fixed issues with multi-machine training environment v…

37dcbcc

…ariable settings

TaoShuchang and others added 27 commits January 22, 2026 18:10

Merge remote-tracking branch 'origin/main' into dev/shuchang

06fda5f

Merge branch 'main' into dev/shuchang

63cc682

revise message parsing

c9b87ac

fix(metric_helper): update openjudge graders list in reward metric he…

3bd4c7d

…lper

fix(deep_finance_judge): add debug logging for OpenJudge evaluation p…

11ed325

…rocess

chore(config): update experiment suffix, prefix and reward weights in…

4538f5a

… deep_finance.sh

Merge remote-tracking branch 'origin/main' into dev/shuchang_newjudge

6f0c420

fix(deep_finance): update environment variables and training launch o…

818a4f7

…ptions

chore(config): parameterize deep finance training configuration

1bb7f60

chore(config): update experiment suffix, prefix, and weight parameters

460318f

fix(example_deep_finance): update dynamic config file generation path

57a3a54

refactor(judge): remove deprecated presentation quality script

beaa540

blog

55d0ad1

Merge remote-tracking branch 'origin/main' into dev/shuchang_newjudge

68e25ae

Merge remote-tracking branch 'origin/main' into dev/shuchang_newjudge

3fb724a

"fix: resolve remaining merge conflicts"

6b9eb50

Merge remote-tracking branch 'origin/main' into dev/shuchang_newjudge

9e73442

TaoShuchang merged commit 3d23e1c into main Feb 24, 2026
0 of 2 checks passed

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

# Pull Request: Deep Finance Judge System Enhancement#10

# Pull Request: Deep Finance Judge System Enhancement#10
TaoShuchang merged 65 commits intomainfrom
dev/shuchang_newjudge

TaoShuchang commented Feb 24, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843906200-9dd35ac4-f71e-40dc-b130-f03e3e6bae6a.png)

		![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843940824-4e3637d7-a16e-4994-8878-242effc2c0d7.png)![img](https://intranetproxy.alipay.com/skylark/lark/0/2026/png/107756372/1771843950142-09def779-5521-41f0-a457-a7715a819cc7.png)

Comments

Conversation

TaoShuchang commented Feb 24, 2026

Summary

Branch Information

Key Changes

1. New Judge Modules

2. Enhanced Existing Modules

3. Configuration & Infrastructure

4. Documentation

5. Bug Fixes

New Files Added

Testing

Breaking Changes

Related Issues

Checklist

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants