PaperSurveyor MVP Blueprint

Part 1. 产品定义

1.1 MVP 是什么

PaperSurveyor 是一个面向科研调研任务的研究引擎与 Agent 工作流平台。它的目标不是替代学者阅读全文，而是帮助用户快速完成以下前置工作：

明确一个研究问题属于哪个领域或哪些交叉领域
找到该方向最值得优先阅读的代表性论文
自动给出主题聚类、方法脉络、热点与空白点
基于一组论文生成可继续编辑的调研报告

1.2 目标用户

研究生和博士生：快速进入新方向
PI / 课题组成员：为组会和项目立项做文献梳理
产业研究员 / 算法工程师：快速理解某个交叉技术方向
科研工具开发者：基于开源框架扩展更多领域与数据源

1.3 核心痛点

普通检索工具返回结果很多，但“先读什么”不清楚
新方向和交叉方向缺少系统化入口
文献调研过程碎片化，检索、筛选、分类、总结分散在多个工具
很难把“领域权威性”和“主题相关性”统一进一套可解释排序

1.4 与普通论文搜索工具的区别

普通工具解决“找到论文”，PaperSurveyor 解决“完成调研”
普通工具重相关性排序，PaperSurveyor 重重要性排序
普通工具不区分领域策略，PaperSurveyor 内置领域感知配置
普通工具通常无工作流，PaperSurveyor 内置多 Agent pipeline

1.5 MVP 边界

包含：

领域配置
关键词与交叉领域检索
可解释 importance ranking
论文详情分析
报告生成
Agent pipeline 定义

暂不包含：

大规模全文抓取与 PDF 解析
私有知识库上传
实时协同编辑
复杂权限系统
生产级分布式调度

Part 2. 核心用户流程

2.1 单一领域调研

用户在首页输入 graph neural networks for recommendation
选择 Computer Science
Query Understanding Agent 解析主题、时间范围、调研意图
Domain Router Agent 命中 computer_science
Source Strategy Agent 选择 CCF-A / 顶会 / 顶刊优先策略
Retrieval Agent 拉取候选论文
Ranking Agent 生成 importance score 与推荐理由
用户在结果页查看“入门 / 进阶 / 前沿跟进”分层

2.2 交叉领域调研

用户输入 multimodal foundation models in clinical decision support
选择 Computer Science + Medicine
Source Strategy Agent 同时启用 AI 顶会 + 医学顶刊策略
Ranking Agent 增加跨领域覆盖因子
Clustering Agent 形成 clinical QA、medical imaging、EHR reasoning 等聚类
Insight Agent 输出交叉脉络和空白点

2.3 报告生成

用户在搜索结果页勾选 20 篇论文
点击 Generate Survey
系统创建 report_task
Report Agent 按模板生成 Markdown 报告
报告页展示主题概述、代表论文、方法路线、热点、空白点、推荐阅读

2.4 从重点论文继续扩展

用户进入论文详情页
查看核心贡献、适用问题、脉络位置、关联论文
点击 Expand Similar Papers
系统基于 paper keywords、domain tags、venue、引用上下文继续检索

Part 3. 系统功能架构

前端模块

首页：查询输入、领域选择、调研模板入口
搜索结果页：过滤器、importance 排序、聚类侧栏
论文详情页：结构化分析与扩展阅读
报告页：Markdown 报告展示与复制
工作台：搜索历史、已保存论文、报告任务
领域配置页：查看内置领域 profile

后端模块

Search API
Domain Profile API
Paper Detail API
Analyze API
Report API
Recommendation API

检索模块

query normalize
domain-aware source strategy
external provider adapters
candidate merge and dedup

排序模块

relevance score
venue authority score
citation score
recency score
survey/foundation score
cross-domain boost
explainable reason generator

Agent 模块

query understanding
domain routing
source strategy
retrieval orchestration
ranking
clustering
insight extraction
report generation

数据管理模块

built-in domain profiles
paper metadata cache
ranking feature logs
report output storage

Part 4. Agent 工作流设计

4.1 Query Understanding Agent

职责：

识别用户调研目标、关键词、任务类型、隐含过滤条件

输入：

{
  "query": "multimodal models for clinical decision support",
  "domains": ["computer_science", "medicine"],
  "time_range": {"from": 2021, "to": 2026}
}

输出：

{
  "normalized_query": "multimodal foundation models clinical decision support",
  "intent": "survey",
  "subtopics": ["medical VLM", "decision support", "clinical reasoning"],
  "constraints": {
    "paper_types": ["survey", "benchmark", "foundational", "recent_frontier"]
  }
}

4.2 Domain Router Agent

职责：

判断主领域与交叉领域，产出领域权重

输出：

{
  "primary_domain": "medicine",
  "secondary_domains": ["computer_science"],
  "domain_weights": {
    "medicine": 0.55,
    "computer_science": 0.45
  }
}

4.3 Source Strategy Agent

职责：

根据领域 profile 选择优先论文源、venue 权重、时间窗口、召回配额

输出：

{
  "sources": ["openalex", "crossref", "arxiv"],
  "preferred_venues": ["Nature Medicine", "The Lancet Digital Health", "NeurIPS", "ICML"],
  "recall_limits": {
    "medicine_top_journals": 30,
    "cs_top_conferences": 40,
    "broad_recall": 60
  }
}

4.4 Retrieval Agent

职责：

调用外部 provider，拉取候选论文并做去重

失败回退：

某 provider 超时则降级到其他 provider
无 citation 数据时将 citation 因子标记为 unavailable

4.5 Ranking Agent

职责：

计算 importance score
输出解释项

输出：

{
  "paper_id": "paper-gat-2018",
  "importance_score": 82.4,
  "explanations": [
    "High relevance to graph neural networks for recommendation",
    "Published in a high-authority venue profile",
    "Strong citation signal",
    "Foundational method frequently referenced in the domain"
  ]
}

4.6 Clustering Agent

职责：

将结果聚类到主题桶，供结果页与报告页复用

4.7 Insight Agent

职责：

提炼研究脉络、方法演进、热点与空白点

4.8 Report Agent

职责：

按 Markdown 模板生成调研报告

4.9 调用关系

flowchart LR
  A["User Query"] --> B["Query Understanding"]
  B --> C["Domain Router"]
  C --> D["Source Strategy"]
  D --> E["Retrieval"]
  E --> F["Ranking"]
  F --> G["Clustering"]
  G --> H["Insight"]
  H --> I["Report"]

4.10 可解释性保障

排序公式显式、可审计
每个结果返回 feature breakdown
Agent 之间输出结构化 JSON，不传自由文本黑箱结果
降级时记录 missing features 与 fallback route

Part 5. 数据模型 / 数据库设计

papers

id UUID PK
external_id VARCHAR
source VARCHAR
title TEXT
abstract TEXT
year INT
doi VARCHAR NULL
venue_id UUID NULL
citation_count INT NULL
publication_type VARCHAR
url TEXT
pdf_url TEXT NULL
keywords JSONB
embedding VECTOR NULL
created_at TIMESTAMP
updated_at TIMESTAMP

authors

id UUID PK
name VARCHAR
orcid VARCHAR NULL
affiliation VARCHAR NULL
created_at TIMESTAMP

paper_authors

paper_id UUID FK
author_id UUID FK
author_order INT

venues

id UUID PK
name VARCHAR
venue_type VARCHAR
publisher VARCHAR NULL
authority_tier VARCHAR
domain_key VARCHAR
issn VARCHAR NULL
url TEXT NULL

domains

id UUID PK
key VARCHAR UNIQUE
name VARCHAR
description TEXT
color_token VARCHAR
is_active BOOLEAN

domain_source_profiles

id UUID PK
domain_key VARCHAR
source_key VARCHAR
priority INT
strategy_json JSONB
created_at TIMESTAMP

source_priority_rules

id UUID PK
domain_key VARCHAR
venue_name VARCHAR
venue_weight NUMERIC(5,2)
tier VARCHAR
notes TEXT NULL

paper_domain_tags

paper_id UUID FK
domain_key VARCHAR
tag VARCHAR
confidence NUMERIC(4,3)

search_history

id UUID PK
query TEXT
normalized_query TEXT
selected_domains JSONB
filters JSONB
result_count INT
created_at TIMESTAMP

report_tasks

id UUID PK
query TEXT
paper_ids JSONB
status VARCHAR
agent_trace JSONB
created_at TIMESTAMP
completed_at TIMESTAMP NULL

report_outputs

id UUID PK
report_task_id UUID FK
format VARCHAR
title VARCHAR
content_markdown TEXT
summary_json JSONB
created_at TIMESTAMP

ranking_features

id UUID PK
paper_id UUID FK
search_id UUID FK NULL
relevance_score NUMERIC(5,2)
venue_score NUMERIC(5,2)
citation_score NUMERIC(5,2)
recency_score NUMERIC(5,2)
survey_foundation_score NUMERIC(5,2)
cross_domain_score NUMERIC(5,2)
domain_profile_boost NUMERIC(5,2)
final_score NUMERIC(5,2)
explanation_json JSONB
created_at TIMESTAMP

Part 6. 重要性排序设计

6.1 评分维度

relevance_score: 查询与标题/摘要/关键词匹配
venue_score: 来源权威性
citation_score: 引用强度，按领域归一
recency_score: 新近程度
survey_foundation_score: 是否综述 / benchmark / foundational
cross_domain_score: 是否同时覆盖多个选定领域
domain_profile_boost: 是否命中内置高优先级 venue/paper pool

6.2 建议权重

relevance: 0.30
venue: 0.22
citation: 0.16
recency: 0.10
survey_foundation: 0.10
cross_domain: 0.07
domain_profile_boost: 0.05

6.3 MVP 公式

importance_score =
100 * (
  0.30 * relevance_score +
  0.22 * venue_score +
  0.16 * citation_score +
  0.10 * recency_score +
  0.10 * survey_foundation_score +
  0.07 * cross_domain_score +
  0.05 * domain_profile_boost
)

所有子项先归一到 [0,1]。

6.4 可解释输出

对每篇论文返回：

final_score
feature_breakdown
recommendation_reason
reading_level

6.5 权威性与相关性的平衡

relevance 仍然是最大单项，避免跑题
venue + citation + domain_profile 合计高于 relevance，保证“先读最值钱的”
交叉领域时对 cross_domain score 增强，避免单领域结果淹没交叉主题

Part 7. 页面与交互设计

首页

组件：

hero 搜索框
domain chips
time/source filters
example survey cards
recent reports preview

交互：

输入 query 后直接跳转搜索页
支持多领域选择

搜索结果页

组件：

query summary
filter bar
cluster sidebar
result list
ranking explanation drawer
save to workspace

论文详情页

组件：

paper header
importance score card
abstract summary
core contributions
methods
related topics
related papers

调研报告页

组件：

report metadata
Markdown viewer
outline navigation
copy / export

领域配置页

组件：

domain profile cards
top venue tables
source strategy preview

工作台页

组件：

saved papers
search history
reports
agent tasks

Part 8. 技术栈建议

原因

Next.js 和 Tailwind 适合快速构建开源 Web 产品
FastAPI 在数据接口、文档、异步任务上简单直接
PostgreSQL 足以承担 MVP 的 metadata + FTS
ranking 先用显式规则，后续再叠加 embedding recall
自定义 workflow 更轻，不会在 MVP 阶段引入过重框架复杂度

Part 9. API 设计

`GET /domains`

返回内置领域与顶级来源 profile 摘要

`GET /domain-profiles`

按领域返回 source strategy 与 top venues

`GET /search`

参数：

q
domains
year_from
year_to
sources
sort=importance

`GET /paper/{id}`

返回论文详情与结构化分析

`POST /analyze`

输入一批 paper ids，返回聚类与 insight

`POST /report/generate`

创建报告任务

`GET /report/{id}`

返回报告结果

`POST /workspace/save`

保存论文或报告到工作台

`GET /recommendations`

基于 query / paper id 返回扩展阅读

Part 10. GitHub 开源项目结构

apps/
  web/                 # Next.js frontend
  api/                 # FastAPI backend
packages/
  agents/              # prompt templates + pipeline config
  config/              # domain/source/venue seed data
  core/                # ranking and shared domain logic
docs/                  # blueprint, architecture, roadmap
examples/              # future API and report samples
scripts/               # seeding and maintenance scripts

Part 11. 开发优先级路线图

Phase 1：最小可运行版

静态领域配置
mock search + ranking
搜索页 / 详情页 / 报告页
FastAPI 基础接口
explainable ranking

Phase 2：增强版

PostgreSQL 持久化
provider adapters
真实 report task queue
workspace save
cluster visualization

Phase 3：高级 Agent 化

LangGraph orchestration
citation graph
PDF parsing
long-running research workspace
personalized ranking

Part 12. 初始骨架说明

仓库已包含：

README 初稿
.env.example
示例领域配置 JSON
示例 API 返回格式
示例 prompt 模板
示例 agent pipeline 定义
Next.js 页面骨架
FastAPI mock API

初始内置来源调研说明

首版内置来源配置采用“可维护种子库”方式，而不是声称给出唯一权威榜单。MVP 里优先内置高共识来源，方便后续社区 PR 补全。

参考来源：

这些链接主要用于首版内置 profile 的来源说明；后续建议引入社区维护的 domain profiles registry。

FilesExpand file tree

MVP_BLUEPRINT.md

Latest commit

History