目录 | English
InsightFlow 是一个 AI Native Web 应用,专注于职业内容搜索与有据可查的摘要生成。它摄入结构化的就业市场内容,通过混合检索获取证据,返回带有引用、证据片段和来源卡片的结构化摘要。项目定位清晰:聚焦于求职、实习和职业研究的检索与综合,而非通用的聊天产品或社交平台。
- 将 CSV 和 JSON 数据集导入本地 SQLite 存储
- 在标准化、分块、索引和检索全流程中保留来源元数据
- 基于 SQLite 关键词搜索和 FAISS 向量索引构建混合检索
- 通过 FastAPI 后端生成有据可查的结构化摘要
- 在 React 前端展示摘要分节、来源卡片和证据片段
- 支持本地开发的 fake provider 和生产环境的 OpenAI-compatible provider
- 后端:FastAPI、SQLAlchemy、SQLite
- 检索:SQLite 关键词搜索 + FAISS 向量索引 + 确定性的合并与重排流程
- 前端:React、TypeScript、Vite
- 测试:pytest、Vitest
insightflow/
backend/ FastAPI 应用、检索管道、数据导入逻辑、测试和脚本
frontend/ React 客户端,用于搜索和有据摘要展示
data/ 本地原始数据和处理后数据目录
docs/ 验证笔记、实验记录和迭代历史
.planning/ GSD 规划产物和项目状态
MVP 已实现:
- 内容导入
- 支持父文档溯源的分块
- 混合检索
- 结构化摘要 API
- 引用链接和证据片段提取
- 前端来源卡片渲染
v1 明确不在范围内:
- 用户账户体系
- 社交/社区功能
- 个性化推荐
- 多轮记忆对话
- 移动端 App
.\bootstrap.ps1该脚本准备后端虚拟环境并安装依赖。
.\bootstrap.ps1 -RunApp核心后端接口:
- GET /healthz
- POST /imports
- POST /retrieval/index/build
- POST /retrieval/search
- POST /summary
Set-Location frontend
npm install
$VITE_API_BASE_URL = "http://127.0.0.1:8001"
npm run dev -- --host 127.0.0.1 --port 5173推荐本地组合:
- 后端:127.0.0.1:8001
- 前端:127.0.0.1:5173
导入样本数据:
& .\backend\.venv\Scripts\python.exe -m app.cli.import_batch
--input backend\tests\fixtures\sample_job_dataset.csv
--source sample_dataset
--source-type dataset
--db data\processed\insightflow.db构建检索索引:
Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe scripts\build_retrieval_index.py
--db sqlite:///../data/processed/insightflow.db
--vector-index-dir ../data/processed/vector_index使用 fake provider 运行后端:
Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$EMBEDDING_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8001InsightFlow 支持本地测试的 fake provider 和真实推理的 OpenAI-compatible provider。
真实 provider 配置示例:
$CHAT_MODEL_PROVIDER = "openai_compatible"
$CHAT_API_BASE_URL = "https://api.example.com/v1"
$CHAT_API_KEY = "your-chat-key"
$CHAT_MODEL_NAME = "your-chat-model"
$EMBEDDING_MODEL_PROVIDER = "openai_compatible"
$EMBEDDING_API_BASE_URL = "https://api.example.com/v1"
$EMBEDDING_API_KEY = "your-embedding-key"
$EMBEDDING_MODEL_NAME = "your-embedding-model"重要默认路径:
- 数据库:data/processed/insightflow.db
- 向量索引:data/processed/vector_index
后端:
& .\backend\.venv\Scripts\python.exe -m pytest backend\tests -q前端:
Set-Location frontend
npm test构建前端:
Set-Location frontend
npm run build- 证据优先:摘要应始终位于检索证据范围之内
- 来源透明:每个主张都应可追溯到引用和片段
- 模块化后端:API、services、retrieval、db、schemas 和 prompts 保持清晰分离
- 聚焦范围:先在职业信息检索与综合上做透,再考虑扩展
- 项目背景:.planning/PROJECT.md
- 需求说明:.planning/REQUIREMENTS.md
- 路线图:.planning/ROADMAP.md
- 当前状态:.planning/STATE.md
- 迭代日志:docs/iteration_log.md
仓库已完成 MVP 初始实施路径,目前处于验证、检索质量调优和发布就绪阶段。
本项目当前采用 MIT License 发布。
首个公开版本发布说明见 docs/release_notes_v0.1.0.md。
InsightFlow is an AI-native web app for career-content search and grounded summarization. It ingests structured job-market content, retrieves evidence with hybrid search, and returns structured summaries with citations, evidence snippets, and source cards.
The project is intentionally narrow: it focuses on retrieval and summarization for job, internship, and career research, rather than being a general-purpose chat product or social platform.
- Imports CSV and JSON datasets into a local SQLite store
- Preserves source metadata across normalization, chunking, indexing, and retrieval
- Builds hybrid retrieval with SQLite keyword search and a FAISS vector index
- Produces evidence-backed structured summaries through a FastAPI backend
- Shows summary sections, source cards, and evidence snippets in a React frontend
- Supports fake providers for local development and OpenAI-compatible providers for real inference
- Backend: FastAPI, SQLAlchemy, SQLite
- Retrieval: SQLite keyword search, FAISS vector index, deterministic merge and rerank pipeline
- Frontend: React, TypeScript, Vite
- Testing: pytest, Vitest
insightflow/
backend/ FastAPI app, retrieval pipeline, ingestion logic, tests, and scripts
frontend/ React client for search and grounded summary display
data/ Local raw and processed data directories
docs/ Validation notes, experiments, and iteration history
.planning/ GSD planning artifacts and project state
Implemented in the current MVP:
- Content import
- Chunking with parent-document traceability
- Hybrid retrieval
- Structured summary API
- Citation linking and evidence snippet extraction
- Source-card rendering in the UI
Explicitly out of scope for v1:
- User accounts
- Social/community features
- Personalized recommendation
- Multi-turn memory chat
- Mobile app
.\bootstrap.ps1This script prepares the backend virtual environment and installs dependencies.
.\bootstrap.ps1 -RunAppCore backend endpoints:
- GET /healthz
- POST /imports
- POST /retrieval/index/build
- POST /retrieval/search
- POST /summary
Set-Location frontend
npm install
$VITE_API_BASE_URL = "http://127.0.0.1:8001"
npm run dev -- --host 127.0.0.1 --port 5173Recommended local pairing:
- Backend: 127.0.0.1:8001
- Frontend: 127.0.0.1:5173
Import sample data:
& .\backend\.venv\Scripts\python.exe -m app.cli.import_batch
--input backend\tests\fixtures\sample_job_dataset.csv
--source sample_dataset
--source-type dataset
--db data\processed\insightflow.dbBuild retrieval indexes:
Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe scripts\build_retrieval_index.py
--db sqlite:///../data/processed/insightflow.db
--vector-index-dir ../data/processed/vector_indexRun the backend with fake providers:
Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$EMBEDDING_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8001InsightFlow supports fake providers for local testing and OpenAI-compatible providers for real inference.
Example real-provider setup:
$CHAT_MODEL_PROVIDER = "openai_compatible"
$CHAT_API_BASE_URL = "https://api.example.com/v1"
$CHAT_API_KEY = "your-chat-key"
$CHAT_MODEL_NAME = "your-chat-model"
$EMBEDDING_MODEL_PROVIDER = "openai_compatible"
$EMBEDDING_API_BASE_URL = "https://api.example.com/v1"
$EMBEDDING_API_KEY = "your-embedding-key"
$EMBEDDING_MODEL_NAME = "your-embedding-model"Important default paths:
- Database: data/processed/insightflow.db
- Vector index: data/processed/vector_index
Backend:
& .\backend\.venv\Scripts\python.exe -m pytest backend\tests -qFrontend:
Set-Location frontend
npm testBuild frontend:
Set-Location frontend
npm run build- Evidence first: summaries should stay within retrieved evidence boundaries
- Source transparency: each claim should be traceable to citations and snippets
- Modular backend: API, services, retrieval, db, schemas, and prompts remain clearly separated
- Narrow scope: solve career-information retrieval and synthesis well before expanding
- Project context: .planning/PROJECT.md
- Requirements: .planning/REQUIREMENTS.md
- Roadmap: .planning/ROADMAP.md
- Current state: .planning/STATE.md
- Iteration log: docs/iteration_log.md
The repository has completed the initial MVP implementation path and is currently in a verification, retrieval-quality tuning, and release-readiness phase.
This project is currently released under the MIT License.
The first public release notes are available at docs/release_notes_v0.1.0.md.