A local-first Python CLI for indexing, searching, enriching, and exporting topic-specific knowledge packets from TESC Google Drive history.
TESC has decades of institutional knowledge spread across Google Drive artifacts: Google Docs, Slides, Sheets, PDFs, Forms, shared folders, shared drives, old officer-owned files, and files shared directly with individual accounts. This project helps preserve and search that history without reorganizing, moving, renaming, or modifying the original Drive structure.
The goal is simple:
When someone asks, “What do we have from past SD Hacks work?” or “What resources exist for banquet planning?”, TESC should be able to quickly generate a useful resource packet instead of manually searching through years of scattered Drive files.
tesc-knowledge-index indexes Google Drive files accessible to one or more authenticated TESC accounts, stores searchable metadata and extracted text in a local SQLite database, and generates topic-specific exports.
Example workflows:
tesc-drive search "SD Hacks"
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75
tesc-drive extract --account contact --query "SD Hacks" --limit 50
tesc-drive report "SD Hacks" --out exports/sd_hacks_report.md
tesc-drive packet "SD Hacks" --out exports/sd_hacks_packetA generated packet can include:
- An advisor-ready Markdown summary
- A CSV of relevant files
- A Markdown list of direct Drive links
- File names, types, owners, modification dates, and source accounts
- Text-extraction status
- A timeline-style view of relevant resources
- Suggested manual review order
The app does not move, rename, or modify Google Drive files. It reads metadata, optionally exports/downloads readable content where permissions allow, and stores a local index.
TESC has a large institutional archive, but much of it is difficult to access because resources are spread across:
- Shared Google Drives
- Folders shared directly with officers
- Files owned by past board members
- Old planning documents
- Event-specific decks, forms, budgets, invoices, and postmortems
- Multiple TESC accounts and aliases
This project is designed to make that history searchable and useful for future boards, advisors, and event leads.
Instead of manually cleaning the entire Drive, this project creates a searchable index over the existing archive.
- Authenticate one or more Google accounts using OAuth
- Crawl accessible Google Drive files
- Support shared-drive-aware indexing
- Store file metadata in SQLite
- Track which account had access to each file
- Merge duplicate files by Google Drive file ID
- Search local indexed files by topic/event/process
- Run live Google Drive
name/fullTextsearch - Save live Drive search results back into the local index
- Extract text from supported file types
- Rebuild the local SQLite FTS search index after schema changes
- Rank results using title, path, extracted text, file type, account overlap, recency, and copy penalties
- Generate Markdown/CSV knowledge packets
- Generate advisor-ready Markdown reports
- Preserve original Drive links
- Avoid modifying Drive contents
The indexer stores metadata for all visible files, but text extraction is intentionally limited to useful/readable formats.
Currently supported for text extraction:
- Google Docs
- Google Slides
- Google Sheets
- PDFs
- DOCX
- XLSX
- TXT
- CSV
Skipped or metadata-only by default:
- Google Forms
- Folders
- Images
- Videos
- ZIP files
- Photoshop/HEIF/SVG and other asset formats
- HTML files by default, because old Drive HTML exports can be huge/noisy
tesc-knowledge-index/
README.md
pyproject.toml
.gitignore
credentials/
.gitkeep
tokens/
.gitkeep
data/
.gitkeep
exports/
.gitkeep
src/
tesc_knowledge_index/
__init__.py
cli.py
auth.py
drive_client.py
crawler.py
database.py
search.py
extractors.py
report.py
packet.py
tests/
test_search.py
test_ranking.py
git clone https://github.com/YOUR_USERNAME/tesc-knowledge-index.git
cd tesc-knowledge-indexOn Windows PowerShell:
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1On macOS/Linux:
python3.11 -m venv .venv
source .venv/bin/activatepython -m pip install --upgrade pip
pip install -e .If you are installing dependencies manually during development:
pip install google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2
pip install typer rich pydantic python-dotenv pypdf pandas python-docx openpyxl
pip install -e .Example dependency section:
[project]
name = "tesc-knowledge-index"
version = "0.1.0"
description = "Index, search, enrich, and export topic-specific knowledge packets from TESC Google Drive history."
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"google-api-python-client>=2.0.0",
"google-auth>=2.0.0",
"google-auth-oauthlib>=1.0.0",
"google-auth-httplib2>=0.2.0",
"typer>=0.12.0",
"rich>=13.0.0",
"pydantic>=2.0.0",
"python-dotenv>=1.0.0",
"pypdf>=4.0.0",
"pandas>=2.0.0",
"python-docx>=1.1.0",
"openpyxl>=3.1.0",
]
[project.scripts]
tesc-drive = "tesc_knowledge_index.cli:app"
[tool.setuptools.packages.find]
where = ["src"]This project uses the Google Drive API with OAuth desktop authentication.
Create a Google Cloud project, for example:
TESC Knowledge Index
In the project, enable:
Google Drive API
Create OAuth client credentials with:
Application type: Desktop app
Name: TESC Knowledge Index Desktop Client
Download the credentials JSON file and save it as:
credentials/client_secret.json
Do not commit this file to GitHub.
Authenticate each Google account that has useful TESC Drive access.
Example:
tesc-drive auth add --account rohan
tesc-drive auth add --account contactOAuth tokens are saved locally in:
tokens/
Do not commit token files to GitHub.
Run:
tesc-drive doctor
tesc-drive init
tesc-drive auth add --account rohan
tesc-drive auth add --account contactThen test indexing:
tesc-drive index --account rohan --max-pages 2
tesc-drive index --account contact --max-pages 2
tesc-drive statsIf that works, run full indexing:
tesc-drive index --account rohan
tesc-drive index --account contact
tesc-drive statstesc-drive doctorChecks for:
credentials/client_secret.jsontokens/data/exports/- SQLite database
tesc-drive initThis creates required tables and ensures the FTS search table has the expected schema.
tesc-drive rebuild-ftsUse this after changing search schema, adding text extraction, or fixing FTS-related issues.
This rebuilds only the derived FTS table. It does not delete indexed files.
tesc-drive statsShows:
- Total unique files indexed
- Number of files with extraction records
- Number of files with successful text extraction
- Files by source account
- Top MIME types
Index one account:
tesc-drive index --account rohanIndex another account:
tesc-drive index --account contactTest with fewer pages:
tesc-drive index --account contact --max-pages 2The app merges duplicate files by Google Drive file ID and records which account had access.
Search local metadata and extracted text:
tesc-drive search "SD Hacks"
tesc-drive search "SDHacks"
tesc-drive search "sponsorship"
tesc-drive search "banquet"
tesc-drive search "Decaf"Show full copy-pasteable links:
tesc-drive search "SD Hacks" --links --limit 25Show extracted text previews:
tesc-drive search "SD Hacks sponsorship" --links --preview --limit 25Search results include:
- Score
- File name
- MIME type
- Modified time
- Source account(s)
- Text extraction status
- Drive link
Local search only knows what is already indexed and extracted. Live Drive search asks Google Drive directly using name and fullText.
tesc-drive drive-search "SD Hacks" --account rohan --links --limit 75
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75Important behavior:
Live Drive search results are saved back into the local database.
This means live search acts as a discovery/enrichment tool. After running live Drive search, local search and packets become stronger.
Recommended pattern:
tesc-drive drive-search "SD Hacks" --account rohan --links --limit 75
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75
tesc-drive search "SD Hacks" --limit 75Extract text for top local search results for a topic:
tesc-drive extract --account contact --query "SD Hacks" --limit 50This is the recommended extraction workflow.
Avoid running global extraction unless you intentionally want to process arbitrary pending files:
tesc-drive extract --account contact --limit 100If no --query is provided, the CLI warns you that it will extract arbitrary pending files from the whole database.
This command:
tesc-drive extract --account contact --query "SD Hacks" --limit 50means:
Extract text from the top 50 local results related to SD Hacks.
This command:
tesc-drive extract --account contact --limit 50means:
Extract the next 50 pending extractable files in the whole database.
Topic-scoped extraction is safer, faster, and more relevant.
Local search uses a weighted score based on:
- Exact title match
- Partial title match
- Folder/path match
- Extracted text match
- Important file type bonus
- Account overlap bonus
- Recency bonus
- Copy/duplicate penalty
Important file types receive a boost:
- Google Docs
- Google Slides
- Google Sheets
- Google Forms
- PDFs
- DOCX/PPTX/XLSX
- TXT/CSV
This helps files like:
SD Hacks Master Document
SD Hacks Operations
SD Hacks Event Planning Guide
Sponsorship Timeline - SD Hacks 2019
SD Hacks 23 Day-of Logistics Company Packet
rank above weaker or less relevant mentions.
Print simple copy-pasteable links:
tesc-drive links "SD Hacks" --limit 25Export links to Markdown:
tesc-drive export-links "SD Hacks" --out exports/sd_hacks_links.md --limit 75The Markdown export includes:
- File name
- Score
- Link
- Type
- Modified time
- Source account(s)
- Owners
- Text extraction status
Generate an advisor-ready Markdown report:
tesc-drive report "SD Hacks" --out exports/sd_hacks_report.md --limit 75The report includes:
- Executive summary
- What TESC appears to have done
- Most relevant resources
- Resource categories
- Timeline by modified year
- Recommended manual review order
- Notes and limitations
The report is meant to be a strong starting point, not a final official historical statement. Review top files manually before sending it to advisors, university staff, sponsors, or external partners.
Create a full packet for a topic:
tesc-drive packet "SD Hacks" --out exports/sd_hacks_packet --limit 75Example output:
exports/sd_hacks_packet/
README.md
files.csv
links.md
summary.md
README.md
Overview of the generated packet and recommended next step.
files.csv
Spreadsheet-friendly list of relevant files.
links.md
Markdown document with clickable Drive links and metadata.
summary.md
Advisor-ready summary generated from the top local search results.
For one topic:
tesc-drive drive-search "SD Hacks" --account rohan --links --limit 75
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75
tesc-drive extract --account contact --query "SD Hacks" --limit 50
tesc-drive search "SD Hacks" --limit 75 --links --preview
tesc-drive report "SD Hacks" --out exports/sd_hacks_report.md --limit 75
tesc-drive packet "SD Hacks" --out exports/sd_hacks_packet --limit 75Create search-topic.ps1:
param(
[Parameter(Mandatory=$true)]
[string]$Topic,
[int]$Limit = 75,
[int]$ExtractLimit = 50
)
$safeName = $Topic.ToLower().Replace(" ", "_").Replace("/", "_").Replace("\", "_")
Write-Host ""
Write-Host "=== Local search before live enrichment: $Topic ==="
tesc-drive search "$Topic" --limit $Limit
Write-Host ""
Write-Host "=== Live Drive search via rohan: $Topic ==="
tesc-drive drive-search "$Topic" --account rohan --links --limit $Limit
Write-Host ""
Write-Host "=== Live Drive search via contact: $Topic ==="
tesc-drive drive-search "$Topic" --account contact --links --limit $Limit
Write-Host ""
Write-Host "=== Extracting text for topic files via contact ==="
tesc-drive extract --account contact --query "$Topic" --limit $ExtractLimit
Write-Host ""
Write-Host "=== Local search after live enrichment + extraction: $Topic ==="
tesc-drive search "$Topic" --limit $Limit
Write-Host ""
Write-Host "=== Exporting links ==="
tesc-drive export-links "$Topic" --out "exports/${safeName}_links.md" --limit $Limit
Write-Host ""
Write-Host "=== Creating advisor report ==="
tesc-drive report "$Topic" --out "exports/${safeName}_report.md" --limit $Limit
Write-Host ""
Write-Host "=== Creating packet ==="
tesc-drive packet "$Topic" --out "exports/${safeName}_packet" --limit $LimitUsage:
.\search-topic.ps1 "SD Hacks" -Limit 75 -ExtractLimit 50
.\search-topic.ps1 "Decaf" -Limit 75 -ExtractLimit 50
.\search-topic.ps1 "sponsorship" -Limit 100 -ExtractLimit 50If an advisor asks:
What resources does TESC have from past SD Hacks involvement?
Run:
tesc-drive drive-search "SD Hacks" --account rohan --links --limit 75
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75
tesc-drive extract --account contact --query "SD Hacks" --limit 50
tesc-drive packet "SD Hacks" --out exports/sd_hacks_packet --limit 75Then review:
exports/sd_hacks_packet/summary.md
exports/sd_hacks_packet/links.md
exports/sd_hacks_packet/files.csv
This provides a structured starting point instead of a raw Google Drive search dump.
Examples of useful search topics:
SD Hacks
SDHacks
hackathon
E Week
E-Week
DECaF
Decaf
banquet
sponsorship
sponsor packet
budget
GBM
General Body Meeting
retreat
ESC Night
project teams
transition docs
constitution
funding
marketing
outreach
alumni
volunteer form
judging
applications
operations
day-of logistics
Search broad first, then narrow:
tesc-drive search "hackathon"
tesc-drive search "SD Hacks sponsorship"
tesc-drive search "SD Hacks budget"
tesc-drive search "SD Hacks logistics"
tesc-drive search "SD Hacks judging"This project can be public on GitHub, but the data it indexes may be private.
Never commit:
- OAuth tokens
- Google API client secret files
- Downloaded TESC files
- Exported packets containing private links or content
- Local SQLite databases
.envfiles- Cached Drive content
Recommended .gitignore entries:
.venv/
__pycache__/
*.pyc
.env
credentials/*.json
tokens/*.json
data/
exports/
cache/
*.sqlite
*.db
.DS_Store
Thumbs.dbTo keep empty folders in Git, use .gitkeep files:
credentials/.gitkeep
tokens/.gitkeep
data/.gitkeep
exports/.gitkeep
Before pushing to GitHub, run:
git statusMake sure no private data, tokens, credentials, database files, or exports are staged.
The app does not try to clean or move TESC Drive files. Historical Drive structures can be messy for good reasons: ownership, sharing, permissions, and context. Indexing is safer than reorganizing.
The app keeps the original Google Drive links so files remain connected to their actual source of truth.
The first version uses SQLite and local exports. This keeps the system simple, transparent, and easy to transfer to future officers.
Google Drive live search can discover files through fullText even when local extraction has not happened yet. The app saves live results back into SQLite so future local searches and packets improve.
The app helps gather and organize resources, but final summaries should be reviewed by a human before being sent to advisors, university staff, sponsors, or external partners.
The pipeline is:
Google Drive OAuth
↓
Drive API file crawl
↓
SQLite metadata index
↓
Live Drive search enrichment
↓
Topic-scoped text extraction
↓
SQLite FTS search
↓
Ranking
↓
Advisor report / knowledge packet export
The app uses SQLite.
Main metadata table:
CREATE TABLE files (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
mime_type TEXT,
web_view_link TEXT,
created_time TEXT,
modified_time TEXT,
owners TEXT,
parents TEXT,
drive_id TEXT,
source_accounts TEXT,
path_hint TEXT,
can_download INTEGER,
indexed_at TEXT
);Extracted text table:
CREATE TABLE file_text (
file_id TEXT PRIMARY KEY,
extracted_text TEXT,
extraction_status TEXT,
extracted_at TEXT,
extractor TEXT,
error_message TEXT,
FOREIGN KEY(file_id) REFERENCES files(id)
);FTS search table:
CREATE VIRTUAL TABLE files_fts
USING fts5(
id UNINDEXED,
name,
mime_type,
owners,
path_hint,
extracted_text,
tokenize='porter'
);If FTS schema changes, run:
tesc-drive rebuild-ftsMake sure your virtual environment is activated and the project is installed:
pip install -e .Download OAuth desktop credentials from Google Cloud Console and save the file as:
credentials/client_secret.json
Run live Drive search to enrich the local database:
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75
tesc-drive search "SD Hacks" --limit 75Your SQLite database has the old FTS schema. Run:
tesc-drive init
tesc-drive rebuild-ftsDo not delete data/index.sqlite unless you intentionally want to recrawl everything.
Use topic-scoped extraction and a small limit first:
tesc-drive extract --account contact --query "SD Hacks" --limit 10Then increase:
tesc-drive extract --account contact --query "SD Hacks" --limit 50Confirm the crawler uses shared-drive-aware options:
supportsAllDrives=True
includeItemsFromAllDrives=TrueImmediately remove them from Git history and rotate/revoke affected credentials or OAuth tokens.
Install in editable mode:
pip install -e .Run a small crawl:
tesc-drive index --account rohan --max-pages 1Rebuild local search index:
tesc-drive rebuild-ftsRun a search:
tesc-drive search "SD Hacks"Run live Drive search:
tesc-drive drive-search "SD Hacks" --account contact --links --limit 75Extract topic text:
tesc-drive extract --account contact --query "SD Hacks" --limit 25Generate report:
tesc-drive report "SD Hacks" --out exports/sd_hacks_report.mdGenerate packet:
tesc-drive packet "SD Hacks" --out exports/sd_hacks_packet- OAuth authentication
- Drive metadata crawl
- Shared-drive-aware indexing
- SQLite database
- Account-source tracking
- Packet export
- Local search
- Live Google Drive search
- Live search enrichment into local DB
- Title exact-match boost
- Path/folder boost
- Extracted-text boost
- File type boost
- Account overlap boost
- Recency boost
- Copy penalty
- Better duplicate grouping
- Folder path reconstruction beyond parent IDs
- Export Google Docs to text
- Export Google Slides to text
- Export Google Sheets to XLSX/text
- Parse PDFs with
pypdf - Parse DOCX
- Parse XLSX
- Store extracted text in SQLite
- Topic-scoped extraction
- OCR for important images/scans
- Better handling for very large PDFs/spreadsheets
- Advisor-ready Markdown reports
- Resource categories
- Year-by-year modified timeline
- Recommended manual review order
- Notes and limitations
- Better natural-language summaries using reviewed top files
- Missing/permission-blocked file notes
- People/contact extraction
- Streamlit prototype
- Search page
- Packet generation page
- File preview metadata
- Export controls
- Simple advisor-facing read-only view
- Embedding-based search
- Local vector database
- Hybrid keyword + semantic ranking
- Question-answering over selected packets
When contributing:
- Do not commit private TESC data.
- Do not commit OAuth credentials or tokens.
- Keep the CLI usable before adding a web app.
- Prefer small, reviewable features.
- Keep generated packets out of Git.
- Document any new command in this README.
- Treat advisor-ready reports as draft summaries requiring human review.
Choose a license before making the repository broadly public. For an internal student-organization tool, MIT is usually simple and permissive.
Suggested:
MIT License
Working local-first CLI.
The app can authenticate multiple TESC accounts, index Drive metadata, run live Drive searches, enrich the local database, extract text for topic-specific files, rank results, and generate advisor-ready Markdown reports and knowledge packets.
The next major improvements are duplicate grouping, better folder-path reconstruction, OCR for scanned/image-heavy files, and a lightweight web interface.