Analysis on every YC Batch ever. Read the initial blog post here.
Y Combinator is one of the largest startup accelerators in the world. It has one of the highest concentrations of technical founders. Companies like Airbnb, Docker, Instacart and Coinbase were all brought up through the accelerator. But they only represent the top percentile.
YC Vault is my attempt to make sense of the entire Y Combinator directory.
Any language model of your choice through LiteLLM. High-performing models like GPT-4o-mini are recommended for their data extraction accuracy.
git clone https://github.com/lukafilipxvic/YC-Vault.git
uv sync
- Set up environment:
- Create a
.envfile using the '.env.example' file as a template - Example
.envfile:[llm] OPENAI_API_KEY=your_api_key_here [data] DATA_DIR=./data
- Create a
- Configure your data sources:
- Update the
YC_Batches.csvfile with all batch IDs - This file will need updating as new batches are launched
- Update the
- Run the pipeline:
uv run python scraper/run_pipeline.py
get_yc_urls.py: ~2.5 minutes to scrape all YC URLsget_yc_data.py: ~2.52 seconds per company (approximately 4.2 hours to scrape 6,000 YC companies synchronously)
- Using GPT-4.1-nano, it costs ~$0.0002 to extract one YC company page.
- Total cost for 6,000 YC companies = ~$1.23
- For comparison, Gumloop costs ~$48.5 for the same data (39.43x more expensive).
The scraping pipeline generates 3 CSV files:
YC_Companies.csv: Company profiles and metricsYC_Founders.csv: Founder information and backgroundsYC_URLs.csv: Source URLs for all scraped data
Contributions are welcome! Please feel free to submit a Pull Request.
Licensed under AGPL-3.0