A fast, async Python tool to find broken links in your Substack newsletter archive.
Checking broken links in your newsletter archive shouldn't cost $100+/month for tools like Semrush or Ahrefs. This free, open-source tool:
- Works with Substack's bot protection - Uses your session cookie to authenticate as a logged-in user
- Handles large archives efficiently - Async concurrent checking is 10-20x faster than sequential
- Tracks what you've already checked - Incremental scanning means you only check new posts
- Fast: Async concurrent checking (10-20x faster than sequential)
- Smart caching: Same link across multiple posts? Checked once
- Retry logic: Exponential backoff for transient failures
- Incremental scanning: Track checked posts, only scan new ones
- Domain filtering: Skip bot-blocking sites (Wikipedia), auto-flag known broken domains
- Multiple error types: HTTP 404, soft 404s, SSL errors, DNS failures, timeouts
# Install (provides the `substack-link-checker` CLI)
pip install git+https://github.com/jcddc83/substack-broken-link-checker.git
# Check all posts from 2024
substack-link-checker check --base-url https://YOUR.substack.com --year 2024
# Check posts from a file
substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txtgit clone https://github.com/jcddc83/substack-broken-link-checker.git
cd substack-broken-link-checker
pip install -e .Or directly from GitHub:
pip install git+https://github.com/jcddc83/substack-broken-link-checker.gitThis installs the substack-link-checker console command. Equivalent
invocations:
substack-link-checker check ...python -m substack_link_checker check ...
Requirements: Python 3.8+
v1.0.0 shipped flat scripts at the repo root. They have been
reorganised into a substack_link_checker package with a subcommand CLI.
| Old (v1.0.0) | New |
|---|---|
python substack_link_checker.py ... |
substack-link-checker check ... |
python compare_posts.py ... |
substack-link-checker compare ... |
python import_checked_posts.py ... |
substack-link-checker import ... |
python fetch_archive_urls.py ... |
substack-link-checker fetch-archive ... |
python demo_link_checker.py |
substack-link-checker demo |
The four helper scripts (compare_posts.py, import_checked_posts.py,
fetch_archive_urls.py, demo_link_checker.py) are kept at the root as
thin back-compat shims, so existing python compare_posts.py ...
invocations and the bundled PowerShell scheduled task continue to work.
The main substack_link_checker.py script could not be kept as a shim
because its name collides with the new package — use
substack-link-checker check ... or python -m substack_link_checker check ...
instead.
If Substack blocks your requests or you need to check paywalled content, use your session cookie:
- Log into your Substack in a browser
- Open Developer Tools (F12) → Application → Cookies
- Find the
substack.sidcookie and copy its value - Provide it via the
SUBSTACK_COOKIEenvironment variable (recommended) or the--cookieflag:
# Recommended: env var (keeps cookie out of shell history / ps aux)
export SUBSTACK_COOKIE="your-substack-sid-cookie-value"
substack-link-checker check --base-url https://YOUR.substack.com --year 2024
# Alternative: --cookie flag (visible in process listings)
substack-link-checker check --base-url https://YOUR.substack.com --year 2024 \
--cookie "your-substack-sid-cookie-value"Security: Treat the session cookie like a password. Prefer the env var
so it does not end up in your shell history or in ps aux. See
SECURITY.md for full guidance.
Note: Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser.
Common failure modes and how to fix them:
Substack's bot protection is rejecting unauthenticated requests. In order of likelihood:
- Set
SUBSTACK_COOKIE(see Authentication above) so you're requesting as a logged-in user. - If you had a cookie set: it has probably expired (Substack rotates session cookies every few weeks). Grab a fresh one from DevTools.
- If both are current: lower
--concurrency(try--concurrency 3) so you look less bot-like.
The year-specific sitemap (e.g. /sitemap-2024.xml) doesn't exist for
your Substack — some accounts only expose a single combined sitemap.
Fall back to scraping the archive page:
substack-link-checker fetch-archive https://YOUR.substack.com 2024
# Produces archive_urls_2024.txt
substack-link-checker check --base-url https://YOUR.substack.com \
--url-file archive_urls_2024.txtThe target site is rate-limiting or geo-blocking the checker, not
actually broken. Add it to --skip-domains so it's assumed OK:
substack-link-checker check ... --skip-domains rate-limited.example.comFor a recurring list, put one domain per line in a file and pass
--skip-domains-file path/to/file.txt.
The target host is using an old TLS version Python's ssl module no
longer accepts by default. Usually the right call is to flag the
domain as broken (it really is unreachable from a modern client):
substack-link-checker check ... --broken-domains old-tls.example.comThe detector matches phrases like "page not found" in the page <title>.
If a legitimate post happens to have one of those phrases in its title,
it'll be misflagged. Open the report, eyeball the URL, and if it's
genuinely live, ignore those rows.
Either no broken links were found (look for "No broken links found!" in the summary) or the run was interrupted before report generation. The tool only writes the CSV on a successful completion of all posts.
Make sure --history-file points at the same JSON file you used on
the previous run. The history file is the source of truth for which
posts have already been checked; without it --only-new has nothing
to compare against.
# Check posts from a specific year (uses sitemap)
substack-link-checker check --base-url https://example.substack.com --year 2024
# Check posts from a URL file
substack-link-checker check --base-url https://example.substack.com --url-file posts.txt
# Verbose output with custom report name
substack-link-checker check --base-url https://example.substack.com --year 2024 \
--verbose --output december_report.csvTrack which posts you've already checked to avoid re-scanning:
# First run: checks all posts, saves history
substack-link-checker check --base-url https://example.substack.com --year 2024 \
--history-file checked_posts.json
# Subsequent runs: only check new posts
substack-link-checker check --base-url https://example.substack.com --year 2024 \
--history-file checked_posts.json --only-new# Skip domains that block bots (assumed OK)
substack-link-checker check ... --skip-domains wikipedia.org
# Auto-flag domains as broken without checking
substack-link-checker check ... --broken-domains old.defunct-site.com# Compare your sitemap against history to find unchecked posts
substack-link-checker compare https://example.substack.com checked_posts.json
# Outputs: unchecked_posts.txt
# Then check just those posts
substack-link-checker check --base-url https://example.substack.com \
--url-file unchecked_posts.txt --history-file checked_posts.json$ substack-link-checker check --base-url https://example.substack.com --year 2024
Substack Broken Link Checker
==================================================
Base URL: https://example.substack.com
Concurrency: 10
Max retries: 3
Input: Sitemap
Year: 2024
==================================================
Found 45 posts from 2024
[1/45] Processing: https://example.substack.com/p/my-first-post
Checking 12 links (10 new, 2 cached)...
Found 1 broken links in this post
[2/45] Processing: https://example.substack.com/p/another-post
Checking 8 links (6 new, 2 cached)...
Found 0 broken links in this post
...
Completed in 34.2 seconds
==================================================
SUMMARY
==================================================
Total links checked: 234
Links skipped (assumed OK): 8
Links auto-flagged broken: 0
Cache hits: 45
Retries performed: 3
Broken links found: 5
Generating report: broken_links_report.csv
Report generated with 5 broken links
| Option | Short | Description |
|---|---|---|
--base-url |
-b |
Your Substack URL (required) |
--year |
-y |
Year to check (uses sitemap) |
--url-file |
-f |
File with post URLs (one per line) |
--output |
-o |
Output CSV filename (default: broken_links_report.csv) |
--concurrency |
-c |
Parallel requests (default: 10) |
--timeout |
-t |
Request timeout in seconds (default: 10) |
--max-retries |
-r |
Retry attempts for failures (default: 3) |
--history-file |
-H |
JSON file for tracking checked posts |
--only-new |
Only check posts not in history | |
--skip-domains |
-S |
Domains to skip (assumed OK) |
--skip-domains-file |
File with domains to skip (one per line) | |
--broken-domains |
-B |
Domains to auto-flag as broken |
--broken-domains-file |
File with domains to auto-flag (one per line) | |
--cookie |
-C |
Substack session cookie for authentication |
--verbose |
-v |
Show detailed progress |
--limit |
-l |
Max posts to check |
| Command | Purpose |
|---|---|
substack-link-checker check |
Main link checker (the command shown throughout this README) |
substack-link-checker compare |
Find posts not yet checked (sitemap vs history) |
substack-link-checker import |
Import previous results from Excel/CSV into history |
substack-link-checker fetch-archive |
Extract URLs from the /archive page (fallback when the sitemap doesn't work) |
substack-link-checker demo |
Self-contained demo against a handful of known-good/bad URLs |
run_link_checker.ps1 |
Windows Task Scheduler automation (PowerShell) |
The tool generates a CSV report with columns:
- Post Title: Title of the post containing the broken link
- Post URL: URL of the post
- Broken Link: The broken URL
- Error Type: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.)
HTTP 404- Page not foundHTTP 4xx/5xx- Other HTTP errorsSoft 404- Page loads but title indicates errorDNS Failure- Domain doesn't existSSL Error- Certificate problemsTimeout- Server didn't respondConnection Error- Network issuesKnown broken domain- Auto-flagged via--broken-domains
MIT License - see LICENSE file.
Issues and pull requests welcome at github.com/jcddc83/substack-broken-link-checker. See CONTRIBUTING.md for guidelines and SECURITY.md for reporting security issues.