Note: This document has been archived. The recommendations below were implemented as part of the 2025 rewrite. See the main README for current documentation.
Yep — and you can cut maintenance time dramatically by turning this repo into a config-driven build pipeline (instead of "hand-curated files + a pile of scripts").
A couple things I can tell from the repo itself:
- Your alt formats (dnsmasq/adguard/etc) are already expected to be script-generated and not manually edited. (GitHub)
- You’re already publishing daily-ish “Aggregated Lists YYYYMMDD” releases via github-actions, which is the right direction. (GitHub)
The big win now is to standardize the entire build around one pipeline + one source of truth.
Pick a single internal format as the “truth,” e.g.:
- normalized domain lines only (one domain per line)
- lowercased
- punycode normalized
- comments stripped (except metadata header)
- invalid domains removed
Everything else (hosts / dnsmasq / adguard / etc) becomes pure output rendering from that canonical set.
Why this saves time: format bugs and “why is it different between versions?” disappear because you only curate one dataset.
Create something like config/lists.yml:
-
list name
-
category (ads, malware, phishing, etc)
-
upstream sources (URLs)
-
local overrides:
- allowlist exceptions
- forced blocks
- regex exclusions (rare, but sometimes needed)
-
output formats to generate
Then the build system loops over the config.
Result: Adding a new list becomes “add 15 lines to YAML,” not “copy script X and hope it works.”
Most time waste in list projects is reprocessing everything from scratch.
Do this:
- Cache downloaded sources by
ETag/Last-Modifiedwhen possible - Hash each upstream content (sha256)
- Only rebuild a list if any of its upstream hashes changed or overrides changed
- Always regenerate outputs from the canonical file for that list (fast)
- download upstreams in parallel with retries/backoff
- store
build/cache/<source_id>.txtand metadata json
- parse all known formats (hosts, adblock-ish, dnsmasq, plain domains)
- output canonical domain stream
-
dedup with a set (or disk-backed sqlite if huge)
-
apply:
- allowlist removals
- forced additions
- optional “safety filters” (drop single-character, invalid TLD, etc.)
-
ensure no IPs, no spaces, no wildcard junk unless you explicitly support it
-
sanity thresholds:
- if a list drops by 80% in one run -> fail build (prevents upstream breaking you silently)
- if a list grows by 5x -> warn/fail (prevents poisoning)
From canonical domains, generate:
0.0.0.0 domaindomainserver=/domain/||domain^
…and any future formats.
- commit changes to repo
- create release artifact(s)
- optionally publish a
manifest.jsonwith counts + sources + build info
Instead of pushing straight to master, have actions open an automated PR:
- Maintainer reviews diffs (quick sanity)
- Merge
- Release auto-publishes
This alone prevents “oops we shipped a bad upstream day.”
Here’s a minimal workflow skeleton:
name: Build Lists
on:
schedule:
- cron: "12 6 * * *" # daily
workflow_dispatch: {}
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install deps
run: |
pip install -r scripts/requirements.txt
- name: Build
run: |
python scripts/build.py --config config/lists.yml --out .
- name: Create Pull Request
uses: peter-evans/create-pull-request@v6
with:
title: "Automated list build"
commit-message: "Automated list build"
branch: "bot/list-build"
labels: "automation"(If you already publish releases via github-actions, this fits right into what you’re doing now. (GitHub))
This is usually the real time sink.
Use GitHub Issue Forms for:
- Add request
- Remove request
- False positive report
Required fields should include:
- domain
- which list(s)
- evidence / reason
- “breaks what” (for removals)
On issue open:
-
check if domain is present in repo files
-
comment with:
- “found in X lists”
- “not found”
-
auto-label and route
This reduces back-and-forth to near zero.
Generate manifest.json (and optionally manifest.csv) with:
- list -> domain count
- sources used
- build timestamp
- git commit
- diff vs previous build (added/removed counts)
This makes debugging fast when users complain.
It’s common for upstreams to get compromised or to ship garbage. Add build fails for:
- sudden massive growth
- sudden massive shrink
- too many non-FQDNs
- too many single-label domains
At the top of each generated list:
- build date
- count
- source count
- link to manifest
This cuts repeated “is this still maintained?” questions. (You already get those. (GitHub))
Right now, GitHub’s web view for that /scripts folder isn’t loading cleanly in my browsing tool (it returns GitHub’s “Uh oh! error”), so I can’t reliably read your current scripts line-by-line from that link in this chat.
If you paste the main build script(s) here (or upload them), I can:
- refactor them into the config-driven pipeline above
- give you a drop-in
build.py+lists.yml - add the validation + “poisoning protection”
- add the PR-based GitHub Action so you’re not babysitting builds
If you only paste one file, paste the entrypoint script that orchestrates everything (the one you run to generate outputs).