Semantic search over 357k+ camera trap images from the Felidae Conservation Fund dataset, powered by CLIP and running entirely in the browser.
- CLIP image embeddings are precomputed for all 357k images on a GPU cluster and stored as a raw float32 binary (
embeddings.bin) on HuggingFace Hub. - On first visit the browser downloads
embeddings.bin(~686 MB) andmetadata.csvand caches them via the Cache API — subsequent visits are instant. - On each search query, the CLIP text encoder runs in-browser via Transformers.js to encode the query, then a dot-product over all image embeddings returns the top-K results.
- Images are served directly from GCS (no backend needed).
- Source: Felidae Conservation Fund 2020–2025 via LILA Science
- Size: ~357,934 camera trap images
- Labels: 66 wildlife species (bobcat, puma, mule deer, gray fox, …)
- Format: COCO Camera Traps JSON
scripts/
download_images.py # Download images from Azure blob storage
compute_embeddings.py # Compute CLIP embeddings (GPU, resumable)
upload_to_hf.py # Upload embeddings.bin + metadata.csv to HuggingFace Hub
run_pipeline.sh # Slurm batch script (end-to-end pipeline)
web/
index.html # Browser search UI
search.js # CLIP search logic (Transformers.js)
wget https://lilawildlife.blob.core.windows.net/lila-wildlife/felidae-conservation-fund/felidae_conservation_fund_2020_2025.zip
unzip felidae_conservation_fund_2020_2025.zippython scripts/download_images.py \
--metadata felidae_conservation_fund_2020_2025.json \
--out-dir data/images/ \
--workers 16python scripts/compute_embeddings.py \
--metadata felidae_conservation_fund_2020_2025.json \
--mode local \
--image-dir data/images/ \
--io-workers 16 \
--resumeAdd --dry-run to process only the first image as a sanity check.
huggingface-cli login
python scripts/upload_to_hf.py --repo your-username/felidae-image-searchsbatch scripts/run_pipeline.shpip install -r requirements.txt- Python 3.10+
- PyTorch 2.0+
transformers,Pillow,requests,numpy,huggingface_hub,tqdm
openai/clip-vit-base-patch32 — 512-dimensional embeddings, in-browser text encoding via Xenova/clip-vit-base-patch32.