Tumblr Tagging Automation

Overview

When posting to platforms like Tumblr (or any other social media), tags/hashtags are important for discoverability. However:

It is time-consuming to manually select optimal tags for each image
It introduces manual overhead where things could have been automated
Quality and consistency can vary between posts

This project explores an automated pipeline that generates relevant, consistent tags for image-based social media posts.

Implementation

Image tagging
- Each image is visually analyzed by wd-swinv2-tagger-v3
- This produces a list of visual tags (and scores) per image
Prompt creation
- A single user prompt is created containing;
  - All wd-swinv2-tagger-v3 generated tags
  - User authored captions for each image
  - A master list of allowed tags
LLM inference
- An LLM uses the given prompt to output optimal tags per image

Benchmarking

Different LLMs showed significant variation in their output quality and cost (per API call). To make an informed choice, a simple benchmark was developed that;

Rewards high quality, relevant tags
Penalizes irrelevant/invalid tags

Scoring methodology

Each tag produced by a model is scored individually based on how relevant it is and how it compares to other models' outputs.

+1.0 – Gold consensus
- This tag was selected by more than 70% of models, indicating strong agreement
+0.5 – Minority but relevant
- This tag was selected by less than 30% of models, but is still relevant
+0.5 – Unique but relevant
- This tag was only selected by this model but is still relevant
-0.5 – Missing gold tag
- This tag is a gold consensus tag, but the model failed to include it
-0.5 – Irrelevant tag from pool
- This tag is from the provided tag pool but is not relevant to the image
-1.0 – Hallucinated tag
- This tag does not exist in the provided tag pool at all

Note: Whether a tag is considered relevant or irrelevant is determined manually by a human reviewer.

Results

While models such as Gemini 3 Pro, Grok 4, and o4 Mini High provided excellent results, they were prohibitively expensive for such a simple use case. On the contrary, Gemini 3 Flash and Deepseek V3.2 provided optimal results per dollar spent.

Full benchmark results can be found here:

Use cases

Automated Tumblr posting
Social media content pipelines
LLM prompt testing for tagging tasks

Tech stack

wd-swinv2-tagger-v3 by SmilingWolf - for image tagging
OpenRouter - for LLM inference
Python 3.14 - orchestration and benchmarking

Legal Disclaimer

This repository contains sample images sourced from Pinterest, purely for testing purposes.

I do not own these images.
They are only used for non-commercial benchmarking.
If you own any of these images and want them removed, please let me know.

This project is focused on tooling and evaluation, not content redistribution.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
images		images
outputs		outputs
results		results
tags		tags
templates		templates
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
config.py		config.py
pyproject.toml		pyproject.toml
tagger.py		tagger.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tumblr Tagging Automation

Overview

Implementation

Benchmarking

Scoring methodology

Results

Full benchmark results can be found here:

Use cases

Tech stack

Legal Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tumblr Tagging Automation

Overview

Implementation

Benchmarking

Scoring methodology

Results

Full benchmark results can be found here:

Use cases

Tech stack

Legal Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages