This repository implements a Next-Gen Multimodal AI pipeline designed to automate the extraction of structured feature data from raw advertising creatives.
Leveraging Google's Gemini 2.0 Flash, this system acts as an intelligent ETL layer. It "watches" video ads and "views" static images to generate high-fidelity, structured JSON metadata (e.g., mood, pacing, shot angles, text density) with significantly lower latency and higher temporal reasoning capabilities than previous architectures.
The structured data produced by this pipeline serves as the critical upstream input for downstream performance forecasting models (TCN, XGBoost), solving the "unstructured data" problem in AdTech.
- Native Multimodality: Processes video frames natively (not just sampled images) for deep temporal understanding of ad pacing and narrative flow.
- Ultra-Low Latency: Optimized for speed, making it viable for high-volume production pipelines processing thousands of creatives.
- Structured Output Enforcement: Utilizes native Function Calling to guarantee strictly valid JSON outputs, eliminating hallucinated formatting errors.
- Multimodal Ingestion: Native support for both
.mp4video and static.jpg/.pngimage formats. - Temporal Video Analysis: Captures time-based features such as pacing (Fast/Slow), audio mood, and call-to-action timing.
- Schema-Constrained Generation: Forces the LLM to adhere to a strict business schema, ensuring 100% compatibility with downstream SQL/Pandas pipelines.
- Production Robustness: Includes polling mechanisms for asynchronous video processing states and automatic rate-limit handling.
- Ingestion: Scans the
extraction_visuals/directory for new creative assets. - Upload & State Management: Uploads large video files to the Gemini File API and polls for
ACTIVEprocessing state. - Inference: Sends the asset + a strict JSON schema definition to Gemini 2.0 Flash.
- Serialization: Saves the extracted metadata as sanitized JSON files in
extraction_results/.
The model is constrained to extract specific dimensions known to impact ad performance:
- Visual Composition:
shot_type(Close-up, Wide),color_tone(Warm, Cool),camera_angle. - Text Analysis:
amount_of_text,text_position,font_style. - Content:
people_presence(count, emotion),setting(Indoor/Outdoor). - Audio/Pacing:
music_mood,voiceover_gender,pacing.
-
Clone the repository:
git clone [https://github.com/yourusername/generative-ad-feature-extraction.git](https://github.com/yourusername/generative-ad-feature-extraction.git) cd generative-ad-feature-extraction -
Install dependencies:
pip install -r requirements.txt
-
API Key Setup: Create a
.envfile in the root directory and add your Google Gemini API key:GEMINI_API_KEY=your_api_key_here
-
Prepare Data: Place your raw video or image files into the
extraction_visuals/folder. -
Run the Extractor:
python main.py
-
Output: Structured JSON files will appear in
extraction_results/.Example Output:
{ "creative_format": { "media_type": "Video", "shot_type": "Close-up", "text_position": "Center" }, "mood_and_tone": { "pacing": "Fast", "emotional_appeal": "Excitement", "music_mood": "Upbeat/Electronic" }, "source_filename": "nike_summer_campaign.mp4" }
This repository is Part 1 of a larger MLOps ecosystem.
- This Repo: Extracts structured features from pixels.
- Forecasting Repos (TCN/XGBoost): Ingest these JSON features to predict the actual Leads/Clicks each ad will generate.
Luciën Tuijp