Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 3.11 KB

File metadata and controls

34 lines (23 loc) · 3.11 KB

Project 2: Due: Monday October 13, Noon. Teaming: You can do this project in team of 1, 2, or 3.

Please submit your blog post here: submission link

You can make a blog for yourself and post pages from it via the GWU blog service: https://blogs.gwu.edu/, or use public services like github, wordpress, etc.  

The goal of this project is to experiment with Vision Language Models, and to give you intuition about current methods to compute embeddings. In particular you are to explore one of the biggest challenges in Visual Language Modelling, the "modality gap" between text and image embeddings.

My CLIP-Challange has two parts.

PART 1:

First, here are 5 images:

For each image, I want you to find the text that matches the image best.

  • For each image I, Find the single word W that maximizes the cosine similarity (CLIP(I), CLIP(W)).
  • For each image I, find the "simple structured caption" with 1 word so that maximizes the cosine similarity (CLIP(I), CLIP("A photo of a W")).
  • For each image I, find the arbitrary caption C that maximizes the cosine similarity (CLIP(I), CLIP(C)).

You can do this with code (recommended), or with manual search (try a bunch of words yourself). If you do it with manual search, you must tell the story about which word you tried, and what you learned (e.g. I tried "dog", and then I tried "puppy" and it gave a higher score so I thought I would keep trying more and more specific words.

PART 2 Find the pair of ("regular image",caption) (e.g. not an image of all white pixels or other super wierd artificial image), that gives the largest cosine similarity you can. You are welcome to do this with manual search (again, explain all parts of your process), or with automated searches. Those can be brute force or make use of online tools (e.g. you could generate the images and/or the captions).

Here is a link to a Google Colab implementation of the most basic approach to compute CLIP features on an image and text.