# CLIP-Style Text↔Image Similarity

**Track:** Creative ML & AI-in-the-Loop — Advanced Creative Coding — proposed (50)
**Framework / surface:** cross-framework
**Level:** Hard
**Prerequisites:** Semantic Search Mini-Index, Run a Model Client-Side

**In one line:** Embed text and images, rank by cosine, build a semantic sorter.

## Theory, aesthetics & inspiration

CLIP, introduced by Alec Radford and colleagues at OpenAI in 2021, trains an image encoder and a text encoder together until a picture and its caption land near each other in a shared space. Distance there is cosine similarity, and that single number becomes a tool: rank a folder of images against the phrase "loneliness," sort a collection by how "baroque" it reads, build a search that understands description rather than filename. The shared embedding is the engine later guiding text-to-image diffusion, but on its own it is a semantic lens—language used to measure pictures. Run through transformers.js, the entire sorter lives client-side, no server required.