# CLIP-Style Text↔Image Similarity **Track:** Creative ML & AI-in-the-Loop — Advanced Creative Coding — proposed (50) **Framework / surface:** cross-framework **Level:** Hard **Prerequisites:** Semantic Search Mini-Index, Run a Model Client-Side **In one line:** Embed text and images, rank by cosine, build a semantic sorter. ## Theory, aesthetics & inspiration CLIP, introduced by Alec Radford and colleagues at OpenAI in 2021, trains an image encoder and a text encoder together until a picture and its caption land near each other in a shared space. Distance there is cosine similarity, and that single number becomes a tool: rank a folder of images against the phrase "loneliness," sort a collection by how "baroque" it reads, build a search that understands description rather than filename. The shared embedding is the engine later guiding text-to-image diffusion, but on its own it is a semantic lens—language used to measure pictures. Run through transformers.js, the entire sorter lives client-side, no server required.