EN DA
AI
AI

Embeddings

How AI translates meaning into mathematics

vectorssemanticsrepresentation

Overview

An embedding is a dense, fixed-size vector of floating-point numbers that represents a piece of data—a word, sentence, image, or user—in a continuous mathematical space. The key insight is that semantic similarity becomes geometric proximity: words or concepts with related meanings end up near each other in the embedding space. Embeddings are the backbone of RAG systems, recommendation engines, and semantic search.

Key Concepts

  • Word embeddings (Word2Vec, GloVe): map individual words to vectors; "king − man + woman ≈ queen"
  • Sentence/document embeddings: encode entire passages as single vectors for semantic similarity search
  • Image embeddings: CNNs or Vision Transformers encode visual content into the same vector space as text (CLIP)
  • Retrieval: cosine similarity or dot product between query and document embeddings ranks relevance
  • Dimensionality reduction (t-SNE, UMAP): projects high-dimensional embeddings to 2D for visualisation

Key Facts

  • Word2Vec (Mikolov et al., Google, 2013) demonstrated that word arithmetic works in vector space
  • OpenAI's text-embedding-3-large produces 3 072-dimensional vectors
  • Vector databases (Pinecone, Weaviate, pgvector) are purpose-built to store and query billions of embeddings at low latency
  • CLIP (OpenAI, 2021) aligned image and text embeddings so you can search images with natural language queries
  • The curse of dimensionality means nearest-neighbour search in millions of vectors requires approximate algorithms (HNSW, IVF)