Embedding & Vector Stores: Notes

Embedding & Vector Stores

What are Embeddings?

  • Low-dimensional numerical representation of almost anything (text, audio, images, video).
  • A lossy compression technique: mapping complex data into simpler space while retaining semantic meaning.
  • Enables pattern recognition โ€” embeddings capture semantic similarity and underlying meaning.
  • Useful for efficient large-scale processing.
  • Think of it as: “encoding meaning in vector form.”

Use Cases

  • Retrieval & Recommendation systems.
  • Search engines: precompute embeddings for billions of pages, convert search query to embedding, find nearest neighbours in vector space.
  • Joint Embeddings:
    • Handle multi-modal data (e.g., text & images).
    • Map different modalities into the same embedding space for cross-modal matching.

Evaluation Metrics

How effective is your embedding?

  • Precision: Of retrieved results, how many were relevant?
  • Recall: Of all relevant results, how many did we retrieve?
  • Precision@k / Recall@k: Focused on top-k results.
  • NDCG (Normalized Discounted Cumulative Gain): Prioritizes relevance ordering โ€” higher is better.
  • Evaluate and compare different embedding models based on these.

Embeddings in RAG (Retrieval-Augmented Generation)

  • Embeddings help retrieve relevant knowledge base data.
  • Used to boost LLM prompts with accurate context.
  • Process:
    • Create index.
    • Chunk documents and store in a vector DB.
    • Retrieve via semantic search โ€” not just keyword match.
  • Meaning-based search wins over keyword-based.

Creating Labeled Data

  • LLMs can generate synthetic questions to scale training data.
  • Useful when labeled examples are scarce.

Types of Embeddings

๐Ÿ”ค Text Embeddings

  • Convert words/sentences to vectors.

Tokenization

  • Text โ†’ tokens โ†’ numerical IDs.
  • Techniques:
    • One-hot encoding
    • Word embeddings

Word Embeddings

  • Word2Vec: Meaning from context.
    • CBOW (Continuous Bag of Words): predict target from context.
    • Skip-gram: predict context from target.
  • GloVe: Co-occurrence matrix + factorization.
  • Swivel: Efficient for massive corpora.

Document Embeddings

  • BOW (Bag of Words) + statistical models:
    • LSA (Latent Semantic Analysis)
    • LDA (Latent Dirichlet Allocation)
    • TF-IDF
  • Doc2Vec: Incorporates word order.
  • BERT: Contextualized embeddings.
  • Multi-vector Embeddings: Multiple views into one document.

๐Ÿงฎ Structured Data Embedding

  • Example: Table โ†’ PCA (Principal Component Analysis).

๐ŸŒ Graph Embedding

  • Encodes relational/social structures into vector space.

Training Embedding Models

  • Dual Encoder Architecture
  • Pretraining on large corpora
  • Fine-tuning on domain-specific datasets

  • Match embeddings close to a query embedding.
  • More powerful than plain text search.

Techniques

  • LSH (Locality-Sensitive Hashing) โ€” like assigning zip codes to vectors.
  • HNSW (Hierarchical Navigable Small World)
  • SCAN (Scalable Approximate Nearest Neighbour)

โš ๏ธ Note: Embeddings arenโ€™t always perfect.
Sometimes, traditional text-based methods are better depending on the task.