Mastering Agentic AI with Java: Live Course
Spring AIAI Engineering

Vector Embeddings and Tokens


AI models can only process numerical representations, they cannot understand text, images, or audio directly. To bridge this gap, input text is first broken into tokens, then converted into vector embeddings (numerical representations) that capture meaning, context, and relationships between words.

What Are Vector Embeddings?

Vector embeddings are a way to represent words (or text) as numerical values in a high-dimensional space. They allow AI models to understand relationships, context, and semantic meaning between words.

Core Idea: Words with similar meanings are mapped close together in vector space, while dissimilar words remain far apart.

Example

  • "cat", "kitty", "feline" → clustered close together
  • "cat" and "automobile" → far apart in vector space

This enables searching for "kitty" and finding results about "cat" even without an exact text match.


How AI Processes Text

The pipeline from text input to AI response:

Text_Processing_Sequence

  1. Tokenization - Text is broken into tokens (subword units)
  2. Embedding - Each token is converted into a numerical vector
  3. Processing - The model operates on these vectors to generate output

Tokenization

What Are Tokens?

A token is the smallest unit of text processed by an AI model.

A token is not always a complete word.

Depending on the tokenizer, a token can be:

  • A word
  • Part of a word
  • A character
  • A punctuation symbol

Understanding Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Example:

Artificial Intelligence

May be tokenized as:

["Artificial", "Intelligence"]

or

["Art", "ificial", "Intelligence"]

Example:

googling

May become:

["go", "ogling"]

Key Facts

AspectDetail
Token-to-word ratio1 token ≈ 3/4 of a word
100 tokens≈ 75 words on average
Word vs token countA 7-word sentence can produce ~11 tokens
Cost implicationMore tokens = higher API cost

How Tokenization Works

  • Common words may be a single token (e.g., "hello" → 1 token)
  • Unfamiliar or compound words are split into parts (e.g., "googling" → "go" + "ogling")
  • Different models (e.g., GPT-4) use different tokenization algorithms, producing varying token counts from the same text

Why It Matters

  • Cost: OpenAI charges per token (both input and output)
  • Context limits: Models have maximum token limits per request
  • Optimization: Understanding tokenization helps optimize prompts for efficiency

Vector Embeddings

Why Not Single Numbers?

Representing each word as a single number is insufficient because:

  • Vocabularies are massive (hundreds of thousands of words)
  • Multiple languages exist
  • Words have nuanced relationships that a single dimension cannot capture

Solution: Use multi-dimensional vectors (e.g., 1536 or 3072 dimensions) to represent each token, capturing rich semantic relationships.

Dimensionality

The number of dimensions determines how much nuance and context the embedding can capture.

ModelDimensionsUse Case
OpenAI text-embedding-small1536Cost-effective, general purpose
OpenAI text-embedding-large3072Higher accuracy, more detail
Other models (e.g., some open-source)1024Varies by provider

Higher dimensionality = more accuracy but increased cost and computation.

Dimensionality Reduction

High-dimensional embeddings can be reduced to lower dimensions (e.g., 2D or 3D) for visualization and analysis purposes, though some information is lost.


Semantic Understanding with Embeddings

Context-Aware Relationships

Embeddings capture that words can have different meanings based on context:

  • "Python" (programming language) vs "Python" (snake) → different vector positions based on surrounding context

Word Grouping

Words with similar meanings appear closer together.

Words with different meanings appear farther apart.

Example Animals, cluster together.:

Dog
Cat
Lion
Tiger

Technology terms, form different clusters:

Software
Code
Programming
Algorithm

Search Example

User searches:

kitty

Document contains:

cat

Traditional search:

No match

Embedding-based search:

Match found

because the vectors are semantically similar.

Analogical Reasoning

Vector embeddings enable mathematical operations on word meanings:

king - man + woman = queen

This demonstrates that embeddings capture gender relationships, hierarchies, and other semantic patterns through arithmetic operations on vectors.


Applications of Embeddings

ApplicationDescription
SearchRank results by semantic relevance to a query
ClusteringGroup text strings by similarity
RecommendationsSuggest items with related text content
Anomaly DetectionIdentify outliers with low relatedness
Diversity MeasurementAnalyze similarity distributions
ClassificationClassify text by most similar label

Creating Embeddings via API

You can generate embeddings using API clients (curl, Insomnia, Postman) or programmatically (Python, Java, JavaScript).

API Request Structure

{
  "input": "Your text here",
  "model": "text-embedding-3-large"
}

Required Components

ComponentPurpose
EndpointEmbedding API URL
AuthorizationAPI Key
Input TextText to convert
ModelEmbedding model

Response

The API returns a vector (array of floating-point numbers) representing the input text in the specified dimensional space.

{
  "embedding": [
    0.123,
    -0.456,
    0.789,
    ...
  ]
}

The array may contain:

  • 1536 numbers
  • 3072 numbers

depending on the selected model.


Embeddings_in_System_Design


Tokens vs Embeddings

AspectTokensEmbeddings
PurposeBreak text into unitsRepresent meaning numerically
OutputText fragmentsNumerical vectors
Used ForModel input processingSemantic understanding
Example"Artificial"[0.23, -0.41, 0.88, ...]
ImpactCost and context limitsSearch and reasoning quality

Summary

  • AI models cannot process raw text directly; text must first be converted into numerical representations.

  • Tokenization breaks text into smaller units called tokens, which are the fundamental inputs consumed by language models.

  • Token count affects API cost, processing time, and context window limitations.

  • Vector embeddings convert tokens, words, or entire documents into high-dimensional numerical vectors that capture semantic meaning.

  • Similar concepts are positioned close together in vector space, enabling semantic search and intelligent retrieval.

  • Higher-dimensional embeddings generally provide richer semantic understanding but require additional storage and computation.

  • Embeddings power critical AI capabilities such as semantic search, recommendations, clustering, anomaly detection, and Retrieval-Augmented Generation (RAG).

  • Understanding tokens and embeddings is fundamental for designing efficient, scalable, and intelligent AI-powered systems.

Official Document Reference: OpenAI Embeddings Guide

Written By: Muskan Garg

How is this guide?

Last updated on