Vector Tables 101: Understanding vector and PGVector Once and for All

You've definitely heard about pgvector, vectors, and how all of this sounds pretty complicated. Well... it actually is! But you don't need to know how to build a car engine from scratch just to drive it. If you've been playing questionable games on a PC-88 until now, vectors are used by artificial intelligence, so this article is obviously about that.

In this post, I'll try to explain, in an extremely didactic and practical way, what on earth this "vector" thing is and why it will become increasingly important in the future. To understand pgvector, you don't need to understand linear algebra or machine learning.

The basics of `vector`

A vector is basically an ordered array of numbers:

[1, 2, 3, 4, 5]
[0.002, 0.887, -0.134, 0.552, ...]

Nothing new so far, but mathematically, we can imagine this vector as a point in space:

With 2 numbers, it exists in a 2D space
With 3 numbers, it exists in a 3D space
With 768 numbers, it exists in a 768-dimensional space

And so on. This is where things get tricky.

Why not just use `TEXT`?

Because plain text doesn't allow for search by meaning, and that's where vector shines. Imagine with me that your database has:

"How to configure PostgreSQL"

And the user searches for:

"Postgres database setup tutorial"

This way, embeddings position (or should position) these phrases close to each other in a vector space, based on proximity. It's possible to state, even without sharing the same words:

"These texts are talking about practically the same thing."

If it were a TEXT record, it would be unfeasible, given the amount of stored information, to make this correlation in a practical and fast way.

Understanding Embedding and Proximity

An embedding is the numerical representation of some stored information. This information can be text, image, audio, video... whatever is necessary. The AI model will take this content and convert it entirely into a vector. For example: the word "cat" will become [0.12, -0.44, 0.88, ...] in the database, while "dog" might become [0.09, -0.41, 0.85, ...].

What matters here aren't the numbers themselves, but the relative position of these vectors. And how does the AI understand this? Well, my dear reader, it doesn't! The model learns to position semantically similar concepts in nearby regions of the vector space. It's not magic, it's not witchcraft—it's extremely complex mathematics that I don't have (at the time of writing) the time or the will to meddle with.

Visually, within the interpretative logic, and in an extremely simple way, it looks like this:

Now take this example and multiply it by millions. That's how it's stored.

How is Proximity measured?

Since everything is a vector, we need to understand what is closest to what. There are three common metrics used—this is the only part of this post where you'll find different (or not so different) technical terms:

Euclidean Distance measures traditional geometric distance; the smaller the distance, the more similar the vectors are.
Inner Product measures the alignment between vectors, commonly used in search and recommendation systems.
Cosine Similarity measures the angle between vectors; semantically similar embeddings tend to point in similar directions.

The Importance of Dimensionality

You already know that dimensionality is defined by the number of values in each array: [1, 2] has two dimensions, [1, 2, 3, 4] has four. Modern embeddings can exceed 3,000 dimensions.

The higher the dimensionality, the greater the capacity to represent semantic nuances. Dimensionality size defines the requirements for storage, memory, and operational cost. Real-world examples: OpenAI's text-embedding-3-small has 1,536 dimensions; text-embedding-3-large has up to 3,072 dimensions; several open-source models have 384 or 768 dimensions.

Normalization

Normalization adjusts the length, making all of them equal to 1. For example, [67, 0] would become [1, 0], and [42, 0] would become [1, 0]. This approach is being widely used in current embedding pipelines, making cosine similarity more efficient and predictable (because we start comparing only the vector's direction, not its magnitude). Simple cases can have simple solutions.

No, you don't need to be a math whiz to learn about vectors.

Vector Search

In a real-world situation, imagine the table:

id | content                     | embedding
1  | How to use Docker           | [0.12, ...]
2  | Introduction to PostgreSQL  | [0.98, ...]
3  | Kubernetes for beginners    | [-0.31, ...]

When the user asks "How to get started with containers?", an embedding of this query is generated, which then looks for the nearest vectors, resulting in row 1 | How to use Docker. Even if the word "container" doesn't exist in the original content. This is what we call semantic search.

The PGVector Extension

Phew! Finally, let's talk about something familiar. PostgreSQL has an extension called PGVector, which adds 100% native support for vectors. You get the best of both worlds: your trusty old table on steroids (the good stuff!), ready to store embeddings directly in it.

It's easier than it looks:

CREATE TABLE articles (
    id BIGSERIAL PRIMARY KEY,
    title TEXT,
    embedding VECTOR(1536)
);

In SQL, querying vectors looks like this:

SELECT *
FROM articles
ORDER BY embedding <-> '[...]'
LIMIT 10;

The <-> operator calculates the euclidean distance. There are operators for cosine similarity and inner product as well, each with its own purpose.

When does it start to fail?

Comparing vectors is expensive. For a normal use case, we'll hardly feel it, but for situations with 10 million embeddings, performance starts to drop drastically. In that case, dive deeper into specialized indexes for PGVector like IVFFlat and HNSW.

I don't need to say much, but PGVector is not a replacement for large-scale vector databases. It remains a great option for having cutting-edge technology in your system—like RAG, semantic search, recommendations, agent memory, smart FAQs... all without adding anything to your current infrastructure. This simplicity is why it's so famous nowadays: for most cases, it serves very well.

Summary

In practice, RAG and semantic search revolve around this simple transformation:

Content
  └── Embedding
    └── Vector
      └── Vector Comparison

When you understand that an embedding is just a point in a high-dimensional space, almost all the witchcraft of modern AI systems starts to look much more like a matter of geometry than artificial intelligence.