2023-06

2023-06-26 Vectors are the new JSON in PostgreSQL

Vectors are the new JSON.

That in itself is an interesting statement, given vectors are a well-studied mathematical structure, and JSON is a data interchange format.

And yet in the world of data storage and retrieval, both of these data representations have become the lingua franca of their domains and are either essential, or soon-to-be-essential, ingredients in modern application development.

And if current trends continue (I think they will), vectors will be as crucial as JSON is for building applications.

Generative AI and all the buzz around it has caused developers to look for convenient ways to store and run queries against the outputs of these systems, with PostgreSQL being a natural choice for a lot of reasons.

But even with the hype around generative AI, this is not a new data pattern. Vectors, as a mathematical concept, have been around for hundreds of years. Machine learning has over a half-century worth of research. The array – the fundamental data structure for a vector – is taught in most introductory computer science classes. Even PostgreSQL has had support for vector operations for over 20 years (more on that later)!

So, what is new? It’s the accessibility of these AI/ML algorithms and how easy it is to represent some “real world” structure (text, images, video) as a vector and store it for some future use by an application.

And again, while folks may point to the fact it’s not new to store the output of these systems (“embeddings”) in data storage systems, the emergent pattern is the accessibility of being able to query and return this data in near real-time in almost any application.

What does this have to do with PostgreSQL? Everything!

Efficient storage and retrieval of a data type used in a common pattern greatly simplifies app development, lets people to keep their related data in the same place, and can work with existing tooling. We saw this with JSON over 10 years ago, and now we’re seeing this with vector data.

To understand why vectors are the new JSON, let’s rewind and look back at what happened as JSON emerged as the de facto data type for web communications.

The next steps for better support for vectors in PostgreSQL

At PGCon 2023, which is a PostgreSQL conference where many internals developers gather, I presented a lightning talk called Vectors are the new JSON where I shared use-cases and some upcoming challenges with improving PostgreSQL and pgvector performance for querying vector data.

Some problems to tackle (many of which are in progress!) involve adding more parallelism to pgvector, adding support for indexing for vectors with more than 2,000 dimensions, and leverage hardware acceleration where possible to speed up calculations.

The good news is that some of these things are not too hard to add, they just require open source contributions!

There is a lot of excitement around using PostgreSQL as a vector database (emphasis on database ;-), and I expect that, as history has shown with JSON, the PostgreSQL community will find a way to support this emergent workload in a way that’s scalable and safe.

I do encourage you to provide feedback – both on PostgreSQL itself and pgvector – on how you’re working with vector data in PostgreSQL, or how you want to work with data in PostgreSQL, as that will help guide the community on providing optimal support for vector queries.

2023-06-17 sqllean.py import sqlean as sqlite3” works as a drop-in replacement for the module from the standard library

sqlean.py: Python’s sqlite3 with extensions. Anton Zhiyanov built a new Python package which bundles a fresh, compiled copy of SQLite with his SQLean family of C extensions built right in.

Installing it gets you the latest SQLite—3.42.0—with nearly 200 additional functions, including things like define() and eval(), fileio_read() and fileio_write(), percentile_95() and uuid4() and many more.

“import sqlean as sqlite3” works as a drop-in replacement for the module from the standard library.

2023-06-08 Hack’PG #1 : Comment créer une extension pour PostgreSQL

Reviers, le 8 juin 2023

Après quelques discussions en interne, nous sommes tombés d’accord pour mettre en place une journée mensuelle sur le codage dans PostgreSQL pour les consultants et développeurs de Dalibo.

Le séminaire fut l’occasion d’une première journée, qui a été consacrée à la création d’une extension. Pour cela, nous avons créé deux fonctions en SQL, puis une fonction en C, et ajouté un opérateur. Cela nous a permis de comprendre comment créer et mettre à jour une extension, et voir ainsi tout son cycle de vie.

Le dernier point a concerné la mise en place de tests unitaires pour cette extension. Cet article détaille tout ce que nous avons appris lors de cette journée.

Pour aller plus loin

Écrire des fonctions en C est plus dangereux, notamment quand on manipule des pointeurs. Cependant, cela a aussi des avantages.

Nous avons eu le cas d’un client qui avait une fonction mathématique écrite en PL/pgsql. Cette fonction codée en C est bien plus rapide. Il n’est pas étonnant que beaucoup des fonctions de PostGIS soient écrites en C.

Un autre intérêt est l’accès à des fonctions internes de PostgreSQL, comme les hooks.

Sur cette journée du séminaire, nous avons eu le temps d’aller plus loin en examinant un peu le concept des hooks, mais cet article est déjà assez long. Le système des hooks fera l’objet d’un deuxième article.

Pour les curieux, le code de l’extension se trouve sur le dépôt GitHub des journées de hacking PostgreSQL .