Full Text Search in PostgreSQL is Good Enough
Searching text is non-trivial, but for most use cases we can get 80% of the way there with a little Django + PostgreSQL magic. Our goal is to enable accurate and fast full text searching without setting up any new packages, databases, services, headaches, etc. besides our existing Django (2.x) project with an existing PostgreSQL (>=9.6) database.
We're going to design a simple blog. First, a model:
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
Each blog post is stored in our Blog model with its title and text. Likely, you've already got a model with some text fields like the Blog model above. We're going to do a few things now:
- Add fields to store vectorized versions of title and text
- Enable indexing on our vectorized fields
- Generate a new migration file
- Add a trigger to automatically populate our vector fields
- Explore searching with our newly enabled full text search
Add Vector Fields
We're going to keep our title and text vectors separate to enable searching through one or both of the fields. It is also possible to vectorize multiple fields into one vector field if you know you're only ever going to search through everything.
Our new model looks like this:
from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
title_vector = SearchVectorField(null=True)
text_vector = SearchVectorField(null=True)
You will note the "null=True
" kwarg passed into SearchVectorField
. We need to populate the vectors after their source fields have data in them, so we cannot create the vectors until after a Blog instance is saved. We will be adding a trigger later to automatically populate the vector fields, but first we need to set up a few more things.
Enable indexing on vector fields
We have two options for vector indexes in Postgres: GIN and GiST. It's worth reading The Documentation on the differences, but we're just going to go ahead and use GIN and not worry too much about the details.
Adding indexing to our vector fields is straightforward:
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
title_vector = SearchVectorField(null=True)
text_vector = SearchVectorField(null=True)
class Meta:
indexes = [GinIndex(fields=[
'text_vector',
'title_vector',
])]
With the GinIndex in place, we have everything setup on the model that we will need, but we still need to reflect these changes in our database.
Generate a New Migration
The next two parts assume you have a django project, you have an app called search containing your Blog model, and the search app is in your INSTALLED_APPS.
Now that we've made our changes, we can make our migration file. My ins and outs look something like this:
rjauquet (master) → python manage.py makemigrations search
Migrations for 'search':
examples/search/migrations/0001_initial.py
- Create model Blog
- Create index search_blog_text_ve_fb62bc_gin on field(s) text_vector, title_vector of model blog
I am starting with a fresh Blog model so all of the fields were added in the first step, and the indexes were added in the second. Before doing anything else, let's take a look at the generated migration file:
import django.contrib.postgres.indexes
import django.contrib.postgres.search
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = []
operations = [
migrations.CreateModel(
name='Blog',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('title', models.TextField()),
('text', models.TextField()),
('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
],
),
migrations.AddIndex(
model_name='blog',
index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
),
]
We can see this initial migration creates our model and adds an index to 'text_vector'
and 'title_vector'
.
Add Triggers
Before we apply this migration we need to add a little something extra to it to sweeten the deal. We're going to add a trigger to automatically populate our vector fields so never have to worry about it again. The migration file will now look like this:
import django.contrib.postgres.indexes
import django.contrib.postgres.search
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = []
operations = [
migrations.CreateModel(
name='Blog',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('title', models.TextField()),
('text', models.TextField()),
('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
],
),
migrations.AddIndex(
model_name='blog',
index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
),
migrations.RunSQL(
'''
CREATE TRIGGER title_vector_update BEFORE INSERT OR UPDATE
ON search_blog FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('title_vector', 'pg_catalog.english', 'title');
''',
),
migrations.RunSQL(
'''
CREATE TRIGGER text_vector_update BEFORE INSERT OR UPDATE
ON search_blog FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('text_vector', 'pg_catalog.english', 'title');
''',
),
]
We add a bit of SQL to populate the vector fields anytime a row in our blog table is inserted or updated. "tsvector" is the Postgres text search vector type, but we don't need to know too much more about it at this point other than that it will be the type of data stored in the two SearchVectorField
s in our model.
At this point we are ready to apply the migration. Again, my ins and outs:
rjauquet (master) → python manage.py migrate search
Operations to perform:
Apply all migrations: search
Running migrations:
Applying search.0001_initial... OK
Our database and blog model are now both ready to handle full text searching!
Searching
So you'd probably like to actually search through some data now. This is also relatively easy, but first we need to add some Blog instances.
Run this to enter into the django shell (I'm using django-extensions with iPython):
rjauquet (master) → python manage.py shell_plus --ipython
If you're using extensions and ipython, everything will be imported and ready to go. Let's add some Blog instances to our database:
In [1]: Blog.objects.create(title="First Post", text="This is my first blog post. It's a good post.")
Out[1]: <Blog: Blog object (1)>
In [2]: Blog.objects.create(title="Second Post", text="This is not first blog post. It's a bad post.")
Out[2]: <Blog: Blog object (2)>
Our vector fields and index will all be updated automatically. We're now ready to search through our blogs. Let's start with just the title:
In [1]: from django.contrib.postgres.search import SearchQuery, SearchRank
In [2]: results = Blog.objects.annotate(
rank=SearchRank(
F('title_vector'),
SearchQuery('first'),
)
)
In [3]: results = results.filter(
rank__gt=0.0,
).order_by('-rank')
In [4]: results.count(), results.first().title, results.first().rank
Out[4]: (1, 'First Post', 0.0607927)
Okay we've got a lot going on here already. Let's break it down by each input.
In[1]: Import
The first thing we're doing is importing the SearchQuery
and SearchRank
classes from the django+postgres tools.
SearchQuery
generates a postgres "tsquery" object from an input string. These can be logically combined as well, but we're going to keep it simple and just use one SearchQuery call.
SearchRank
performs a query on a SearchVector
with a SearchQuery
and calculates a rank. We want to do this for every row in our Blog table.
Take a look at the django docs for more information on SearchQuery
and SearchRank
.
In[2]: Annotate
On our second step, we do the actual interesting work of assigning a rank value to every row in our Blog table. We're going to search for the word 'first'
through the vector 'title_vector'
.
There is a tricky bit here. Normally SearchRank
takes a SearchVector
as its first argument, but we've luckily already told our database to build (and index!) our search vectors. All we need to do is tell Django to use our vector field instead of building a new SearchVector
; F('title_vector')
will do the trick.
In[3]: Filter
This part can be chained with the previous step, but I've separated it out so we can talk about is separately. We've already applied our annotation to assign each record a rank, so we can now filter our results down to just those records that have a score greater than zero. We of course also want to sort those results so the highest ranks appear first.
In[4]: Result
Of our two titles "First Post"
and "Second Post"
, only one had a rank greater than zero when searching with the word 'first'
. The rank value is a bit hard to reason about out of context, but you can read about the algorithm used in the Postgres docs. You shouldn't really have to futz with it to start getting good-enough results.
Final Thoughts
This is obviously a simplified example but hopefully it's enough ground work to build more complex queries. It would be good to add a bunch more Blog instances to start exploring what happens to the result set with different queries. For example, you can build two complex SearchQuery
objects, one for each vector on our model:
In[1]: from django.contrib.postgres.search import SearchQuery, SearchRank
In[2]: title_query = ~SearchQuery('first') & SearchQuery('post')
In[3]: text_query = SearchQuery('good') | SearchQuery('great')
In[4]: results = Blog.objects.annotate(
title_rank=SearchRank(
F('title_vector'),
title_query,
)
).annotate(
text_rank=SearchRank(
F('title_vector'),
text_query,
)
).annotate(
rank=F('title_rank') + F('text_rank')
).filter(
title_rank__gt=0.0,
rank__gt=0.0,
).order_by('-rank')
This will give results that have the word 'post'
but do not have the word 'first'
in the title, and contain either 'good'
or 'great'
in the text. What fun.
Good luck! Feel free to send questions my way: rjauquet@gmail.com
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
title_vector = SearchVectorField(null=True)
text_vector = SearchVectorField(null=True)
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
title = models.TextField()
text = models.TextField()
title_vector = SearchVectorField(null=True)
text_vector = SearchVectorField(null=True)
class Meta:
indexes = [GinIndex(fields=[
'text_vector',
'title_vector',
])]
rjauquet (master) → python manage.py makemigrations search
Migrations for 'search':
examples/search/migrations/0001_initial.py
- Create model Blog
- Create index search_blog_text_ve_fb62bc_gin on field(s) text_vector, title_vector of model blog
import django.contrib.postgres.indexes
import django.contrib.postgres.search
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = []
operations = [
migrations.CreateModel(
name='Blog',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('title', models.TextField()),
('text', models.TextField()),
('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
],
),
migrations.AddIndex(
model_name='blog',
index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
),
]
import django.contrib.postgres.indexes
import django.contrib.postgres.search
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = []
operations = [
migrations.CreateModel(
name='Blog',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('title', models.TextField()),
('text', models.TextField()),
('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
],
),
migrations.AddIndex(
model_name='blog',
index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
),
migrations.RunSQL(
'''
CREATE TRIGGER title_vector_update BEFORE INSERT OR UPDATE
ON search_blog FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('title_vector', 'pg_catalog.english', 'title');
''',
),
migrations.RunSQL(
'''
CREATE TRIGGER text_vector_update BEFORE INSERT OR UPDATE
ON search_blog FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('text_vector', 'pg_catalog.english', 'title');
''',
),
]
rjauquet (master) → python manage.py migrate search
Operations to perform:
Apply all migrations: search
Running migrations:
Applying search.0001_initial... OK
rjauquet (master) → python manage.py shell_plus --ipython
In [1]: Blog.objects.create(title="First Post", text="This is my first blog post. It's a good post.")
Out[1]: <Blog: Blog object (1)>
In [2]: Blog.objects.create(title="Second Post", text="This is not first blog post. It's a bad post.")
Out[2]: <Blog: Blog object (2)>
In [1]: from django.contrib.postgres.search import SearchQuery, SearchRank
In [2]: results = Blog.objects.annotate(
rank=SearchRank(
F('title_vector'),
SearchQuery('first'),
)
)
In [3]: results = results.filter(
rank__gt=0.0,
).order_by('-rank')
In [4]: results.count(), results.first().title, results.first().rank
Out[4]: (1, 'First Post', 0.0607927)
In[1]: from django.contrib.postgres.search import SearchQuery, SearchRank
In[2]: title_query = ~SearchQuery('first') & SearchQuery('post')
In[3]: text_query = SearchQuery('good') | SearchQuery('great')
In[4]: results = Blog.objects.annotate(
title_rank=SearchRank(
F('title_vector'),
title_query,
)
).annotate(
text_rank=SearchRank(
F('title_vector'),
text_query,
)
).annotate(
rank=F('title_rank') + F('text_rank')
).filter(
title_rank__gt=0.0,
rank__gt=0.0,
).order_by('-rank')