Rob Ervin Jauquet

Code, Coffee, Books, Board Games

Full Text Search in PostgreSQL is Good Enough

Searching text is non-trivial, but for most use cases we can get 80% of the way there with a little Django + PostgreSQL magic. Our goal is to enable accurate and fast full text searching without setting up any new packages, databases, services, headaches, etc. besides our existing Django (2.x) project with an existing PostgreSQL (>=9.6) database.

We're going to design a simple blog. First, a model:
    
        from django.db import models

        class Blog(models.Model):

            title = models.TextField()
            text = models.TextField()
    
Each blog post is stored in our Blog model with its title and text. Likely, you've already got a model with some text fields like the Blog model above. We're going to do a few things now:

Add Vector Fields

We're going to keep our title and text vectors separate to enable searching through one or both of the fields. It is also possible to vectorize multiple fields into one vector field if you know you're only ever going to search through everything.

Our new model looks like this:
    
        from django.contrib.postgres.search import SearchVectorField
        from django.db import models

        class Blog(models.Model):

            title = models.TextField()
            text = models.TextField()

            title_vector = SearchVectorField(null=True)
            text_vector = SearchVectorField(null=True)
    
You will note the "null=True" kwarg passed into SearchVectorField. We need to populate the vectors after their source fields have data in them, so we cannot create the vectors until after a Blog instance is saved. We will be adding a trigger later to automatically populate the vector fields, but first we need to set up a few more things.

Enable indexing on vector fields

We have two options for vector indexes in Postgres: GIN and GiST. It's worth reading The Documentation on the differences, but we're just going to go ahead and use GIN and not worry too much about the details.

Adding indexing to our vector fields is straightforward:
    
        from django.contrib.postgres.indexes import GinIndex
        from django.contrib.postgres.search import SearchVectorField
        from django.db import models

        class Blog(models.Model):

            title = models.TextField()
            text = models.TextField()

            title_vector = SearchVectorField(null=True)
            text_vector = SearchVectorField(null=True)

            class Meta:
                indexes = [GinIndex(fields=[
                    'text_vector',
                    'title_vector',
                ])]
    
With the GinIndex in place, we have everything setup on the model that we will need, but we still need to reflect these changes in our database.

Generate a New Migration

The next two parts assume you have a django project, you have an app called search containing your Blog model, and the search app is in your INSTALLED_APPS.

Now that we've made our changes, we can make our migration file. My ins and outs look something like this:
    
        rjauquet (master) → python manage.py makemigrations search
            Migrations for 'search':
                examples/search/migrations/0001_initial.py
                - Create model Blog
                - Create index search_blog_text_ve_fb62bc_gin on field(s) text_vector, title_vector of model blog
    
I am starting with a fresh Blog model so all of the fields were added in the first step, and the indexes were added in the second. Before doing anything else, let's take a look at the generated migration file:
    
        import django.contrib.postgres.indexes
        import django.contrib.postgres.search
        from django.db import migrations, models

        class Migration(migrations.Migration):

            initial = True

            dependencies = []

            operations = [
                migrations.CreateModel(
                    name='Blog',
                    fields=[
                        ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                        ('title', models.TextField()),
                        ('text', models.TextField()),
                        ('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
                        ('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
                    ],
                ),
                migrations.AddIndex(
                    model_name='blog',
                    index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
                ),
            ]
    
We can see this initial migration creates our model and adds an index to 'text_vector' and 'title_vector'.

Add Triggers

Before we apply this migration we need to add a little something extra to it to sweeten the deal. We're going to add a trigger to automatically populate our vector fields so never have to worry about it again. The migration file will now look like this:
    
        import django.contrib.postgres.indexes
        import django.contrib.postgres.search
        from django.db import migrations, models

        class Migration(migrations.Migration):

            initial = True

            dependencies = []

            operations = [
                migrations.CreateModel(
                    name='Blog',
                    fields=[
                        ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                        ('title', models.TextField()),
                        ('text', models.TextField()),
                        ('title_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
                        ('text_vector', django.contrib.postgres.search.SearchVectorField(null=True)),
                    ],
                ),
                migrations.AddIndex(
                    model_name='blog',
                    index=django.contrib.postgres.indexes.GinIndex(fields=['text_vector', 'title_vector'], name='search_blog_text_ve_fb62bc_gin'),
                ),
                migrations.RunSQL(
                    '''
                    CREATE TRIGGER title_vector_update BEFORE INSERT OR UPDATE
                    ON search_blog FOR EACH ROW EXECUTE PROCEDURE
                    tsvector_update_trigger('title_vector', 'pg_catalog.english', 'title');
                    ''',
                ),
                migrations.RunSQL(
                    '''
                    CREATE TRIGGER text_vector_update BEFORE INSERT OR UPDATE
                    ON search_blog FOR EACH ROW EXECUTE PROCEDURE
                    tsvector_update_trigger('text_vector', 'pg_catalog.english', 'title');
                    ''',
                ),
            ]
    
We add a bit of SQL to populate the vector fields anytime a row in our blog table is inserted or updated. "tsvector" is the Postgres text search vector type, but we don't need to know too much more about it at this point other than that it will be the type of data stored in the two SearchVectorFields in our model.

At this point we are ready to apply the migration. Again, my ins and outs:
    
        rjauquet (master) → python manage.py migrate search
            Operations to perform:
                Apply all migrations: search
            Running migrations:
                Applying search.0001_initial... OK
    
Our database and blog model are now both ready to handle full text searching!

Searching

So you'd probably like to actually search through some data now. This is also relatively easy, but first we need to add some Blog instances.

Run this to enter into the django shell (I'm using django-extensions with iPython):
    
        rjauquet (master) → python manage.py shell_plus --ipython
    
If you're using extensions and ipython, everything will be imported and ready to go. Let's add some Blog instances to our database:
    
        In [1]: Blog.objects.create(title="First Post", text="This is my first blog post. It's a good post.")
        Out[1]: <Blog: Blog object (1)>

        In [2]: Blog.objects.create(title="Second Post", text="This is not first blog post. It's a bad post.")
        Out[2]: <Blog: Blog object (2)>
    
Our vector fields and index will all be updated automatically. We're now ready to search through our blogs. Let's start with just the title:
    
        In [1]: from django.contrib.postgres.search import SearchQuery, SearchRank

        In [2]: results = Blog.objects.annotate(
                rank=SearchRank(
                    F('title_vector'),
                    SearchQuery('first'),
                )
            )
        In [3]: results = results.filter(
                rank__gt=0.0,
            ).order_by('-rank')

        In [4]: results.count(), results.first().title, results.first().rank
        Out[4]: (1, 'First Post', 0.0607927)
    
Okay we've got a lot going on here already. Let's break it down by each input.

In[1]: Import

The first thing we're doing is importing the SearchQuery and SearchRank classes from the django+postgres tools.

SearchQuery generates a postgres "tsquery" object from an input string. These can be logically combined as well, but we're going to keep it simple and just use one SearchQuery call.

SearchRank performs a query on a SearchVector with a SearchQuery and calculates a rank. We want to do this for every row in our Blog table.

Take a look at the django docs for more information on SearchQuery and SearchRank.

In[2]: Annotate

On our second step, we do the actual interesting work of assigning a rank value to every row in our Blog table. We're going to search for the word 'first' through the vector 'title_vector'. There is a tricky bit here. Normally SearchRank takes a SearchVector as its first argument, but we've luckily already told our database to build (and index!) our search vectors. All we need to do is tell Django to use our vector field instead of building a new SearchVector; F('title_vector') will do the trick.

In[3]: Filter

This part can be chained with the previous step, but I've separated it out so we can talk about is separately. We've already applied our annotation to assign each record a rank, so we can now filter our results down to just those records that have a score greater than zero. We of course also want to sort those results so the highest ranks appear first.

In[4]: Result

Of our two titles "First Post" and "Second Post", only one had a rank greater than zero when searching with the word 'first'. The rank value is a bit hard to reason about out of context, but you can read about the algorithm used in the Postgres docs. You shouldn't really have to futz with it to start getting good-enough results.

Final Thoughts

This is obviously a simplified example but hopefully it's enough ground work to build more complex queries. It would be good to add a bunch more Blog instances to start exploring what happens to the result set with different queries. For example, you can build two complex SearchQuery objects, one for each vector on our model:
    
        In[1]: from django.contrib.postgres.search import SearchQuery, SearchRank

        In[2]: title_query = ~SearchQuery('first') & SearchQuery('post')

        In[3]: text_query = SearchQuery('good') | SearchQuery('great')

        In[4]: results = Blog.objects.annotate(
                title_rank=SearchRank(
                    F('title_vector'),
                    title_query,
                )
            ).annotate(
                text_rank=SearchRank(
                    F('title_vector'),
                    text_query,
                )
            ).annotate(
                rank=F('title_rank') + F('text_rank')
            ).filter(
                title_rank__gt=0.0,
                rank__gt=0.0,
            ).order_by('-rank')
    
This will give results that have the word 'post' but do not have the word 'first' in the title, and contain either 'good' or 'great' in the text. What fun.

Good luck! Feel free to send questions my way: rjauquet@gmail.com