Marco's Blog

All content personal opinions or work.
en eo

Adding Search to a HUGO Site

2023-11-11 10 min read Site Programming marco

One function that is harder to do without a database backend, as static sites don’t, is search. With WordPress, search is baked into the core and all templates have to do is not get in the way. But on a static site, there is no backend to call, no infrastructure to invoke, and no user interface to show.

Static sites have gotten around this by using third party search providers. Basically, you add some third party code to your site and then magic happens. There are two options on how to do it:

  1. You use an external provider, usually on the cloud. Your site makes calls to the provider and receives results
  2. You use a library that indexes the documents and returns results

Neither approach is intrinsically better, and they both have their advantages and disadvantages.

1. External Providers

A lot of Hugo sites (and themes) use Algolia as provider. That’s a cloud service that takes the content of your site and creates a persistent index, then is invoked by your site’s search function to return the most relevant results.

Other providers work in a similar fashion: Whenever you update your site, you generate an index, submit it to the service, and the service will do its magic without you having to do anything more.

The advantage is obvioius: the service has to figure out how to handle the search. Your site doesn’t have to have the code, it doesn’t have to have the resources, it doesn’t have to run the service. All it does is add Javascript to the front-end to load the results, and the external service will do its magic. Not only do you offload the heavy weight of search, you get incremental improvement in the search engine for no additional cost (usually). Maybe the search engine added AI searches? Awesome, and you have to do exactly nothing to benefit from that!

2. Use a Library

For whatever reason, you may want to do things yourself. The downsides are obvious, since I listed the upsides of using an external provider above. The upsides of doing it this way are less obvious, but also important.

For one, you can just add content, update a local index, and all is good in the world. You don’t have to update the new index on the external site, you don’t have to deal with the site not understanding your documents, you don’t have to wait for the index to update. Basically, you can set up a pipeline that gets you from posting a new article to adding it to the search in an instant.

Also important: you don’t depend on the third party to update their service to get what you want. Maybe you want to allow partial search results; maybe you want to emphasize the title more than the content of an article: whatever it is, with a library you have a better chance of getting what you want than with a fixed external service provider.

The Pipeline I Chose

The process I wanted I obviously the most simple one:

  1. Add content to the site
  2. Update the index
  3. Done!

What I wanted to avoid was a generation of the index on the browser. The idea that anyone crawling the site would have to load all pages was awful. But that was the default of all engines.

Instead, I wanted something that took the existing markdown files, generated a searchable index from them, and spit it into the site map somewhere.

While that is normally easy to do, it’s hard to find instructions. I ended up at this moment with a set of technologies that were (a) well described and (b) easy to implement.

I didn’t want to end up with a third party service. Part of it was cost - even if they are free today, tomorrow they might start charging. But I also hated the added complexity of having to deal with someone external that made decisions without consulting me.

At the same time, the idea of generating the index on the fly was daunting. I didn’t want to deal with hundreds of requests for content from searchers! So I needed some way to do the search ahead, which meant creating an index.

I finally found an article about the combination of Hugo, Grunt, and Lunr. It seemed to do all that I wanted, so I tried using it. And it really worked!

Setup

You start out by building the index file. That’s an index that contains all the content on your site (usually markdown) in searchable fashion. The browser downloads the index, which means it doesn’t have to download all the single files with content.

The bad news is that the index requires new technology. I haven’t found any indexing in Go (which is the tech Hugo is based on). Instead, lunr uses Javascript on the server side. To make the indexing work, the author of the gist uses Grunt, a make analog in Node.

I could talk about my dislike for Node forever, but it boils down to this: there are too many packages, they are each too small, and the end result is that even a simple package like Grunt has hundreds of dependencies. That’s twice a problem: first, the dependencies have strong ideas of what they want for their own dependencies, which can easily cause a versioning nightmare where one package wants one version of a dependency and another a different one, which ends up crashing the build. Second, and potentially worse, the number of packages results in many potential security flaws. And, in fact, no matter what I tried I ended up with Grunt requiring at least one package with a flaw marked severe.

I sighed, but since the software would run in a virtual machine on my development laptop I didn’t really care about security flaws. Maybe I should.

Code Changes

Even after getting the code and grunt installed, things didn’t work as expected. Some of the changes were trivial - like the location files were expected to be at, ‘site/content’. In my case, content was in content and site seems to be the place where you’d put the generated files (which is public by default in Hugo). I am not sure what happened, but it was easy enough to fix.

Next was the version of lunr I used. I don’t know if it’s older or newer than the one the author used, but I got mine straight from unpkg. In particular, my version didn’t like adding items to an index that had already been created, so I had to move the adding code to the initializer.

Finally, all my files used yaml as front matter format, while the author expected toml. The difference is mostly trivial: yaml uses colons where toml uses equal signs, but the processor didn’t like the format, so I had to add the corresponding code and yaml to the dependencies.

With all that done, I ran the grunt file again and found myself the proud owner of an index file!

Integration

My template (Bilberry) looks great and has a bunch of features I like - like the integrated gallery. It does use algolia for search, which I didn’t want. So now the question was what needed to happen for the lunr-based search to work.

There are basically four tasks to accomplish:

  1. Remove algolia specific code
  2. Load the lunr dependencies
  3. Load the index into lunr
  4. Tie the search box to search result display

Remove Algolia Code

This one was easy, as all I needed to do was override the layouts with my own. Hugo prefers top level layouts/partials to theme specific ones, so all I had to do was grep through the theme for algolia and copy the results into the top level hierarchy. There were three files that needed to be edited:

  • _default/baseof.html - the main template
  • partials/topnav.html - the top navigation bar
  • partials/algolia-search.html - just the configuration for algolia, API keys and such

The only one of these that required real modification was baseof.html, where I had to add both the style for the search results and a new partial that loaded the lunr specific code. No such code is needed for algolia, since that’s provided server-side by them.

Load lunr Dependencies

Lunr has only a dependency on jquery, so all there was to it was to load the two of them. Since lunr itself can be loaded by jquery, and since jquery is a fairly common library to use (still), I only added jquery as an immediate download and lunr is only loaded if the search box is made visible. But this may change in the future, since lunr is a really small library and doesn’t add much load.

I decided to download both lunr and jquery and serve them from my machine to have consistent results. I could have left them on unpkg and pinned a specific version, of course, and may do so in the future.

Load the Index into lunr

The glue code from the gist does that just fine. I created a partial called lunr-search.html which is in essence that code, modified as mentioned above. The index is loaded first thing and adds a delay to the opening of the search box. Not fond of that, of course, and maybe I should look into compressing the index.

Tie Search Input to Search Result Display

This was the only thing that required any real amount of work, in the order of an hour or so. Basically, there is an input on the page with the id/class/… that identifies it to the lunr search code. As the input changes, lunr spits out new results. These then have to be displayed somehow.

The whole thing is not complicated. You just add a div into your code that is unhidden when there is something to display. You can then format it as you want, styling it to your heart’s content. Thinking of mobile devices and limited screen real estate, I opted for a flex box and it worked marvelously well.

For simplicity’s sake, I added the style to baseof.html instead of putting it into a separate CSS file. Editing the theme was not an option, as the theme.css was minified. Since there is only one search box, it’s easy enough to give it a distinctive marker and make it float above all else.

Updating

Each time a new article is posted (like this one), the index file needs to be updated. I could opt for an incremental update that simply generates the new entry for the existing JSON file, but the process is fast enough that regenerating the entire index is not a hassle, at all. At least at the current size of this blog.

I added an SVN hook on check-in of a file into the content subtree that regenerates the index using grunt. The index file is also version controlled, so I just need to check in the update. Then the server gets the check-in and regenerates the entire site transparently. As a result, whenever I post a new article, the index is regenerated automatically. Unless I forget to check in the new index, all is good.

Contrast that with algolia et.al. that require you to upload the index file to their service. Then they process it at their leisure.

It’s not a big deal, I’d think. But this process is much quicker for the size of the site we are talking about.

Conclusion

There are still a few issues I need to work out. For instance, the loading time of the index slows down the display of the top navigation bar. That’s both bad because users don’t know something is happening (and I should add a loader symbol), and because the top nav also has navigation links that shouldn’t be slowed down by the search box. I guess I could make the loading happen when the user interacts with the search box.

But aside from that, I am really happy with the way it works. It’s easy to update the index, the search results are super zippy and accurate, and I like the display of results. All of it works well thanks to lunr, but could be easily replaced by a different engine if needed.