In my day job, I'm responsible for managing, amongst other things, our project's search engine. It's based on the popular apache solr search indexer, and relies on a lot of custom code to parse our HTML content. I quite enjoy working with solr, and have added various faceting, filtering and boosts to improve its effectiveness. It's now a 'solved problem' at work though, and aside from the occasional failure to index, there's no new work required, so I don't get to work on it any more.
I have been wondering how I might implement a site search pipeline for my bascule static site generator. Not because this website particularly needs a search engine, but because it might be an interesting problem to solve. I first looked at hosted search solutions in the cloud, using solr or its major competitor elasticsearch. None of the cloud hosted solutions made much financial sense. The upfront and monthly costs are just too high for a website which receives essentially no traffic. The web hosting package this site runs on doesn't allow me to deploy any major Java servlet applications, and search indexing/search querying doesn't seem to me to be a good fit for serverless solutions such as AWS Lambda or other function-as-a-service platforms.
Some googling lead me to an unexpected place - write the search in Javascript. But I dislike Javascript, and a good search engine needs to do more than a text.contains(query)
string comparison function, so I started looking for libraries. I first found lunr.js and then a more extensive fork of lunr called elasticlunr. I've been experimenting with both, but ultimately decided that lunr.js is the better solution. It's simpler to get started, well documented, and still maintained. elasticlunr looks good, but doesn't seem to be under active development.
I've been worried about the performance of deploying a search engine in Javascript, running in the user's browser, but a recent training course I attended made me realise that browsers these days are running much more complex apps written in Javascript frameworks like React and Angular than my little search engine, so I think I don't need to worry.
Implementation
lunr.js can't scan the HTML of this website; instead, it must be provided with a JSON object containing the text to be indexed. That's an easy job for bascule - I merely had to write a new GeneratorPipeline class to parse the website's static HTML and return the relevant content as a JSON array. The new class is in the bascule-extras repository. The JSON array is uploaded alongside the website HTML, and a call to the lunr.js library starts in the indexing process in the browser.
var lunrIndex;
console.log("Fetching...");
const request = new Request('./lunrindex.json', {
headers: new Headers({
'Content-Type': 'application/json'
})
});
fetch(request).then(response => response.json())
.then(data => lunrIndex = lunr(function () {
console.log("Indexing...");
this.ref('id');
this.field('body');
this.field('title');
this.field('date');
this.field('headings');
data.forEach(function (doc) {
this.add(doc)
}, this)
}));
(This is the first time I've ever written Javascript which uses Promises in the form of the fetch
call above - as I say, I'm not a fan of Javascript!)
I have added a search box to the home page of this website, but there's no functioning UI yet. It is plugged in to lunr.js though, and if you open your browser's console logs and then search for a word, the logs will show that search is working.
//Searching for bascule
//search results:
(3) […]
0: Object { ref: "new-website-bascule", score: 5.654999999999999, matchData: {…} }
1: Object { ref: "pdf-generation-from-bascule", score: 5.252, matchData: {…} }
2: Object { ref: "whats-next-bascule", score: 4.677, matchData: {…} }
length: 3
<prototype>: Array []
The basic UI will come next, hopefully before Christmas.
One obvious flaw with the current approach is that lunr.js is asked to reindex the entire JSON array every time the home page is loaded, which costs both processor time and network traffic as the array is fetched. lunr.js allows me to save the index so that is can be reloaded on demand, skipping the reindexing step. But where to save it? It's far too big to stuff into a cookie. I'm going to explore the WebStorage API to store the index in the user's browser. But there is another possibility.
What if I can ask Bascule to build the lunr index directly? The format of the index is documented, though I fear not comprehensively. Or could I run the lunr indexing step as part of Bascule's generator pipeline? But to do that I'd need a Javascript execution environment running on my computer. This is beginning to sound like a job for node.js. I wonder if there's a Java JAR available which contains a Javascript runtime environment?
Overall, I'm pleased with how little effort I've needed to implement a full-site search solution. lunr.js is ideally suited to a static site generator project, and it seems fast, configurable and easy to write with. Next step is the UI, and to explore saving or pre-generating the lunr index.