Archived: Brainstorming a new indie web search engine | James' Coffee Blog

This is a simplified archive of the page at https://jamesg.blog/2024/06/05/indieweb-search-brainstorming/

Use this page embed on your own site:

I would like for there to be more tiny search engines that are focused on a particular topic. It would be cool if I could type in

ReadArchived

Published on under the IndieWeb category.

I would like for there to be more tiny search engines that are focused on a particular topic. It would be cool if I could type in "coffee" into an indie web search engine and see what other bloggers have written about on the topic.

A few years ago, I built an IndieWeb search engine that crawled ~1,000 sites once per week. In total, the tool indexed more than 500,000 pages. It took many months of my free time to build that search engine tool. TL;DR: web search is hard.

I was thinking about what a minimal indie web search engine would look like. My idea was a script that runs daily that:

  1. Polls a list of RSS feeds.
  2. Retrieves the content from each feed and saves it in a database.
  3. Uses full-text search in a database like PostgreSQL for ranking.

Optionally, there may be a link graph that is calculated after all feeds are polled.

The list of feeds could be in a GitHub repository, allowing community contribution. This should be opt in.

The choice of working from a list of feeds is an intentional one. This way, the search engine can build contents having to either crawl a sitemap and/or spider through an entire website.

In both cases, you end up crawling many URLs that are not relevant to search (i.e. date archives, category archives, and more). Crawling so many URLs introduces many hard-to-solve information re-ranking challenges. There are probably clever ways to infer if a page is or is not an archive, for example, but that is more logic to write. (Side note: If you can think of a solution to this, please send me an email at readers [at] jamesg [dot] blog. I love thinking about this sort of thing!)

Spidering through websites also comes with greater exposure to causing someone's site to load slower. Spidering must be done when you are fully confident your spider can work through many of the weird parts of the web. In building any search engine, you quickly realise how much of a wild west the web is, technically -- there are so many points where things can go wrong. There are likely out of the box tools to solve this problem, but there is still the mental burden of running a search engine that is crawling the open web.

If you only poll specific feeds, you don't have to worry about rate limiting, etc: you would only poll every feed once. This only works if a site advertises post contents in their feed: if not, you may want to retrieve the contents of a page. This is the moment where rate limiting needs to be engineered (i.e. ensure you don't request more than X pages in a given time frame, respect retry-after headers, etc.).

I don't plan on working on this project right now, but I did want to share my thoughts in case anyone finds it appealing.

The idea above is almost like a planet, with two differences: (i) full text search over data you have retrieved, and; (ii) you could ask people to provide archival feeds if they want, allowing you to build a greater index of a website.

Web search is fascinating, but to the extent you can reduce the amount of code you need to think about -- and the number of pages you need to retrieve, and the frequency of retrieving those pages -- the better. Search engines can quickly go from being an interesting side project to one that feels like a full-time job.

Also posted on IndieNews

James' Coffee Blog mascot


Written by human, not by AI

Responses

Comment on this post

Respond to this post by sending a Webmention.

Have a comment? Email me at readers@jamesg.blog.