Indie Search Engines
What is Teclis?
Kagi is a privacy-focused, user-centric search engine. Great search experience starts with Kagi!
Wiby is a search engine for older style pages, lightweight and based on a subject of interest. Building a web more reminiscent of the early internet.
Find a web page made by an IndieWeb community member.
At Mojeek we like to do things differently, that's why we're building a search engine that respects your privacy whilst providing unique and unbiased results.
Ecosia uses the ad revenue from your searches to plant trees where they are needed the most. By searching with Ecosia, you’re not only reforesting our planet, but you’re also empowering the communities around our planting projects to build a better future for themselves. Give it a try!
spot ecloud global, powered by searx
Presearch is a decentralized search engine that provides search choice, quality results, privacy and rewards to those who want to end the search monopoly and take back the web.
Tools
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments) - GitHub - adbar/trafilatura: Web scraping library and command-line tool for text dis...
Headless Chrome Node.js API. Contribute to puppeteer/puppeteer development by creating an account on GitHub.
A standalone version of the readability lib. Contribute to mozilla/readability development by creating an account on GitHub.
Lightning-fast, open source search engine for everyone
You can install it using pip:
FastAPI framework, high performance, easy to learn, fast to code, ready for production
Google Research. Contribute to google-research/google-research development by creating an account on GitHub.
A motivating factor is the search engine has sort of grown to a scale where it's becoming increasingly difficult to productively work on as a personal solo project. It needs more structure. What's kept me from open sourcing it so far has also been the need for more structure. The needs of the marginalia project, and the needs of an open source project have effectively aligned.
YaCy P2P - Decentralized Search Engine
Parse And Create Web ARChive (WARC) files with node.js - GitHub - N0taN3rd/node-warc: Parse And Create Web ARChive (WARC) files with node.js
Enterprise Tools
Amazon Kendra offers an intelligent enterprise search solution that increases employee productivity and improves customer satisfaction.
Enterprises and developers use Algolia’s AI search infrastructure to understand users and show them what they’re looking for.
Search for Static Sites
Elasticlunr.js, lightweight full-text search engine in Javascript for browser search and offline search. Elasticlunr.js is developed based on Lunr.js, but more flexible than lunr.js. Elasticlunr.js provides Query-Time boosting and field search. A bit like Solr, but much smaller and not as bright, but also provide flexible configuration and query-time boosting.
LunrSearch made simple
Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.
Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.
Impossibly fast web search, built for static sites.
Specific Search & Recommendation Platforms
Blog Surf is the internet's only search engine for blogs. Explore the best writing on the internet.
An open index of well-known resources.
TinyGem is a bookmarking service, that automatically uses the links you save to surface other related content from manually curated sources. If you are intelectually curious, have a selective news diet and enjoy reading places like Hacker News, TinyGem might be for you.
Corpuses
Us
The HTTP Archive Tracks how the web is built by periodically crawl the top sites on the web and record detailed information about fetched resources, used web platform APIs and features, and execution traces of each page.
Crawl Techniques
Stealth mode: Applies various techniques to make detection of headless puppeteer harder.. Latest version: 2.11.1, last published: 3 months ago. Start using puppeteer-extra-plugin-stealth in your project by running `npm i puppeteer-extra-plugin-stealth`. There are 334 other projects in the npm registry using puppeteer-extra-plugin-stealth.
I want to share lists of links, but make them readable and archived
Other languages:
In my blog post brainstorming a new indie web search engine, I noted that running a web search engine is hard. With that in mind, I started to think that I haven't written too much about what I learned about web crawling when running IndieWeb Search, a search engine for the indie web. IndieWeb Search crawled a whitelist of websites, searching for pages, and indexed them for use in the search engine.
Search Techniques
A cursory review of all the non-metasearch, indexing search engines I have been able to find.
Thanks to the multi billion dollar advertisement industry, searching for something on the internet …
Code
GitLab Enterprise Edition
Search without being tracked.
The source code and instructions to create your own version of Wiby.
community search engine. Contribute to cblgh/lieu development by creating an account on GitHub.
Search as a service with YaCy Searchlab: Web Crawling and Data Science Apps for Web Content
Why Should We Care?
With a landmark antitrust trial under way, a giant of the modern web is buckling under its own weight.
A look at the new Tiptoe encrypted search system
Testimony during Google’s antitrust case revealed that the company may be altering billions of queries a day to generate search results that will get you to buy more stuff.
“The Three-Legged Stool” is the Initiative for Digital Public Infrastructure’s banner white paper: the culmination of our work here at the lab so far and our roadmap for our efforts in the coming years. It was written primarily by Chand Rajendra-Nicolucci and Michael Sugarman under the editorial direction of Ethan Zuckerman. Access “The Three-Legged Stool” […],"The Three-Legged Stool" is the Initiative for Digital Public Infrastructure's banner white paper: the culmination of our work here at the lab so far and our roadmap for our efforts in the coming years. It was written primarily by Chand Rajendra-Nicolucci and Michael Sugarman under the editorial dir
Talking About Search
.@vladquant has a new noncommercial search engine that looks pretty neat. I like this trend. https://t.co/6Y5UpmpaYj
— Ernie Smith (@ShortFormErnie) March 23, 2022
I chatted with him back in January: https://t.co/nWWxHPadRb
— Ernie Smith (@ShortFormErnie) March 23, 2022
Is there an open source object definition for search engine indexes that are willing to work together like Teclis, Kagi and Marginalia seem to do? Could I build my own index and federate in?
— Aram Zucker-Scharff (@Chronotope) March 23, 2022
Let’s invent a term: “The Oggoverse”
— Ernie Smith (@ShortFormErnie) March 23, 2022
The idea behind said term is that it’s basically the opposite of what Google does, so rather than starting with goog, it starts with oggo.
@robinberjon & @braedon were discussing this very thing and now it's in the back of my brain b/c I find it fascinating. Looking at the bottom of page documentation it looks like something I could roll into another effort I'm working on...
— Aram Zucker-Scharff (@Chronotope) March 23, 2022
I'm sick of dead links so building a tool to basically create an index of things I link to in a blog and archive them. I hadn't thought to make it searchable, but now I'm thinking it might make sense to AND it might be something I could offer up to use by indie search engines.
— Aram Zucker-Scharff (@Chronotope) March 23, 2022
Since I'm building an accompanying plugin that lets it work together w/11ty sites (& potentially other static site builders later) I think it would be cool to create a system where passionate bloggers build indexes that they can offer to search engines. https://t.co/LgIeYtsV26
— Aram Zucker-Scharff (@Chronotope) March 23, 2022
And @braedon's work indexing .well-known could be used as a ranking factor if he wanted to share it. It would be easy to see how these things could become mutually beneficial with partnered search engines providing functionality back to replace "Search with Google" embeds.
— Aram Zucker-Scharff (@Chronotope) March 23, 2022
I've been thinking about that a lot and I reckon we can fix it. Building a good index and building a good search UI are two very different things (as Google keeps demonstrating). The only reason to have them together is that the ads in the UI pay for the index. But... https://t.co/UloXHlkWle
— Robin Berjon (@robinberjon) March 18, 2022
...we can split that. There are many proposals for interoperability in social, which is useful but hard, we need to look at interoperability in search, which is a lot easier and comes with great benefits.
— Robin Berjon (@robinberjon) March 18, 2022
Competition in UI, built-in multihoming, integration into browsers and more, diversity of business models (ads or pay), build your own, merge results, no AMP, a return to media pluralism...
— Robin Berjon (@robinberjon) March 18, 2022
It's pure upside. We are *choosing* to live with shitty search by not doing this.
I'm very interested in how Github Actions (and the like) can make building rich indexes cheap and easy if handled properly. It's interesting to think how that might come together with an federation model and a Wikipedia editors approach to topical maintenance and interests.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
I've been wondering about distributed indices but I don't know if it can be done with potentially adversarial participants?
— Robin Berjon (@robinberjon) March 18, 2022
I think there's a good model in how Wikipedia editors handle refining and working on a piece and while, mechanically it is adversarial, I don't think it is philosophically. A mechanism for merging indices with particular priorities or outcome data seems like it would be useful.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
Maybe I don't have a clear view of what you have in mind, but that's not the adversarial I have in mind. I'm worried about malicious actors deliberately decreasing the value.
— Robin Berjon (@robinberjon) March 18, 2022
Indexing has to be automated, there's too much drudgery. How do you only get good indexing?
That's what I was thinking, you need good indexes and sure they have to be automated for collection, but if you have people engaged with topical expertise, you can still have editors who own and maintain indexes of subsets of the web with topical focuses. Then, join em together.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
Thinking along Mastodon lines here, for example, if you want to search for code tips you federate with the folks who have expertise in code and have selected the best sites to index in that space. It could incentivize good citizens to act well, like with Wikipedia editors.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
Hmm, an interesting idea. I wonder how effectively you could integrate heterogeneous indices into a cohesive general search experience?
— Braedon (@braedon) March 19, 2022
Wikipedia and Mastodon bring together content from many different sources, but all following a very restricted form.
But I suspect there's a lot of potential variation in how different indices are best structured, and queried.
— Braedon (@braedon) March 19, 2022
I run a very niche index in Well-Known, and would love to contribute that to a project like this, but it's built very differently to a general purpose search engine.
If you're meaning people only contribute on *what* gets indexed, with most of the technical indexing implementation common/fixed, that seems more clearly doable. But also suggests to me a centralised platform like Wikipedia, which needs a large entity to host it.
— Braedon (@braedon) March 19, 2022
I think it's possible to be more distributed than Wikipedia, but I agree that providing infrastructure is key. You need a lot of common interfaces and the ability to develop bureaucracy where needed.
— Robin Berjon (@robinberjon) March 19, 2022
Yeah, for sure! I think core concepts include being able to join & participate in an indexing project in some way, being able to request & refuse peering, & being able to merge similar structured indices w/different data, & allowing any indexing project to choose merge priorities
— Aram Zucker-Scharff (@Chronotope) March 19, 2022
I've been thinking about that a lot and I reckon we can fix it. Building a good index and building a good search UI are two very different things (as Google keeps demonstrating). The only reason to have them together is that the ads in the UI pay for the index. But... https://t.co/UloXHlkWle
— Robin Berjon (@robinberjon) March 18, 2022
...we can split that. There are many proposals for interoperability in social, which is useful but hard, we need to look at interoperability in search, which is a lot easier and comes with great benefits.
— Robin Berjon (@robinberjon) March 18, 2022
Competition in UI, built-in multihoming, integration into browsers and more, diversity of business models (ads or pay), build your own, merge results, no AMP, a return to media pluralism...
— Robin Berjon (@robinberjon) March 18, 2022
It's pure upside. We are *choosing* to live with shitty search by not doing this.
I'm very interested in how Github Actions (and the like) can make building rich indexes cheap and easy if handled properly. It's interesting to think how that might come together with an federation model and a Wikipedia editors approach to topical maintenance and interests.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
I've been wondering about distributed indices but I don't know if it can be done with potentially adversarial participants?
— Robin Berjon (@robinberjon) March 18, 2022
I think there's a good model in how Wikipedia editors handle refining and working on a piece and while, mechanically it is adversarial, I don't think it is philosophically. A mechanism for merging indices with particular priorities or outcome data seems like it would be useful.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
Maybe I don't have a clear view of what you have in mind, but that's not the adversarial I have in mind. I'm worried about malicious actors deliberately decreasing the value.
— Robin Berjon (@robinberjon) March 18, 2022
Indexing has to be automated, there's too much drudgery. How do you only get good indexing?
You hire librarians.
— melody joy kramer (@mkramer) March 18, 2022
Yeah, this too! It isn't accidental that Wikipedia editors and librarians tend to overlap. I think the thing we need is a way that empowers people with passion and expertise to make the best indexes and represent themselves as candidates to join your search tool to.
— Aram Zucker-Scharff (@Chronotope) March 18, 2022
I love this model (and librarians)! So, this would need shared access to crawl (so as not to duplicate infrastructure, à la common crawl), an API to query federated indices, and ILP so people can be paid and infra costs covered. It's not crazy hard!
— Robin Berjon (@robinberjon) March 18, 2022
I don't know if you'd need a shared crawler, crawling is easy and these days fairly cheap. You just need a shared crawler standard and an agreement to use a common user agent I think?
— Aram Zucker-Scharff (@Chronotope) March 19, 2022
I think this could get really interesting if it came together with Hyperdrive, your browsing than stands in for the crawler and your storage hosts an index which gets auto joined via connections, with larger servers using Hyper agents to pull federation into the standard web.
— Aram Zucker-Scharff (@Chronotope) March 19, 2022
I considered that, but how do you prevent this from including super private stuff that shows up in page?
— Robin Berjon (@robinberjon) March 19, 2022
Yeah that part is a little messy, but maybe start with 'click to add', build a blocklist, evolve the process. Private URLs are a relatively small set compared to the rest of the web I'd bet, and many have common characteristics.
— Aram Zucker-Scharff (@Chronotope) March 19, 2022
I don't know, that would not be foolproof enough. Like, your name would be caught up with news content if you're signed in.
— Robin Berjon (@robinberjon) March 19, 2022
How about: sites *push* their content to be indexed, along with IP terms (to prevent the hostile mining that Google does but also independent repub), and Hyper agents are used to verify that real site content matches the index, to catch cheaters.
— Robin Berjon (@robinberjon) March 19, 2022
Plus, detection of interstitials, notification prompts, bad cookie banners, perf...
— Robin Berjon (@robinberjon) March 19, 2022
A guide for how to discover cool things on the internet.
Hello. I was going to write a post about how to surf the web only I remembered it had already been written, in a far more comprehensive format, by another person. So I'm just going to link to it and…
Meilisearch is neat together with their tokenizer lib they use. More practically DocSearch is great for plug and use solution. Tantivy, Quickwit & Edgesearch are interesting too.
This article is a stub. You can help the IndieWeb wiki by expanding it.
Hey nerds: I recently stumbled across “Marginalia Search”. It’s a search engine with a fascinating design — rather than give you exactly what you’re looking for, it tries to surprise you.
Indie Map is a complete crawl of 2300 of the most active IndieWeb sites as of June 2017, sliced and diced and rolled up in a few useful ways:
🍵️
The way to improve search is not to mimic Google, but instead to build boutique search engines that index, curate, and organize things in new ways.
bookmark
Kyle Chayka writes about the evolution of Google Search, which has become the runaway favorite Internet search engine despite many users’ misgivings about how the company monetizes the data it collects and how its algorithms determine the search results that a user is shown.
I would like for there to be more tiny search engines that are focused on a particular topic. It would be cool if I could type in