Self Hosted Search

Page content

I made a big deal out of switching to DuckDuckGo for the site search here at Macdrifter. I was very happy with the results on the WordPress version of the site. However, when I switched to Pelican, DDG presented both the old cached links and the new links. I was generating a new site map a couple times a day, but, unlike Google, DDG can not be pinged with the new site map. So I waited. And Waited. After several weeks, the results have gotten better but a lot of the old links still show up. After multiple emails with DDG and final reply from them that said there is nothing further they will do, I decide to shop around.

If you have done a search here in the past week you probably noticed I was using Google. Google worked fine but I still didn’t like the results or the look. So I continued to shop until I found Sphider.

Sphider

Sphider is a php web app with a MySQL backend. Sphider will crawl a site, including external links if necessary, and generate a profile of each page. Sphider is also a search engine and web front-end with some nice features like spelling suggestions, word completion a boolean logic.

Go ahead, try it

It’s blazing fast and I think the results are much better than either Google or DuckDuckGo.

Sphider is also ridiculously easy to install and configure.

Installation

The Documentation doesn’t lie. It’s as simple as it looks. Here’s how I did it.

  1. I created a new sub-domain named “search.macdrifter.com”
  2. I added a new PHP app on the domain
  3. I dragged the Sphider files into the new web app directory on my host using Coda 2.
  4. Using phpMySQL Admin I created a nice, new, clean database for the site.
  5. I edited the database.php file to add my DB connection settings.
  6. I visited the install page on the site to have Sphider setup the DB
  7. Using the Sphider admin page, I configured the crawler to crawl Macdrifter.com
  8. I smoothed off the edges of the design and automation

Since I am using a subdomain, I configured Sphider to crawl the main site but to skip some directories that would clutter results. The exclusion list can include regex but this works well for me:

:::text
/uploads/
/tag/
/category/
/author/
http://www.macdrifter.com/tags.html 
www.macdrifter.com/index

I want to avoid duplicate results from summary pages like “tags” and “author” pages. I also don’t want the index pages to show up.

I don’t want search engines crawling my search engine because it might create Inception-like dream worlds. I simply created a new robots.txt file for the Sphider web app:

:::text
User-agent: *
Disallow: /

I also added an htaccess file to redirect the main url to the advanced search form:

:::text
DirectoryIndex search.php

Reindexing

This was a trick for me. I want the site to reindex throughout the day. Luckily Sphider can be called from the command line with a number of useful parameters. But there’s a catch. Parameters in the command line overwrite the stored parameters in the admin console. That doesn’t matter much for me. I include everything I need in one command.

I setup a cron job to run every 5 minutes throughout the day to reindex the site. Indexing is very fast. However, since a Pelican site is regenerated every time a new post is added, Sphider is completely rebuilding the index every time it runs. It still only takes a few seconds.

Here’s the cron line I added:

:::text
*/10 * * * * cd ~/webapps/sphyder/admin && php spider.php -u http://www.macdrifter.com -r -f -n '/uploads/\n/tag/\n/category/\n/author/\nhttp://www.macdrifter.com/tags.html\nwww.macdrifter.com/index'

Notice that I have included the exclusion directories right in the command. I also needed to change directory for some reason. The spider.php command fails every other way I tried from cron. It works from the command line at the shell, but not in cron. I’d love to hear a better solution.

Doing it with Style

The basic Sphider installation is plain. It looks like Google, which is ugly. It’s easy to touch up Sphider. I edited the css file in the /sphider/template/standard directory. I also modified the header.html file in the same location.

Conclusion

Right now, I’m very happy with the new search option. There are no ads, no tracking cookies and most important, the results are very accurate. I also like that I have complete control over the presentation layout and styling. Sphider provides access to keyword weighting which will allow me to adjust how top result ranking is determined. I haven’t needed to adjust it yet. I still have a couple more tweaks to apply but it looks consistent. The entire process from beginning to end required about an hour of my time and that includes tweaking CSS.