WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
========================================================
The progam ht://Dig is a free (GPL) internet indexing and search program. The
ht://Dig documentation describes itself as follows:
The ht://Dig system is a complete world wide web indexing and
searching system for a small domain or intranet. This system
is *not* meant to replace the need for powerful internet-wide
search systems like Lycos, Infoseek, Webcrawler and AltaVista.
Instead it is meant to cover the search needs for a single
company, campus, or even a particular sub section of a web site.
As opposed to some WAIS-based or web-server based search
engines, ht://Dig can span several web servers at a site. The
type of these different web servers doesn't matter as long as
they understand the HTTP 1.0 protocol.
ht://Dig was developed at San Diego State University as a way
to search the various web servers on the campus network.
I have written WWWOFFLE so that ht://Dig can be used with it to allow the
entire cache of pages can be indexed. There are three stages to using the
program that are described in this document; installation, digging and
searching.
Getting ht://Dig
----------------
ht://Dig is available from the web site
http://www.htdig.org/
You need to have version 3.1.0b4 or later of htdig.
No special compile-time configuration of htdig is required to be able to use it
with WWWOFFLE.
I tested with version 3.1.6 using the official Debian package.
Configure ht://Dig to run with WWWOFFLE
---------------------------------------
If you have already got ht://Dig installed on your system, for example as part
of a Linux distribution, then you may need to make some changes to the
configuration files. The problem is that ht://Dig sets some of the parameters
at compile time in the HTML templates that is uses. This makes it impossible to
use the same "common files" with more than one search path on the same system.
Using WWWOFFLE to run ht://Dig will mean that the base URL for the htsearch form
and all images is '/search/htdig/'. If the configuration file has a different
compiled in variable (often '/htdig/') then the images will not be found. The
changes that you need to make are the following:
In htsearch.conf (in /var/spool/wwwoffle/search/htdig/conf) you need to add the
following lines (I have already done it in the default config file):
allow_in_form: image_url_prefix
image_url_prefix: /search/htdig
The HTML template files that htsearch uses are called footer.html, header.html,
nomatch.html, syntax.html and wrapper.html. They will be installed in different
places depending where your version of ht://Dig came from (for Debian GNU/Linux
they are in /etc/htdig) you need to replace image references like:
with
Unfortunately making this change will mean that the template files will no
longer work with any other ht://Dig database that uses them. You could make a
copy of the template files somewhere else and modify the config file to
reference them.
Obviously if you are going to make this change then you may as well just edit
the files and not bother with the $(IMAGE_URL_PREFIX) variable but use
'/search/htdig' instead.
Configure WWWOFFLE to run with ht://Dig
---------------------------------------
The configuration files for the ht://Dig programs as used with WWWOFFLE will
have been installed in /var/spool/wwwoffle/search/htdig/conf when WWWOFFLE was
installed. The scripts used to run the htdig programs will have been installed
in /var/spool/wwwoffle/search/htdig/scripts when WWWOFFLE was installed. In
both these cases the directory /var/spool/wwwoffle can be changed at compile
time with options to the configure script.
These files should be correct if the information at the time of running
configure was set correctly. Check them, they should have the spool directory
and the proxy hostname and port set correctly.
Also they should be checked to ensure that the ht://Dig programs are on the path
(you can edit the PATH variable here if they are not in /usr/local/bin). The
merging process can use a lot of disk space when the sort program is run, you
can change the location of the temporary directory used for this with the TMPDIR
variable.
The Fuzzy Database
------------------
The ht://Dig programs use a database of fuzzy word endings and synonyms. This
needs to be created just once, there is a script provided with WWWOFFLE that
does this.
/var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htfuzzy
If you have an existing ht://Dig installation then this step will probably have
already been performed and is not required again.
Note: When you do this it will take a *long* time since it produces two
databases that htsearch uses to help in matching words.
Digging and Merging
-------------------
Digging is the name that is given to the process of searching through the
web-pages to make the list of words. Merging is the process of converting the
raw list of words into a database that can be searched.
The ht://Dig installation will include a script called 'rundig' that
demonstrates how digging and merging is supposed to work. To work with WWWOFFLE
I have produced my own scripts that should be used instead.
/var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-full
/var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-incr
/var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-lasttime
The first of these scripts will do a full search and index all of the URLs in
the cache. The second one will do an incremental search and will only index
those that have changed since the last full search was done. The third will add
in the files in the lasttime index into the database.
Unfortunately due to the way that the htmerge program works, it will take almost
as long to do an incremental search or a lasttime search as to do a full search.
The only differnce is that for the incremental search and lasttime search the
WWWOFFLE cache is only accessed for the files that have changed.
If you cannot get htdig version 3.1.6 or 3.2.0 to index any pages then try
removing the line in the file /var/spool/wwwoffle/html/en/robots.txt that says
'Disallow: /index' since it triggers a bug in htdig that stops it searching
properly.
Searching
---------
The search page for ht://Dig is located at http://localhost:8080/search/htdig/
and is linked to from the "Welcome Page". The word or words that you want to
search for should be entered here.
This form actually calls the script
/var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htsearch
to do the searching so it is possible to edit this to modify it if required.
Thanks to
---------
I would like to thank the htdig maintainer (Geoffrey.R.Hutchison@williams.edu)
for the help that he has provided to get me started with htdig and the patches
and comments that he has accepted from me into the htdig program.
Andrew M. Bishop
6th Aug 2001