Welcome to TR-WWW
Feedback to Adrian Vanzyl at adrian.vanzyl@med.mo for downloading from here. The most
recent beta version (1.2.1beta) is available for downloading from here.
TR-WWW has been
developed at Monash Medical Informatics MMI
Having problems? See the
troubleshooting section.
Note - you MUST have
the latest version of machttp for TR-WWW to work (ie version 1.3 beta 8 or
later). This can be retrieved from Chuck Shotton's machttp home page.
You must also be using the latest versions of MacWeb and Mosaic (and version 0.93
or later of Netscape).
What is TR and TR-WWW?
TR is Total Research, the complete research system for dealing with unstructured
textual information, searching it, extracting information, generating reports and
integrating this with a database. It has been developed by Chris Priestley, and
is a commercial, fully supported product with a large user base. Contact
chris@woody.apana.org.au for more details.
TR-WWW is a modified version of TR designed to work with MacHTTP. It has all the
search strengths of the normal version of TR, but requires a web client to access
its functionality and to act as its user interface. It has been written by Adrian
Vanzyl at MMI. It has been
released as shareware, and is distributed through the Internet. See the payment and order forms at the end of this document.
Special features
-- Fat Binary TR-WWW runs in native mode on
both the PowerPC and 68K based Macintoshes.
-- No Preindexing
TR
uses a unique algorithm to provide for rapid real time searching of text
containing documents. This is ideal for rapidly changing document sets.
Depending on the processor power available, it provides acceptable performance
for individual document sets up to around the 15 megabyte size. By breaking
large document collections into sets of less than 15M (and allowing the user to
limit the search to the selected document set), the actual document collection
can be any arbitrary size.
-- Provides context or relevance finding
Search results can be returned either using relevance ranking (similar to the way
that WAIS returns matches), or via a keyword in context system. The latter is
most appropriate for searching when you want more information about the matches
that were found before retrieving the whole document, or for large documents (eg
the CIA world factbook), where it is not necessary or desirable to retrieve the
entire document as a single file.
-- Record Delimiters
Record
delimiters may be defined so that searching a large file of archived mail
messages for example will return individual messages that match, extracted from
the larger document. This is ideal for searching digests, exported database
records and other similar documents that have some record structure.
-- Low
maintenance
Simply add files to the document sets folder to make them
instantly available for searching
-- Dynamic searching
If one of the
documents in the loaded document set is constantly changing (eg a newsfeed that
is being appended to a file), the search when performed searches the document as
it exists at the instant of the search, ie it always searches the latest version
of the document. It also automatically searches any new documents that appear in
a folder.
-- Reads any text containing documents
Will correctly
search text, Word and HTML documents.
-- Boolean searches
Supports
OR, AND, NEAR, NOTNEAR and PHRASE searches.
Setting up
tr-www.cgi is the TR-WWW application program.
tr-www.prompt is the file returned when the TR-WWW application is first accessed.
It defines the form interface that is used to access TR-WWW.
tr-www.config
is the configuration file.
(These can be renamed to anything as long as the
three file extensions match up as above).
Docs folder is the name of the
folder that holds all available document sets (this can be changed in the .config
file).
You can also create multiple custom forms of your own based on the
.prompt file, all having their submit action linked to the tr-www.cgi file.
Place these files in the same folder as MacHTTP (this is critically important -
the tr-www.cgi program must be in the same folder as machttp).
Place
some files, folders or aliases in the Docs folder (aliases can only be to other
files that exist in the Docs folder or its subfolders).
Run TR-WWW.
From
you web client which is capable of viewing forms (eg mosaic 2 or netscape), open
the url
http://your.server/tr-www.cgi
This will activate TR-WWW, and since
there is no search string, it will return its introductory message (which is the
.prompt file), with a pop up list of available document sets appended at the end.
TR will be correctly launched and set up the default document set as specified
in the .config file if it is not already running when a client requests a
search.
Searching
After opening the above URL, you can type in a
search term (or multiple terms separated by spaces), then click on the search
button. Default is to treat each word as having a boolean OR between it
(override this with the popup boolean list). Note - if directly sending a
search string with more than one word from the open url menu, place plus '+'
signs or '%20' between each word, NOT SPACES!!
Configuring
You can have as many different copies of TR running as you
wish, each with a different default set of documents. Each can have its own
default message by simply adding the word .prompt to an appropriate file that
starts with the name of the TR application with which you wish to associate it,
eg if you have a copy of TR called TRDictionary.cgi then its default reply file
should be called TRDictionary.prompt (and likewise for its .config file). It can
also have its own documents set folder, eg TRDictionarySets (simply specify this
in its config file) In general however, you only need to run one copy of TR,
since by using custom forms you can have different interfaces, specifying
different document sets and search modes.
.prompt file
This is the default file returned when the TR-WWW
application is accessed without any search parameters, eg
http://your.server/tr-www.cgi It must be an html containing file, with the
following provisos:
- Must have FORM tag
- Must have HTML and BODY
tags at its start, but not /HTML, /BODY or /FORM tags at its end (these will be
added by TR-WWW)
- can use either a GET or POST method to submit the query
(POST is preferred).
Customise this file for your own site, change the form
parameters, and use inline images as you wish. TR usually lists the currently
loaded document set in a pop up list after this prompt message. You can override
this in the .config file.
.config file
This sets up the defaults for TR-WWW. A client can
override many of these for a particular search. They revert back to the defaults
as specified in the .config file when that search finishes (ie clients can only
make a temporary change for their search, does not affect other clients'
searches). See the .config file.
Here is a sample copy of a .config
file.
# This is the TR-WWW configuration file # This file must have the
same name as the TR application to which it applies, # with .config appended,
eg tr-www.config for application tr-www.cgi # NOTE - these are all defaults that
will be used unless the client overrides it # To change the strings returned by
TR-WWW to clients (to customise them or to # change to a non english language),
see the manual. ################################### # adrian vanzyl,
adrian.vanzyl@med.monash.edu.au # http://informatics.med.monash.edu.au
###################################
# To have TR return the currently available document set after listing the
.prompt file, set to TRUE SHOWDOCUMENTSET TRUE
# To return context lists, set to CONTEXT, or for relevance finds, set to
RELEVANCE # This would usually be overridden by the client's form RESULTSMODE
RELEVANCE
# The number of hits to return when doing a context find, around 40 is a good
number # This would usually be overridden by the client's form MAXHITS 50
# Set to TRUE to disallow clients from being able to specifiy a $path set of
files # Use this when a preloaded document set is provided, and you don't want
clients to override it IGNOREPATH FALSE
# Set to PHRASE, AND, OR, NEAR, NOTNEAR to specify the default search mode
SEARCHMODE OR
# Default number of chars between words when doing a NEAR or NOTNEAR search
NEARVALUE 40
# Specify the name of the default document set, can be a filename or foldername #
remember that documents have to be at or below machttp in the hierarchy # folders
and aliases are OK DOCUMENTSET Docs
# Quit set to true forces tr-www to quit after completing each request # Leave it
set to false to avoid wasted time in reloading tr-www for each search QUIT FALSE
# EntireFile set to true allows the user to retrieve the entire file in its
native # format when viewing an extracted chunk from a context search # if set to
false, disallows this by not showing the link to the whole file # use FALSE when
you don't want people to accidentally download large files ENTIREFILE TRUE
# Default number of chars to return when a context line has been selected. # For
small files, where you want the entire file to be returned, make this a large nr,
# eg 999999. Suggested value is 1500 chars either side of match # This margin is
used when: # - no delimiters are found or specified (see next section) # - it
also determines the maximum number of chars either that may be returned if #
delimiters are used. # Thus - if you are using delimiters, set MARGINS to the max
size of each record (eg # several kilobytes). And if not using delimiters, set
MARGINS to a couple of hundred bytes MARGINS 16000
# Delimiters specify how a large document with context matches is broken down
into # smaller segments, eg for a large file full of mail messages, each 'record'
is anything # that begins with 'From: '. Other delimiters may be eg a line of
dashes, or two carriage # returns etc. You may have up to 8 delimiter strings.
Just remember that each # one is used in the order given # The actual delimiter
string is the text starting after the word and space 'DELIMITER ', # up to but
not including the terminating carriage return # This string must always occur at
the start of a line # REPEAT - the string MUST OCCUR AT THE START OF A LINE!! #
As long as the line begins with the delimiter string, it is considered a match #
For small documents or those without delimiters, leave the following lines
commented #DELIMITER ----------------------- #DELIMITER ******************** #
For my mail files I use MARGINS of 16000 and a Delimiter of 'From ' #DELIMITER
From # For digest files I use MARGINS 16000 and a Delimiter of 'Topic No. '
DELIMITER Topic No.
# The next two options specify how context lines are displayed # MATCHPOSITION
specifies the position of the match from the left margin # suggested position is
35 # WIDTH specifies the total width of each context line # suggested width is 75
characters MATCHPOSITION 35 WIDTH 75
Customising the text returned by TR-WWW
To fully customise the text
returned by TR-WWW (eg to change it to a foreign language), you need to edit two
things:
the .prompt file (this is the easy part)
and the string STR
resources in the application (this is not too difficult). Use resedit on a copy
of the file. All the STR resources will need to be edited. Their names indicate
what they should say. Have a look at the TR-WWW response screens using a web
client, and see how each string is put together to get the final result. I would
be very interested in distributing internationalised versions of TR-WWW. If you
convert it, please send me a copy of the application back, and I will incorporate
the string resources into the next version, with a flag for which language you
wish to run it under.
Adding documents
All searchable documents have to be in a documents set
folder. The default folder name is Docs, but this can be changed in the .config
file. As an example, you can create multiple folders (or files, or aliases of
files or folders) in the Docs folder, each dealing with a specific category of
information. When a client chooses that document set from the pop up documents
sets list, all the files (including all files in folders and subfolders etc) in
the folder will be searched.
Clients can run searches on any document or
folder by specifying its path in the __DOCUMENTSET forms parameter. All TR does
whenever it gets the __DOCUMENTSET parameter is to load all the files within it,
and search them for the __FIND term. You can thus set up your own pop up list of
document sets, without relying on the one created by TR (just change to .config
file to not return the document set). IMPORTANT - see the note below on the
__DOCUMENTSET parameter. You can ONLY specify files that are within the Docs
folder.
Aliases are acceptable, provided that the following rule is adhered to.
GOLDEN
RULE - All files to be searched (and resolved aliases of such files) must be in
the Docs folder or below, or clients will get 'Unable to access' errors (this is
for security and simplicity reasons).
Setting Options from the Client
Standard forms controls provide
access to all the features of TR. Most of these are given default values in
the .config file.
PLEASE see the .prompt file for details.
The keywords
are:
__FIND=find+terms
This specifies the words to be searched. Clients
can type in spaces (which are converted to pluses), but if you create a URL
directly, don't use spaces! The booleans (below) specify the relationship
between the first and subsequent words.
__DOCUMENTSET=File_or_folder
This sets the document set to be searched. It can be:
A single file, a
folder name, or an alias.
NOTE - this file or folder name must be that of a
file or folder in your documents directory, and its path is specified without the
documentset folder preceeding it, eg
- folder f1, f2 and f3 are in my default
Docs directory
- I want to search only the f2 folder
- use
__DOCUMENTSET=f2, and NOT __DOCUMENTSET=Docs/f2
Again:
- all files and
folders to be searched must be in your documents folder (default 'Docs')
- do
not put the documents folder name in front of document names when using
__DOCUMENTSET=
TR-WWW usually constructs this list for you automatically by listing all the
files/folders in the specified document sets folder.
__searchmode=type
This sets the search mode. Type can be:
AND - all
words in the following search string must occur in a file
OR - any word in
the following search string may occur in a file
PHRASE - the exact phrase
must occur in the search string (word for word, including punctuation but
ignoring case)
NEAR - each word in the search string must occur in a file
near the first word in the search string. The nearness can also be specified
(see below).
NOTNEAR - the opposite of NEAR.
__resultsmode=mode
Mode can be either CONTEXT to return results as a
context list, or RELEVANCE to return only a relevance list. eg
__resultsmode=context
__maxhits=number
Specifies the number of hits (result lines) returned by
a context find. eg __maxhits=100
__nearvalue=number
For NEAR or NOTNEAR searches, specifies the number of
characters within which each match must/must not occur. eg __nearvalue=40
Philosophy
TR uses a realtime search engine. This has several
important implications. Each document is searched in its enterity each time a
search request is received. Searching times are optimal when the files are held
on a locally (non network) mounted fast hard drive, connected to a fast (Quadra
or PowerPC) machine. Constanly changing documents are ideally suited to this
method. Small to medium sized document collections are ideally suited to this
method. Large, non changing collections with many regular accesses require an
indexed search engine - AppleSearch can do the job if you can afford it.
Known bugs
Spaces in URLs - this is not a bug, but a 'feature'
inherent in the www system - NO SPACES ARE ALLOWED IN A URL STRING!!!!! Take
note. This will cause you much wasted time. NO SPACES!!!! Use '%20' instead.
This problem commonly occurs when you use the open URL menu command from a client
directly. No '%', '~' or '/' characters are allowed in file or folder names.
TR converts all '+' characters to spaces (to undo the fact the clients convert
spaces into pluses). TR converts slashes to colons (to convert URLs to Macintosh
file paths which have colons in them). If you accidentally have one of these
characters in your path, TR will convert it, and you will get unexpected
results.
Nothing works
-->
Make sure you are using a CGI enabled copy of MacHTTP. This means version
1.3.1beta or later. If using the POST rather than GET methods, you must use
version 1.3.9 beta or later.
--> make sure your client is capable of dealing
with forms (latest versions of netscape, macweb or mosaic)
--> reread the
instructions carefully, particularly the section that says all files must be in
the Docs folder to be searchable
--> mail me with the exact details
(including your .config file, and how you have things set up). How to search
changing documents or folders?
--> For a given loaded document set, it does
not matter if the file contents are constantly changing, or if new information is
being appended to any of the files. Each file will be searched in the form it
exists at the time of the search request.
TR is not returning the documents (Unable to access requested document)
--> This happens when the file is not in the same directory tree as TR and
MacHTTP. For security reasons, tr-www will only serve files that exists in the
folder structure at its level or below (never higher). This is true of aliases
also - once an alias is resolved to its actual file, that file must obey the
above rule.
Additional matches in context list
--> What does "Number of additional
matches not shown above is xxx" mean at the bottom of the context list? When many
hundreds or thousands of matches are found, only the first 40 (by default - see
the .config file) are returned. This number is an arbitrary figure that I have
chosen to limit large amounts of network traffic. It gives the client some
indication of the result space, and subtly suggests that a different more
discriminatory search term may be appropriate.
Context results highlight
the entire word
--> When doing a search on the word 'the' for example,
anything that starts with the word 'the' is highlighted from the letters 'the' up
to the end of the word, eg mothers rather than mothers. The reason
for this is speed related (the underlying philosophy of the engine as previously
mentioned is speed). The searching is done in two steps - first find all the
matches and their offsets, then extract and return the context lines. When
multiple search terms are given, this would require comparing each extracted
context line with each search term, and then highlighting only the correct part.
This wastes time (currently I simply highlight to the end of the word, which is
low in overhead), and would get confusing when search terms start with the same
letters, eg 'the' and 'there' with a match of 'therefore' should have just the
'the' or 'there' highlighted at its start?
Too many matches found
error
--> This happens when too many matches have been found, and TR has run
out of memory to store the information about them (usually after about 30000
matches). Either the client can try a more specific search, or you can increase
the memory allocated to it in the Finder (ie quit TR-WWW, select it in the
Finder, do Get Info, and change the Preferred Size to eg 1500k.
Why do I
get too many matches errors when doing boolean searches, and not otherwise?
--> For boolean searches, a separate search is done for every single word, and
the offset information for every single match in that file is held in memory.
These offsets are then compared (eg for a NEAR search, which ones are near to
each other), and the results are then returned. This means that an enormous
amount of memory can be required to hold all these offsets. Solution is to give
TR-WWW much more application memory, or try the search with less common words.
What are the numbers in the TR-WWW files list?
-->In the TR-WWW
application's window is a list of all the currently loaded files. As searches
are done the number of hits in each file will be shown to the right of the file
name. Why can't the doc set folder be the parent folder of machttp and
TR-WWW?
--> Because the document sets would then include all the config, log
and other confusing files also. You can make a copy of eg your Default.html file
and put it in a folder in the docs folder if you want it to be searchable.
Can TR-WWW live somewhere other than the machttp directory?
--> No. The
reason for this is that relative paths have to be returned to the client (rather
than the path beginning with the name of your hard disk). The only way to know
where to start the path from is to do it from where machttp has been run from.
The easiest way to do this is for TR-WWW to be run from the same folder, since it
can then work out where it is in the folder hierarchy, and chop the beginning of
the file path off up to where it itself lives (if that makes sense). When
does a context result have PRE tags around it?
--> If the file name ends in
.html, the extracted context is returned as is. For all other files, a PRE tag
is put around the extraction, thus causing text only files to display properly on
screen.
For bug reports and any queries, place mail me at
adrian@medlan.med.monash.edu.au
Enjoy,
Adrian Vanzyl.
Version History
Beta release 1
- Stripped all non server code (user interface) out of TR to improve
network performance
- Less HandleEvents to improve performance
- Returns file
sizes in relevance searching to discourage accidental download of large files
Beta release 2 - Added .prompt and .config files
- Added booleans
- Added client control of some features
Beta release 3 - Fixed
nasty bug when hundreds of small files are opened
- Improved speed 5%
Beta release 4 and 5 - Internal or limited release
Beta release 6
- Fixed memory bug when dealing with large files, many small files,
thousands of matches
- forms support
- document sets.
Beta release 7
- Improved context returns by adding PRE tags around extractions from non
.html files
- Number of characters returned (usually 3000 around match) can be
set in the .config file
Beta release 8 - Now searches html files
correctly, and shows context lists correctly
Version 1 - PowerPC
Native, fat binary for both platforms
- QUIT option in the config file to quit
after each search
- ENTIREFILE option in the config file to dis/allow entire
file downloads for context matches
- Handles filenames and folder names with
spaces correctly.
Version 1.1 - Fixed bug that resulted in 'click
here to get entire file in its native format' to not work properly
- Added
sample files to distribution.
Version 1.2 - Added support for record
structures in large files
- Added matchposition and width config options for
controlling display of context search results
- Put all text produced by tr-www
into STR resources so customisation and internationalisation are easier
- Fixed
bug with multiple files where last file was never closed - this would cause a
system crash after several searches.
Version 1.3 - Fixed bug with
loading enormous document sets. Very large document sets now search correctly.
For Australian sites, price is in
Australian dollars, for other sites, in US dollars. Educational - single
license $50, site license $500.
Non Profit or Internal use - single license
$50, site license $500.
For Profit/commercial use - single license - $300,
site license $3000.
For profit/commercial use is subject ot an annual license
renewal fee of 30% of the original licence cost. This includes all updates and
upgrades.
Please download and print the order form, which
includes the disclaimer and conditions.
PLEASE give me feedback (bug reports, ideas etc) at
adrian.vanzyl@med.monash.edu.au.