Welcome to TR-WWW

Feedback to Adrian Vanzyl at adrian.vanzyl@med.mo for downloading from here.

The most recent beta version (1.2.1beta) is available for downloading from here.

TR-WWW has been developed at Monash Medical Informatics MMI

Having problems? See the troubleshooting section.

Note - you MUST have the latest version of machttp for TR-WWW to work (ie version 1.3 beta 8 or later). This can be retrieved from Chuck Shotton's machttp home page. You must also be using the latest versions of MacWeb and Mosaic (and version 0.93 or later of Netscape).

What is TR and TR-WWW?

TR is Total Research, the complete research system for dealing with unstructured textual information, searching it, extracting information, generating reports and integrating this with a database. It has been developed by Chris Priestley, and is a commercial, fully supported product with a large user base. Contact chris@woody.apana.org.au for more details.

TR-WWW is a modified version of TR designed to work with MacHTTP. It has all the search strengths of the normal version of TR, but requires a web client to access its functionality and to act as its user interface. It has been written by Adrian Vanzyl at MMI. It has been released as shareware, and is distributed through the Internet. See the payment and order forms at the end of this document.

Special features

-- Fat Binary

TR-WWW runs in native mode on both the PowerPC and 68K based Macintoshes.

-- No Preindexing

TR uses a unique algorithm to provide for rapid real time searching of text containing documents. This is ideal for rapidly changing document sets. Depending on the processor power available, it provides acceptable performance for individual document sets up to around the 15 megabyte size. By breaking large document collections into sets of less than 15M (and allowing the user to limit the search to the selected document set), the actual document collection can be any arbitrary size.

-- Provides context or relevance finding

Search results can be returned either using relevance ranking (similar to the way that WAIS returns matches), or via a keyword in context system. The latter is most appropriate for searching when you want more information about the matches that were found before retrieving the whole document, or for large documents (eg the CIA world factbook), where it is not necessary or desirable to retrieve the entire document as a single file.

-- Record Delimiters

Record delimiters may be defined so that searching a large file of archived mail messages for example will return individual messages that match, extracted from the larger document. This is ideal for searching digests, exported database records and other similar documents that have some record structure.

-- Low maintenance

Simply add files to the document sets folder to make them instantly available for searching

-- Dynamic searching

If one of the documents in the loaded document set is constantly changing (eg a newsfeed that is being appended to a file), the search when performed searches the document as it exists at the instant of the search, ie it always searches the latest version of the document. It also automatically searches any new documents that appear in a folder.

-- Reads any text containing documents

Will correctly search text, Word and HTML documents.

-- Boolean searches

Supports OR, AND, NEAR, NOTNEAR and PHRASE searches.


Setting up

tr-www.cgi is the TR-WWW application program.
tr-www.prompt is the file returned when the TR-WWW application is first accessed. It defines the form interface that is used to access TR-WWW.
tr-www.config is the configuration file.
(These can be renamed to anything as long as the three file extensions match up as above).
Docs folder is the name of the folder that holds all available document sets (this can be changed in the .config file).
You can also create multiple custom forms of your own based on the .prompt file, all having their submit action linked to the tr-www.cgi file.

Place these files in the same folder as MacHTTP (this is critically important - the tr-www.cgi program must be in the same folder as machttp).
Place some files, folders or aliases in the Docs folder (aliases can only be to other files that exist in the Docs folder or its subfolders).
Run TR-WWW.
From you web client which is capable of viewing forms (eg mosaic 2 or netscape), open the url

http://your.server/tr-www.cgi

This will activate TR-WWW, and since there is no search string, it will return its introductory message (which is the .prompt file), with a pop up list of available document sets appended at the end. TR will be correctly launched and set up the default document set as specified in the .config file if it is not already running when a client requests a search.

Searching

After opening the above URL, you can type in a search term (or multiple terms separated by spaces), then click on the search button. Default is to treat each word as having a boolean OR between it (override this with the popup boolean list).

Note - if directly sending a search string with more than one word from the open url menu, place plus '+' signs or '%20' between each word, NOT SPACES!!


Configuring

You can have as many different copies of TR running as you wish, each with a different default set of documents. Each can have its own default message by simply adding the word .prompt to an appropriate file that starts with the name of the TR application with which you wish to associate it, eg if you have a copy of TR called TRDictionary.cgi then its default reply file should be called TRDictionary.prompt (and likewise for its .config file). It can also have its own documents set folder, eg TRDictionarySets (simply specify this in its config file)

In general however, you only need to run one copy of TR, since by using custom forms you can have different interfaces, specifying different document sets and search modes.

.prompt file

This is the default file returned when the TR-WWW application is accessed without any search parameters, eg http://your.server/tr-www.cgi

It must be an html containing file, with the following provisos:

Customise this file for your own site, change the form parameters, and use inline images as you wish.

TR usually lists the currently loaded document set in a pop up list after this prompt message. You can override this in the .config file.

.config file

This sets up the defaults for TR-WWW. A client can override many of these for a particular search. They revert back to the defaults as specified in the .config file when that search finishes (ie clients can only make a temporary change for their search, does not affect other clients' searches).

See the .config file.

Here is a sample copy of a .config file.

 # This is the TR-WWW configuration file # This file must have the
same name as the TR application to which it applies, #    with .config appended,
eg tr-www.config for application tr-www.cgi # NOTE - these are all defaults that
will be used unless the client overrides it # To change the strings returned by
TR-WWW to clients (to customise them or to # change to a non english language),
see the manual. ################################### # adrian vanzyl, 
adrian.vanzyl@med.monash.edu.au # http://informatics.med.monash.edu.au
###################################


# To have TR return the currently available document set after listing the
.prompt file, set to TRUE SHOWDOCUMENTSET TRUE

# To return context lists, set to CONTEXT, or for relevance finds, set to
RELEVANCE # This would usually be overridden by the client's form RESULTSMODE
RELEVANCE

# The number of hits to return when doing a context find, around 40 is a good
number # This would usually be overridden by the client's form MAXHITS 50

# Set to TRUE to disallow clients from being able to specifiy a $path set of
files # Use this when a preloaded document set is provided, and you don't want
clients to override it IGNOREPATH FALSE

# Set to PHRASE, AND, OR, NEAR, NOTNEAR to specify the default search mode
SEARCHMODE OR


# Default number of chars between words when doing a NEAR or NOTNEAR search
NEARVALUE 40


# Specify the name of the default document set, can be a filename or foldername #
remember that documents have to be at or below machttp in the hierarchy # folders
and aliases are OK DOCUMENTSET Docs

# Quit set to true forces tr-www to quit after completing each request # Leave it
set to false to avoid wasted time in reloading tr-www for each search QUIT FALSE

# EntireFile set to true allows the user to retrieve the entire file in its
native # format when viewing an extracted chunk from a context search # if set to
false, disallows this by not showing the link to the whole file # use FALSE when
you don't want people to accidentally download large files ENTIREFILE TRUE


# Default number of chars to return when a context line has been selected. # For
small files, where you want the entire file to be returned, make this a large nr,
# eg 999999.  Suggested value is 1500 chars either side of match # This margin is
used when: #  -  no delimiters are found or specified (see next section) #  -  it
also determines the maximum number of chars either that may be returned if #     
delimiters are used. # Thus - if you are using delimiters, set MARGINS to the max
size of each record (eg #  several kilobytes).  And if not using delimiters, set
MARGINS to a couple of hundred bytes MARGINS 16000

# Delimiters specify how a large document with context matches is broken down
into # smaller segments, eg for a large file full of mail messages, each 'record'
is anything # that begins with 'From: '.  Other delimiters may be eg a line of
dashes, or two carriage # returns etc.  You may have up to 8 delimiter strings. 
Just remember that each # one is used in the order given # The actual delimiter
string is the text starting after the word and space 'DELIMITER ', # up to but
not including the terminating carriage return # This string must always occur at
the start of a line # REPEAT - the string MUST OCCUR AT THE START OF A LINE!! #
As long as the line begins with the delimiter string, it is considered a match #
For small documents or those without delimiters, leave the following lines
commented #DELIMITER ----------------------- #DELIMITER ******************** #
For my mail files I use MARGINS of 16000 and a Delimiter of 'From ' #DELIMITER
From # For digest files I use MARGINS 16000 and a Delimiter of 'Topic No. '
DELIMITER Topic No.



# The next two options specify how context lines are displayed # MATCHPOSITION
specifies the position of the match from the left margin # suggested position is
35 # WIDTH specifies the total width of each context line # suggested width is 75
characters MATCHPOSITION 35 WIDTH 75




Customising the text returned by TR-WWW

To fully customise the text returned by TR-WWW (eg to change it to a foreign language), you need to edit two things:
the .prompt file (this is the easy part)
and the string STR resources in the application (this is not too difficult). Use resedit on a copy of the file. All the STR resources will need to be edited. Their names indicate what they should say. Have a look at the TR-WWW response screens using a web client, and see how each string is put together to get the final result. I would be very interested in distributing internationalised versions of TR-WWW. If you convert it, please send me a copy of the application back, and I will incorporate the string resources into the next version, with a flag for which language you wish to run it under.


Adding documents

All searchable documents have to be in a documents set folder. The default folder name is Docs, but this can be changed in the .config file.

As an example, you can create multiple folders (or files, or aliases of files or folders) in the Docs folder, each dealing with a specific category of information. When a client chooses that document set from the pop up documents sets list, all the files (including all files in folders and subfolders etc) in the folder will be searched.

Clients can run searches on any document or folder by specifying its path in the __DOCUMENTSET forms parameter. All TR does whenever it gets the __DOCUMENTSET parameter is to load all the files within it, and search them for the __FIND term. You can thus set up your own pop up list of document sets, without relying on the one created by TR (just change to .config file to not return the document set). IMPORTANT - see the note below on the __DOCUMENTSET parameter. You can ONLY specify files that are within the Docs folder.

Aliases are acceptable, provided that the following rule is adhered to.

GOLDEN RULE - All files to be searched (and resolved aliases of such files) must be in the Docs folder or below, or clients will get 'Unable to access' errors (this is for security and simplicity reasons).


Setting Options from the Client

Standard forms controls provide access to all the features of TR.

Most of these are given default values in the .config file.

PLEASE see the .prompt file for details.

The keywords are:

__FIND=find+terms

This specifies the words to be searched. Clients can type in spaces (which are converted to pluses), but if you create a URL directly, don't use spaces! The booleans (below) specify the relationship between the first and subsequent words.

__DOCUMENTSET=File_or_folder

This sets the document set to be searched. It can be:
A single file, a folder name, or an alias.
NOTE - this file or folder name must be that of a file or folder in your documents directory, and its path is specified without the documentset folder preceeding it, eg
- folder f1, f2 and f3 are in my default Docs directory
- I want to search only the f2 folder
- use __DOCUMENTSET=f2, and NOT __DOCUMENTSET=Docs/f2
Again:
- all files and folders to be searched must be in your documents folder (default 'Docs')
- do not put the documents folder name in front of document names when using __DOCUMENTSET=
TR-WWW usually constructs this list for you automatically by listing all the files/folders in the specified document sets folder.

__searchmode=type

This sets the search mode. Type can be:
AND - all words in the following search string must occur in a file
OR - any word in the following search string may occur in a file
PHRASE - the exact phrase must occur in the search string (word for word, including punctuation but ignoring case)
NEAR - each word in the search string must occur in a file near the first word in the search string. The nearness can also be specified (see below).
NOTNEAR - the opposite of NEAR.

__resultsmode=mode

Mode can be either CONTEXT to return results as a context list, or RELEVANCE to return only a relevance list.

eg __resultsmode=context

__maxhits=number

Specifies the number of hits (result lines) returned by a context find.

eg __maxhits=100

__nearvalue=number

For NEAR or NOTNEAR searches, specifies the number of characters within which each match must/must not occur.

eg __nearvalue=40


Philosophy

TR uses a realtime search engine. This has several important implications.

Each document is searched in its enterity each time a search request is received. Searching times are optimal when the files are held on a locally (non network) mounted fast hard drive, connected to a fast (Quadra or PowerPC) machine. Constanly changing documents are ideally suited to this method. Small to medium sized document collections are ideally suited to this method. Large, non changing collections with many regular accesses require an indexed search engine - AppleSearch can do the job if you can afford it.


Known bugs

Spaces in URLs - this is not a bug, but a 'feature' inherent in the www system - NO SPACES ARE ALLOWED IN A URL STRING!!!!! Take note. This will cause you much wasted time. NO SPACES!!!! Use '%20' instead. This problem commonly occurs when you use the open URL menu command from a client directly.

No '%', '~' or '/' characters are allowed in file or folder names. TR converts all '+' characters to spaces (to undo the fact the clients convert spaces into pluses). TR converts slashes to colons (to convert URLs to Macintosh file paths which have colons in them). If you accidentally have one of these characters in your path, TR will convert it, and you will get unexpected results.


Troubleshooting

Nothing works

--> Make sure you are using a CGI enabled copy of MacHTTP. This means version 1.3.1beta or later. If using the POST rather than GET methods, you must use version 1.3.9 beta or later.
--> make sure your client is capable of dealing with forms (latest versions of netscape, macweb or mosaic)
--> reread the instructions carefully, particularly the section that says all files must be in the Docs folder to be searchable
--> mail me with the exact details (including your .config file, and how you have things set up).

How to search changing documents or folders?

--> For a given loaded document set, it does not matter if the file contents are constantly changing, or if new information is being appended to any of the files. Each file will be searched in the form it exists at the time of the search request.

TR is not returning the documents (Unable to access requested document)

--> This happens when the file is not in the same directory tree as TR and MacHTTP. For security reasons, tr-www will only serve files that exists in the folder structure at its level or below (never higher). This is true of aliases also - once an alias is resolved to its actual file, that file must obey the above rule.

Additional matches in context list

--> What does "Number of additional matches not shown above is xxx" mean at the bottom of the context list? When many hundreds or thousands of matches are found, only the first 40 (by default - see the .config file) are returned. This number is an arbitrary figure that I have chosen to limit large amounts of network traffic. It gives the client some indication of the result space, and subtly suggests that a different more discriminatory search term may be appropriate.

Context results highlight the entire word

--> When doing a search on the word 'the' for example, anything that starts with the word 'the' is highlighted from the letters 'the' up to the end of the word, eg mothers rather than mothers. The reason for this is speed related (the underlying philosophy of the engine as previously mentioned is speed). The searching is done in two steps - first find all the matches and their offsets, then extract and return the context lines. When multiple search terms are given, this would require comparing each extracted context line with each search term, and then highlighting only the correct part. This wastes time (currently I simply highlight to the end of the word, which is low in overhead), and would get confusing when search terms start with the same letters, eg 'the' and 'there' with a match of 'therefore' should have just the 'the' or 'there' highlighted at its start?

Too many matches found error

--> This happens when too many matches have been found, and TR has run out of memory to store the information about them (usually after about 30000 matches). Either the client can try a more specific search, or you can increase the memory allocated to it in the Finder (ie quit TR-WWW, select it in the Finder, do Get Info, and change the Preferred Size to eg 1500k.

Why do I get too many matches errors when doing boolean searches, and not otherwise?

--> For boolean searches, a separate search is done for every single word, and the offset information for every single match in that file is held in memory. These offsets are then compared (eg for a NEAR search, which ones are near to each other), and the results are then returned. This means that an enormous amount of memory can be required to hold all these offsets. Solution is to give TR-WWW much more application memory, or try the search with less common words.

What are the numbers in the TR-WWW files list?

-->In the TR-WWW application's window is a list of all the currently loaded files. As searches are done the number of hits in each file will be shown to the right of the file name.

Why can't the doc set folder be the parent folder of machttp and TR-WWW?

--> Because the document sets would then include all the config, log and other confusing files also. You can make a copy of eg your Default.html file and put it in a folder in the docs folder if you want it to be searchable.

Can TR-WWW live somewhere other than the machttp directory?

--> No. The reason for this is that relative paths have to be returned to the client (rather than the path beginning with the name of your hard disk). The only way to know where to start the path from is to do it from where machttp has been run from. The easiest way to do this is for TR-WWW to be run from the same folder, since it can then work out where it is in the folder hierarchy, and chop the beginning of the file path off up to where it itself lives (if that makes sense).

When does a context result have PRE tags around it?

--> If the file name ends in .html, the extracted context is returned as is. For all other files, a PRE tag is put around the extraction, thus causing text only files to display properly on screen.




For bug reports and any queries, place mail me at

adrian@medlan.med.monash.edu.au

Enjoy,

Adrian Vanzyl.


Version History

Beta release 1 Beta release 2 Beta release 3 Beta release 4 and 5 Beta release 6 Beta release 7 Beta release 8 Version 1 Version 1.1 Version 1.2 Version 1.3

Licence costs:

For Australian sites, price is in Australian dollars, for other sites, in US dollars.

Educational - single license $50, site license $500.

Non Profit or Internal use - single license $50, site license $500.

For Profit/commercial use - single license - $300, site license $3000.

For profit/commercial use is subject ot an annual license renewal fee of 30% of the original licence cost. This includes all updates and upgrades.

Please download and print the order form, which includes the disclaimer and conditions.


PLEASE give me feedback (bug reports, ideas etc) at adrian.vanzyl@med.monash.edu.au.