


extract(1)		  User Commands		       extract(1)



NAME
     extract - SWISH++ text extractor

SYNOPSIS
     extract [ options ] directory... file...

DESCRIPTION
     extract is	the SWISH++ text extractor, a utility to  extract
     what  text	 there is from a (mostly) binary file (similar to
     the strings(1) command) prior to indexing.	  Original  files
     are untouched.

     Text is extracted from the	specified files	and files in  the
     specified	directories;  text from	files in subdiretories of
     specified directories is also extracted.  Text is	extracted
     from files	only if	their filename extension is among the set
     specified with the	-e option.

     Extracted files have the same  filename  with  the	 ``.txt''
     extension	   appended,	 e.g.,	   ``foo.doc''	  becomes
     ``foo.doc.txt'' after extraction.	 However,  extraction  is
     not performed if the extracted text file exists.

  File Compression
     Files can be compressed either by compress(1) or gzip(1) and
     will  be  uncompressed  and  recompressed	before	and after
     extraction, respectively.	The original  filename	extension
     must  be  present,	 however,  e.g.,  ``file.doc.gz''  for	a
     compressed	 Microsoft  Word  file.	  Text	 extracted   from
     compressed	 files have the	compression extension replaced by
     the  ``.txt''  extension,	 e.g.,	 ``foo.doc.gz''	  becomes
     ``foo.doc.txt'' after extraction.

  Word Determination
     extract performs the same character entity	 conversions  and
     word determination	heuristics as index(1) but also	addition-
     ally:

     1.	 Considers all PostScript Level	2 operators that are  not
	 also  English	words  to be stop words.  Such words in	a
	 file usually indicate an encapsulated	PostScript  (EPS)
	 file and such should not ultimately be	indexed.

     2.	 Looks specifically  for  encapsulated	PostScript  (EPS)
	 data  between	everything  between  one of %%BeginSetup,
	 %%BoundingBox,	%%Creator, %%EndComments, or %%Title  and
	 %%Trailer and discards	it.

     3.	 Discards strings of  ASCII  hex  data	Word_Hex_Min_Size
	 characters  or	 longer, e.g., ``7F454C46.''  (Default is
	 5.)




SWISH++		 Last change: February 27, 1998			1






extract(1)		  User Commands		       extract(1)



  Motivation
     extract was developed to be able to index non-text	files  in
     proprietary  formats  such	 as  Microsoft	Office documents.
     There are a couple	 of  reasons  why  the	functionality  of
     extract isn't simply built	into index(1):

     1.	 Users who do not need to index	such documents	shouldn't
	 have  to pay the performance penatly for doing	the extra
	 checks	for PostScript and hex data.

     2.	 If files are compressed on a server,  uncompressing  and
	 recompressing	them  every time indexing is performed is
	 excessive.  Text extraction, on the other hand, is  done
	 only  once  per  file;	if the file is updated,	the text-
	 extracted version should be deleted and recreated.

     A good way	to perform extraction is via a cron job.

OPTIONS
     -eextension   A filename extension	of files to extract  text
		   from	without	the ``dot.''  Multiple -e options
		   may be specified.

     -l		   Follow symbolic links during	extraction.   The
		   default is not to follow them.

     -vverbosity   The verbosity level,	0-3:

		   0   No  output  is	generated   (except   for
		       errors).
		   1   Only run	statistics (elapsed time,  number
		       of files, word count) are printed.
		   2   Directories  are	 printed  as   extraction
		       progresses.
		   3   Directories and files are printed  with	a
		       word-count for each file.

     -V		   Print the version number of SWISH++ and exit.

EXAMPLE
     To	extract	text from all Microsoft	Office	files  on  a  web
     server:

	  extract -v3 -e doc -e	ppt -e xls /home/www/htdocs


EXIT STATUS
     Exits with	a value	of  zero  only	if  extraction	completed
     sucessfully; non-zero otherwise.

CAVEATS
     Text extraction is	not perfect, nor can be.



SWISH++		 Last change: February 27, 1998			2






extract(1)		  User Commands		       extract(1)



SEE ALSO
     compress(1),  crontab(1),	gunzip(1),   gzip(1),	index(1),
     search(1),	strings(1), uncompress(1)

     Adobe Systems Incorporated.  ``PostScript Langauge	Reference
     Manual,  2nd  ed.''   Addison-Wesley, Reading, MA.	 pp. 346-
     359.

AUTHOR
     Paul J. Lucas <pjl@best.com>













































SWISH++		 Last change: February 27, 1998			3



