Advanced Configuration
Using external data supply
What's meant by external Datasupply?
This was introduced in version 0.927. Building the index (using the indexer) needs an algorithm to find the files to be indexed. The TSEP-integrated filefind-algorithm reads each (sub)-directory, starting at the given starting-directory, to collect the filenames found there, to be indexed.
In addition to this integrated filefind-algorithm, TSEP gives the ability to build the index for files, whose filenames (urls) are supplied from outside of TSEP (e.g. a simple file(url)list, a filelist returned by a crawler/spider-process,...).
Also see:
How to use external Datasupply:
How to send data to TSEP
The external datasupply has to be a .php-script, which has to communicate with TSEP in the following way:
At the TSEP-admin-page "build-new-index", the (fully qualified) name of this .php-script has to be given.
Urls,... are returned to TSEP via
call_user_func("TSEP_ExternalCallBack", returnstring);
returnstring has to be one of the following:
- "URL>an_url"
- an_url will be indexed by TSEP
- "ERR>an_errormessage"
- an_errormessage will be echoed to the browser as error
- "INF>an_errormessage"
- an_errormessage will be echoed to the browser as information
- "ALL>an_url<tsepcontent>content_of_the_file"
- an_url will be indexed by TSEP, but TSEP does not read the file. The file-content is taken from content_of_the_file.
In the admin/example-directory you can find examples, how this external datasupply feature can be used. Try using "examples/urllist.php" as external datasupply for your first tests (it's a very simple datasupply and makes it easy to understand, how this feature works).
examples/phpcrawl4tsep.php is an up-and-running samplescript to communicate with an installed PHPCrawl. Just place the phpcrawl4tsep.php into the same directory as PHPCrawl (there where the example.php from PHPCrawl is) and call it from the TSEP indexer.
Attention:
- Do not forget to add a starting page where the crawling process should start from in the text field labeled "Enter parameter to be sent to datasupply-script" in the "Create new index" page!
- Also make sure your page is being displayed correctly in your browser, otherwise the crawler will also have problems to follow links. Example: We encountered a problem in a PHP file which required another file that could not be found. Therefore nothing showed up in the browser - and the crawler could not follow any URLs because it simply did not find any.
- You can not give an URL to the script. This if for security reasons!
Example of how to correctly configure to run phpcrawl4tsep.php from within TSEP:
assume:
www.mydomain.de/index.php entry-page of your site
www.mydomain.de/php/tsepsearch installation-directory of TSEP
www.mydomain.de/php/tsepsearch/admin/indexer.php TSEP-indexer.php-script ("build-new-index"-startpage)
www.mydomain.de/php/phpcrawl/ install-directory of PHPCRAWL
www.mydomain.de/php/phpcrawl/phpcrawl4tsep.php our samplescript
The picture of the installation shows our example with it's settings.
Definitions, made available by TSEP
Parameter, entered at TSEP-admin-page "build-new-index", are made available for the external datasupply script via public variables:
- $TSEPdirname
- Directory given as first entry in the "build-new-index"-startpage
- $TSEPwebdir
- WebDirectory given as first entry in the "build-new-index"-startpage
- $TSEPdirexclude
- directory-excludes given in the "build-new-index"-startpage
- $TSEPfileexclude
- file-excludes given in the "build-new-index"-startpage
- $TSEPlistFilenamesOnly
- "1", if the '"would-be-indexed"-filelist only'-checkbox is checked
- $TSEPparmsexternalphp
- this is the value entered on the "build-new-index"-startpage in the field "Enter parameter to be sent to datasupply-script" This can be any string, which the external datasupply script needs.
e.g. if the external datasupply script is a crawler, this normally needs an entry-html-filename, where the search has to start. This filename can be passed to the script via the field "Enter parameter to be sent to datasupply-script" at TSEPs "build-new-index"-startpage and can be read by the external datasupply script via variable TSEPparmsexternalphp. - $TSEPextinclude
- this is the value entered on the "build-new-index"-startpage in the field "Fileextensions to be included". In $TSEPextinclude, whitespace are removed and the extension-list is pipe-separated (e.g. "htm|html|php")
TSEP Tags for your code
In version 0.938 we introduced the first tags.
You can simply add those to your pages which will be indexed to give TSEP instructions.
At this time there are 3 different tags:
- <!-- tsep:cmd:start/ -->
- Ignore (do not index) all before this tag.
- <!-- tsep:cmd:end/ -->
- Ignore (do not index) all after this tag.
- <!-- tsep:cmd:noindex --> and <!-- /tsep:cmd:noindex -->
- Ignore all inbetween those two tags (the word "and" in this case)
Scheduling: cron / at
TSEPautoIndexing.sh should be placed in the admin/examples directory. This shellscript should be called by cron (Linux). The equivalent for windows systems is a new script, TSEPautoIndexing.cmd. Some detailes instructions:
How to initiate indexing via unix-command curl (intended to be combined with cron):
- launching indexer using current IndexingProfile:
- curl http://.../admin/indexer.php -d startindexing=startindexing -o <out.htm>
- launching indexer using specific IndexingProfile:
- curl http://.../admin/indexer.php -d startindexing=startindexing -d profile=<name-of-profile> -o <out.htm>
Examples:
- curl http://.../admin/indexer.php -d startindexing=startindexing -d profile=demo -o <out.htm>
or
- curl http://.../admin/indexer.php -d startindexing=startindexing -d profile="my demoprofile" -o <out.htm>
Important Notes:
- Do not forget to embed <name-of-profile> in quotes, if it contains blanks!
- curl writes the indexer.php-generated output into this file: <out.htm>
Hint:
Use the shell-script admin/examples/TSEPautoIndexing.sh
Please adjust the two variables within that script to your needs first (see there). This script can be called without parameter to launch indexer using the current Indexingprofile. Examples:
- .../admin/examples/TSEPautoIndexing.sh
or
simply use the name of the IndexingProfile to be indexed as parameter:
- .../admin/examples/TSEPautoIndexing.sh "my demoprofile"
This script runs the indexing-process and stores the resulting html-outputfile into the tsep-subdirectory "bgindexing.log" using a filename containing current date/time and indexingprofilename. You can later browse the desired file using your favorite browser to check the results
ContentImages
In general
Usually, searchresults are shown in textformat as page title, part of the content and the link to the page. ContentImages can be shown in addition. This, what we call ContentImages are images of your webpages! More or less tiny screenshots, you might know such things as thumbnails for example from Thumbshots.org ( http://www.thumbshots.org/ )
- Each indexed page can have an associated ContentImage.
- A default-ContentImage can be defined, which is shown, if a page does not have "it's own" ContentImage.
- definable maximum width and height
- automatism to create and upload images (via creation of "ContentImage File Lists" in conjunction with the indexer)
- ContentImage filenames are the md5-hash retrieved from the url of the page + the defined "Image-Filename-Extension"
- ContentImages are removed, if the indexer does not find an url anymore, which has previously been defined
- Within "Edit the data stored in the index" you can maintain the image associated to the page
ToDo:
If Delete an Image or ContentImage File List or Upload an Image, currently you have to refresh the window (F5) afterwards manually. We are working on a solution for this "user unfriendliness".
Configure ContentImages
- Use ContentImages
- Switch on or off, if ContentImages should be used in your TSEP installation
- Images-Path for Web-Access
- Path, where ContentImages are located at (used by html-img-tag to show the images)
- Images-Path for PHPscript-Access
- Path, where ContentImages are located at (used by php-script's file-access)
- Root path for ContentImage File Lists for Web-Access
- Path, where ContentImage File Lists are located at (used by html within "Configure/Manage ContentImages")
- Root path for ContentImage File Lists for PHPscript-Access
- Path, where ContentImage File Lists are located at (used by php-script's file-access)
- Image-Filename-Extension
- FileExtension to be used for ContentImages: preferably use ".jpg" or ".png"
- Default image
- Filename (Name only no path and no extension!). You may upload the defaultimage via the button on the right side. But before, you have to enter the name of the file (don't have to equal to the "pc-file" you are uploading) - if you want to upload a file, all Paths (above) have to be defined AND saved (via "update values above"-button).
- maximal display-height
- Maximal height of the image to be shown on the result-pages (aspect ratio is kept in association with the "maximal display-width")
- maximal display-width
- Maximal height of the image to be shown on the result-pages (aspect ratio is kept in association with the "maximal display-height")
- The indexer should create ContentImage File Lists
- If the indexer is run and this option is switched on, a ContentImage File List (associated to the indexing profile) is created.
- Only for pages having no ContentImage
- If "The indexer should create ContentImage File Lists", a file list entry is written into the ContentImage File List only, if no ContentImage exist for the page.
- Automatically run transformation
- Transformation is run automatically after the indexer.
- Transformations
- ContentImage File List entries can be transformed using a transformation-template into .bat-files, .shell-scripts,... This output can e.g. be used to run an external program for building the screenshots or upload the screenshots.
There are three template examples delivered with TSEP (located in
<tsepinstalldir>/contentimages/filelists/transformation_templates).
1. toWebswoon.bat
create .bat-file, which runs Webswoon (http://www.intellitamper.com/webswoon/) to create screenshots of each page
2. WebswoonCopy2Host.bat
create a .bat-file, which copies the created screenshots (Webswoon-results) into the directory, where the ContentImages resides ("Images-Path for PHPscript-Access")
3. WebswoonFtp2Host.bat
create a .bat-file, which uploads the created screenshots (Webswoon-results) into the directory, where the ContentImages resides ("Images-Path for PHPscript-Access")
These templates are examples, which has to be adjusted to your needs before use. These examples are thought to be used with Webswoon and are designed to be used under Windows.
Please have a look into the directory <tsepinstalldir>/admin/examples, where you can find two shell-scripts to be used to create screenshots under *nix systems: wwws.sh and wwwshot.sh.
We will add an example-template for creating wwws.sh in future.
TSEP currently support up to two transformations. - Templatefilename
- Filename+Extension. Do not enter a path.
Currently, template files has to be located under <Root path for ContentImage File Lists for PHPscript-Access>/transformation_templates.
The extension of the generated outputfile is gathered from this template filename. - Active
- You may deactivate a templateexpansion here.
- Commentline starts with
- Cause the transformation writes additional comments into the generated outputfiles, you have to define the prefix to be used, to retrieve a commentline (e.g. '@REM' for .bat-files, '#' for .sh-scripts).
Manage ContentImages
- ContentImage File Lists
- In this area, all existing ContentImage File Lists and all associateded transformation-outpufiles are shown.
You may open, download or delete every file.
On ContentImage File Lists you may launch a transformation. - Manually create ContentImage File List, from currently indexed Pages
- In this area you may select an existing Indexing Profile and create a ContentImage File List manually.