1 Licence This file is part of SECONDO. Copyright (C) 2005, University in Hagen, Department of Computer Science, Database Systems for New Applications. SECONDO is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. SECONDO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with SECONDO; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 2 createTextRelation This script creates a SECONDO object. It's a relation with the following schema (rel (tuple ( (filename text) (pdf text) (theText text)))) The pdf attribute is embedded in a tag so that the parser of secondo loads the content of this file as base64 coded text atom. The text contained in this pdf is put into the text attribute using the pdftotext tool. pdftotext is part of the xpdf project. You can download it from http://www.foolabs.com/xpdf/download.html The script needs the name of the object as the only argument. Then, it waits for inputs which have to be filenames of pdf documents. After entering a blank line, the script will stop its output. You can use the script in a pipe, e.g. find -name "*.pdf" | createPdfRelation MyPdfFiles > MyPdfFilesobj 3 createTextRelation 2.1 Standard Use The script createTextRelation stores a set of files into a relation of the following type: ( rel (tuple ((filename text)(theText text)))) The script requires one single argument indicating the name of the created object. It reads filenames from the standardinput and writes for each input a tuple to the standard output. The input ends with a blank line. If you want to use this tools frequently, add the path to this script into your PATH variable or create an alias for this script: alias createTextRelation="$SECONDO_BUILD_DIR/Tools/Generators/TextRelations/createTextRelation" Combined with the standard tools, you can use this script to collect all files with wanted properties into a single relation. Example 1: You want to collect all text files directly located in your $HOME/Documents directory. The call is: ls $HOME/Documents/*.txt | createTextRelation MyDocuments > MyDocumentsObj Example 2: Collecting all html files in your home directory or subdirectories of it. The call is: find $HOME -iname "*.html" | createTextRelation MyWebpages > MyWebpagesObj 3. Changing the script for special purposes Binary data are frequently stored in textatoms. You can change the script for handling such data. To do this, you have just to change the three variables CONTENTTYPE, CONTENT, and CONTENTNAME in the script. The CONTENTTYPE is the name of the attribute type e.g. text, binfile, or jpg. Some objects are stored as base 64 coded textatoms. The nested list parser provides a tag for automatically coding a file into such an text. Set the variable CONTENT to 'file' if your type uses base64 encoding. Note, that the coding is maked by the nested list parser, not by this script. For this reason use absolute pathnames for this script e.g. use find $PWD -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj instead of find -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj With the variable CONTENTNAME you can control the attributename of the filecontent. 4. Creating a single page Relations You can also create a relation containing each page as well as each double page as a single tuple within a relation. For this purpose, the tool pdf2SecondoPages can be used. The tool is called as follow: pdf2SecondoPages [pdffiles] [>outfile] It creates a single relation with attributes: FileName (string) : The file name IsDoublePage (bool) : represents this tupe a single or a double page FirstPage (int) : the number of the first page ThePdf (text) : the content as pdf Content (text) : the content as plain text If no pdf files are given, the script reads the filename from the stdin. This is required when a lot of pdf files (std-bash more than 1000) should be converted. In this case, the tool is called: find -type f -name "*.pdf" | pdf2SecondoPages The script creates for each pdf file a subdirectory containing the splitted pdf files. Remember to copy these subdirectories when moving the relation. 5 The file createText.cpp Run make in order to get a simple tool called ~createText~ which creates a simple relation containg an attribute of type "text". This is useful for generating synthetic data which has attributes using FLOBs.