1 Licence
This file is part of SECONDO.
Copyright (C) 2005, University in Hagen, Department of Computer Science,
Database Systems for New Applications.
SECONDO is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
SECONDO is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with SECONDO; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
2 createTextRelation
This script creates a SECONDO object. It's a relation with the following schema
(rel (tuple ( (filename text) (pdf text) (theText text))))
The pdf attribute is embedded in a <file> tag so that the parser of secondo
loads the content of this file as base64 coded text atom. The text contained in
this pdf is put into the text attribute using the pdftotext tool.
pdftotext is part of the xpdf project. You can download it from
http://www.foolabs.com/xpdf/download.html
The script needs the name of the object as the only argument. Then, it waits
for inputs which have to be filenames of pdf documents. After entering a blank
line, the script will stop its output. You can use the script in a pipe, e.g.
find -name "*.pdf" | createPdfRelation MyPdfFiles > MyPdfFilesobj
3 createTextRelation
2.1 Standard Use
The script createTextRelation stores a set of files into a relation
of the following type:
( rel (tuple ((filename text)(theText text))))
The script requires one single argument indicating the name of the
created object. It reads filenames from the standardinput and writes
for each input a tuple to the standard output. The input ends with a
blank line.
If you want to use this tools frequently, add the path to this script
into your PATH variable or create an alias for this script:
alias createTextRelation="$SECONDO_BUILD_DIR/Tools/Generators/TextRelations/createTextRelation"
Combined with the standard tools, you can use this script to collect
all files with wanted properties into a single relation.
Example 1:
You want to collect all text files directly located in your $HOME/Documents
directory. The call is:
ls $HOME/Documents/*.txt | createTextRelation MyDocuments > MyDocumentsObj
Example 2:
Collecting all html files in your home directory or subdirectories of it.
The call is:
find $HOME -iname "*.html" | createTextRelation MyWebpages > MyWebpagesObj
3. Changing the script for special purposes
Binary data are frequently stored in textatoms. You can change the script for handling
such data. To do this, you have just to change the three variables CONTENTTYPE,
CONTENT, and CONTENTNAME in the script.
The CONTENTTYPE is the name of the attribute type e.g. text, binfile, or jpg.
Some objects are stored as base 64 coded textatoms. The nested list parser
provides a <file> tag for automatically coding a file into such an text.
Set the variable CONTENT to 'file' if your type uses base64 encoding.
Note, that the coding is maked by the nested list parser, not by this script.
For this reason use absolute pathnames for this script e.g.
use
find $PWD -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj
instead of
find -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj
With the variable CONTENTNAME you can control the attributename of the filecontent.
4. Creating a single page Relations
You can also create a relation containing each page as well as each double page as
a single tuple within a relation. For this purpose, the tool pdf2SecondoPages can be used.
The tool is called as follow:
pdf2SecondoPages <relname> [pdffiles] [>outfile]
It creates a single relation with attributes:
FileName (string) : The file name
IsDoublePage (bool) : represents this tupe a single or a double page
FirstPage (int) : the number of the first page
ThePdf (text) : the content as pdf
Content (text) : the content as plain text
If no pdf files are given, the script reads the filename from the stdin.
This is required when a lot of pdf files (std-bash more than 1000) should be
converted. In this case, the tool is called:
find -type f -name "*.pdf" | pdf2SecondoPages <relname>
The script creates for each pdf file a subdirectory containing the splitted pdf files.
Remember to copy these subdirectories when moving the relation.
5 The file createText.cpp
Run make in order to get a simple tool called ~createText~ which creates a
simple relation containg an attribute of type "text". This is useful for
generating synthetic data which has attributes using FLOBs.