128 lines
4.8 KiB
Plaintext
128 lines
4.8 KiB
Plaintext
|
|
1 Licence
|
||
|
|
|
||
|
|
This file is part of SECONDO.
|
||
|
|
|
||
|
|
Copyright (C) 2005, University in Hagen, Department of Computer Science,
|
||
|
|
Database Systems for New Applications.
|
||
|
|
|
||
|
|
SECONDO is free software; you can redistribute it and/or modify
|
||
|
|
it under the terms of the GNU General Public License as published by
|
||
|
|
the Free Software Foundation; either version 2 of the License, or
|
||
|
|
(at your option) any later version.
|
||
|
|
|
||
|
|
SECONDO is distributed in the hope that it will be useful,
|
||
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||
|
|
GNU General Public License for more details.
|
||
|
|
|
||
|
|
You should have received a copy of the GNU General Public License
|
||
|
|
along with SECONDO; if not, write to the Free Software
|
||
|
|
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
||
|
|
|
||
|
|
|
||
|
|
2 createTextRelation
|
||
|
|
|
||
|
|
This script creates a SECONDO object. It's a relation with the following schema
|
||
|
|
(rel (tuple ( (filename text) (pdf text) (theText text))))
|
||
|
|
The pdf attribute is embedded in a <file> tag so that the parser of secondo
|
||
|
|
loads the content of this file as base64 coded text atom. The text contained in
|
||
|
|
this pdf is put into the text attribute using the pdftotext tool.
|
||
|
|
pdftotext is part of the xpdf project. You can download it from
|
||
|
|
http://www.foolabs.com/xpdf/download.html
|
||
|
|
|
||
|
|
The script needs the name of the object as the only argument. Then, it waits
|
||
|
|
for inputs which have to be filenames of pdf documents. After entering a blank
|
||
|
|
line, the script will stop its output. You can use the script in a pipe, e.g.
|
||
|
|
find -name "*.pdf" | createPdfRelation MyPdfFiles > MyPdfFilesobj
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
3 createTextRelation
|
||
|
|
|
||
|
|
2.1 Standard Use
|
||
|
|
|
||
|
|
The script createTextRelation stores a set of files into a relation
|
||
|
|
of the following type:
|
||
|
|
( rel (tuple ((filename text)(theText text))))
|
||
|
|
|
||
|
|
The script requires one single argument indicating the name of the
|
||
|
|
created object. It reads filenames from the standardinput and writes
|
||
|
|
for each input a tuple to the standard output. The input ends with a
|
||
|
|
blank line.
|
||
|
|
|
||
|
|
If you want to use this tools frequently, add the path to this script
|
||
|
|
into your PATH variable or create an alias for this script:
|
||
|
|
alias createTextRelation="$SECONDO_BUILD_DIR/Tools/Generators/TextRelations/createTextRelation"
|
||
|
|
|
||
|
|
|
||
|
|
Combined with the standard tools, you can use this script to collect
|
||
|
|
all files with wanted properties into a single relation.
|
||
|
|
|
||
|
|
Example 1:
|
||
|
|
You want to collect all text files directly located in your $HOME/Documents
|
||
|
|
directory. The call is:
|
||
|
|
|
||
|
|
ls $HOME/Documents/*.txt | createTextRelation MyDocuments > MyDocumentsObj
|
||
|
|
|
||
|
|
Example 2:
|
||
|
|
Collecting all html files in your home directory or subdirectories of it.
|
||
|
|
The call is:
|
||
|
|
|
||
|
|
find $HOME -iname "*.html" | createTextRelation MyWebpages > MyWebpagesObj
|
||
|
|
|
||
|
|
|
||
|
|
3. Changing the script for special purposes
|
||
|
|
|
||
|
|
Binary data are frequently stored in textatoms. You can change the script for handling
|
||
|
|
such data. To do this, you have just to change the three variables CONTENTTYPE,
|
||
|
|
CONTENT, and CONTENTNAME in the script.
|
||
|
|
|
||
|
|
The CONTENTTYPE is the name of the attribute type e.g. text, binfile, or jpg.
|
||
|
|
Some objects are stored as base 64 coded textatoms. The nested list parser
|
||
|
|
provides a <file> tag for automatically coding a file into such an text.
|
||
|
|
Set the variable CONTENT to 'file' if your type uses base64 encoding.
|
||
|
|
Note, that the coding is maked by the nested list parser, not by this script.
|
||
|
|
For this reason use absolute pathnames for this script e.g.
|
||
|
|
use
|
||
|
|
find $PWD -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj
|
||
|
|
instead of
|
||
|
|
find -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj
|
||
|
|
|
||
|
|
With the variable CONTENTNAME you can control the attributename of the filecontent.
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
4. Creating a single page Relations
|
||
|
|
|
||
|
|
You can also create a relation containing each page as well as each double page as
|
||
|
|
a single tuple within a relation. For this purpose, the tool pdf2SecondoPages can be used.
|
||
|
|
The tool is called as follow:
|
||
|
|
|
||
|
|
pdf2SecondoPages <relname> [pdffiles] [>outfile]
|
||
|
|
|
||
|
|
It creates a single relation with attributes:
|
||
|
|
FileName (string) : The file name
|
||
|
|
IsDoublePage (bool) : represents this tupe a single or a double page
|
||
|
|
FirstPage (int) : the number of the first page
|
||
|
|
ThePdf (text) : the content as pdf
|
||
|
|
Content (text) : the content as plain text
|
||
|
|
|
||
|
|
If no pdf files are given, the script reads the filename from the stdin.
|
||
|
|
This is required when a lot of pdf files (std-bash more than 1000) should be
|
||
|
|
converted. In this case, the tool is called:
|
||
|
|
|
||
|
|
find -type f -name "*.pdf" | pdf2SecondoPages <relname>
|
||
|
|
|
||
|
|
The script creates for each pdf file a subdirectory containing the splitted pdf files.
|
||
|
|
Remember to copy these subdirectories when moving the relation.
|
||
|
|
|
||
|
|
|
||
|
|
5 The file createText.cpp
|
||
|
|
|
||
|
|
Run make in order to get a simple tool called ~createText~ which creates a
|
||
|
|
simple relation containg an attribute of type "text". This is useful for
|
||
|
|
generating synthetic data which has attributes using FLOBs.
|
||
|
|
|
||
|
|
|