secondo/Tools/Generators/TextRelations/readme

1 Licence

 This file is part of SECONDO.

 Copyright (C) 2005, University in Hagen, Department of Computer Science,
 Database Systems for New Applications.

 SECONDO is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.

 SECONDO is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

 You should have received a copy of the GNU General Public License
 along with SECONDO; if not, write to the Free Software
 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


2 createTextRelation

This script creates a SECONDO object. It's a relation with the following schema
   (rel (tuple ( (filename text) (pdf text) (theText text))))
The pdf attribute is embedded in a  <file> tag so that the parser of secondo
loads the content of this file as base64 coded text atom. The text contained in
this pdf is put into the text attribute using the pdftotext tool. 
pdftotext is part of the xpdf project. You can download it from 
    http://www.foolabs.com/xpdf/download.html

The script needs the name of the object as the only argument. Then, it waits 
for inputs which have to be filenames of pdf documents. After entering a blank
line, the script will stop its output. You can use the script in a pipe, e.g.
  find -name "*.pdf" | createPdfRelation MyPdfFiles > MyPdfFilesobj


3 createTextRelation

2.1 Standard Use

The script createTextRelation stores a set of files into a relation
of the following type: 
   ( rel (tuple ((filename text)(theText text)))) 

The script requires one single argument indicating the name of the 
created object. It reads filenames from the standardinput and writes 
for each input a tuple to the standard output. The input ends with a 
blank line.

If you want to use this tools frequently, add the path to this script 
into your PATH variable or create an alias for this script:
alias createTextRelation="$SECONDO_BUILD_DIR/Tools/Generators/TextRelations/createTextRelation"


Combined with the standard tools, you can use this script to collect 
all files with wanted properties into a single relation.

Example 1:
You want to collect all text files directly located in your $HOME/Documents 
directory. The call is:

ls $HOME/Documents/*.txt | createTextRelation MyDocuments > MyDocumentsObj

Example 2:
Collecting all html files in your home directory or subdirectories of it.
The call is:

find $HOME -iname "*.html" | createTextRelation MyWebpages > MyWebpagesObj


3. Changing the script for special purposes

Binary data are frequently stored in textatoms. You can change the script for handling
such data. To do this, you have just to change the three variables CONTENTTYPE, 
CONTENT, and CONTENTNAME in the script. 

The CONTENTTYPE is the name of the attribute type e.g. text, binfile, or jpg.
Some objects are stored as base 64 coded textatoms. The nested list parser 
provides a <file> tag for automatically coding a file into such an text. 
Set the variable CONTENT to 'file' if your type uses base64 encoding. 
Note, that the coding is maked by the nested list parser, not by this script.
For this reason use absolute pathnames for this script e.g.
use
  find $PWD -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj
instead of
  find -name "*.jpg" | createTextRelation MyPictures > MyPicturesObj

With the variable CONTENTNAME you can control the attributename of the  filecontent.


4. Creating a single page Relations

You can also create a relation containing each page as well as each double page as 
a single tuple within a relation. For this purpose, the tool pdf2SecondoPages can be used.
The tool is called as follow:

pdf2SecondoPages <relname> [pdffiles] [>outfile]

It creates a single relation with attributes:
   FileName      (string) : The file name
   IsDoublePage  (bool)   : represents this tupe a single or a double page
   FirstPage     (int)    : the number of the first page
   ThePdf        (text)   : the content as pdf
   Content       (text)   : the content as plain text

If no pdf files are given, the script reads the filename from the stdin.
This is required when a lot of pdf files (std-bash more than 1000) should be
converted. In this case, the tool is called:

find -type f -name "*.pdf" | pdf2SecondoPages <relname>

The script creates for each pdf file a subdirectory containing the splitted pdf files.
Remember to copy these subdirectories when moving the relation.


5 The file createText.cpp

Run make in order to get a simple tool called ~createText~ which creates a
simple relation containg an attribute of type "text". This is useful for
generating synthetic data which has attributes using FLOBs.
firs commit 2026-01-23 17:03:45 +08:00			`1 Licence`

			`This file is part of SECONDO.`

			`Copyright (C) 2005, University in Hagen, Department of Computer Science,`
			`Database Systems for New Applications.`

			`SECONDO is free software; you can redistribute it and/or modify`
			`it under the terms of the GNU General Public License as published by`
			`the Free Software Foundation; either version 2 of the License, or`
			`(at your option) any later version.`

			`SECONDO is distributed in the hope that it will be useful,`
			`but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`GNU General Public License for more details.`

			`You should have received a copy of the GNU General Public License`
			`along with SECONDO; if not, write to the Free Software`
			`Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA`


			`2 createTextRelation`

			`This script creates a SECONDO object. It's a relation with the following schema`
			`(rel (tuple ( (filename text) (pdf text) (theText text))))`
			`The pdf attribute is embedded in a <file> tag so that the parser of secondo`
			`loads the content of this file as base64 coded text atom. The text contained in`
			`this pdf is put into the text attribute using the pdftotext tool.`
			`pdftotext is part of the xpdf project. You can download it from`
			`http://www.foolabs.com/xpdf/download.html`

			`The script needs the name of the object as the only argument. Then, it waits`
			`for inputs which have to be filenames of pdf documents. After entering a blank`
			`line, the script will stop its output. You can use the script in a pipe, e.g.`
			`find -name "*.pdf" \| createPdfRelation MyPdfFiles > MyPdfFilesobj`




			`3 createTextRelation`

			`2.1 Standard Use`

			`The script createTextRelation stores a set of files into a relation`
			`of the following type:`
			`( rel (tuple ((filename text)(theText text))))`

			`The script requires one single argument indicating the name of the`
			`created object. It reads filenames from the standardinput and writes`
			`for each input a tuple to the standard output. The input ends with a`
			`blank line.`

			`If you want to use this tools frequently, add the path to this script`
			`into your PATH variable or create an alias for this script:`
			`alias createTextRelation="$SECONDO_BUILD_DIR/Tools/Generators/TextRelations/createTextRelation"`


			`Combined with the standard tools, you can use this script to collect`
			`all files with wanted properties into a single relation.`

			`Example 1:`
			`You want to collect all text files directly located in your $HOME/Documents`
			`directory. The call is:`

			`ls $HOME/Documents/*.txt \| createTextRelation MyDocuments > MyDocumentsObj`

			`Example 2:`
			`Collecting all html files in your home directory or subdirectories of it.`
			`The call is:`

			`find $HOME -iname "*.html" \| createTextRelation MyWebpages > MyWebpagesObj`


			`3. Changing the script for special purposes`

			`Binary data are frequently stored in textatoms. You can change the script for handling`
			`such data. To do this, you have just to change the three variables CONTENTTYPE,`
			`CONTENT, and CONTENTNAME in the script.`

			`The CONTENTTYPE is the name of the attribute type e.g. text, binfile, or jpg.`
			`Some objects are stored as base 64 coded textatoms. The nested list parser`
			`provides a <file> tag for automatically coding a file into such an text.`
			`Set the variable CONTENT to 'file' if your type uses base64 encoding.`
			`Note, that the coding is maked by the nested list parser, not by this script.`
			`For this reason use absolute pathnames for this script e.g.`
			`use`
			`find $PWD -name "*.jpg" \| createTextRelation MyPictures > MyPicturesObj`
			`instead of`
			`find -name "*.jpg" \| createTextRelation MyPictures > MyPicturesObj`

			`With the variable CONTENTNAME you can control the attributename of the filecontent.`



			`4. Creating a single page Relations`

			`You can also create a relation containing each page as well as each double page as`
			`a single tuple within a relation. For this purpose, the tool pdf2SecondoPages can be used.`
			`The tool is called as follow:`

			`pdf2SecondoPages <relname> [pdffiles] [>outfile]`

			`It creates a single relation with attributes:`
			`FileName (string) : The file name`
			`IsDoublePage (bool) : represents this tupe a single or a double page`
			`FirstPage (int) : the number of the first page`
			`ThePdf (text) : the content as pdf`
			`Content (text) : the content as plain text`

			`If no pdf files are given, the script reads the filename from the stdin.`
			`This is required when a lot of pdf files (std-bash more than 1000) should be`
			`converted. In this case, the tool is called:`

			`find -type f -name "*.pdf" \| pdf2SecondoPages <relname>`

			`The script creates for each pdf file a subdirectory containing the splitted pdf files.`
			`Remember to copy these subdirectories when moving the relation.`


			`5 The file createText.cpp`

			`Run make in order to get a simple tool called ~createText~ which creates a`
			`simple relation containg an attribute of type "text". This is useful for`
			`generating synthetic data which has attributes using FLOBs.`