415 lines
10 KiB
Plaintext
415 lines
10 KiB
Plaintext
|
|
/*
|
||
|
|
----
|
||
|
|
This file is part of SECONDO.
|
||
|
|
|
||
|
|
Copyright (C) 2012, University in Hagen
|
||
|
|
Faculty of Mathematic and Computer Science,
|
||
|
|
Database Systems for New Applications.
|
||
|
|
|
||
|
|
SECONDO is free software; you can redistribute it and/or modify
|
||
|
|
it under the terms of the GNU General Public License as published by
|
||
|
|
the Free Software Foundation; either version 2 of the License, or
|
||
|
|
(at your option) any later version.
|
||
|
|
|
||
|
|
SECONDO is distributed in the hope that it will be useful,
|
||
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||
|
|
GNU General Public License for more details.
|
||
|
|
|
||
|
|
You should have received a copy of the GNU General Public License
|
||
|
|
along with SECONDO; if not, write to the Free Software
|
||
|
|
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
||
|
|
----
|
||
|
|
|
||
|
|
\tableofcontents
|
||
|
|
|
||
|
|
\newpage
|
||
|
|
|
||
|
|
1 Overview
|
||
|
|
|
||
|
|
DBLP is a large collection of bibligraphy data. As the data is structured well
|
||
|
|
it is possible to transfer a subset to a property graph.
|
||
|
|
|
||
|
|
\begin{figure}[h]
|
||
|
|
\centerline{
|
||
|
|
\includegraphics[width=1\textwidth]{schema.eps}}
|
||
|
|
\end{figure}
|
||
|
|
|
||
|
|
|
||
|
|
2 Prepare data
|
||
|
|
|
||
|
|
2.1 Create database
|
||
|
|
|
||
|
|
Create a SECONDO database named "pgraph2".
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> create database pgraph2;
|
||
|
|
SECONDO> open database pgraph2;
|
||
|
|
/*
|
||
|
|
|
||
|
|
As this database is quite huge, it is necessary to adjust the available memory
|
||
|
|
in ~SecondoConfig.ini~ to at least 2GB:
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
[QueryProcessor]
|
||
|
|
GlobalMemory=2048
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
In the script files the following statement exposes additional memory to the MainMemoryAlgebra:
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query meminit (1524);
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
2.2 Import relations
|
||
|
|
|
||
|
|
To import the raw data to SECONDO follow the following steps:
|
||
|
|
|
||
|
|
1 Download the raw data from https://dblp.uni-trier.de/xml/dblp.xml.gz
|
||
|
|
|
||
|
|
2 Make sure, SECONDO is (temporary) compiled without transaction support as transaction
|
||
|
|
logging will use a huge amount of system resources. \\
|
||
|
|
(See bin/SecondoConfig.ini value "RTFlags += SMI:NoTransactions")
|
||
|
|
|
||
|
|
3 Use the application in Tools/Converter/Dblp2Secondo to generate the
|
||
|
|
following import files: \\ Document, Authordoc, Author, Keyword. \\
|
||
|
|
(Follow the instructions in the contained READ.ME file)
|
||
|
|
Before importing (!) the relations using the script ~restore\_objs~, rename
|
||
|
|
the relations in the let-statement to Document\_raw, Author\_raw, Authordoc\_raw,
|
||
|
|
Keyword\_raw. This allows to use these names in the node relations later.
|
||
|
|
|
||
|
|
|
||
|
|
2.3 Transform relations to property graph
|
||
|
|
|
||
|
|
As the complete dataset is very large, this sample converts just a
|
||
|
|
subset to the property graph. This can be adjusted in the beginning of
|
||
|
|
script ~createrels~. Currently it will take all publications from 2017
|
||
|
|
and all documents where the author contains the word "gueting"
|
||
|
|
(About 320.000 records.)
|
||
|
|
|
||
|
|
The imported data will be split to nodes and edge relations to represent a graph.
|
||
|
|
These relations will be taken to define the graph later.
|
||
|
|
|
||
|
|
The above structure will be created by the script ~createrels~:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/createrels
|
||
|
|
/*
|
||
|
|
|
||
|
|
|
||
|
|
3 Creating a property graph
|
||
|
|
|
||
|
|
A property graph has to be defined before matching operators can be
|
||
|
|
used to query the graph. This is done be registering the node and edge
|
||
|
|
relations. (This could be seen as the schema of the graph)
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/create
|
||
|
|
/*
|
||
|
|
|
||
|
|
At first a PropertyGraph object is created. The argument "p2" is
|
||
|
|
used to prefix objects in the memory catalog to keep the data
|
||
|
|
of multiple graphs separated.
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
let p2=createpgraph("p2");
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
To define the schema of the graph, the script ~create~ uses the
|
||
|
|
following operators to register the property graph:
|
||
|
|
|
||
|
|
* ~createpgraph(name)~ to create a property graph object
|
||
|
|
|
||
|
|
* ~addnodesrel[relname,fromclause,toclause]~ to register node relations
|
||
|
|
|
||
|
|
* ~addedgesrel[relanme,propertyname, indexname]~ to register edge relations
|
||
|
|
|
||
|
|
* ~addnodeindex[relanme]~ to register node property indexes
|
||
|
|
|
||
|
|
This configuration will be saved in the database and will
|
||
|
|
be available between sessions.
|
||
|
|
|
||
|
|
To get information about the configuration of a property graph
|
||
|
|
objects, use the ~info~ operator.
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> query p2 info;
|
||
|
|
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/info
|
||
|
|
/*
|
||
|
|
|
||
|
|
This will print the following information to the console:
|
||
|
|
|
||
|
|
*/
|
||
|
|
PGRAPH Information
|
||
|
|
- name : p2
|
||
|
|
- node relations
|
||
|
|
- Author (Authorid)
|
||
|
|
- Collection (Id)
|
||
|
|
- Conference (Id)
|
||
|
|
- Document (Docid)
|
||
|
|
- Keyword (Id)
|
||
|
|
- Publisher (Id)
|
||
|
|
- edge relations
|
||
|
|
- AUTHOR_OF (FROM Authorid=>Author.Authorid; TO Docid=>Document.Docid)
|
||
|
|
- KEYWORD_OF (FROM Wordid=>Keyword.Wordid; TO Docid=>Document.Docid)
|
||
|
|
- PART_OF (FROM Docid=>Document.Docid; TO Collectionid=>Collection.Id)
|
||
|
|
- PUBLISHED_AT (FROM Docid=>Document.Docid; TO Conferenceid=>Conference.Id)
|
||
|
|
- PUBLISHED_BY (FROM Docid=>Document.Docid; TO Publisherid=>Publisher.Id)
|
||
|
|
/*
|
||
|
|
|
||
|
|
4 Loading the property graph
|
||
|
|
|
||
|
|
To be able to query the property graph, it needs to be loaded.
|
||
|
|
This will load all configured data into memory and create
|
||
|
|
additional structures to support the match operators.
|
||
|
|
Using loadgraph the fist time will also gather some statistics
|
||
|
|
that will be stored in the graph object for later reuse, so
|
||
|
|
loading will be faster the next time.
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> query p2 loadgraph;
|
||
|
|
/*
|
||
|
|
|
||
|
|
After loading the graph, the following statistics are calculated:
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
- statistics:
|
||
|
|
- noderelations:
|
||
|
|
- Author CARD: 487803
|
||
|
|
- Collection CARD: 77
|
||
|
|
- Conference CARD: 2633
|
||
|
|
- Document CARD: 321126
|
||
|
|
- Keyword CARD: 82451
|
||
|
|
- Publisher CARD: 172
|
||
|
|
- edgerelations:
|
||
|
|
- AUTHOR_OF CARDFW: 2.263125 CARDBW: 3.437775
|
||
|
|
- KEYWORD_OF CARDFW: 30.364629 CARDBW: 7.796298
|
||
|
|
- PART_OF CARDFW: 0.012609 CARDBW: 52.584416
|
||
|
|
- PUBLISHED_AT CARDFW: 0.488674 CARDBW: 59.599696
|
||
|
|
- PUBLISHED_BY CARDFW: 0.009678 CARDBW: 18.069767
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
5 Sample Queries
|
||
|
|
|
||
|
|
The PropertyGraph Algebra defines three matching operators, namely
|
||
|
|
|
||
|
|
- ~match1~: Uses a query tree and a stream of input nodes to match subgraphs
|
||
|
|
starting from the root node trying to matching edge by edge and node by node
|
||
|
|
|
||
|
|
- ~match2~: Takes only a query graph. A query tree is derived automatically by selecting
|
||
|
|
the optimal start node. The input node relation is internally opened.
|
||
|
|
|
||
|
|
- ~match3~: Queries are written in cypher, a popular graph query language.
|
||
|
|
|
||
|
|
|
||
|
|
5.1 Query 'coauthor'
|
||
|
|
|
||
|
|
Queries the top 5 co-authors of publications of "Ralf Hartmut Gueting".
|
||
|
|
In the following this query will be expressed by the three match-operators.
|
||
|
|
|
||
|
|
The results will be grouped and show the authors with the sum of joint publications.
|
||
|
|
|
||
|
|
|
||
|
|
5.1.1 match1
|
||
|
|
|
||
|
|
The starting nodes for the subgraph match are taken from the
|
||
|
|
tuple stream (first argument).
|
||
|
|
Note the direction argument "$<$-" to match an edge in reverse direction.
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
Document feed
|
||
|
|
match1
|
||
|
|
['
|
||
|
|
(
|
||
|
|
(doc Document)
|
||
|
|
p PUBLISHED_AT
|
||
|
|
( (conf Conference) )
|
||
|
|
KEYWORD_OF <-
|
||
|
|
( (k Keyword) )
|
||
|
|
AUTHOR_OF <-
|
||
|
|
( (a Author ( (Name "Ralf Hartmut Gueting")) ) )
|
||
|
|
)',
|
||
|
|
'( ((k Word) contains "tempo") )',
|
||
|
|
'( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )'
|
||
|
|
] consume;
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match1-coauthors
|
||
|
|
/*
|
||
|
|
|
||
|
|
5.1.2 match2
|
||
|
|
|
||
|
|
A query graph is given as a list. The optimal start node is
|
||
|
|
determined automatically and the corresponding tuple stream is
|
||
|
|
used internally. (Note that the reverse direction for edges
|
||
|
|
are not necessary here)
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
match2 ['
|
||
|
|
(
|
||
|
|
( ( (Name "Ralf Hartmut Gueting") ))
|
||
|
|
(doc Document)
|
||
|
|
(a AUTHOR_OF doc)
|
||
|
|
)',
|
||
|
|
'( ((a Name) <> "Ralf Hartmut Gueting") ) ',
|
||
|
|
'( ((a Name) Name) )'
|
||
|
|
]
|
||
|
|
sortby[Name] groupby[Name; Cnt:group count] sortby[Cnt:desc] head[5] consume;
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-coauthors
|
||
|
|
/*
|
||
|
|
|
||
|
|
NOTE:
|
||
|
|
There is an additional script, that forces to choose an adverse strategy.
|
||
|
|
It will take much more time to succeed.
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-coauthors-slow
|
||
|
|
/*
|
||
|
|
|
||
|
|
5.1.3 match3
|
||
|
|
|
||
|
|
The query is expressed as Cypher expression.
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
match3 ['
|
||
|
|
MATCH
|
||
|
|
(a1 {Name:"Ralf Hartmut Gueting"})-[:AUTHOR_OF]->(doc:Document)<-[:AUTHOR_OF]-(a)
|
||
|
|
WHERE a.Name <> "Ralf Hartmut Gueting"
|
||
|
|
RETURN a.Name
|
||
|
|
'] sortby[Name] groupby[Name; Cnt:group count] sortby[Cnt:desc] head[5] consume;
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match3-coauthors
|
||
|
|
/*
|
||
|
|
|
||
|
|
5.2 Query 'keywords'
|
||
|
|
|
||
|
|
Queries the conferences and publication titles where "Ralf Hartmut Gueting"
|
||
|
|
presented a paper that is indexed with a keyword containing "tempo"
|
||
|
|
|
||
|
|
5.2.1 match1
|
||
|
|
|
||
|
|
The starting nodes for the subgraph match are taken from the
|
||
|
|
tuple stream (first argument).
|
||
|
|
Note the direction argument "$<$-" to match an edge in reverse direction.
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
Document feed
|
||
|
|
match1
|
||
|
|
['
|
||
|
|
(
|
||
|
|
(doc Document)
|
||
|
|
p PUBLISHED_AT
|
||
|
|
( (conf Conference) )
|
||
|
|
KEYWORD_OF <-
|
||
|
|
( (k Keyword) )
|
||
|
|
AUTHOR_OF <-
|
||
|
|
( (a Author ( (Name "Ralf Hartmut Gueting")) ) )
|
||
|
|
)',
|
||
|
|
'( ((k Word) contains "tempo") )',
|
||
|
|
'( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )'
|
||
|
|
] consume;
|
||
|
|
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match1-keywords
|
||
|
|
/*
|
||
|
|
|
||
|
|
5.2.2 match2
|
||
|
|
|
||
|
|
A query graph is given as a list. The optimal start node is
|
||
|
|
determined automatically and the corresponding tuple stream is
|
||
|
|
used internally. (Note that the reverse direction for edges
|
||
|
|
are not necessary here)
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
match2 ['
|
||
|
|
(
|
||
|
|
(doc Document)
|
||
|
|
(a ( (Name "Ralf Hartmut Gueting") ))
|
||
|
|
(doc p PUBLISHED_AT conf)
|
||
|
|
(k KEYWORD_OF doc)
|
||
|
|
(a AUTHOR_OF doc)
|
||
|
|
)',
|
||
|
|
'( ((k Word) contains "tempo") ) ',
|
||
|
|
'( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )'
|
||
|
|
] consume;
|
||
|
|
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-keywords
|
||
|
|
/*
|
||
|
|
|
||
|
|
5.2.3 match3
|
||
|
|
|
||
|
|
The query is expressed as Cypher expression.
|
||
|
|
In this sample, the query tree is expressed by two pathes, that
|
||
|
|
are combined by the node alias 'doc'. Also the node types of the aliases
|
||
|
|
'k', 'a' and 'doc' are derived from the edge types. Note, The Year is
|
||
|
|
an edge property
|
||
|
|
|
||
|
|
*/
|
||
|
|
|
||
|
|
query p2
|
||
|
|
match3 ['
|
||
|
|
MATCH
|
||
|
|
(conf)<-[p:_PUBLISHED_AT]-(doc:Document)<-[:KEYWORD_OF]-(k),
|
||
|
|
(doc)<-[AUTHOR_OF]-(a{Name:"Ralf Hartmut Gueting"})
|
||
|
|
WHERE k.Word contains "tempo"
|
||
|
|
RETURN conf.Name, p.Year, doc.Title
|
||
|
|
'] consume;
|
||
|
|
/*
|
||
|
|
|
||
|
|
Also available as sciptfile:
|
||
|
|
|
||
|
|
*/
|
||
|
|
SECONDO> @../Algebras/PropertyGraph/sample-dblp/match3-keywords
|
||
|
|
/*
|
||
|
|
|
||
|
|
6 References
|
||
|
|
|
||
|
|
[CYP20] https://neo4j.com/docs/cypher-manual/current/
|
||
|
|
|
||
|
|
*/
|
||
|
|
|