/* ---- This file is part of SECONDO. Copyright (C) 2012, University in Hagen Faculty of Mathematic and Computer Science, Database Systems for New Applications. SECONDO is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. SECONDO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with SECONDO; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ---- \tableofcontents \newpage 1 Overview DBLP is a large collection of bibligraphy data. As the data is structured well it is possible to transfer a subset to a property graph. \begin{figure}[h] \centerline{ \includegraphics[width=1\textwidth]{schema.eps}} \end{figure} 2 Prepare data 2.1 Create database Create a SECONDO database named "pgraph2". */ SECONDO> create database pgraph2; SECONDO> open database pgraph2; /* As this database is quite huge, it is necessary to adjust the available memory in ~SecondoConfig.ini~ to at least 2GB: */ [QueryProcessor] GlobalMemory=2048 /* In the script files the following statement exposes additional memory to the MainMemoryAlgebra: */ query meminit (1524); /* 2.2 Import relations To import the raw data to SECONDO follow the following steps: 1 Download the raw data from https://dblp.uni-trier.de/xml/dblp.xml.gz 2 Make sure, SECONDO is (temporary) compiled without transaction support as transaction logging will use a huge amount of system resources. \\ (See bin/SecondoConfig.ini value "RTFlags += SMI:NoTransactions") 3 Use the application in Tools/Converter/Dblp2Secondo to generate the following import files: \\ Document, Authordoc, Author, Keyword. \\ (Follow the instructions in the contained READ.ME file) Before importing (!) the relations using the script ~restore\_objs~, rename the relations in the let-statement to Document\_raw, Author\_raw, Authordoc\_raw, Keyword\_raw. This allows to use these names in the node relations later. 2.3 Transform relations to property graph As the complete dataset is very large, this sample converts just a subset to the property graph. This can be adjusted in the beginning of script ~createrels~. Currently it will take all publications from 2017 and all documents where the author contains the word "gueting" (About 320.000 records.) The imported data will be split to nodes and edge relations to represent a graph. These relations will be taken to define the graph later. The above structure will be created by the script ~createrels~: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/createrels /* 3 Creating a property graph A property graph has to be defined before matching operators can be used to query the graph. This is done be registering the node and edge relations. (This could be seen as the schema of the graph) */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/create /* At first a PropertyGraph object is created. The argument "p2" is used to prefix objects in the memory catalog to keep the data of multiple graphs separated. */ let p2=createpgraph("p2"); /* To define the schema of the graph, the script ~create~ uses the following operators to register the property graph: * ~createpgraph(name)~ to create a property graph object * ~addnodesrel[relname,fromclause,toclause]~ to register node relations * ~addedgesrel[relanme,propertyname, indexname]~ to register edge relations * ~addnodeindex[relanme]~ to register node property indexes This configuration will be saved in the database and will be available between sessions. To get information about the configuration of a property graph objects, use the ~info~ operator. */ SECONDO> query p2 info; SECONDO> @../Algebras/PropertyGraph/sample-dblp/info /* This will print the following information to the console: */ PGRAPH Information - name : p2 - node relations - Author (Authorid) - Collection (Id) - Conference (Id) - Document (Docid) - Keyword (Id) - Publisher (Id) - edge relations - AUTHOR_OF (FROM Authorid=>Author.Authorid; TO Docid=>Document.Docid) - KEYWORD_OF (FROM Wordid=>Keyword.Wordid; TO Docid=>Document.Docid) - PART_OF (FROM Docid=>Document.Docid; TO Collectionid=>Collection.Id) - PUBLISHED_AT (FROM Docid=>Document.Docid; TO Conferenceid=>Conference.Id) - PUBLISHED_BY (FROM Docid=>Document.Docid; TO Publisherid=>Publisher.Id) /* 4 Loading the property graph To be able to query the property graph, it needs to be loaded. This will load all configured data into memory and create additional structures to support the match operators. Using loadgraph the fist time will also gather some statistics that will be stored in the graph object for later reuse, so loading will be faster the next time. */ SECONDO> query p2 loadgraph; /* After loading the graph, the following statistics are calculated: */ - statistics: - noderelations: - Author CARD: 487803 - Collection CARD: 77 - Conference CARD: 2633 - Document CARD: 321126 - Keyword CARD: 82451 - Publisher CARD: 172 - edgerelations: - AUTHOR_OF CARDFW: 2.263125 CARDBW: 3.437775 - KEYWORD_OF CARDFW: 30.364629 CARDBW: 7.796298 - PART_OF CARDFW: 0.012609 CARDBW: 52.584416 - PUBLISHED_AT CARDFW: 0.488674 CARDBW: 59.599696 - PUBLISHED_BY CARDFW: 0.009678 CARDBW: 18.069767 /* 5 Sample Queries The PropertyGraph Algebra defines three matching operators, namely - ~match1~: Uses a query tree and a stream of input nodes to match subgraphs starting from the root node trying to matching edge by edge and node by node - ~match2~: Takes only a query graph. A query tree is derived automatically by selecting the optimal start node. The input node relation is internally opened. - ~match3~: Queries are written in cypher, a popular graph query language. 5.1 Query 'coauthor' Queries the top 5 co-authors of publications of "Ralf Hartmut Gueting". In the following this query will be expressed by the three match-operators. The results will be grouped and show the authors with the sum of joint publications. 5.1.1 match1 The starting nodes for the subgraph match are taken from the tuple stream (first argument). Note the direction argument "$<$-" to match an edge in reverse direction. */ query p2 Document feed match1 [' ( (doc Document) p PUBLISHED_AT ( (conf Conference) ) KEYWORD_OF <- ( (k Keyword) ) AUTHOR_OF <- ( (a Author ( (Name "Ralf Hartmut Gueting")) ) ) )', '( ((k Word) contains "tempo") )', '( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )' ] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match1-coauthors /* 5.1.2 match2 A query graph is given as a list. The optimal start node is determined automatically and the corresponding tuple stream is used internally. (Note that the reverse direction for edges are not necessary here) */ query p2 match2 [' ( ( ( (Name "Ralf Hartmut Gueting") )) (doc Document) (a AUTHOR_OF doc) )', '( ((a Name) <> "Ralf Hartmut Gueting") ) ', '( ((a Name) Name) )' ] sortby[Name] groupby[Name; Cnt:group count] sortby[Cnt:desc] head[5] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-coauthors /* NOTE: There is an additional script, that forces to choose an adverse strategy. It will take much more time to succeed. */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-coauthors-slow /* 5.1.3 match3 The query is expressed as Cypher expression. */ query p2 match3 [' MATCH (a1 {Name:"Ralf Hartmut Gueting"})-[:AUTHOR_OF]->(doc:Document)<-[:AUTHOR_OF]-(a) WHERE a.Name <> "Ralf Hartmut Gueting" RETURN a.Name '] sortby[Name] groupby[Name; Cnt:group count] sortby[Cnt:desc] head[5] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match3-coauthors /* 5.2 Query 'keywords' Queries the conferences and publication titles where "Ralf Hartmut Gueting" presented a paper that is indexed with a keyword containing "tempo" 5.2.1 match1 The starting nodes for the subgraph match are taken from the tuple stream (first argument). Note the direction argument "$<$-" to match an edge in reverse direction. */ query p2 Document feed match1 [' ( (doc Document) p PUBLISHED_AT ( (conf Conference) ) KEYWORD_OF <- ( (k Keyword) ) AUTHOR_OF <- ( (a Author ( (Name "Ralf Hartmut Gueting")) ) ) )', '( ((k Word) contains "tempo") )', '( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )' ] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match1-keywords /* 5.2.2 match2 A query graph is given as a list. The optimal start node is determined automatically and the corresponding tuple stream is used internally. (Note that the reverse direction for edges are not necessary here) */ query p2 match2 [' ( (doc Document) (a ( (Name "Ralf Hartmut Gueting") )) (doc p PUBLISHED_AT conf) (k KEYWORD_OF doc) (a AUTHOR_OF doc) )', '( ((k Word) contains "tempo") ) ', '( ((conf Name) Name) ((p Year) Year) ((doc Title) Title) )' ] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match2-keywords /* 5.2.3 match3 The query is expressed as Cypher expression. In this sample, the query tree is expressed by two pathes, that are combined by the node alias 'doc'. Also the node types of the aliases 'k', 'a' and 'doc' are derived from the edge types. Note, The Year is an edge property */ query p2 match3 [' MATCH (conf)<-[p:_PUBLISHED_AT]-(doc:Document)<-[:KEYWORD_OF]-(k), (doc)<-[AUTHOR_OF]-(a{Name:"Ralf Hartmut Gueting"}) WHERE k.Word contains "tempo" RETURN conf.Name, p.Year, doc.Title '] consume; /* Also available as sciptfile: */ SECONDO> @../Algebras/PropertyGraph/sample-dblp/match3-keywords /* 6 References [CYP20] https://neo4j.com/docs/cypher-manual/current/ */