Files
secondo/Tools/Generators/TPC-H/postgres/Postgres-Introduction.txt

321 lines
11 KiB
Plaintext
Raw Normal View History

2026-01-23 17:03:45 +08:00
/*
----
This file is part of SECONDO.
Copyright (C) 2004, University in Hagen, Department of Computer Science,
Database Systems for New Applications.
SECONDO is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
SECONDO is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with SECONDO; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
----
//paragraph [1] Title: [{\Large \bf \begin{center}] [\end{center}}]
//paragraph [2] Center: [{\begin{center}] [\end{center}}]
//paragraph [10] Footnote: [{\footnote{] [}}]
//paragraph [44] table4columns: [\begin{quote}\begin{tabular}{llll}] [\end{tabular}\end{quote}]
//characters [20] verbatim: [\verb@] [@]
//characters [21] formula: [$] [$]
//characters [22] capital: [\textsc{] [}]
//characters [23] teletype: [\texttt{] [}]
//[--------] [\hline]
//[TOC] [\tableofcontents]
//[p] [\par]
//[@] [\@]
//[LISTING-SH] [\lstsetSH]
[1] A quick introduction into the PostgreSQL DBMS
[2] Database Systems for new Applications [p]
University of Hagen [p]
http://www.informatik.fernuni-hagen.de/secondo [p]
Author: M. Spiekermann, Last Changes: 2007-02-13
[TOC]
1 Introduction
PostgreSQL is a popular open source DBMS which is the successor of INGRES and
POSTGRES. Sometimes it may be interesting to compare it with Secondo. Hence we
will give a short overview how to install it on a Linux system, how to create databases and how
to create objects and populate it with data. However, its just a rough
introduction for further details consult the Postgres documentation which is
available as HTML-files below /usr/share/doc/packages/postgresql/html.
2 Installation on Linux
Start the package manager (on SuSe-Linux its called YAST) and select all
packages whose name starts with postgres.
3 Environment Setup
Before you can create a database you need to define and initialize a so called
data storage area or database cluster. The location of this directory should be
defined in the environment variable "PGDATA"[20]. The directory must be only
readable and writeable by the Linux user which is the database administrator.
In order to set up the storage area run the following commands:
[LISTING-SH]
*/
export PGDATA=/data/postgres-databases
mkdir $PGDATA
chmod go-rwx $PGDATA
initdb -D$PGDATA
/*
Afterwards the directory "$PGDATA" contains about 26MB data. The definition of
"PGDATA" should be done in the shells startup script (".bashrc") otherwise you
have to define it in every new shell. Now we can startup
up the database server process which is called "postmaster".
*/
postmaster [-D$PGDATA]
/*
It will print messages to the standard output.
4 Creating Databases
The utility "createdb" can be used to create a database, e.g. the
command
*/
createdb tpch
/*
will create a database called "tpch" which adds 31MB to the storage area. The
text based database client is called "psql", client internal commands start with
a "\" symbol, for example "\?" will list all client internal commands and "\q"
will quit the session. The command
*/
psql -dtpch
/*
establishes a connection to the "tpch" database. The command prompt now
includes the used database:
*/
tpch# \dt % display tables
tpch# \di % display indexes
tpch# \q % disconnect and exit
tpch# \i <file> % run query from file
tpch# \s <file> % save the cmd history
tpch# \h select % explain the syntax of the select statement
/*
5 Creating Objects
If you are connected with a database the create command can be used to
define a relation.
*/
create table customer (
C_CUSTKEY int4,
C_NAME varchar(25),
C_ADDRESS varchar(40),
C_NATIONKEY int4,
C_PHONE char(15),
C_ACCTBAL float4,
C_MKTSEGMENT char(10),
C_COMMENT varchar(117)
);
/*
Afterwards you can populate it with tuples by importing a text file. Each line
will be interpreted as a tuple. A field separator can be specified which marks
the end of an attribute value. This is a special client command, e.g.
*/
\copy customer FROM 's05pp/customer.tbl.pg' WITH DELIMITER AS '|';
/*
reads the tuple data from the file "s05pp/customer.tbl.pg". An index can be
created by
*/
create index customer_c_custkey on cutomer(c_custkey);
/*
Another kind of objects are sequences. The commands
*/
create sequence serial starts 1;
select nextval('serial); % will return 2
/*
Sometimes it is necessary to store query results as new relations. This can be
done by the "create table <ident> as" command. Moreover new attribute values can
be computed by the existing tuple values by just writing expressions of the
available functions and operations, e.g.
*/
create table customer_s100
as select C_CUSTKEY, C_NAME, nextval('serial') % 100 as C_NUM
from customer;
/*
6 Investigating Query Plans
If a query is introduced by "explain" or "explain analyze" the used query plan
will be printed. The second variant runs the query and displays estimated costs
and tuple cardinalities with actual runtimes.
*/
explain <query>
explain analyze <query>
/*
7 Maintenance
The query planner needs accurate statistics about the data. It will use samples
of the data to estimate the frequency distribution of a table attribute's
values. The internal estimates will be updated by the command "analyze"
it collects statistics about the contents of tables in the database, and
stores the results in the system table "pg_statistic".
In normal PostgreSQL operation, tuples that are deleted or obsoleted by an
update are not physically removed from their table; they remain present until
the command "vaccum" is called. This procedure reclaims storage occupied by deleted
tuples. Hence the administrator should run
*/
vacuum analyze
/*
after remarkable updates.
8 Tuning
By using the set command the admin can set various runtime parameters.
This can be useful to force or to disable some evaluation methods for
relational algebra expressions. For example, the statement below disables the use
of indexes.
*/
set enable_indexscan = off;
/*
8.1 Adjusting cost factors
SQL statements can be translated into different execution plans which compute
the same result. The Planner (or Optimizer) module uses data statistics, cost functions
and some basic cost factors to rate such plans. The optimization algorithms sytematically
procudes subplans and prunes non-efficient solutions. The result of this process might be
the best available plan. However, error factors are
(1) Imprecise statistics
(2) Imprecise cost functions
(3) Imprecise cost factors
Some important cost factors are:
*/
cpu_tuple_cost;
cpu_operator_cost;
/*
Those are expressed as float values which define the ratio of time they need compared
with a sequential access of a memory page. The costs can be determined by running
some queries.
First you need to create relations $R_1, R_2$ with different tuple sizes but
the same number of tuples and pages. Hence the time difference for scanning
those relations can be used to compute the time for a page fetch. Moreover, the
size of the relations should be bigger than the main memory. Hence we have
$|t_{q1} - t_{q2}| = T_{pc} |P_1 - P_2|$ where $t_{qi}$ is the runtime for a
query which scans relation $R_i$.
Afterwards one can mesaure the time for processing a tuple $T_{tc}$by constructing
relations with the same number of pages but a different number of tuples. Again
the run time difference for a scan can be utilized to compute the processing
overhead for a single tuple.
Finally queries applying a different number of operators are used to compute the
time needed for a single operator $T_{oc}$.
9 Understanding the Postgres Planner
Below there are three similar queries which result in different plans.
*/
Q1: explain select count(*) from m1, m2 where m1.a = m2.a and m1.a = 1;
Aggregate (cost=22128.85..22128.85 rows=1 width=0)
-> Nested Loop (cost=8543.55..22119.35 rows=949638 width=0)
-> Seq Scan on m2 (cost=0.00..8542.72 rows=978 width=4)
Filter: (1 = a)
-> Materialize (cost=8543.55..8546.12 rows=971 width=4)
-> Seq Scan on m1 (cost=0.00..8543.29 rows=971 width=4)
Filter: (a = 1)
Q2: explain select count(*) from m1, m2 where m1.a = m2.a and m2.a < 10;
Aggregate (cost=99334.22..99334.23 rows=1 width=0)
-> Merge Join (cost=53549.54..99163.39 rows=17083708 width=0)
Merge Cond: ("outer".a = "inner".a)
-> Sort (cost=8547.57..8547.74 rows=17246 width=4)
Sort Key: m2.a
-> Seq Scan on m2 (cost=0.00..8542.72 rows=17246 width=4)
Filter: (a < 10)
-> Sort (cost=45001.97..45011.97 rows=1000110 width=4)
Sort Key: m1.a
-> Seq Scan on m1 (cost=0.00..8533.29 rows=1000110 width=4)
Q3 explain select count(*) from m1, m2 where m1.a = m2.a and m1.a < 10;
Aggregate (cost=80644.17..80644.17 rows=1 width=0)
-> Hash Join (cost=8543.39..80543.07 rows=10109754 width=0)
Hash Cond: ("outer".a = "inner".a)
-> Seq Scan on m2 (cost=0.00..8532.72 rows=999894 width=4)
-> Hash (cost=8543.29..8543.29 rows=10208 width=4)
-> Seq Scan on m1 (cost=0.00..8543.29 rows=10208 width=4)
Filter: (a < 10)
/*
Note that in Q1 the planner rewrites the query and adds an additional predicate m2.a = 0.
This is possible since an equi-join essentially needs the same values to produce matches.
Moreover, it seems that hashjoins and mergejoins are prevented since they are never chosen, even with
configuration option "enable_nestloop = off" which raises the total costs up to 100.000.000.
Extraordinarily, this technique is not applied for queries "Q2" and "Q3" even
though it could reduce costs. Moreover, one can observe, that the estimates for
"m1.a < 10" and "m2.a < 10" vary in a wide range despite the fact that relation "m2"
is a copy of "m1". After each command which updates statistics samples, e.g.
analyze m1, the estimate changes. Note: the sttistics about data distributions
can be confiured on a per column basis or for globally by the parameter
"default_statistics_target".
Adding a redundant (totally correlated) predicate "m2.b = 2" misguides the planner since it
chooses a very expensive plan based on the estimate that the scan on "m2" will return only
1 tuple (actually 1000 tuples). This leads to a nested loop-join without materialization
of the intermediate result, hence m2 will be scanned 1000 times. This is a good demonstration
for the needs of robust query optimization as claimed in [xxx].
*/
Q4: Q3 and m2.b = 2
Aggregate (cost=17099.17..17099.17 rows=1 width=0)
-> Nested Loop (cost=0.00..17099.16 rows=971 width=0)
-> Seq Scan on m2 (cost=0.00..8553.29 rows=1 width=4)
Filter: ((b = 2) AND (1 = a))
-> Seq Scan on m1 (cost=0.00..8543.29 rows=971 width=4)
Filter: (a = 1)