Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Query Processing in Database Systems: A Comprehensive Guide, Slides of Introduction to Database Management Systems

Query Processing, Query Processing Components, Possible Execution Plan, Pictorial Representation, Query Processing Methodology, Query Normalization, Simplification, Restructuring, Index Selection Guidelines, Clustering and Joins, Selecting Alternatives, Optimization Granularity, Query Optimization, Relation, Cost Calculation

Typology: Slides

2011/2012

Uploaded on 02/13/2012

richiey
richiey 🇨🇦

32 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
7-1
Query Processing
high level user query (SQL)
Query
Processor
low level data manipulation
commands
(execution plan)
Query Compiler
Plan
Generator
Plan
Cost
Estimator
Plan Evaluator
7-2
Query Processing Components
Query language that is used
SQL: “intergalactic dataspeak”
Query execution methodology
The steps that one goes through in executing high-level
(declarative) user queries.
Query optimization
How do we determine a good execution plan?
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Query Processing in Database Systems: A Comprehensive Guide and more Slides Introduction to Database Management Systems in PDF only on Docsity!

7-

Query Processing

high level user query (SQL)

Query

Processor

low level data manipulation

commands

(execution plan)

Query Compiler

Plan

Generator

Plan

Cost

Estimator

Plan Evaluator

Query Processing Components

 Query language that is used

 SQL: “intergalactic dataspeak”

 Query execution methodology

 The steps that one goes through in executing high-level

(declarative) user queries.

 Query optimization

 How do we determine a good execution plan?

7-

What are we trying to do?

 Consider query

 “For each project whose budget is greater than $250000 and which employs more than two employees, list the names and titles of employees.”

 In SQL

SELECT Ename, Title

FROM Emp, Project, Works

WHERE Budget > 250000

AND Emp.Eno=Works.Eno

AND Project.Pno=Works.Pno

AND Project.Pno IN

(SELECT w.Pno

FROM Works w

GROUP BY w.Pno

HAVING SUM (*) > 2)

 How to execute this query?

A Possible Execution Plan

1. T 1 ← Scan Project table and select all tuples

with Budget value > 250000

2. T 2 ← Join T 1 with the Works relation

3. T 3 ← Join T 2 with the Emp relation

4. T 4 ← Group tuples of T 3 over Pno

5. Scan tuples in each group of T 4 and for groups

that have more than 2 tuples, Project over

Ename, Title

Note: Overly simplified – we’ll detail later.

7-

 Lexical and syntactic analysis

 check validity (similar to compilers)  check for attributes and relations  type checking on the qualification

 Put into (query normal form

 Conjunctive normal form ( p 11 ∨ p 12 ∨…∨ p 1 n ) ∧…∧ ( p (^) m 1 ∨ p (^) m 2 ∨…∨ p (^) mn )  Disjunctive normal form ( p 11 ∧ p 12 ∧…∧ p 1 n ) ∨…∨ ( p (^) m 1 ∧ p (^) m 2 ∧…∧ p (^) mn )  OR's mapped into union  AND's mapped into join or selection

Query Normalization

 Refute incorrect queries

 Type incorrect

 If any of its attribute or relation names are not defined in

the global schema

 If operations are applied to attributes of the wrong type

 Semantically incorrect

 Components do not contribute in any way to the

generation of the result

 Only a subset of relational calculus queries can be tested

for correctness

 Those that do not contain disjunction and negation

 To detect

➠ connection graph (query graph) ➠ join graph

Analysis

7-

SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’

Query graph (^) Join graph

Analysis – Example

Dur>

Pname=ëCAD/CAMí

Ename

Emp.Eno=Works.Eno Works.Pno=Project.Pno

RESULT

Title = ëProgrammerí

Resp

Works Emp.Eno=Works.Eno Works.Pno=Project.Pno

Emp Project Emp^ Project

Works

If the query graph is not connected, the query

may be wrong.

SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’

Analysis

Pname=ëCAD/CAMí

Ename (^) RESULT

Resp

Works

Emp Project

7-

 Convert SQL to relational algebra

 Make use of query trees

 Example

SELECT Ename FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Ename <> ‘J. Doe’ AND Pname = ‘CAD/CAM’ AND (Dur = 12 OR Dur = 24)

Restructuring

ΠENAME

σDUR=12 OR DUR=

σPNAME=ìCAD/CAMî

σENAME≠ìJ. DOEî

Project Works Emp

Project

Select

Join

⋈PNO

⋈ENO

How to implement operators

 Selection (assume R has n pages)

 Scan without an index – O( n)

 Scan with index

➠ B +^ index – O(log n ) ➠ Hash index – O(1)

 Projection

 Without duplicate elimination – O( n )

 With duplicate elimination

➠ Sorting-based – O( n log n ) ➠ Hash-based – O( n + t ) where t is the result of hashing phase

7-

How to implement operators

(cont’d)

 Join

 Nested loop join: R ⋈ S

foreach tuple r∈R do

foreach tuple s∈S do

if r==s then add <r,s> to result

 O( n * m )

 Improvements possible by

➠ page-oriented nested loop join ➠ block-oriented nested loop join

How to implement operators

(cont’d)

 Join

 Index nested loop join: R ⋈ S

foreach tuple r∈R do use index on join attr. to find tuples of S foreach such tuple s∈S do add <r,s> to result

 Sort-merge join

➠ Sort R and S on the join attribute ➠ Merge the sorted relations

 Hash join

➠ Hash R and S using a common hash function ➠ Within each bucket, find tuples where r = s

7-

Example 2

SELECT e.Ename, w.Resp

FROM Emp e, Works w

WHERE e.Age BETWEEN 45 AND 60

AND e.Title=‘Programmer’

AND e.Eno=w.Eno

 Clearly, Emp should be the outer relation.

 Suggests that we build a hash index on w.Eno.

 What index should we build on Emp?

 B+ tree on e.Age could be used, OR an index on e.Title could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. ➠ As a rule of thumb, equality selections more selective than range selections.

 As both examples indicate, our choice of indexes is guided by

the plan(s) that we expect an optimizer to consider for a query.

Have to understand optimizers!

Examples of Clustering

SELECT e.Title

FROM Emp e

WHERE e.Age > 40

 B+ tree index on e.Age can be used to get

qualifying tuples.

 How selective is the condition?

 Is the index clustered?

7-

Clustering and Joins

SELECT e.Ename, p.Pname

FROM Emp e, Project p

WHERE p.Budget=‘350000’

AND e.City=p.City

 Clustering is especially important when accessing inner tuples

in Index Nested Loop join.

 Should make index on e.City clustered.

 Suppose that the WHERE clause is instead:

WHERE e.Title=‘Programmer’ AND e.City=p.City  If many employees are Programmers, Sort-Merge join may be worth considering. A clustered index on p.City would help.

 Summary: Clustering is useful whenever many tuples are to

be retrieved.

SELECT Ename

FROM Emp e,Works w

WHERE e.Eno = w.Eno

AND w.Dur > 37

Strategy 1

ΠENAME (σDUR>37∧EMP.ENO=ASG.ENO (Emp × Works))

Strategy 2

ΠENAME (Emp ⋈ENO (σDUR>37 (Works)))

 Strategy 2 is “better” because

 It avoids Cartesian product  It selects a subset of Works before joining

 How to determine the “better” alternative?

Selecting Alternatives

7-

Query Optimization Issues –

Optimization Timing

 Static

 compilation ⇒ optimize prior to the execution

 difficult to estimate the size of the intermediate results

⇒ error propagation

 can amortize over many executions

 Dynamic

 run time optimization

 exact information on the intermediate relation sizes

 have to reoptimize for multiple executions

 Hybrid

 compile using a static algorithm

 if the error in estimate sizes > threshold, reoptimize at

run time

Query Optimization Issues –

Statistics

 Relation

 cardinality

 size of a tuple

 fraction of tuples participating in a join with another

relation

 Attribute

 cardinality of domain

 actual number of distinct values

 Common assumptions

 independence between different attribute values

 uniform distribution of attribute values within their

domain

7-

Query Optimization Components

 Cost function (in terms of time)

 I/O cost + CPU cost

 These might have different weights

 Can also maximize throughput

 Solution space

 The set of equivalent algebra expressions (query trees).

 Search algorithm

 How do we move inside the solution space?

 Exhaustive search, heuristic algorithms (iterative

improvement, simulated annealing, genetic,…)

Cost Calculation

 Cost function takes CPU and I/O processing into

account

 Instruction and I/O path lengths

 Estimate the cost of executing each node of the

query tree

 Is pipelining used or are temporary relations created?

 Estimate the size of the result of each node

 Selectivity of operations – “reduction factor”

 Error propagation is possible

7-

Join

 Special case: A is a key of R and B is a foreign key

of S;

card ( R ⋈ A=B S ) = card ( S )

 More general:

card ( R ⋈ S ) = SF J ∗ card ( R ) ∗ card ( S )

Intermediate Relation Size

Search Space

 Characterized by “equivalent” query plans

 Equivalence is defined in terms of equivalent query

results

 Equivalent plans are generated by means of

algebraic transformation rules

 The cost of each plan may be different

 Focus on joins

7-

Search Space – Join Trees

 For N relations, there are O( N !)

equivalent join trees that can be

obtained by applying commutativity

and associativity rules

SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno=Works.Eno AND Works.PNO=Project.PNO

Project

Emp Works

Project (^) Works

Emp

Project

Works

Emp

×

 Commutativity of binary operations

 R × S ⇔ S × R

 R ⋈ S ⇔ S ⋈ R

 R ∪ S ⇔ S ∪ R

 Associativity of binary operations

 ( R × S ) × T ⇔ R × ( S × T )

 ( R ⋈ S ) ⋈ T ⇔ R ⋈ ( S ⋈ T )

 Idempotence of unary operations

 Π A ’(Π A ’’( R )) ⇔ Π A ’( R )

 σ p 1 ( A 1 )(σ p 2 ( A 2 )( R )) = σ p 1 ( A 1 ) ∧ p 2 ( A 2 )( R )

where R [ A ] and A' ⊆ A , A" ⊆ A and A' ⊆ A"

Transformation Rules

7-

Equivalent Query

ΠEname

σPname=CAD/CAMí ∧ (Dur=12 ∨ Dur=24) ∧ Ename<>J. DOEí

×

Works Project Emp

Emp

ΠEname

σEname <> `J. Doeí

Project Works

ΠEno,Ename

σPname = `CAD/CAMí

ΠPno

σDur = 12 ∧ Dur=

ΠPno,Eno

ΠPno,Ename

Another Equivalent Query

7-

Search Strategy

 How to “move” in the search space.

 Deterministic

 Start from base relations and build plans by adding one relation at each step  Dynamic programming: breadth-first  Greedy: depth-first

 Randomized

 Search for optimalities around a particular starting point  Trade optimization time for execution time  Better when > 5-6 relations  Simulated annealing  Iterative improvement

Search Algorithms

 Restrict the search space

 Use heuristics ➠ E.g., Perform unary operations before binary operations  Restrict the shape of the join tree ➠ Consider only linear trees, ignore bushy ones

R 1 R 2

R 3

R 4

Linear Join Tree

R 1 R 2 R 3 R 4

Bushy Join Tree