Download Query Processing in Database Systems: A Comprehensive Guide and more Slides Introduction to Database Management Systems in PDF only on Docsity!
7-
Query Processing
high level user query (SQL)
Query
Processor
low level data manipulation
commands
(execution plan)
Query Compiler
Plan
Generator
Plan
Cost
Estimator
Plan Evaluator
Query Processing Components
Query language that is used
SQL: “intergalactic dataspeak”
Query execution methodology
The steps that one goes through in executing high-level
(declarative) user queries.
Query optimization
How do we determine a good execution plan?
7-
What are we trying to do?
Consider query
“For each project whose budget is greater than $250000 and which employs more than two employees, list the names and titles of employees.”
In SQL
SELECT Ename, Title
FROM Emp, Project, Works
WHERE Budget > 250000
AND Emp.Eno=Works.Eno
AND Project.Pno=Works.Pno
AND Project.Pno IN
(SELECT w.Pno
FROM Works w
GROUP BY w.Pno
HAVING SUM (*) > 2)
How to execute this query?
A Possible Execution Plan
1. T 1 ← Scan Project table and select all tuples
with Budget value > 250000
2. T 2 ← Join T 1 with the Works relation
3. T 3 ← Join T 2 with the Emp relation
4. T 4 ← Group tuples of T 3 over Pno
5. Scan tuples in each group of T 4 and for groups
that have more than 2 tuples, Project over
Ename, Title
Note: Overly simplified – we’ll detail later.
7-
Lexical and syntactic analysis
check validity (similar to compilers) check for attributes and relations type checking on the qualification
Put into (query normal form
Conjunctive normal form ( p 11 ∨ p 12 ∨…∨ p 1 n ) ∧…∧ ( p (^) m 1 ∨ p (^) m 2 ∨…∨ p (^) mn ) Disjunctive normal form ( p 11 ∧ p 12 ∧…∧ p 1 n ) ∨…∨ ( p (^) m 1 ∧ p (^) m 2 ∧…∧ p (^) mn ) OR's mapped into union AND's mapped into join or selection
Query Normalization
Refute incorrect queries
Type incorrect
If any of its attribute or relation names are not defined in
the global schema
If operations are applied to attributes of the wrong type
Semantically incorrect
Components do not contribute in any way to the
generation of the result
Only a subset of relational calculus queries can be tested
for correctness
Those that do not contain disjunction and negation
To detect
➠ connection graph (query graph) ➠ join graph
Analysis
7-
SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’
Query graph (^) Join graph
Analysis – Example
Dur>
Pname=ëCAD/CAMí
Ename
Emp.Eno=Works.Eno Works.Pno=Project.Pno
RESULT
Title = ëProgrammerí
Resp
Works Emp.Eno=Works.Eno Works.Pno=Project.Pno
Emp Project Emp^ Project
Works
If the query graph is not connected, the query
may be wrong.
SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’
Analysis
Pname=ëCAD/CAMí
Ename (^) RESULT
Resp
Works
Emp Project
7-
Convert SQL to relational algebra
Make use of query trees
Example
SELECT Ename FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Ename <> ‘J. Doe’ AND Pname = ‘CAD/CAM’ AND (Dur = 12 OR Dur = 24)
Restructuring
ΠENAME
σDUR=12 OR DUR=
σPNAME=ìCAD/CAMî
σENAME≠ìJ. DOEî
Project Works Emp
Project
Select
Join
⋈PNO
⋈ENO
How to implement operators
Selection (assume R has n pages)
Scan without an index – O( n)
Scan with index
➠ B +^ index – O(log n ) ➠ Hash index – O(1)
Projection
Without duplicate elimination – O( n )
With duplicate elimination
➠ Sorting-based – O( n log n ) ➠ Hash-based – O( n + t ) where t is the result of hashing phase
7-
How to implement operators
(cont’d)
Join
Nested loop join: R ⋈ S
foreach tuple r∈R do
foreach tuple s∈S do
if r==s then add <r,s> to result
O( n * m )
Improvements possible by
➠ page-oriented nested loop join ➠ block-oriented nested loop join
How to implement operators
(cont’d)
Join
Index nested loop join: R ⋈ S
foreach tuple r∈R do use index on join attr. to find tuples of S foreach such tuple s∈S do add <r,s> to result
Sort-merge join
➠ Sort R and S on the join attribute ➠ Merge the sorted relations
Hash join
➠ Hash R and S using a common hash function ➠ Within each bucket, find tuples where r = s
7-
Example 2
SELECT e.Ename, w.Resp
FROM Emp e, Works w
WHERE e.Age BETWEEN 45 AND 60
AND e.Title=‘Programmer’
AND e.Eno=w.Eno
Clearly, Emp should be the outer relation.
Suggests that we build a hash index on w.Eno.
What index should we build on Emp?
B+ tree on e.Age could be used, OR an index on e.Title could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. ➠ As a rule of thumb, equality selections more selective than range selections.
As both examples indicate, our choice of indexes is guided by
the plan(s) that we expect an optimizer to consider for a query.
Have to understand optimizers!
Examples of Clustering
SELECT e.Title
FROM Emp e
WHERE e.Age > 40
B+ tree index on e.Age can be used to get
qualifying tuples.
How selective is the condition?
Is the index clustered?
7-
Clustering and Joins
SELECT e.Ename, p.Pname
FROM Emp e, Project p
WHERE p.Budget=‘350000’
AND e.City=p.City
Clustering is especially important when accessing inner tuples
in Index Nested Loop join.
Should make index on e.City clustered.
Suppose that the WHERE clause is instead:
WHERE e.Title=‘Programmer’ AND e.City=p.City If many employees are Programmers, Sort-Merge join may be worth considering. A clustered index on p.City would help.
Summary: Clustering is useful whenever many tuples are to
be retrieved.
SELECT Ename
FROM Emp e,Works w
WHERE e.Eno = w.Eno
AND w.Dur > 37
Strategy 1
ΠENAME (σDUR>37∧EMP.ENO=ASG.ENO (Emp × Works))
Strategy 2
ΠENAME (Emp ⋈ENO (σDUR>37 (Works)))
Strategy 2 is “better” because
It avoids Cartesian product It selects a subset of Works before joining
How to determine the “better” alternative?
Selecting Alternatives
7-
Query Optimization Issues –
Optimization Timing
Static
compilation ⇒ optimize prior to the execution
difficult to estimate the size of the intermediate results
⇒ error propagation
can amortize over many executions
Dynamic
run time optimization
exact information on the intermediate relation sizes
have to reoptimize for multiple executions
Hybrid
compile using a static algorithm
if the error in estimate sizes > threshold, reoptimize at
run time
Query Optimization Issues –
Statistics
Relation
cardinality
size of a tuple
fraction of tuples participating in a join with another
relation
Attribute
cardinality of domain
actual number of distinct values
Common assumptions
independence between different attribute values
uniform distribution of attribute values within their
domain
7-
Query Optimization Components
Cost function (in terms of time)
I/O cost + CPU cost
These might have different weights
Can also maximize throughput
Solution space
The set of equivalent algebra expressions (query trees).
Search algorithm
How do we move inside the solution space?
Exhaustive search, heuristic algorithms (iterative
improvement, simulated annealing, genetic,…)
Cost Calculation
Cost function takes CPU and I/O processing into
account
Instruction and I/O path lengths
Estimate the cost of executing each node of the
query tree
Is pipelining used or are temporary relations created?
Estimate the size of the result of each node
Selectivity of operations – “reduction factor”
Error propagation is possible
7-
Join
Special case: A is a key of R and B is a foreign key
of S;
card ( R ⋈ A=B S ) = card ( S )
More general:
card ( R ⋈ S ) = SF J ∗ card ( R ) ∗ card ( S )
Intermediate Relation Size
Search Space
Characterized by “equivalent” query plans
Equivalence is defined in terms of equivalent query
results
Equivalent plans are generated by means of
algebraic transformation rules
The cost of each plan may be different
Focus on joins
7-
Search Space – Join Trees
For N relations, there are O( N !)
equivalent join trees that can be
obtained by applying commutativity
and associativity rules
SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno=Works.Eno AND Works.PNO=Project.PNO
Project
Emp Works
Project (^) Works
Emp
Project
Works
Emp
×
Commutativity of binary operations
R × S ⇔ S × R
R ⋈ S ⇔ S ⋈ R
R ∪ S ⇔ S ∪ R
Associativity of binary operations
( R × S ) × T ⇔ R × ( S × T )
( R ⋈ S ) ⋈ T ⇔ R ⋈ ( S ⋈ T )
Idempotence of unary operations
Π A ’(Π A ’’( R )) ⇔ Π A ’( R )
σ p 1 ( A 1 )(σ p 2 ( A 2 )( R )) = σ p 1 ( A 1 ) ∧ p 2 ( A 2 )( R )
where R [ A ] and A' ⊆ A , A" ⊆ A and A' ⊆ A"
Transformation Rules
7-
Equivalent Query
ΠEname
σPname=CAD/CAMí ∧ (Dur=12 ∨ Dur=24) ∧ Ename<>
J. DOEí
×
Works Project Emp
Emp
ΠEname
σEname <> `J. Doeí
Project Works
ΠEno,Ename
σPname = `CAD/CAMí
ΠPno
σDur = 12 ∧ Dur=
ΠPno,Eno
ΠPno,Ename
Another Equivalent Query
7-
Search Strategy
How to “move” in the search space.
Deterministic
Start from base relations and build plans by adding one relation at each step Dynamic programming: breadth-first Greedy: depth-first
Randomized
Search for optimalities around a particular starting point Trade optimization time for execution time Better when > 5-6 relations Simulated annealing Iterative improvement
Search Algorithms
Restrict the search space
Use heuristics ➠ E.g., Perform unary operations before binary operations Restrict the shape of the join tree ➠ Consider only linear trees, ignore bushy ones
R 1 R 2
R 3
R 4
Linear Join Tree
R 1 R 2 R 3 R 4
Bushy Join Tree