Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Query Optimization - Database Management Systems - Project 3 | CSCI 1270, Study Guides, Projects, Research of Database Management Systems (DBMS)

Material Type: Project; Class: Database Management Systems; Subject: Computer Science; University: Brown University; Term: Fall 2009;

Typology: Study Guides, Projects, Research

2009/2010

Uploaded on 02/24/2010

koofers-user-k0j
koofers-user-k0j 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 127
Project 3- Query Optimization
Database Management Systems
Sunday, December 13, 2009, 11:59 P.M.
Project 3
Query Optimization
Out: November, 23, 2009
In: Sunday, December 13, 2009, 11:59 P.M.
1 Silly Premise
Ben Koen runs I SQL YOU!TM, a service which will execute your SQL queries when given
a database. Unfortunately, Ben hasn’t heard of trees as data structures (incidentally, his
family tree doesn’t split).
Luckily for you, Ben hires you not to implement his B-Tree (he made last year’s students
do that), but to work with him in handling the queries. “Some people!” exclaims Ben,
“don’t know the love that comes from a massive, slow join!”
Save Ben’s business before Ben destroys it.
2 Introduction
We task you with the following:
Collecting statistics on the data
Generating a cost estimation for a given query plan
Giving a justification for your cost model
Your code will implement a cost model: a relatively small program that receives queries
and uses information about the database to help the query optimizer pick the best query
plan.
Most DBMS’s have a stage during optimization that comes before the cost model: query
rewriting. Usually, after a query has been parsed into a relational algebra expression from
the SQL, a rewriter creates an equivalent RA expression that will cost less in most cases.
An example of this would be the RA expression:
σamount >2500(borrower ./ loan)
Which can be re-written to the equivalent expression:
borrower ./ (σamount >2500(loan))
which usually costs less. We are not asking you to implement a re-writer.1We only ask
for the cost model and plan generator.
1unless you want grad credit. See ’Grad Credit’.
Project 3- Query Optimization December 11, 2009 1
pf3
pf4
pf5

Partial preview of the text

Download Query Optimization - Database Management Systems - Project 3 | CSCI 1270 and more Study Guides, Projects, Research Database Management Systems (DBMS) in PDF only on Docsity!

Project 3- Query Optimization Sunday, December 13, 2009, 11:59 P.M.

Project 3

Query Optimization

Out: November, 23, 2009

In: Sunday, December 13, 2009, 11:59 P.M.

1 Silly Premise

Ben Koen runs I SQL YOU!TM, a service which will execute your SQL queries when given a database. Unfortunately, Ben hasn’t heard of trees as data structures (incidentally, his family tree doesn’t split). Luckily for you, Ben hires you not to implement his B-Tree (he made last year’s students do that), but to work with him in handling the queries. “Some people!” exclaims Ben, “don’t know the love that comes from a massive, slow join!” Save Ben’s business before Ben destroys it.

2 Introduction

We task you with the following:

  • Collecting statistics on the data
  • Generating a cost estimation for a given query plan
  • Giving a justification for your cost model

Your code will implement a cost model: a relatively small program that receives queries and uses information about the database to help the query optimizer pick the best query plan. Most DBMS’s have a stage during optimization that comes before the cost model: query rewriting. Usually, after a query has been parsed into a relational algebra expression from the SQL, a rewriter creates an equivalent RA expression that will cost less in most cases. An example of this would be the RA expression:

σamount > 2500 (borrower ./ loan) Which can be re-written to the equivalent expression:

borrower ./ (σamount > 2500 (loan)) which usually costs less. We are not asking you to implement a re-writer.^1 We only ask for the cost model and plan generator.

(^1) unless you want grad credit. See ’Grad Credit’.

Project 3- Query Optimization Sunday, December 13, 2009, 11:59 P.M.

3 Plan Generator

3.1 Generating the Cost Model

As mentioned in lecture, there are several ways a DBMS can use its indices to select data as appropriate (see The Plans, below). Given metadata about the database, the plan generator finds the lowest-cost option. This is done with a number of formulas. For your plan generator, you will need to use the following data accessible by the methods outlined:

  • nr: The number of tuples in r. Statistics.getTuples()
  • br: The number of blocks in r. Statistics.getPages()
  • fr: The block size (number of tuples per block) of r (typically br = dnr/fre). Global for the system BufferPool.PAGE SIZE
  • sr: Tuple size (in bytes). Statistics.getTupleSize()
  • V(att, r): Range of values for attribute att in r (called the Attribute Variance). Statistics.getVariance()
  • SC(att, r): Number of possible values for attribute att in r (called the Selection Car- dinality). You may assume for simplicity that we only deal with continuous intervals. Statistics.getCardinality()

Your job for this portion is to implement the calculateStatistics() method in Statis- tics.java. You must scan through the file and collect the variance and cardinality for each field.

3.2 Using the Cost Model for Plan Generation

Each PlanNode takes either a table or some children PlanNodes and estimates its cost based on the data available from each. As an example, take the following query:

Πbranch name(σcustomer city=“Harrison”(customer ./ (account ./ depositor)))

This evaluates to a tree like:

Project 3- Query Optimization Sunday, December 13, 2009, 11:59 P.M.

many different values there are) for each attribute in a table. You will only be writing the code to calculate variance, the cardinality is calculated on the fly in Statistics.getCardinality().

5 The Plans

For each plan node, write the following methods within the class:

  • estimateDiskCost() - estimate how many units of time will be spent fetching from disk for the current node, which should include the cost of the descendants
  • estimateTupleCount() - estimate how many tuples the result of this node will contain
  • estimateTupleSize() - estimate how big an individual tuple will be

We have provided explanations of the following plan nodes to help you figure out a cost model. Remember to write down your justification for your cost model in your README!

5.1 Aggregate

Input: PlanNode child Op o Description: An aggregate operation reads over the child node and applies the operation. The result is just an INT TYPE.

5.2 Joins

Input: PlanNode outer PlanNode inner Description: In general, a join operation can take any two relations to join. In this case, we assume that the outer relation has a foreign key dependency on the primary key of the inner table. So the tuple count of the join result is the same as that of the outer relation. The JoinNode class contains the estimateTupleCount() and estimateTupleSize() methods for all join operations, as these estimates will be the same for all join operations. The tuple size for join operations will merely be the sum of the tuples of both child operators.

5.2.1 Block Nested Loops Join

This is the operator you wrote in Executor. Read all blocks in the inner operator for every block in the outer operator and join.

Project 3- Query Optimization Sunday, December 13, 2009, 11:59 P.M.

5.2.2 Index Nested Inner Join

This is similar to block nested loops join, except you traverse an index of the inner operator rather than the entire relation.

5.2.3 Merge Inner Join

Assume that the sorting has already been done. You receive two sorted operators and are expected to join them.

5.3 Filter

Input: PlanNode child Attribute a Description: You are filtering on a predicate, assume here that it is EQUAL. You will need to look at the entire relation. The resulting relation will be related to the cardinality of the attribute being predicated. The actual size of each tuple will remain the same.

5.4 Primary Index Seek

Input: Table t Attribute selection Description: Seeking an element in a primary index will depend on the size and depth of the index.

5.5 Secondary Index Seek

Input: Table t Attribute selection Description: Seeking an element in a secondary index will also depend on the size and depth of the index.

5.6 Project

Input: PlanNode child Attributes a Description: Projecting merely removes columns, so the read cost and result size should be the same as the child, but the tuple size will only be the length of the attributes being projected.

Project 3- Query Optimization Sunday, December 13, 2009, 11:59 P.M.

With the second operating on much smaller amounts of data. The simplest method to do this is called predicate pushdown. You must find the columns in a query plan which are selected out of the table and pick the ones which are never used in any operator other than the outermost selection (as seen in the above example). It is intuitive to see that the ordering of joins in a multi-join query plan could also affect the running time. If you know that attributes A,B exist in tables R 1 ,R 2 ,R 3 , where |R 1 | ≤ |R 2 | ≤ |R 3 |. You know that any query which joins the smaller two together before it joins with the bigger one will perform better than a query which joins the larger two first.

8 The Code

Retrieve the code like you did last time from https://database.cs.brown.edu/svn/cs127rep/QueryOptimizer.

9 Handin

We expect the following components to be included in your handin:

  • The plannodes package with all estimation functions filled in
  • The simpledb.Statistics class
  • README justifying your formulas for each PlanNode

The code you wrote must pass all the unit tests in plannodes.test. If it does not, make sure you have a solid, reasonable explanation in your README. You can handin your project using the script:

/course/cs127/bin/cs127_handin optimizer

Good luck, and as always, feel free to ask TA’s any questions you like!