Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

449849_2815098_indexing.doc, Study notes of Computer Science

Class: Prog+Problem Solving III; Subject: Computer Engr & Computer Sci; University: California State University - Long Beach; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-6r8
koofers-user-6r8 🇺🇸

5

(1)

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
INDEXING
1. An index is a tool for finding records in a file. It consists of:
a) A key field on which the index is searched.
i) The key field is the portion of an index record that contains the
canonical form of the key that is being sought.
b) A reference field that tells where to find the data file record associated
with a particular key.
i) The reference field is the portion of an index record that
contains information about where to find the data record
containing the information listed in the associated key field of
the index.
2. Advantages of using an index file with a data file:
a) Since it works by indirection, an index lets you impose order on a file
without actually rearranging the file.
b) Indexing can provide multiple access paths to a file.
i) You can have multiple sets of indexes, each index set
referencing a different key field in the data record.
ii) For example, if the data file contained student information, one
index set might reference social security numbers, another
might reference last name, etc.
c) Indexing gives us keyed access to variable-length record files.
3. An entry-sequenced file is a file in which the records occur in the order that
they are entered into the file.
4. Suppose we own a collection of musical recordings and we want to keep
track of the collection through the use of computer files (fig. 7.2)
a) The data file records are variable length.
b) Suppose we form a primary key for these records consisting of the
initials for the record company label combined with the record
company's ID number (fig. 7.3).
i) This will make a good primary key since it provides a unique
key for entry in the file.
ii) We'll call the key the Label ID.
iii) The canonical form for the Label ID consists of the uppercase
form of the Label field followed immediately by the ASCII
representation of the ID number (e.g. RCA2626).
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download 449849_2815098_indexing.doc and more Study notes Computer Science in PDF only on Docsity!

INDEXING

  1. An index is a tool for finding records in a file. It consists of: a) A key field on which the index is searched. i) The key field is the portion of an index record that contains the canonical form of the key that is being sought. b) A reference field that tells where to find the data file record associated with a particular key. i) The reference field is the portion of an index record that contains information about where to find the data record containing the information listed in the associated key field of the index.
  2. Advantages of using an index file with a data file: a) Since it works by indirection, an index lets you impose order on a file without actually rearranging the file. b) Indexing can provide multiple access paths to a file. i) You can have multiple sets of indexes, each index set referencing a different key field in the data record. ii) For example, if the data file contained student information, one index set might reference social security numbers, another might reference last name, etc. c) Indexing gives us keyed access to variable-length record files.
  3. An entry-sequenced file is a file in which the records occur in the order that they are entered into the file.
  4. Suppose we own a collection of musical recordings and we want to keep track of the collection through the use of computer files (fig. 7.2) a) The data file records are variable length. b) Suppose we form a primary key for these records consisting of the initials for the record company label combined with the record company's ID number (fig. 7.3). i) This will make a good primary key since it provides a unique key for entry in the file. ii) We'll call the key the Label ID. iii) The canonical form for the Label ID consists of the uppercase form of the Label field followed immediately by the ASCII representation of the ID number (e.g. RCA2626).
  1. Note that the records in the data file are of variable length and thus no binary search on the records is possible. a) It's impossible to do a binary search on variable length records because direct access by relative record number is not possible—there is no way to know where the middle record is in any group of records. b) If you build an index containing keys, you can do the binary search on the index file. i) The records in the index file are of fixed length and thus we can use binary search (assuming that they are sorted). ii) The index file is smaller and thus stands a good chance of being kept in RAM which reduces the number of seeks.
  2. Procedure to retrieve a record: a) Find the key in the index file (probably using a binary search). b) Use the index file to determine the byte offset of the corresponding record in the data file. c) Use the seek function and the byte offset to move to the data record in the data file. d) Read the record from the data file.
  3. Although this retrieval strategy is relatively straightforward, it contains some features that deserve comment: a) We are now dealing with two files—the index file and the data file. i) The index file is considerably easier to work with than the data file. a) It uses fixed-length records (which is why we can search it with a binary search). b) It is likely to be much smaller that the data file. b) By requiring that the index file have fixed-length records, we impose a limit on the sizes of the keys. i) Fixed-length key fields can lead to non-uniqueness if the key must be truncated to fit within the fixed length field. c) In the handout example, the index carries no information other than the keys and the reference fields. We could, for example, keep the length of each data file record in the index file.

e) There are a number of approaches for deleting records in variable- length record files that allow for the reuse of the space occupied by these records. i) Prior lectures discussed a number of approaches to record deletion, including using a linked list that contains open slots. ii) The various approaches for deleting records that were discussed in prior lectures are viable for the data file because, unlike a sorted data file, the records in this file need not be moved around to maintain an ordering on the file. a) This is one of the advantages of an indexed file organization. b) We have rapid access to individual records by key without disturbing pinned records. c) The index itself pins all the records. f) When we delete a record from the data file, we must also delete the corresponding entry from the index file. i) Since the data file contains variable length records, a linked list showing open slots would be easy to maintain. a) A new node would be created. b) The node would be inserted into the linked list. ii) If the index is contained in an array during program execution, deleting the index record and shifting the other records to close up the space may not be an overly expensive operation. a) Alternatively, we could simply mark the index record as deleted, just as we might mark the corresponding data record. g) Updating records depends on the size of the updated record. i) If the new record is smaller or equal in size, the record can be written directly into its old space. ii) If the new record is bigger, you will need to delete the old record, then add the new record. iii) Updating also depends on whether a key field is being updated. a) If it is, you will need to delete, then add a new index.

MULTIPLE KEYS

  1. Just as a library card catalog allows us to regard a collection of books in author order, title order, or subject order, so index files allow us to maintain different views of the records in a data file. a) We can use secondary indexes to obtain different views of the file. b) We can also combine the associated lists of primary key references and thereby combine particular views.
  2. Consider the file of musical recordings. a) Using primary key access, we could find the record with Label ID COL38358. b) Using one secondary key index, we can answer a query such as "Find all recordings with Beethoven as composer." i) The secondary key would be composer. ii) See figure 7.8 in handout for example of a composer index. c) We need another secondary key index file to find all recordings of Beethoven's Symphony No. 9. i) The second index's secondary key would be title. ii) See figure 7.10 in handout for example of a title index.
  3. The secondary key index record contains a secondary key and a primary key.
  4. To find all of the records in the data file that match the secondary key (e.g. Beethoven): a) Search the secondary key index, locating all matching secondary key(s) (e.g. Beethoven). b) Use the matching index records to obtain the primary key(s) (i.e. the secondary key Beethoven may have one or more primary keys associated with it—ANG3795 and DG139201). c) Search the primary key index file to obtain the data file byte offset of each primary key associated with the secondary key.
  1. Record updating. a) The primary key index serves as a kind of protective buffer, insulating the secondary index from changes in the data file. b) Data file updates affect the secondary key only when they change either the primary or the secondary key. There are three possible situations: i) Update changes the secondary key. a) If the secondary key is changed, then we may have to rearrange the secondary key index so it stays in sorted order. b) This can be a relatively expensive operation. ii) Update changes the primary key. a) This kind of change has a large impact on the primary key index, but often requires only that we only update the primary key reference field in all of the secondary indexes. b) This involves searching the secondary indexes (on the unchanged secondary keys) and rewriting the affected fixed-length primary key reference field. (1) It does not require reordering the secondary indexes unless the corresponding key occurs more than once in the index. (2) If a secondary key does occur more than once, there may be some local reordering, since records having the same secondary key are ordered by the reference field (primary key). iii) Update doesn't affect either the secondary or primary key – no change in either the secondary or primary index files.
  2. Typical queries of the musical recordings file: a) Primary key search (e.g. label = LON, id = 2312) b) Secondary key search (e.g. all labels = COL) c) Multiple secondary keys (e.g. label = COL and artist = SPRINGSTEEN) i) Must search multiple secondary key indexes (e.g. label secondary key index and artist secondary key index) and then match the primary keys using a Boolean AND operation. ii) See page 271 in handout. iii) Algorithms for performing this kind of match operation will be in future lectures.

INVERTED LISTS

  1. The secondary index files that we have developed so far have two distinct difficulties: a) We must rearrange them every time a new record is added, even if it results in a duplicate secondary key (e.g. two or more Beethovens). i) This may require shifting the array to allow for new secondary keys or rearranging a linked list. b) If there are duplicate secondary keys, the secondary key field is repeated for each entry. i) This wastes space, making the files larger than necessary. ii) Larger index files are less likely to be able to fit in electronic memory.
  2. First attempt at a solution: a) Use an array of references (primary keys) for each secondary key (see figure 7.11). b) Problems are the typical problems associated with arrays: i) Fixed length. ii) Inflexible. iii) Lousy for dynamic data structures. iv) By extending the fixed length of each of the secondary records to hold more reference fields, we might easily lose more space to internal fragmentation than we gained by not repeating identical keys.
  3. A second (and better) solution: a) Use a linked list for the primary key references (see figure 6.10). i) This is known as an inverted list because the secondary key leads to a set of one or more primary keys. ii) Each secondary key points to a different list of primary key references. iii) Each of these lists could grow to be just as long as it needs to be. iv) Advantages: a) Retains the attractive feature of not requiring reorganization of the secondary indexes for every new entry to the data file. b) Allows more than four Label IDs to be associated with each secondary key.

c) Associating the Secondary Index file with a new file containing linked lists of references provides some advantages over other structures considered up to this point. i) The only time we need to rearrange the Secondary Index file is when a new composer's name is added or an existing composer's name is changed (perhaps because of a spelling error). ii) Deleting or adding recordings for a composer who is already in the index involves changing only the Label ID List file. iii) Deleting all the recordings for a composer could be handled by modifying the Label ID List file. a) Leave the entry in the Secondary Index file in place. b) Use value of –1 in its reference field to indicate that the list of entries for the composer is empty. iv) In the event that we do need to rearrange the Secondary Index file, the task is quicker now since there are fewer records and each record is smaller. v) Since there is less need for sorting, it follows that there is less of a penalty associated with keeping the Secondary Index files off on secondary storage, leaving more room in RAM for other data structures. vi) The Label ID List file is entry sequenced. That means that it never needs to be sorted. vii) Since the Label ID List file is a fixed-length record file, it would be very easy to implement a mechanism for reusing the space from deleted records, as described in earlier lectures. d) There is a potentially significant disadvantage to this kind of file organization. i) The Label IDs associated with a given composer are no longer guaranteed to be physically grouped together, i.e. with a linked, entry-sequenced structure, it is less likely that there will be locality (physical togetherness) associated with the logical groupings of reference fields for a given secondary key. ii) This lack of locality means that picking up the references for a composer that has a long list of references could involve a large amount of seeking back and forth on the disk. iii) One obvious solution to this seeking problem is to keep the Label ID List file in memory. a) This could be expensive and impractical, given many secondary indexes.

BINDING

  1. At what point in time is the key bound to the physical address of its associated record? a) The binding of the primary keys to an address takes place at the time the files are constructed. b) The secondary keys are bound to an address at the time that they are actually used.
  2. Binding at the time of the file construction results in faster access. a) Once you have found the right index record, you have in hand the byte offset of the data record you are seeking. b) If we elected to bind our secondary keys to their associated records at the time of file construction, secondary key retrieval would be simpler and faster. i) It would do away entirely with the need to search on the primary key. ii) The disadvantage of binding directly in the file (i.e. of binding tightly) is that reorganizations of the data file must result in modification to all bound index files. iii) This reorganization cost can be very expensive, particularly with simple index files in which modification would often mean shifting records. c) By postponing binding until execution time, when the records are actually being used, we are able to develop a secondary key system that involves a minimal amount of reorganization when records are added or deleted.