Chapter 1 – File Systems – Introduction to Database Management Systems

Chapter 1






File Systems

Computer files and file systems have great similarities with traditional files and file systems. Both allow records to be inserted, updated, deleted, searched for, sorted, and so on.

Chapter Highlights

  • The Concept of File Systems
  • Files and Records
  • Various Data Structures Used in File Systems, such as Chains, Pointers and Lists
  • Different Types of File Organisations
  • Concept of Record Keys and Indexes

1.1 NEED FOR A FILE

We use a lot of files in carrying out our day-to-day activities. For example, a college student uses notebooks and journals while a workplace contains many files and registers. Why do we need files at all? The simple explanation is that we cannot remember each and every word and number that is important to us. Hence, we store information in the form of files, notebooks and so on as shown in Fig. 1.1.

Fig. 1.1 File

In simple terms, a file is a collection of logical information.

Imagine a clerk working in the payroll department of a company that has 10 employees. Since 10 is a small number, the clerk might remember all the details such as basic pay, various allowances and deductions applicable for each one of them. Hence the clerk may not need a file at all! At the end of a month, the clerk would simply take 10 blank pay-slips and write various details in the appropriate columns directly from his memory. However, even this situation does not guarantee perfect payroll processing. What if the clerk leaves the job or is on leave during the end of the month? The person replacing the clerk would take at least some time to remember all the different names and numbers. Furthermore, if the number of employees increases (say to 100 employees), even the old clerk may not be able to remember all the information correctly.

In view of all these problems, it is essential that all such information is recorded somewhere; the easiest option being files or notebooks. If that is done, the clerk can simply use the recorded information at the end of the month to calculate the pay and prepare the pay-slips.

1.2 FILES

Let us assume that the clerk has decided to store the information in the form of files. For understanding the process of filing better, let us discuss it in detail.

1.2.1 Sample File

Let us imagine that the clerk is about to start preparing the pay-slips for this month. As we had seen, the clerk now uses a file to store information such as name, basic pay, days-worked etc. Suppose the company has 1000 employees. There will be one page of information for each of the 1000 employees, as shown in Fig. 1.2. Thus, there will be 1000 records for the clerk to consult while preparing the pay-slips.

Fig. 1.2 Employee records in a file

A file can contain one or more records.

As the figure shows, the clerk stores the information related to various employees and their pay details (i.e. employee records) in this file prior to the calculation of final pay. Hence, the clerk calls this file employee file. In other words, this is the file name of this particular file. We need to write an algorithm to prepare the pay-slips in the manual system. The creation of a file would not only simplify the clerk's task, but would also help the person who takes over in his absence. The next few sections build up towards that goal.

1.2.2 Records and Fields

We have seen that in the file there is one page or record per employee. Each of these records contains details such as employee number, name etc. Such a detail is called a field or data item of the record. A sample record for Jui lists all the available fields. Table 1.1 shows the skeleton of a record, and outlines the information that can be stored in it. This skeleton is also called a record layout. Generally, it contains sections such as field name, type of data that can be stored (alphabets, numbers or both) and maximum allowable length of each field.

Files were used in 1960s, and are used widely in all serious business applications even today. Database systems are slowly replacing most files systems.

Table 1.1 Record layout for the Employee file

Field name Type of data Maximum length
Employee number N 4
Employee name X 30
Basic rate per day N 4
Days worked N 2 + 1
Allowances per month N 4
Deductions percentage N 3
Activity indicator A 1

Let us note our observations regarding field name, and the type and maximum size of data that can be stored.

  • We have used some symbols for the type of data allowed, as follows:

    N means only numeric data is allowed

    A allows only alphabets

    X allows both

  • In the “maximum length” column, the days worked can contain a fraction. In Table 1.1 the days worked are 2 + 1. This means that it has two integer positions and one decimal, for example, 23.5 or 12.2. The maximum value in this field can be 99.9. However, in real life, can an employee work for 99.9 days in a month? The specification should, therefore, validate that the entry for this field should not exceed a maximum value of, say 31.0 days, depending on the month for which the record is being created. However, as of now we will not check for this condition so that we keep our observations simple.
  • It is always a good idea to make provisions for the future. So, the basic per day section has been designed to allows up to 4 digits. Let us suppose that the basic pays are currently in the range 100 to 500. But sooner or later, an employee might have a basic pay of 1000. Unless we provide for an extra digit, the bigger figure cannot be stored.

1.2.3 Master and Transaction Data

Data can be classified into two types – master and transaction.

Master data does not change with time. Some examples are employee number and employee name. Days worked and allowances are transaction data, which can change from time to time.

Master and transaction data are kept as separate files with reference to Table1.1, if some changes have occurred during the month (e.g. recruitment of new employees resignations or increments) we go through a master maintenance process to keep the master data up-to-date. All the transactions that take place during the month are collected separately in a transaction file.

Finally, both files are read and the payroll is processed. In Table 1.1, we have shown an employee record in which some data is master data and some is transaction data. This kind of mixed data is usually available after both the master and transaction files have been read. The data is extracted after matching the records from both the files for the same employee number, and is further used for the payroll. The process is illustrated in Fig 1.3.

Fig. 1.3 Processing master and transaction files to generate pay-slips

1.3 COMPUTER FILES

Maintaining records and files in the paper format is convenient. It allows easy access to information and does not require any formal training. However, it is far easier and much more convenient to create and maintain files by using computers.

A computer file contains information arranged in an electronic format.

In concept, computer files are not significantly different from paper files. Computer files also facilitate easy storage, retrieval, and manipulation of data. The major difference between paper files and computer files, however, is that while the former is stored on paper the latter is stored in the form of bits and bytes. Computer files also contain information similar to paper files. A computer file has a name. Thus, if we store employee information in a computer file, we may call it the employee file. The computer would recognise the file based on this name. A programmer working with this file can give instructions to the computer to open the file, read from it, write to it, modify its contents, close it, and so on.

For example, if we write a program for payroll processing, the steps carried out by this program (i.e. the algorithm) can be summarised as shown in Fig. 1.4. We have assumed that two files are available:

  • The employee file is the master file, containing records of all the employees.
  • The payroll file is the transaction file, which would contain payroll details such as number of days worked and pay details of all the employees after appropriate calculations.

Fig. 1.4 Payroll processing algorithm

We present the flow chart for this algorithm in Fig. 1.5.

If we look at the payroll-processing program carefully, we will notice that once it starts executing, the program needs very little interaction on the part of the clerk. Suppose there is some sort of mechanism by which the following can be achieved:

  1. The input records are put (read) into the input area one by one as desired from the external medium (in this case, the employee file stored on the disk).
  2. Calculations are done for the record read.
  3. The output pay-slips are put (written) into the output area (that is the payroll file).

Fig. 1.5 Payroll-processing flowchart

Then the clerk would virtually do nothing until all the pay-slips were printed and filed for distribution. This is similar to a factory where things keep on moving over a belt. The entire cycle starts with the input of spare parts and ends with the production of the end product. Everything moves over a belt, with processes involved at many intermediate stages. This type of processing is called batch processing. Batch processing involves a batch of goods manufactured and passed along the conveyer belt. Here, the entire batch of input employee records moves over the payroll program. Calculations are performed in each case. The output of the process is the set of pay-slips. This, in fact, was the motivation behind the earlier batch computer systems, which aimed at reducing the human drudgery of performing repetitive tasks!

In batch processing, no or minimum human interaction is required. One program passes control to another in a sequence.

However, in contrast to batch processing, in many situations the program needs to be conversational. That means, it needs to interact with the user of the program. Take the example of a manual telephone directory service. You tell the operator the name of the person and ask for the corresponding telephone number. The operator on the other side searches for this information and comes back with an answer. These days, the computer performs both the searching and the answering operations in an automated manner.

A search that can take place at any time is called as an online query. When an instantaneous answer is expected, it is called online processing or real-time processing.

An example of real-time processing is seen in the case of airlines reservations. Note that it is random, which means that there could be any query at any time. In such a case, how do we organise information to enable easy, faster access to it? We shall study this in subsequent sections of the book.

1.4 LIBRARY MANAGEMENT – A CASE STUDY

We shall now use another example to illustrate the more advanced concepts of file processing.

Imagine that we have a library of books and have appointed a librarian to maintain it. The librarian has created one card per book, which contains details such as book number, title, author, price and date of purchase. For this, the librarian has used the conceptual record layout as shown in Table1.2.

Table 1.2 Record layout for the Book file

Field name Type of data Maximum length
Book number N 4
Book title X 100
Author X 50
Price N 4
Date of purchase X 10
Category X 20
Publisher X 50

For simplicity, we have not shown the part of the card containing details on borrowing and returning of books. We shall ignore these details throughout the discussion. As soon as a new book comes in to the library, the librarian creates a new card for it. Note that a card is similar to a record and, in technical terms, the entire pile of cards is similar to a file. We shall use the terms interchangeably. The cards arranged by the librarian are shown in Fig. 1.6.

1.4.1 Record Keys

One pertinent question is: why do we need additional data elements such as an employee number in the case of an employee record and a book number in a book record? The reason is quite simple. By using an employee number, we can quickly identify an employee; and similarly, by using a book number, we can uniquely identify a book.

Fig. 1.6 Cards for the books

A field used to identify a record is called as a record key, or just a key.

As and when a new book arrives in the library, the librarian creates a new card for the book and gives it the next number in sequence. Since there can be many books with the same title, author or publisher, none of these can be used to uniquely identify a book. Hence, something additional is needed for this purpose. This is achieved by having book number as an extra field in the record.

A field that can identify a record uniquely is called as primary key of the record.

Hence, here the book number is the primary key. Given a book number, we are sure that to get only one record.

A field that identifies one or more records, not necessarily uniquely, is called a secondary key.

For instance, given an author name, we may or may not be able to identify a book uniquely: the library may have two or more books written by the same author. Hence, the author field identifies a group of records. Hence, it is a secondary key. Other secondary keys in this case can be the book title, publisher and so on.

Thus, as shown in Fig. 1.7 a record key can be of two types:

  • Primary key: Identifies a record uniquely based on a field value.
  • Secondary key: May or may not identify a record uniquely, but can identify one or more records based on a field value.

Fig. 1.7 Types of record keys

1.4.2 Searching Records

For simplicity, let us assume that there are only 10 books in the library. Suppose a member wants a list of all the books written by Mahatma Gandhi. The librarian can perform a quick search. He would simply look at one card after another and check if the author's name on the card is Gandhi. If they match, the librarian would inform the member about the book. The algorithm for this process is shown in Fig. 1.8.

Fig. 1.8 Algorithm for retrieving books written by Gandhi

Figure 1.9 shows the corresponding flow chart.

A similar search can be carried out to find the books published by a particular publisher. As we already know, any such inquiry is called as online query. The process of looking up the cards is called a search. Thus, the records in a file need to be searched in order to respond to an online query.

Many file systems have become popular, but the two most notable ones are Indexed Sequential Access Mechanism (ISAM) and Virtual Storage Access Mechanism (VSAM). VSAM files dominate the IBM mainframe world even today.

Fig. 1.9 Flow chart for retrieving books written by Gandhi

1.5 SEQUENTIAL ORGANISATION

1.5.1 What is Sequential Organisation?

We have noted that the cards are not arranged in any particular order. The natural order of cards is based on the arrival of books: as and when a new book arrives, a new card is made for it. Hence, the cards are arranged sequentially. In computer terms, the main characteristic of this sequential organisation of records is that the records are added as and when they are available. The idea is that a new record is always added to the end of the current file, as depicted in Fig. 1.10.

Note the following:

A file is called as a sequential file when it contains records arranged in sequential fashion.

Fig. 1.10 Concept of sequential files

We have noticed that the sequential file contains n records, numbered 1, 2,…, n. The start and end of the file are clearly identified and marked. If we want to add any new records to the file, it must be done at the end of the file. We cannot, for instance, insert a new record after the 3rd record in the file.

1.5.2 Advantages of Sequential Organisation

Let us summarise the advantages of sequential organisation.

  • Simplicity: As we can see, the sequential organisation of cards is quite simple. All that needs to be done is to create a new card for every new book and add it to the end of the current set of cards. In technical terms too, the sequential organisation of records is quite simple and we just need to create a new records for the new books.
  • Less overheads: We have noticed that in this form of organisation the librarian does not need to maintain any additional information about the books. Maintenance of the cards is good enough. In technical terms, we do not need to keep any key or any other extra information on the books file. The file is enough.

1.5.3 Problems with Sequential Organisation

In spite of its simplicity, sequential organisation does have a number of problems. These are mainly associated with the (lack of) searching capability. Let us summarise them.

  • Difficulties in searching: Since the number of books is quite small in our imaginary library, the current scheme works fine. However, if the number of books goes on increasing, this form of organisation will be clearly inefficient. Suppose our library contains 1000 books instead of 10. If a member now inquires about the books written by Gandhi, the librarian would have to start from the first card (that is, card number 1). As in the previous case, he would then check one card after another to see if the author mentioned on it is Gandhi so that he can come up with a list of books written by Gandhi. Note that, now he would have to look at 1000 cards. Clearly, it would take a long time. If there were many such queries, the librarian would be busy in merely answering queries and would not be able to do anything else! Obviously, the main drawback of this scheme is that the librarian has to physically examine every card.

    In technical terms, searching information in a sequential file can be a very slow process. For any search operation, we need to start reading a sequential file from the beginning and continue till the end, or until the desired record is found, whichever is earlier. This is both time-consuming and cumbersome.

  • Lack of support for queries: Sequential organisation is not suitable for answering some queries such as: Do you have any book written by Gandhi at all? If the library has one such book, but unfortunately it is the last entry in the file, all cards need to be scanned! In technical terms, it means that to even find out whether something is available in the file or not, the entire file has to be read!
  • Problem with record deletion: This problem is perhaps not so clear in our example. In an ordinary situation, if the card associated with a book needs to be removed, the librarian can simply remove it from the deck of cards and tear it off. However, in the technical world, we simply cannot delete records in a sequential file. To understand this problem, let us consider a contrived situation wherein the librarian must maintain a number of cards equivalent to the number of books, regardless of whether the book is available or is lost. Thus, if there are 1000 books in the library, there must be 1000 corresponding cards. Thus the card must be maintained even if a book is lost. Perhaps, the librarian can put a special marker (say ***) on the card to indicate the card is void—meaning that the book is lost. The idea is illustrated in Fig. 1.11 for card number 971 (indicating that book number 971 is lost).

In any case, the librarian must not remove the card. This is exactly what happens with sequential files on computer disks. When we want to remove a record from a sequential file, the file cannot crunch itself automatically. For example suppose that there are 1000 records in a file, each occupying 100 bytes. The file would occupy a total of 1,00,000 bytes on the disk. Now, suppose that we want to delete record number 971 for some reason. We cannot reduce the file size by the size of one record (i.e. 100 bytes) to 99,900 bytes. All we can do is to mark record number 971 in some special fashion to indicate that it should be considered as deleted. The idea is explained in Fig. 1.12.

Earlier file systems used to be dependent on the physical locations of their records. Modern file systems are quite independent of the actual storage locations.

Fig. 1.11 Marking a book as deleted/lost

Fig. 1.12 Marking a record as deleted in a file

Also, we cannot reclaim the space freed by the deletion of this record. For this we would need to have a separate program which reads every record from the file and, if it is not deleted, writes to another file. Thus, record number 971 (and any such records, which have been logically deleted), would not be written to the second file. In other words, they would get physically deleted. Thus, this second file would contain only the records for the available books. This process is used in many cases where sequential files are used, and is called file reorganisation.

In general, sequential search works fine if the number of records is small. However, if the number of records as well as the number of queries is large, it becomes inefficient.

1.6 POINTERS AND CHAINS

In order to overcome the drawbacks of sequential organisation, let us add one more field to the original record layout. Let us call this field Next by same author. This field is a pointer.

In general, a pointer in a record is a special field, whose value is the address/reference of another record in the same file.

The pointer points to the next book written by the same author as shown in Fig. 1.13. We can see that there is a pointer in Record Number 1, pointing to Record Number 3. This means that the same author has written the books numbered 1 and 3.

Fig. 1.13 A pointer to the next record

Note the new field added to the record layout. This field gives the book number for the next book written by the same author. Every card will have this additional field. So, when it comes to looking for the next book written by the same author, every card will actually point to another card. So, conceptually, we will have the following situation, assuming that book numbers 1, 3, 21 and 761 are written by the same author—Joseph Martin. Figure 1.14 ignores the other details and concentrates on only the pointer field of these four cards.

Fig. 1.14 Logical chain of records formed by using pointers

What the additional field does is simple? It forms a chain of records. Pointers maintain the chain.

A chain of records is a logical sequence of records created by the use of pointer fields.

We have shown this by arrows. Using the field Next by same author creates the chain. Effectively, what we are stating is this:

  • In the card for book number 1, we are saying that the next book written by this author is book number 3.
  • In the card for book number 3, we note that the next book written by the same author is book number 21, and so on.
  • The last book written by this author is book number 761.
  • After this, there are no more books written by Joseph Martin in our library. This is indicated by putting an asterisk (*) against the Next (same author) field of the card for book number 761. It signifies the end of the chain for this author.

Now, suppose a member comes up with another query on books written by Gandhi. The librarian's task will be simpler. He will start with record number 1, as before, and read every record until he finds one for Gandhi. Having found that, he will simply follow the pointers in the chain and comes up with the response. The Next by same author field will give him the next record for Gandhi. When he finds a record with an asterisk against the Next by same author field, he will stop the search. The only time-consuming task will be obtaining the first record for Gandhi. Once that is done, the rest of the search will be a lot quicker. The algorithm for this search is shown in Fig. 1.15.

Fig. 1.15 Algorithm for searching books by using pointers

Figure 1.16 shows the corresponding flow chart.

Fig. 1.16 Flow chart for searching books by using pointers

1.6.1 Problems with One-way Chains

What we just saw were one-way chains. The Next by same author field gives us the next record for the same author in the forward direction only. However, imagine that our librarian goofs up one day. While he is adding some new cards for books that have arrived recently as shown in Fig. 1.17, he spills ink on the first card.

Fig. 1.17 Breaking of a chain

As we can see, the ink broke the chain of records for author Joseph Martin. How can the librarian now find out all the books written by Martin? Well, he would need to conduct some of the search manually. Fortunately, in this case, the author has record numbers 1 and 3 as the first two records. So, the manual search would be limited to only record numbers 1, 2 and 3. When the librarian comes to record number 3, he would see the next pointer as before and can use the chain from there onwards. This is shown in Figure 1.18.

Fig. 1.18 A broken chain

However, imagine that the concerned author's first book was book number 1 and the second was 990! That would mean a sequential search of the first 990 records! From 990, the chain would be available.

One-way chains can suffer from the drawback of lost/damaged references.

Another problem with one-way chains is that they are unidirectional by nature. Therefore, if anyone wants the third book from the end written by Joseph Martin, there is no direct way of finding it except traversing the whole list from the end to the beginning of the chain, in a reverse manner. As a result, some of the library members may start complaining that their queries are not getting a quick response. What is the solution?

1.6.2 Two-way Chains

The librarian can improve things by creating another logical chain. Until now, the chain of records for an author was unidirectional. Now, the librarian adds another chain—in the reverse direction—such that we now have two pointer fields: one for the Next by same author and another for the Previous by same author. The former, as before, gives the book number for the next book written by the same author. The Previous by same author field gives the previous book written by the same author. We shall call it as the back-pointer. Now the cards look as shown in Fig. 1.19.

Fig. 1.19 Addition of the back-pointer

Since book number 1 is the first book in our library, there is no book (for this author) prior to it. Therefore, the Previous by same author is marked with an asterisk (*). This is a special value, which indicates that there is no previous book in the library for this author. However, the remaining records would have a back-pointer, corresponding to the Next by same author field as shown in Fig. 1.20.

Fig. 1.20 Two-way chains

As the figure shows, we now have two-way chains. The back-pointer points to the previous record for the same author in the chain. Therefore, any query related to the previous as well as the next records for the same author can be easily answered. Now let us see what happens if ink falls on the Next by same author pointer of record number 21, as shown in Fig. 1.21.

Fig. 1.21 Broken link does not affect two-way chains so much

A broken chain does not cause a major problem in two-way chains. This is for the simple reason that another chain exists in the opposite direction. In this case, the Next by same author pointer for record number 21 is lost, due to which the forward connection between record numbers 21 and 761 is broken. However, the backward pointer from record 761 to record 21 still exists. Hence, the following can be easily established:

  1. When moving in the forward direction, record number 1 points to record number 3 and record number 3 points to record number 21.
  2. We are not sure about the next pointer from record number 21.
  3. However, while moving backwards, record number 761 points to record 21 in the reverse direction.

Therefore, record number 21 must point to record number 761 in the forward direction. Thus, a Next by same author pointer can be reconstructed for record number 21 to point to record number 761. This is called recovery of a lost pointer. Thus, we can state the following:

Two-way chains do not suffer from the drawback of lost/damaged references.

What are the disadvantages of two-way chains? Obviously more effort goes into their maintenance. For every new book that is being added, the clerk has to make a Previous by same author entry in addition to the Next by same author entry. Also, if a book is lost both the previous and next pointers need to be adjusted so that they point to the correct record.

1.6.3 Queries Based on Other Fields

So far, we have been considering queries based only on authors. However, members might be interested in knowing about all the books published by a particular publisher. How do we answer to such queries? Quite simply, in addition to the author field the librarian needs to maintain two-way chains for the publisher field. Thus, we now have four pointers establishing four chains. The first two next and previous author pointers maintain the author chains. A similar set of next and previous pointers maintains the publisher chain. The modified record structure now looks like the one shown in Fig. 1.22.

Fig. 1.22 Pointers for publishers in addition to those for authors

There can be two-way chains for publishers as well. We will not illustrate the chains, as the characteristics of chains must be very clear by now. So, if an inquiry based on the author is made, the librarian would follow the author chains as before. However, if an inquiry is made on the publisher, the new publisher chains would be used. This would facilitate a fast search based on either the author or the publisher. Also, we can have additional chains for other fields such as book title, price and so on, depending on the queries that come up. The downside, as pointed out earlier, is the effort that goes into more maintenance. Every time a book is added to or deleted from the library, the librarian has to adjust upto four or more pointers.

1.7 INDEXED ORGANISATION

1.7.1 Using Indexes

Our librarian is really improving fast. For still faster access to the desired record(s), he devises a novel scheme. Not only does he maintain two-way chains for authors and publishers, but does something more—he maintains indexes.

An index is a table of records arranged in a particular fashion for quick access to data.

In this case, the librarian has listed all authors on a piece of paper in ascending order. Each author appears only once in the index. Against each author, the librarian has written the number of the first book by that author. Therefore, the paper is an author index. The concept is illustrated as shown in Fig. 1.23.

Fig. 1.23 Concept of index and chain

The idea behind this scheme is simple.

  1. The author index points to the first record for every author which acts as the entry pointer for that author.
  2. As before, the actual records still have two-way chains maintained for each author.

Thus, the search based on a particular author becomes even simpler. The first thing that needs to be done now is to look at the author index. Because it is sorted in alphabetical sequence, an author's name can be found quickly. Thus it provides easy access to the first record number for the selected author. Using this number, we can directly go to the first record for this author by following the logical pointer. From there, the chains assist in the remaining search as before. Thus, a combination of an index and chains makes the search even faster.

Figure 1.24 shows the algorithm for searching all the books written by a particular author quickly by using the index and the chains.

Fig. 1.24 Algorithm for searching all books by one author

For example, suppose someone asks: Which books written by Zuse are available in the library? Using the author index, we realise that the first book in our library by this author is number 192. Hence, we can now go directly to record number 192 and then, use the author chain for the remaining records. This saves the time needed for searching the first record for the author. A similar scheme can be arranged for publisher index. We will not go into its details as the same concept is applicable here too.

Can there be further enhancements to our basic scheme of index and chains?

1.7.2 Improvements to Index-chain Method

Just as the chains are two-way, the librarian can make the index two-way too!

That is, rather than keeping an entry for only the first record for an author in the index, he can keep an entry for the first and the last records. If that is done, a two-way chain would be formed. Thus, by using the pointer for the first record in the index, we can traverse the forward chain. For a search from the end, we can use the reverse chain starting with the last record for the author. This would make the search faster in certain situations.

For example, if a member wants to see the latest book by Agatha Christie, we can first look up the new entry in the index (Last book for this author). Then, we can simply go to the chain using that book number and identify the record. Clearly, this adds some more overheads to the process of record maintenance, but gives him more flexibility in terms of searching records. However, the more important reason for maintaining two-way chains is to aid recovery. We had previously seen that when a chain based on the next pointer is broken, information is lost. Use of two-way chains ensures that vital information is not lost.

The concept is illustrated in Fig. 1.25. Due to space constraints, instead of full book cards only book numbers are shown. Also, not all cards and their pointers are shown. But it should be adequate in making the concept of first and last book entries in the author index clear.

Fig. 1.25 Index with entry for the first and the last records

1.7.3 Maintaining a List of All Items in the Index

The index can be constructed in another manner. We can maintain a list of all the books written by an author in the index. In the index described earlier, only details on the first and the last books were maintained. In the new index, we can list all the book numbers for an author in the second column, as shown in Fig. 1.26. The figure shows only the author index and not the chain.

Fig. 1.26 Author index with list of all books

This system would not only work fine but also give faster results. For example, if someone wants to know about all the books written by Joseph Martin, the above author index would quickly provide the answer. Using the book numbers provided (1, 3, 21, 761) for Joseph Martin, the information can be searched quickly. Note that we need not maintain any chains or next/previous pointers now. The index contains all the information.

The algorithm for this search is shown in Fig. 1.27.

Fig. 1.27 Modified algorithm for searching all books by one author

However, the drawback of this scheme is that there is variable number of entries per author. We are not sure how many books per author is allowed in the library: it could be any number. Therefore, maintaining the index could be cumbersome, especially if the number of books written by an author is large. A better scheme is needed for quickly finding such information.

1.7.4 Keeping a Count of Records

Another improvement in the basic scheme could be achieved by keeping record counts. For instance, suppose there are four books for Joseph Martin, we can not only keep a record of the first and last pointers for the author in the author index, but additionally, we can also keep a count of the total number of books by the author available in the library, that is 4. This is shown in Fig. 1.28.

Such counts can help in two ways:

(a) Queries such as: How many books in your library are written by Joseph Martin? can be answered very quickly without even going to the cards or following the chains.

Fig. 1.28 Maintaining the count of records for each author

(b) Errors can be detected early. For instance, since we know that we have four books by Joseph Martin from our index entry, we must obtain the same number when we actually traverse the chains. If we get anything other than four, it means that either the index entry for count is wrong or that the chains are not quite correct. This is called index corruption. Either way, a correction is required and walking through the chains help in implementing it.

1.7.5 Complex Queries and Query Optimisation

Maintaining a count of records for a given key can have significant effect on a search. This is especially true in case of a complex query. For instance, if someone wants a list of books written by Philip Bailey and published by ACS Publications, we will assume that we have two indexes, one for author and another for publisher. Also, both point to the first and last records in the indexes, respectively for author and publisher. These concepts have already been discussed in detail. However, suppose they do not have a count field as yet.

In order to search the books written by an author for a particular publisher, there can be two approaches:

  1. First, get a list of all the books written by the concerned author — in this case, Philip Bailey — using the author index and chains. Having found them, look for each of these books in the publisher index and chains.

    OR

  2. First, get a list of all the books published by the concerned publisher — in this case, ACS Publications — using the publisher index and chains. Having found them, look for each of these books in the author index and chains.

Both schemes look equally good, but which one should we select? There is no fixed rule. We can either go for the first method, or the second. However, our choice could affect the time required to come up with an answer to the query.

For instance, suppose there are 50 books by Philip Bailey in the library and only three of them are published by ACS Publications. In addition, ACS has published two other books not authored by Philip Bailey. If we use the first method, we would walk through 50 entries in the author chain and then three in the publisher chain. Thus, we would need to go through 53 records altogether.

However, by any luck, if we had selected the second method of searching, we would first go to the publisher index and follow the publisher chain. Thus, we would go through only five records in the publisher chain. Using these five records, we would go through only five records in the author chain. This would mean we go through only ten records altogether. Finally, we would select three records in the author chain out of the five read, for Philip Bailey.

As we can see, depending on the selection, there can be significant performance improvement in answering queries. If one opts for the second method, the efforts required for answering the query will be much lesser. This process of resolving queries in a more efficient manner is called query optimisation. This can prove to be extremely important as the number of library members increases. Many of them would come up with different sets of queries and the librarian must be able to respond to those as quickly as possible.

Having count field in both the indexes would greatly help the process of query optimisation. In the example just discussed, if we would have had such a field in the author as well as publisher indexes, we could have decided about choosing either approach 1 or 2 based on the count. From author index count, we get a figure of 50. From the publisher index count, we get five. Clearly, there is enough evidence to start our query evaluation using the second approach. However, in the absence of a count field, there is no guarantee that either of the approaches would be better. Maintaining a count of records in the index thus helps to significantly improve results.

1.7.6 Indexed Organisation in Computer Files

We have studied the concepts of pointers, chains and indexes with a simple example. Computers organise indexed files in a similar manner. That is, one of the fields in the file is the primary key, as we have discussed earlier. This field identifies a record uniquely. We should note that the position of this field is the same in all the records. That is, in every record, the primary key field should occupy the same position. The idea is illustrated in Fig. 1.29.

It is interesting to study how computers maintain index files. We have noted that our librarian had to maintain the index separately. Computers do the same. There are various ways of achieving this objective. However, the basic idea is simple and common to most implementation of indexes.

Sequential files are worse than audio cassettes in concept. You must not only rewind them to go to a specific location, but you must rewind them completely! Indexed files are like CDs/DVDs; you can go to any specific location of the disk directly.

Fig. 1.29 Primary key positions in an indexed file

In order to create and maintain index files, a computer creates a data file and an index file. The data file contains the actual contents (data) of the record, whereas the index file contains the index entries.

The way files are organised in computers is as follows:

  1. The data file is sorted in the order of the primary key field values.
  2. The index file contains two fields: the key value and the pointer to the data area. Note the similarity between this organisation and the index mechanism employed by our librarian.
  3. One record in the index file thus consists of a key value and a pointer to the corresponding data record. The key value is generally the largest primary key value in a given range of records. The pointer points to the first entry within that range of data records.

This is best illustrated than explained, as shown in Fig. 1.30. Note that there are other ways of organising records in a data file.

Let us understand how this works. We will notice that there are two separate files: an index file and a data file. The index file contains two parts: an index value and a pointer to the data area. Every index value is the highest in a range of data values.

  • In the first index entry, the index value is C, which is the highest primary key value in the first data block. Note that the pointer from this index entry points to the start of this range (i.e. A). The address (i.e. memory location) of this on the disk is assumed to be 0, as shown.
  • Similarly, the second index entry contains F as the highest primary key value for that range of records, and a pointer to D, which is the start of the range, and so on. The address (i.e. memory location) of this on the disk is assumed to be 100, as shown.

Fig. 1.30 Indexed file organisation

An example of this is shown in Fig. 1.31.

This arrangement works fine. However, we would need to deal with two problems, as follows:

  1. How do we insert new index values between any two existing values? For example, suppose we now have an index value called as B1. Where do we store it? Clearly, we do not have any space left in the first range of A-C.
  2. What happens if the number of index values becomes too high? Would it not slow down the search?

Fig. 1.31 Searching the index

To answer the first question, a split is created. That is, when we need to add the new index value B1, the first line of index values (i.e. A-B-C) is split into two lines. Perhaps the new first line would now contain A-B, and the second line would contain B1-C. This is shown in Fig. 1.32.

Fig. 1.32 Split in index entries

We can similarly insert any new index entry anywhere. This would necessitate a split in the index, and appropriate adjustments in the address values.

Solving the second problem is equally interesting. We can create an index of indexes. In other words, we can create a multi-level index. In this type of index, the very first line does not point to the data items as before. Instead, it points to another lower-level index. Depending on the need, this lower-level index may point to yet another lower-level index, and so on. Only the final level of index points to the actual data items.

A multi-level index contains multiple lines of indexes, the final line pointing to the data items.

The benefit of this scheme is that we can create multiple levels of indexes, accommodating as many index entries as desired. This leaves enough space for future expansions (insertions). Of course, if even this strategy does not suffice, we would create another split in the index, as shown earlier.

A two-level index is shown in Fig. 1.33. To keep things simple, we have shown only two second-level indexes. There would, of course, be four such indexes in the real copy, corresponding to the four entries in the first-level index. We also show only a couple of pointers from the second-level index to the data area.

Fig. 1.33 Multi-level index

Of course, there are many other ways of arranging index values and providing pointers from there to the data areas. What we have shown is just one way of doing it. In fact, IBM's VSAM databases are based on a similar concept.

1.8 DIRECT ORGANISATION

1.8.1 Basic Concepts

There is a third popular type of file, called direct files. In some technologies, these files are also called relative files. There is no concept of an index (and hence that of a primary key) here. The idea is quite simple. All records in a direct file are of the same size. Every record has an associated record number. The record number serves the same purpose as a primary key in an index file. Thus, we can ask the computer to provide us with record number 13 (instead of asking for a record whose employee-ID equals A17, for example). In relative files we cannot search records based on a value in a field directly, as can be done in the case of an index file.

Direct files can be classified into two main types as shown in Fig. 1.34.

Fig. 1.34 Types of direct files

Let us examine these file types.

1.8.2 Non-hashed Files

Whatever we have discussed in the context of direct files so far applies to non hashed files. In such files, based on its record number, a record is placed in its appropriate slot. So, if a programmer instructs the computer to write record number 1 and then record number 100 to such a file, it would do so, keeping the 98 slots between 1 and 100 on the disk empty. We can treat the record number itself as the primary key. However, the drawback of the non-hashed file approach is the creation of too many empty slots. Hence, better schemes are required.

1.8.3 Hashed Files

In hashed files the record number (or the memory address corresponding to the record number) itself becomes an equivalent of the primary key, as before. However, there is also an intelligent conservation of space, unlike in the earlier approach.

The term hash indicates splitting or chopping of a key into pieces. In effect, the computer chops the key into pieces so as to avoid wastage of space, as illustrated subsequently. There are three primary hashing techniques, as shown in Fig. 1.35.

All data structures are based on human ideas of making information storage and search simpler and friendlier. We use pointers, chains and indexes manually without realising that computer applications do the same.

Fig. 1.35 Hashing techniques

The following discussion requires some mathematical concepts and can be safely skipped without any loss of continuity. The reader may instead choose to move to the example that follows the discussion, which explains the concepts in a simpler manner.

To understand these techniques, we need to be familiar with certain terminologies, as follows.

N = Number of records in the file

K = Set of keys that can uniquely identify all the records in the file

  • Division method: In this method, we choose a number M, such that M > N. It is better to choose a prime number as M. Then, the hash function H is defined as follows:

H (K) = K mod M

Just for clarity, K mod M means we divide K by M, and take the remainder of the division.

For example, if K is 9875, N is 58 and M is 97, then we have:
H (K) = 9875 mod 99 = 78.

  • Mid-square method: In this method, we take the square of K (i.e. K2). We then chop off digits from both the ends of K2. We will call this final value as L. Therefore, the hash function H is defined as follows:

H (K) = L

For example, if K is 9875, N is 58 and M is 97, then we have:
K2 = 97515625
H (K) = Middle two digits of K2 = 15.

  • Folding method: Here, the key, K, is partitioned into a number of parts, such as K1, K2, and so on. The parts are then added together, ignoring the final carry. Thus, the hash function, H, is defined as follows:

H (K) =K1 + K2 +…+ Kn

For example, if K is 9875, N is 58 and M is 97, then we have:
H (K) = 98 + 75 = 173; ignoring the leading 1, we have
H (K) = 73.

Figure 1.36 shows an example to illustrate these techniques.

Fig. 1.36 Examples of hashing techniques

Batch processing

Complex query

Data item

Division method

File

File reorganisation

Hashed file

Index corruption

Key

Master data

Multi-level index

One-way chain

Online query

Physical record deletion

Primary key

Real time processing

Record key

Record number

Secondary key

Sequential organisation

Two-way chain

Chain

Data file

Direct file

Field

File name

Folding method

Index

Index file

Logical record deletion

Mid-square method

Non-hashed file

Online processing

Performance improvement

Pointer

Query optimisation

Record

Record layout

Recovery of lost pointer

Sequential file

Transaction data

  • A file is a collection of logical information. Each file has an associated file name.
  • A file contains many records.
  • One record consists of many fields or data items.
  • The skeleton of a file is called its record layout.
  • Master data changes infrequently. Transaction data changes quite regularly.
  • In batch processing, there is no or little human intervention.
  • In online processing, information needs to be obtained from the computer in real time and provided to the user. Real time processing is a special case of online processing.
  • A field that identifies a record is called the record key, or just the key. A primary key identifies a record uniquely. A secondary key may or may not identify a record uniquely.
  • One file can have at the most one primary key, but zero or more secondary keys.
  • A sequential file contains records arranged in a sequential fashion.
  • A sequential file is simple and has lesser overheads. However, it is not easy to search or delete records from such a file.
  • A pointer is a special field that points to another record in the same or a different file.
  • One-way chains are created to link related information together in the forward direction.
  • Two-way chains are created to link related information together in the forward as well as the backward direction.
  • An index is a table of records arranged in a particular fashion.
  • Query optimisation means trying to improve the performance of the query.
  • An index file helps faster search of data.
  • In a direct file, every record is identified based on its record number.
  • Direct files can be non-hashed or hashed.
  • Hashing techniques include division method, mid-square method, and folding method.

  1. In our daily life, we store information in the form of files and registers.
  2. A file can contain only one record.
  3. The skeleton of a record is called its layout.
  4. Master data keeps changing every month.
  5. A file that identifies a record is called a record key.
  6. In sequential organisation, records are added as and when they are available.
  7. A pointer in a record contains the address of another record.
  8. Two-way chains suffer from lost/damaged references.
  9. The process of resolving queries in a more efficient manner is called performance improvement.
  10. Direct files are also called relative files.

1.  The details of a record are called _________.

(a) field

(b) record layout

(c) description

(d) file name

2.  ________ involves a batch of goods manufactured and passed along the conveyor belt.

(a) Real time processing

(b) Online processing

(c) Batch processb

(d) Online query

3.  A field that can identify a record uniquely is called _________.

(a) secondary key

(b) primary key

(c) record key

(d) data key

4.  A ________ contains the address of another record

(a) field

(b) data item

(c) key

(d) pointer

5.  A chain of records is a ________ created by using pointers.

(a) physical sequence

(b) logical sequence

(c) connection

(d) organisation

6.  A ________ is a table of records arranged in a particular fashion.

(a) file

(b) one-way chain

(c) two-way chain

(d) None of the above

7.  By using the pointer for the first record in the index, we can traverse the __________.

(a) forward chain

(b) reverse chain

(c) opposite chain

(d) two-way chain

8.  In order to create and maintain index files, a computer creates a ________ and a ________.

(a) data file, key file

(b) index file, key file

(c) data file, chain file

(d) data file, index file

9.  Inserting a new index entry necessitates a ____________ in the index.

(a) split

(b) join

(c) connection

(d) none of the above

10. Another word for index of indexes is _______.

(a) single-level index

(b) hashed file

(c) non-hashed file

(d) multi-level index

  1. Explain the concept of master and transaction files.
  2. What is sequential organisation? What are its advantages and disadvantages?
  3. What are pointers and chains?
  4. Explain the difference between one-way chains and two-way chains.
  5. How are indexes used?
  6. How are record counts maintained?
  7. What are complex queries? What is query optimisation?
  8. Explain indexed organisation of computer files.
  9. What are the different types of direct files? Explain with examples.
  10. What is the difference between non-hashed files and hashed files?

  1. Write down the layout for a student file containing the following fields:
         Roll number Number Maximum length 5
         Student name Character Maximum length 30
         Rank Number Maximum length 2
         Total marks Number Maximum length 3
  2. Create the above file as a table in MS-Access.
  3. Assume that the student file contains the following data. Show an index for the same.
    Roll number Student name Rank Total marks
    17 Atul 4 500
    90 Anita 3 712
    27 Jui 1 910
    56 Harsh 2 851
  4. Calculate the hash of the roll numbers by using all the three methods.
  5. Should the Total marks field in the above file be a secondary key? Justify your answer.
  6. Learn more about file organisation packages such as ISAM or VSAM.
  7. If you know a programming language, study its aspects related to file processing.
  8. The C programming language provides for extensive file processing. Learn how to handle files in C.
  9. The COBOL programming language facilitates the usage of three types of files. Investigate than and describe how they differ from one another.
  10. Study how different operating systems treat files. Contrast between the views of UNIX and Windows about files.