UNIX REVIEW December-January 1983 (Part 1 of 3)

UNIX DATABASE MANAGEMENT SYSTEMS

By Jon Roland

JON ROLAND is a consultant with Cyberian Computer Consultants, 206 East Nakoma, San Antonio, TX 78216, 512/340-7641, 340-8443, whose SoftevalTM division specializes in evaluating software and publishes reports to its subscribers. Cyberian is currently involved in porting various software packages between UNIX systems. Roland is also an associate of Cybermart, a computer dealer, and Director of the Vanguard Institute, which does research in several areas, including expert systems and problem-solving theory.

Please send duplicate copies of any comments or suggestions in connection with these articles to the author at the above address and to the Editor of UNIX REVIEW at POB 563, Denville, NJ 07936, 201/625-1797. Software submitted for review should be on Fortune 32:16 format disk.


Database management systems (DBMSs) were first developed to solve the problems faced by programmers and system managers who found themselves writing and using large systems of interacting applications programs, each of which did its own data entry, file management, retrieval, manipulation, and reporting. As much as ninety percent of the code in each application program was devoted to these functions. Much of that code was very similar from one program to the next. Data was often duplicated in files created by different programs. In this situation it was often extremely difficult to modify programs, to update data, to find and correct errors, and to make sure the data in the files of different programs was consistent. It was perceived that this situation wasted system resources, resulted in poor system performance and low programmer productivity, and sometimes gave disastrous results. A better way had to be found.

The DBMS solution was to take data management functions out of each of the many application programs and consolidate them into a single system of programs and data files which can be accessed by any application program through a standard interface. Application programs can then be smaller, more efficient, easier to write, and easier to modify. Data need only be entered once, and once entered, made available to the entire system. Minimizing duplication makes it much easier to keep the data consistent. Simple commands can make extensive transformations of data throughout the system.

With data management thus consolidated, general-purpose query and reporting commands could elicit information interactively in ways not anticipated when the system was set up. This led to the notion of the DBMS as a powerful general-purpose tool for a great many applications, each of which might not have sufficient value to justify writing a program to do it, but whose aggregate value could be greater than that of the original applications which gave rise to the DBMS. Putting DBMS commands into batch or command files with conditional branching and looping gives many DBMSs the characteristics of a programming language, so that they have come to be called "fourth-generation" or "non-procedural" languages.

Starting on large mainframes, several approaches to the design and development of DBMSs were tried. Some worked better for some applications than for others. Some had better performance, others were easier to use. As experience was gained, principles of design emerged and a body of theory developed, but progress was slow as long as most DBMSs were specific to the hardware and operating system and used for only a few high-value applications. The need to reduce programming costs encouraged implementation on minicomputers which expanded the variety of applications, facilitated new approaches, encouraged the development of standard packages, and accelerated theoretical development. Improvement in programmer productivity by factors of ten to fifty became common. The advent of low-cost microcomputers and their widespread use led to attempts to implement standard DBMSs on them for use by non-programmers.

Unfortunately, DBMSs bring performance problems of their own. For many applications, a DBMS can push a system to its limits and give unacceptable performance. The limitations of most single-processor 8- and 16-bit micros have discouraged the use of DBMSs on them for many of the purposes for which DBMSs were originally developed, such as complex transaction processing and accounting applications, and procedural languages like Basic, COBOL, Pascal, and C continue to be favored in this situation. Despite such limitations, DBMSs for 8- and 16-bit micros are increasingly being used for a variety of smaller applications, and are approaching the popularity of spreadsheets and word processors. Such products as Condor, dBase II, and FMS-80 have made the notion familiar to millions.

The advent of low-cost 32-bit micros with abundant RAM and fast hard disk drives has shifted the balance between procedural and non-procedural languages. Such systems now offer the performance needed to make DBMSs a practical alternative to procedural languages for the development of complex or specialized applications. Reduced programming costs on these systems promises to more than offset the greater cost of the hardware.

In many ways, UNIX provides an especially congenial environment for the development of DBMS packages and applications. It provides many facilities which developers of DBMS packages might otherwise have to write. Some DBMS packages, such as /rdb, are simply a collection of database utility programs run using the shell as the command interpreter, and as such function as natural extensions of the UNIX environment. Unfortunately, many of the features of UNIX that make it the favorite of programmers can impede the implementation of several important features of DBMSs and can impair their performance. Some DBMSs try to work around UNIX limitations. Others make modifications to UNIX. We can expect that a major driving force for further modification of UNIX will come from this direction.

Listed at the end of this article are a number of DBMS packages that are available under UNIX. More will appear in the months ahead. They differ greatly in their capabilities, performance, and suitability for various users and uses. Many are still under active development. This is an introduction to several reports on an ongoing evaluation of DBMS and other packages that run under UNIX. An attempt will be made to develop benchmarks which reveal not only quantitative performance, but innovative approaches, hidden flaws, and subjective features which affect their suitability for various purposes. The benchmarks to be used for these published reports will consist of elementary operations like inserts, deletes, sorts, selects, projects, and joins, and some simple, typical applications of general interest. Readers are encouraged to suggest other approaches and applications of particular interest to them, and to report results of their efforts to use such packages. DBMS vendors are encouraged to suggest benchmarks which best reveal their product's strengths and the weaknesses of their competitors. The remainder of this article will focus on some of the factors which affect the performance of DBMS packages and their suitability for various purposes.

A GENTLE INTRODUCTION

The term model denotes any formal structure for the representation and manipulation of information. In this sense, a computer language, a spreadsheet, a word processor, an accounting package, or a DBMS are different kinds of models, designed to serve somewhat different purposes within the constraints of the system on which they are to be used.

Consider how two models, a spreadsheet and a DBMS, might be used to represent the sales figures for various items of merchandise for each of the months of the year, with annual totals for each item, monthly totals for all items, and a grand total for all items for the entire year. In Figure 1, the values have been entered into a spreadsheet. Some cells have been assigned the names of the items and of the months. Others contain the numeric values of the sales figures. Still others contain the values of functions which compute the sums by item, by month, and for the year. In Figure 2, we see the same information displayed in a database report.

In a spreadsheet, we get the monthly totals by entering a formula for summing the sales figures for one month, then enter a command to replicate the formula in the 11 cells to the right, using a relative rather than absolute reference to the cells on which the sum is to be calculated. In a DBMS, we execute a command which lists the data in columns and sums the contents of each column. In a spreadsheet, the information remains spread out on the screen, or at least as much of it as the screen will allow, and the user can see the effects of the changes being made. To see part of the spreadsheet not displayed, the user can scroll horizontally or vertically until the cells of interest come into view. In a DBMS, display of the data after it is entered is not automatic, but requires a query or report command like that shown.

For this example, there would seem to be little basis for choosing one model over the other. But now consider that the sales figures are hourly over a period of five years for 999 items. Now we have a spreadsheet with about 15,000 columns or a database file with a like number of rows or records. Many DBMSs can handle this, but unless the spreadsheet or the system it runs on has a virtual-memory capability, like Q-Calc, so that it can page its contents to and from disk, it will not fit in the available memory of most systems.

Now consider Figure 3, an amortization schedule. The balance on each row is the sum of the value of the interest on the same row and the balance from the previous row, minus the value of the payment on the same row. The interest is the rate of interest multiplied by the value of the balance on the previous row. Most DBMSs can't handle this situation, because they only permit calculations involving fields within the same row or record. A spreadsheet can handle it, provided that the number of payments to be amortized is not too large. One way to partially get around this is to interface a DBMS to a spreadsheet and read into a spreadsheet as much of a database file as will fit. The spreadsheet does the calculations on the values, which are then written back into the database file.

The basis for the two different models is the fact that fast main memory is expensive. If RAM were as cheap as disk storage, we could combine the spreadsheet and DBMS model. As the cost of RAM drops, we can expect to see DBMSs and spreadsheets converge in functionality. Virtual-memory spreadsheets like Q-Calc are leading in that direction.

Because DBMSs are designed to handle large quantities of data, they need some capabilities that spreadsheets usually don't have. A database record may contain more fields than will fit on one line of a video terminal, or even on the entire screen. People who use a DBMS typically want to be able to use a data entry form and to generate reports using a free format that is easier to create than the layout of a spreadsheet. Finding a value on a spreadsheet is usually just a matter of scrolling until it turns up, but that's not feasible in a large database, in which one wants to be able to find and gather together scattered items that satisfy some condition, or to sort items in various ways, or to put together items that have matching fields.

TYPES OF DBMS

There are three kinds of DBMS models: hierarchial, network, and relational. A hierarchical DBMS has a tree-structure that looks like Figure 4, similar to the UNIX directory system. All references from one item to another must trace a unique path up the tree to the common node, then down to the other item. This simplifies storage and retrieval, makes it easier to modify, and often has a better performance than other designs. However, only a few kinds of applications, have data which naturally fit into a rigid hierarchical framework. The main contenders are the network and relational models.

To understand these alternatives, consider the problem of representing a college, its students, faculty, advisors, classrooms, courses, and periods. Each student is registered for one section in each of several courses, and during various periods on various days, they meet in classrooms, each taught by one teacher. Each student also has one advisor, who is also a teacher.

What is the best way to represent this data? One way could be as the first two tables shown in Figure 5. The first and second columns are for the course and section. The third is for the classroom, the fourth for the teacher, and the next thirty columns for the students. However, some sections have as few as five students, and two have more than thirty. If we want a class list with the students in alphabetic order, we find that we don't have a convenient way to sort values in columns, only entire columns. The names of classrooms, teachers, and students are duplicated many times throughout the table. We could also use just one column for student names, as in the third table in Figure 5. Then it wouldn't matter how many students were in a class, but there would be more duplication of courses, sections, classrooms, and teachers. If one of the students gets married and changes his or her name, it has to be changed everywhere it appears. lf you miss one, it is likely to cause trouble later. The second table in Figure 5 is simpler, though. We put the student names in column one, the name of his or her advisor in column two. The student names aren't duplicated, but those of the faculty advisors are.

The basic notion involved in relational database models is the relation. In set theory, a relation between two sets of discrete objects can be represented by a table in which pairs of related objects are entered into columns opposite one another. By extension, relations among more than two things can be represented by tables with more than two columns, or by a collection or table of tables. There is, in general, more than one way to represent a relation in a table, but not all tables properly represent relations. To do that, they must satisfy certain restrictions. The simplest such restriction is that rows in a table not be duplicated. Another is that at least one, and perhaps more than one, of the columns function as a key for indexing (and usually sorting) the rows. There are also some additional restrictions which they must satisfy, called being in normal form. There are several normal form restrictions. The first is that each of the values not be a composite of several values. As we can see from our example above, it is not always easy to choose the best way to put related things into a table, especially if you are trying to minimize the amount of space, the number of duplications, and the time it takes to insert, delete, retrieve, sort, or otherwise manipulate their contents.

HOW TO SPOT A RELATIONAL DBMS

The class of relational DBMSs is not yet well-defined, but in general, to be considered minimally relational, we look for the following features:

  1. Data must appear to the user as values in a time-varying collection of tables with various numbers of columns that are in at least the first normal form.
  2. There are no user-visible links between such tables.
  3. The system has at least three operators: a select operator that produces a new table consisting of selected rows from the original; a project operator that produces a new table consisting of selected columns from the original; and a join operator that puts two tables together by concatenating rows that share the same key.

And to be considered fully relational:

  1. The system has all the operators of the relational algebra, such as union, intersection, and difference, without resorting to iteration or recursion, and without restrictions due to predefined access paths.
  2. The system supports entity integrity: a key value may not be null; and referential integrity: if a value in one table refers to a key value in another table, that other table must have that key value in one of its rows.
  3. The system supports insert, update, and delete functions.

When it comes to network DBMSs, there is little agreement on definitions other than the CODASYL standards, which only cover one-to-many relations. As with relational DBMSs, the foundation notion is the set, but now considered not as a subset of a cartesian cross-product of sets, but as objects connected by one-way links. In Figure 6, we see this represented, with the set teacher linked by arrows to the members of which it is the owner. In Figure 7, the relation teaches between Teachers and Students is represented by a collection of links from each teacher to the students he or she teaches, either in detail, as on the left, or schematically, as on the right. All of the main sets, such as teacher, student, course, and classroom, are considered members of the system or database set. This representation can handle many-to-many relations, and provide for linkages between them. We can, as in Figure 8, derive the relation teaches by linking teachers to course sections in one relation, students to course sections in another, and defining the relation teaches as the composition of the two.

It must be emphasized that the differences between these two models is in the way they are perceived by the user, and therefore in the syntax of the languages used to define and modify the database schemata and manipulate the data The tables of some relational DBMSs are implemented as simple ASCII text files that look just like the result of a list (or cat) command but this is inefficient. Data values in both models are more often represented internally by pointers, that is, by the addresses of the values. Links in the network model are also usually represented by pointers, either direct or indirect (that is, pointers to pointers). The internal structure of such models may be quite different from the way they appear to the user, but they usually reflect the models, so that the choice of model indicates the performance that can be expected.

Each of these models has its proponents and its claimed advantages. Both are still under development, and may ultimately converge. In the meantime, certain generalizations can be attempted with heavy qualifications. Relational DBMSs (rDBMSs) are usually much easier to use, especially for quick, simple, ad hoc applications. Relations are explicit. Schemata are easy to discern. Queries are easy to formulate without reference to some chart showing how the data items are related to one another. They are generally easier to implement (some would dispute this), to restructure, and to recover from system crashes. Access security is usually easier to implement and use. Finally, it has the benefit of a well-developed body of mathematical theory which permits, in many cases, rigorous validation of rDBMS constructs and operations. In other words, if a DBMS satisfies the conditions for being fully relational, it is possible to prove that its operations will work for all values and to predict its performance without having to test it extensively.

On the other hand, network DBMSs (nDBMSs) offer the potential for being more efficient, at least for some applications. An rDBMS tends to have a great deal of redundancy in that a given data value may appear many times in many different tables. Like sorting algorithms, no one is best for all situations. A multi-level, multi-dimensional, indexed, dynamic array of pointers like that used by MDBS III is likely to run faster and use less memory than a typical rDBMS for larger applications, but it is more difficult to use. If the relations and transformations are sufficiently complex, the time and memory requirements of that pointer array may not always be much less than a comparable rDBMS would be, especially when requirements for crash recovery, integrity, concurrency, and security are implemented, or if the DBMS is distributed over a network. As the complexity of the data relations and transformations increases, both the performance of the DBMS and the productivity of the programmer may decrease relative to what they would be if the situation was modeled using a custom-written procedural language.

For evaluation purposes, there are difficulties in comparing rDBMSs and nDBMSs. Both have insert and delete functions, but one can't compare the performance of the join operation of an rDBMS with an nDBMS that doesn't have one. The join operation tends to be a critical bottleneck in the performance of rDBMSs. It is important because it corresponds to what we normally think of as the composition of two relations, as in Figure 8. It is used a great deal, so how well it performs largely governs the overall performance of the rDBMS. Difficulty in implementing it is one of the main factors leading to departures from the pure relational model. Hybrid DBMSs like Unify resemble an rDBMS, but maintain a system of links between relations that function as a kind of "prejoin" in place of the join operation. This improves performance at the cost of formal purity. The tradeoff may or may not be worth it, depending on the particular needs and preferences of the user.

A difficulty of the hierarchical file system of UNIX is the fact that using direct and indirect nodes, a logical access may require as many as four or five physical accesses to accomplish it. This leads to departures from the UNIX file system like that of Unify, which allows you to partition the disk and set up a dedicated disk area for higher performance.

For smaller or short-term applications, or for interactive use by non-programmers, ease-of-use can be a more important consideration than sheer performance. A great deal of thought is being given to the development of a language or languages for database definition, data manipulation, query, and report generation. Most try to use an English-like syntax to ease the transition for non-programmers without sacrificing rigor or functionality. This approach runs the risk of confusing users who bring habits of natural-language usage incompatible with the restricted syntax of these languages. Perhaps the most prominent of these is SQL, a query language developed for use on IBM mainframes. It has been widely imitated, and some DBMSs, such as Unify and Knowledgeman, have attempted to duplicate it with only minor departures. Others have attempted to developed a single language that is both comprehensive and more natural than SQL or its derivatives. Some limit such languages to database management functions, and provide an interface to other procedural languages for those functions not provided by the DBMS which might be needed in a complete application. Others are extending their DBM languages to encompass all of the functionality of previous procedural languages, reducing or eliminating the need for programming in other languages.

Although these languages are sometimes called "non-procedural", this is a misnomer. What are emerging are languages that are actually procedural at a higher level, with far-reaching DBM primitives. A previous generation of programming was done at two levels: The first includes machine-dependent assemblers. The second includes algorithmic/procedural languages like COBOL and Basic. What we now see, especially in the UNIX environment, is the emergence of two different levels, the lowest being C, which bridges the two previous levels, and a new second level consisting of the DBM languages. The level represented by C is still the province of professional programmers, but since it is not machine-specific, most programmers can look forward to being able to write programs that can run on anything. The level represented by the DBM languages, on the other hand, is accessible to non-programmers, and as their power and functionality increases, the need for programmers for applications programming is likely to diminish, and be replaced by the need for broader systems analysis and management skills.

There is a level beyond that of DBMSs. This is the level of artificial intelligence and expert systems. It is now clear that what we call "intelligence", whether human or artificial, involves a great deal of highly-structured knowledge of the world, and that "intelligent" systems are going to have to be built upon sophisticated, efficient DBMSs. The efforts now being made to develop practical DBMSs for common commercial use are likely to furnish the tools needed for the development of the "fifth generation" of "intelligent" systems.

To compare these very diverse DBMS packages one must try to use them to represent the same data entry procedures and generate the same reports. It is difficult to be sure one has chosen the most efficient way to do this. This process of evaluation will go on for a long time. These reports do not pretend to offer any final or canonical answers, only preliminary or propaedeutic ones.

REFERENCES:

commUNIXations, Apr/May 1983. Newsletter of /usr/group, POB 8570, Stanford, CA 94305-0221. Entire issue devoted to DBMS.

Date, C.J., An Introduction to Database Systems, Vol. 1 & 2. Reading, MA.: Addison-Wesley, 1982 & 1983.

Holsapple, Clyde, A Primer on Database Management Systems. Lafayette, IN: Micro Data Base Systems, Inc., 1980.

Kruglinski, David, Data Base Management Systems. Berkeley, CA: Osborne/McGraw-Hill, 1983.

Nierenberg, Nicolas C., "Unix-based data base fits 16-bit systems," Electronics, Aug. 11,1983. McGraw-Hill.

Schmidt, Joachim W., and Brodie, Michael L., ed., Relational Database Systems, Analysis and Comparison. New York: Springer-Verlag, 1983.

FOR MORE ADVANCED STUDY:

Hayes-Roth, Frederick, et al., ed., Building Expert Systems. Reading, MA: Addison-Wesley, 1983.

Mamdani, E.H., and Gaines, B.R., ed., Fuzzy Reasoning and its Applications. New York: Academic Press, 1981.

PRODUCTS AND VENDORS:

IDOL
SMC Systems and Technology, Inc.
1011 Route 22, POB 6800
Bridgewater, NJ 08807
201/685-9000
INFORMIX
Relational Database Systems, Inc.
1208 Apollo Way #503
Sunnyvale, CA 94086
408/746-0982
INGRES
Relational Technology, Inc.
2855 Telegraph Ave. #515
Berkeley, CA 94705
415/845-1700
MDBS III
Micro Data Base Systems, Inc.
POB 248
Lafayette, IN 47902
317/463-4561
MISTRESS
Rhodnius, Inc.
POB 1, Station D, Scarborough
Ontario, Canada MlR-4Y7
416/922-1743
N-GEN
Conetic Systems Inc.
12025 Northeast Marx #5H
Portland, OR 97220
503/253-0074
ORACLE
Relational Software, Inc.
300 Sand Hill Rd.
Menlo Park, CA 94025
415/854-7350
PROGRES
Data Language Corporation
5 Andover Road
Billerica, MA 01821
617/663-6500
Q-CALC
Quality Software Products Company
348 South Clark Drive
Beverly Hills, CA 90211
213/659-1560
/rdb
Multinational Computer Software, Inc.
162 Fourth Avenue
San Francisco, CA 94118
415/387-9407
RUBIX
Infosystems Technology, Inc.
6301 Ivy Lane
Greenbelt, MD 20770
301/345-7800
SEED
International Data Base Systems
2300 Walnut St #734
Philadelphia, PA 19103
215/568-2424
SEQUITUR
Pacific Software Manufacturing Co.
2600 Tenth St.
Berkeley, CA 94710
415/540-0616
TROLL
Medical Information Science
University of California
San Francisco, CA 94143
415/666-2951
Anthony I. Wasserman
UNIBASE
RLG Corporation
YARD1760 Reston Avenue #508
Reston, PA 22090
703/471-6860
UNIFY
Unify Corporation
9570 W. Barbur Blvd.
Portland, OR 97219
503/245-6585

Figure 1

Jan     Feb     Mar     Apr     May      Jun     Jul
Widgets  151.00  152.00  153.00  154.00  155.00   156.00  157.00
Doodads   27.00   30.00   33.00   36.00   39.00    42.00   45.00
Tregobs  383.00  381.00  379.00  377.00  375.00   373.00  371.00
Totals   561.00  563.00  565.00  567.00  569.00   571.00  573.00
      SUM(R[-3]C:R[-1]C)

         Aug     Sep     Oct     Nov     Dec     Totals
         158.00  159.00  160.00  161.00  162.00  1878.00  SUM(RC[-12]:RC[-1])
          48.00   51.00   54.00   57.00   60.00   522.00
         369.00  367.00  365.00  363.00  361.00  4464.00
         575.00  577.00  579.00  581.00  583.00  6864.00

Figure 2

Months   Widgets    Doodads    Tregobs    MonTotal

Subtotals   Jan       151.00      27.00     383.00     561.00
            Feb       152.00      30.00     381.00     563.00
            Mar       153.00      33.00     379.00     565.00
            Apr       154.00      36.00     377.00     567.00
            May       155.00      39.00     375.00     569.00
            Jun       156.00      42.00     373.00     571.00
            Jul       157.00      45.00     371.00     573.00
            Aug       158.00      48.00     369.00     575.00
            Sep       159.00      51.00     367.00     577.00
            Oct       160.00      54.00     365.00     579.00
            Nov       161.00      57.00     363.00     581.00
            Dec       162.00      60.00     361.00     583.00

Totals               1878.00     522.00    4464.00    6864.00

COMPUTE SALESDATA.MONTOTAL = WIDGETS + DOODADS + TREGOBS
LIST SALESDATA FOR YEAR = 1982 SUBTOTAL BY MONTH TOTAL
     WIDGETS, DOODADS, TREGOBS, MONTOTAL TO PRINTER

Figure 3

Yr Mo Da   Payment    Interest   Paydown    Balance

82  2 15       0.00       0.00       0.00   45600.00
82  3 15    1000.00     684.00     316.00   45284.00
82  4 15    1000.00     679.26     320.74   44963.26
82  5 15    1000.00     674.45     325.55   44637.71
82  6 15    1000.00     669.57     330.43   44307.27
82  7 15    1000.00     664.61     335.39   43971.88
82  8 15    1000.00     659.58     340.42   43631.46
82  9 15    1000.00     654.47     345.53   43285.93
82 lO 15    1000.00     649.29     350.71   42935.22
82 11 15    1000.00     644.03     355.97   42579.25

Figure 4

Figure 5

Course    Section   Classroom Teacher   Stud#Ol   Stud#02   ...   Stud#30

  Phys201   01        A403      Smith     Ryan      Adams     ...   Kewitz
  Chem303   01        M888      Parsons   Morrison  Kewitz    ...   Ryan
  CompSclO2 01        C210      Stein     Ryan      Morrison  ...   Adams

  Student   Advisor

  Adams     Stein
  Kewitz    Smith
  Morrison  Parsons
  Ryan      Jones

  Alternatively:

  Course    Section   Classroom Teacher   Student

  Phys201   01        A403      Smith     Ryan
  Phys201   01        A403      Smith     Adams
  ...       ...       ...       ...       ...
  Phys201   01        A403      Smith     Kewitz
  Chem303   01        M888      Parsons   Morrison
  Chem303   01        M888      Parsons   Kewitz
  ...       ...       ...       ...       ...
  Chem303   01        M888      Parsons   Ryan
  CompSclO2 01        C210      Stein     Ryan
  CompSclO2 01        C210      Stein     Morrison
  ...       ...       ...       ...       ...
  CompSclO2 01        C210      Stein     Adams

Figure 6

Figure 7

Figure 8


Text File WordPerfect File Part 2 Part 3