Observations on the ODMG-93 Proposal for an Object-Oriented Database Language Won Kim UniSQL, Inc. 9390 Research Blvd. Austin, Texas 78759 1. What It Is and What It Is Not Although it has many problems, ODMG-93 is an important and positive contribution as a starting point in industry-wide efforts to define a standard object-oriented database language. The ODMG-93 specification may be summarized simply as follows: - Despite its hasty claim, it is NOT a "standard". Rather, it is a work-in-progress proposal for an object-oriented database language and language bindings to it for C++ and Smalltalk. ODMG (Object-Oriented Database Management Group) is not a formal "standards" body. It is a committee formed by five vendors of first-generation object-oriented database systems (OODB). (For several years, some of these vendors have offered products that are not much more than persistent storage managers for object-oriented programming languages, but the misleading label "Object-Oriented Database System" has been stuck on such products in the market.) And now the misleading label "standard" seems to be attached to the ODMG-93 work-in-progress proposals. - It espouses a database architecture that consists of a database management system that supports an object-oriented database language, and language binding layers on top of it for specific object-oriented programming languages. In particular, the specification consists of proposals for C++ language binding and Smalltalk language binding to a database language' and the database language is specified in proposals for a data definition language, a data manipulation language, and a query language. Of these proposals, the C++ language binding is the most mature, as these vendors have a lot of expertise in providing persistent storage for C++. However, in my view, the query language (at least as it is currently presented) is woefully deficient; and the data definition language and data manipulation language have problems, but are a reasonable basis for further work. - In essence, the database language proposal is an Object SQL. It uses the familiar SELECT FROM WHERE clauses of SQL. It includes the ORDER BY and GROUP BY clauses; the aggregation functions MIN, MAX, COUNT, SUM, and AVG; the set UNION, INTERSECTION, and DIFFERENCE queries, the existential quantifier predicate EXISTS, etc. It even has the SQL-2 facility called "derived table" to allow the SELECT clause of a query to contain another query. Besides these, the language includes facilities for defining and manipulating compound data structures (i.e., sets, bags, lists, arrays, and structs) that SQL does not support. - The major problems and deficiencies in the ODMG-93 database language are due to the fact that the database language does not subsume the facilities in SQL (despite the fact that the data model on which the database language is based, the Core Object Model espoused by the Object Management Group (OMG), fully subsumes the relational model). The database language is missing some important elements of SQL, including views, dynamic schema changes, and access authorization. Further, the current language includes many little features that will add up to a "meta-data management" nightmare, such as naming each individual object, maintaining multiple overlapping subsets of all objects in a type, etc. The goal of ODMG, with respect to the database language (but not the language bindings for C++ and Smalltalk), is largely identical to that of the X3H2 Database Standards Committee (i.e., the SQL-3 Committee); namely, the development of a database language for an object-oriented database as a post-relational database. SQL-3 is envisioned in essence as an Object SQL which extends SQL-2 with facilities for defining, manipulating, and querying an object database as a superset of a relational database. The SQL-3 committee includes all major relational database vendors, and some OODB vendors. Although members of ODMG are supposedly to implement the ODMG-93 proposal within 18 months, at this point it is not clear how many of the members will actually do so, and how much of the proposal they will implement. The technical challenge of implementing the full ODMG language, especially the automatic query optimizer and query processor, is one that is unlikely to be met in 18 months. ODMG had five original members: Object Design, Inc., O2 Technology, Versant Object Technology, Objectivity, Inc., and Ontos. However, Versant Object Technology has opted for a joint development and marketing of an Object SQL product based on UniSQL's SQL/X object-oriented SQL. Objectivity, Inc. has announced a plan to deliver an object-oriented SQL product by early 1994 which extends a pure SQL language processor they licensed from an SQL vendor. (Objectivity claims that they will offer both the SQL interface product and an ODMG-based query language product.) Ontos has had their own (limited) Object SQL for some time. O2 Technology claims that it already supports ODMG-93. (Although it is not a member of ODMG, UniSQL also supports most of the features found in ODMG-93, albeit in somewhat different syntax.) The ODMG-93 database language specification, from an academic perspective, represents a much more concrete progress than the current confused state of SQL-3. In fact, despite some problems, its treatment of compound data (sets, bags, arrays, structs) -- including the definition and creation of a compound data, and retrieval of the elements in a compound data -- is a significant accomplishment. The ODMG-93 database language also accounts for nested data (e.g., parts explosion), methods, and inheritance in the language to some extent. However, the ODMG-93 database language has many major problems and deficiencies, and is premature to be considered a "standard" for "anything". The biggest problem with the ODMG-93 database language is that it fails to recognize that the OMG Core Object Model on which it is based to is a superset of the relational model of data, and as such the ODMG database language should be a superset of ANSI SQL. For example, a type may have attributes, relationships, and methods; whereas a table in a relational database may have only attributes -- in other words, a type subsumes a table. Further, the domain of an attribute under the OMG Core Object Model may be a primitive type (e.g., integer, float, char, date, time, money) or a compound type (e.g., set, bag, list, array, struct, enumerated type); whereas the domain of an attribute of a table in a relational database may only be a primitive type -- in other words, the set of types supported under the Core Object Model subsumes the set of types supported in relational databases. The failure to recognize this simple fact has led to the current language which is similar to SQL but not compatible with SQL. Given that ODMG does not include any relational database vendors as members, and that relational database vendors are working to develop the SQL-3 object-oriented SQL, ODMG needs to address the major technical problems in the database language parts of ODMG-93 and work with the ANSI X3H2 SQL-3 Committee to arrive at a single standard for a database language. The C++ and Smalltalk bindings, as specified in ODMG-93, should still be able to work with the database language. 2. ODMG-93 Database Language 2.1 What a Database Language Is Before we proceed, it is important to briefly review what a database language is. A complete database language, such as SQL, consists of three sublanguages: data definition language (DDL), query and data manipulation language (DML), and data control language (DCL). The DDL is used for specifying the structure and integrity conditions on the database schema. In a relational database, the DDL is used to define tables, attributes in a table, the domain of an attribute, and constraints on an attribute or a table. In an OODB, the DDL should be used to define types, attributes in a type, the domain of an attribute, and constraints on an attribute or a type. The DDL for an OODB, however, must be richer than the DDL for a relational database. This is because an OODB is supposed to admit additional types of information which relational databases do not; namely, methods, compound data, nested data (e.g., parts explosion), and inheritance and IS-A relationships among types. The DML is used for creating (inserting), updating, and deleting data that populates the database schema. The query language is used for nonprocedurally retrieving a subset of the database that satisfies user-specified search conditions. In a relational database, the DML is used to populate the database based on the database schema defined using the DDL. Once the database has been populated, the query language and DML are used to retrieve data from the database and to update and delete the contents of the database. In an OODB, the DML and query language must serve the same purposes. Because the DDL of an OODB allows additional types of information to be represented (and therefore stored in the database), the DML and query language must necessarily be more powerful than SQL in order to access and manipulate the additional information stored in the database. The data control language is used for managing transactions (i.e., commit work, rollback work, restart), for controlling the access of the database by multiple users (i.e., grant or revoke access authorization), and for managing database resources (create index, drop index), for enforcing database integrity based on user-specified conditions (create trigger), etc. Of the sublanguages, the query language is the most involved and difficult to design and implement. Efficient evaluation of a query is a major determinant of the performance of a database system. Unfortunately, this is the part of the ODMG-93 database language that is given the most skimpy treatment. 2.2 Technical Contributions of ODMG-93 Database Language In my view, the most important contribution of the ODMG-93 database language is the specification of compound data types in the ODL, OML, and OQL. The ODMG-93 database language provides a broad set of facilities for defining and creating each type of compound data, facilities for accessing elements of a compound data (get first, get last, get i-th element), and facilities for manipulating more than one compound data (e.g., flatten a list, concatenate lists, compute the union of bags, compute the intersection of bags, etc.). SQL has no such facilities. I note, however, that compound data is not one of the primary object-oriented concepts, namely, encapsulation and inheritance. In my view, the ODMG-93 database language proposal on compound data management addresses the technical challenges in making programming languages persistent and also one of the longstanding deficiencies in relational database systems. A second important contribution of the ODMG-93 database language is the introduction of strong typing in the OQL. The user must indicate the type of the result of a query or the database language processor will infer the type of the query result. A strongly typed language makes it possible for invalid operations to be detected at compile time, rather than at run-time. (The ODMG-93 database language proposes a simplified syntax for creating new objects (rather than using the SQL INSERT statement) and accessing named objects (rather than using the SELECT FROM WHERE query). In my view, however, this is not a significant contribution, since relational database systems also offer call-level interfaces to create and access records in much simpler ways than using the SQL INSERT or SELECT statements.) 2.3 Technical Problems of ODMG-93 Database Language The ODMG-93 database language has the following three types of technical problems that need to be addressed. The OQL in its current form is missing almost all of the SQL extensions that are required to account for the additional types of information that the ODL and OML allow to be defined and stored in the database. In particular, the OQL has virtually no mention of queries that involve methods or nested data; and it does not define queries that involve a type hierarchy. (The France-based O2 Technology, who authored the OQL part of the ODMG-93 "specification", claim that they already support all these features. However, the current OQL "specification" merely includes one simple example that shows the use of a method in a query, and one simple example that shows a path expression -- and no additional semantic or syntactic specifications. A specification for each such feature is much more involved than can be described in such a skimpy manner. Below, I provide a brief review of the motivation and issues for these features.) - It proposes a different syntax and terminology from SQL even for those facilities whose semantics are identical or compatible with SQL. For example, it uses different keywords for the set difference operation (it calls it EXCEPT), GROUP BY HAVING (it calls it GROUP BY WITH), ORDER BY (it calls it SORT BY), etc. - It has not adequately accounted for the semantic consequences of object-oriented concepts on a query language. In particular, the semantics of queries involving methods and nested data are woefully deficient, and queries involving an inheritance hierarchy of types are not even included. - It is simply missing some major features found in SQL and relational database systems: views, access authorization, triggers, and dynamic schema changes. It also has no SQL-like statements for creating new objects, updating a set of objects based on query search conditions, etc. The first two problems above arise because the OQL of ODMG-93 is an object-oriented SQL; and because the object-oriented data model on which the ODMG-93 database language is based fully subsumes the relational model, the OQL should simply subsume SQL. In other words, the OQL should simply be designed such that if the users will not use any object-oriented extensions to SQL, it should naturally degenerate into the standard SQL. The second problem above, the seemingly gratuitous syntax and terminology change, may be eliminated by simply accepting the SQL standard. In the remainder of this section, I will discuss the major problems and omissions in the ODMG-93 database language in some detail. Deficiencies of Query and Data Manipulation Language The OML does not support inserts, updates, and deletes that are based on query specifications; that is, OML offers no means of inserting, updating, or deleting more than one objects based on their satisfying arbitrary search conditions. Standard SQL supports such facilities. An attribute that holds a value is a special case of a method; that is, an attribute has a method for reading the value and a method for updating the value. An attribute is an essential element in a search condition (predicate of the form "attribute_name comparison_operator value_expression; e.g., Salary > 50000). It should be possible to allow a method anywhere in a search condition where an attribute name may be used (e.g., RetirementBenefits > 30000, where RetirementBenefits is a method). But it is very difficult to support queries that involve arbitrary user-supplied methods. There are such issues as whether methods will reside on the client or server in a client/server environment, and whether even unsafe methods (methods with unpredictable side effects) should be allowed in queries. Further, arbitrary methods are not amenable to automatic query optimization. These considerations impose practical limitations on allowing queries that involve methods. The OQL provides no such considerations. The notion of a "path query" has been well-developed by database researchers during the past decade. A path query is a query written against nested data, by specifying search conditions against nested data. A path query contains, instead of just an attribute name, a sequence of attribute names (called a path expression). For example, a Person type may have an attribute named Hobby; the domain of Hobby may be an Activity type; and the Activity type may have an attribute named Number_of_Participants. Then it should be possible to issue a single query that says "find all persons whose hobby includes an activity that involve four or more participants." The WHERE clause of the query may contain a predicate "Person.Hobby.Number_of_Participants > 4". The OQL specification, except for one short paragraph in which one simple path query is illustrated, does not define the semantics of a path query. A path expression contains a sequence of attributes. When any of the attributes in a path expression has a compound data as its domain, the meaning of the path expression becomes complicated. Further, a path expression should allow methods as well as attributes. A path expression also makes it difficult for automatic query optimization and processing. The OQL provides no such considerations. The OQL allows a query on only a single type, but does not allow an inheritance-hierarchy query (i.e. a query that is targeted to an entire type hierarchy). An inheritance hierarchy of types implies an IS-A relationship; that is, an entity represented by a subtype "is a kind of" an entity represented by a supertype. For example, an Employee type "is a kind of" a Person type, where Employee is a subtype of Person (and conversely, Person is a supertype of Employee). Sometimes it makes sense to enquire about Person objects; and other times it is useful to be able to enquire about all types of Person (i.e., Person objects and Employee objects collectively). Deficiencies of Data Definition Language The ODL has an important omission, namely view support. Relational database systems all support views as the external schema over the conceptual schema of an integrated database, and are used as a unit of access authorization. A user may define a view over a table by, for example, omitting certain attributes, and authorize another user to access the database only through the view. The utility of views remains the same for object-oriented databases; it is not true at all that somehow object-oriented databases obviate the need for views. In fact, without views, access authorization can only be supported partially. I note that the semantics of views for object-oriented databases are much more involved than their counterpart for relational databases. This is because a view is almost like a table in a relational database. If this semi-equivalence of a view and a table is to be transferred to an object-oriented database, a view should be almost like a type; which means that views may form an inheritance hierarchy, a view may be used as the domain of an attribute, and a view may have methods as well as attributes. The ODL also is missing facilities for making dynamic changes to the database schema, beyond just adding a type as a new subtype of some existing types. Even relational database systems allow a table to be dynamically added or dropped, and an attribute to be added or dropped. Since an object-oriented data model admits additional information, its DDL should include facilities for making such schema changes as adding a method or attribute to an existing type, dropping a method or attribute from a type, adding a new supertype to an existing type, dropping a supertype from a type, etc. Deficiencies of Data Control Language The ODMG-93 database language has (almost) no data control language; in particular, it has no provision for access authorization (granting and revoking authorizations), and no provision for triggers (for automating the enforcement of data integrity with user-specified actions). Again, the facility for granting authorization should include consideration of the object-oriented data model; for example, authorization to run methods, and authorization to access not just one type, but a type hierarchy. The ODMG-93 database language proposes a nested transaction model (a transaction recursively consisting of other transactions) for transaction management. A transaction is simply a sequence of reads and updates against a database (any database -- relational or object-oriented). The purpose of bundling a sequence of reads and updates into a single transaction is simply to define the sequence as an atomic database access. Either all the reads and updates in a transaction finish (i.e., the effects get recorded permanently in the database), or none finish. In this way, the database is not corrupted with data as a result of a partial work. A nested transaction model is sometimes desirable, since it naturally models a situation in which the work of a transaction may be split into multiple transactions (subtransactions) which may be executed in parallel. However, this has nothing whatever to do with object-orientation (i.e., encapsulation, methods, inheritance); that is, object-orientation does not require that transactions be nested. In my view, ODMG should adopt the standard non-nested transaction model as default, and suggest a nested transaction model as an option. Potential Meta-Data Management Nightmare The ODMG-93 database language has various little facilities that will add up to a "meta-data management" nightmare. For example, it proposes to allow a user to specify multiple names for a single object; to specify the lifetime of a single object; to create an index on a subset of an extent of a type (i.e., a subset of all instances of a type -- equivalent to a subset of all records of a table in a relational database); to create multiple such subsets of an extent of a type, such that some objects may belong to more than one subset; to name a query statement and use it in other queries, etc. I note that it is of course desirable, conceptually, to allow the user to manage the database at the object level, that is, to use a single object as the smallest unit of database access and control. For example, in a document management application, it is certainly desirable to be able to name each document object, attach an authorization on the document (although ODMG-93 does not include any consideration of authorization), to fetch the document by its name, etc. However, it must also be possible to manage the database at the type level, that is, to use a type (and its extent) as a unit of database access and control -- relational database systems allow this, but do not allow each tuple (record or row) of a table as a unit of database access and control. Although it is desirable to have a single object as a unit of database access and control, it may sometimes lead to significant additional difficulties with meta data management. Further, it can lead to a performance degradation; a larger system catalog to maintain the status of each named object, a larger catalog to keep track of access authorizations on individual objects, a more complex logic for authorization checking, a more complex logic to screen redundant objects when processing queries against multiple subsets of the extent of the same type, etc. In summary, it should be recognized that an object-level database access and control is merely the next logical step from the traditional type-level access and control, rather than a replacement of the type-level access and control. (Although I do not regard this as a "major" problem, the ODMG-93 database language requires the user to explicitly name the extent of a type. This seems to merely add to the meta-data management problem. It is not clear at all why the extent of a type cannot be implicit. In a relational database, the extent of a table is implicitly all the records that are inserted into the table -- the user need not declare the extent of a table explicitly and separately.) C++ and Smalltalk Bindings Both the C++ and Smalltalk bindings force all objects to be persistent; and this is regarded as totally unacceptable to some users and prospective users of C++ and Smalltalk. It will be necessary to allow C++ or Smalltalk applications to deal with both persistent objects and nonpersistent objects. To do this, it will be necessary to allow the users to declare persistent classes and persistent attributes, to differentiate them from nonpersistent classes and nonpersistent attributes, respectively. Further, the issue of storing persistent objects that reference nonpersistent objects needs to be addressed. 3. Future Course for ODMG In order to understand the relative merit of ODMG and ODMG-93 and the future course of action that ODMG should take, we must bear in mind the fact that relational database vendors are planning to add object management facilities to their relational database products, that is, extend SQL to object-oriented SQL and that they are all working within the ANSI X3H2 (SQL-3) Standards Committee. Given that OODB vendors that are members of ODMG have elected to adopt SQL as the basis of their object-oriented database language, it is clear that ODMG should really join forces with ANSI X3H2 to arrive at a single standard. To this end, ODMG may do the following. 1. ODMG should start developing levels of conformance and certification process if they hope to have their proposals become a standard. The current proposal is "all or nothing"; either a vendor must implement all aspects of all of the proposals or they do not conform at all. 2. ODMG should quickly proceed to implement a certification process for the C++ and Smalltalk language bindings, as these are the most mature parts of its current proposal. 3. ODMG should influence SQL-3 with the two primary technical contributions of the ODMG-93 database language, namely management of compound data and strong typing of query results. 4. ODMG should make a revised proposal that will fully subsume SQL, while preserving its two primary technical contributions. In particular, it should simply accept and extend the major missing features from SQL, namely views, dynamic schema changes, and access authorization; and drop gratuitous differences in syntax and terminology from SQL. In my view, the first step ODMG may take is to reorganize its data model and database language into three tiers. The first tier should consist only of those data modeling concepts and language constructs that are identical to their counterparts in relational databases. The second tier may consist of major new concepts and constructs, such as the concepts of relationships between two types, inheritance hierarchy, methods, and compound data types. The third tier may consist of the "little" things that can lead to meta-data management problems. In practice, the first two tiers should be combined into a single tier; but the artificial separate presentations may make it clear that an object-oriented model subsumes the relational model. In any case, the first two tiers should be the baseline for compliance certification for vendors. The third tier should strictly be an option for compliance. Finally, ODMG proposes to expand ODMG-93 with proposals for additional database features. Before they proceed, they should first seek to broaden its membership to add more experiences in discussions involving some of the really thorny issues ODMG claims to have on its agenda. The small number of members whose primary expertise is in providing persistent storage for C++ should not be so presumptuous as to believe that they will define standards for such far-reaching and thorny issues as versioning, work-group transactions, multimedia data management, etc. These issues are very important to a far wider area of computing industry than just persistent storage for C++ or Smalltalk.