Abstract
During the life cycle of an Information System, the original design of the database may be difficult to acquire. The database schema may have been continuously modified and drifted away semantically from the intended design, or perhaps no conceptual modelling method was employed at all. Database reverse engineering (DBRE) is the process for retrieving semantics from an existing relational schema into a conceptual schema. The conceptual schema offers a richer description of a domain and is a useful tool for the maintenance of semantic consistency of the database at runtime, among other things. Though a well known research problem, the DBRE process itself has not been fully formalised with respect to how one preserves information in the transformation. While transforming a relational schema into a conceptual schema, the latter should be able to maintain the information captured in the first. The bedrock of such a schema transformation is the presence of well defined functions that declare the association between instances of both schemas. Also, the depth of component extraction in the relational schema is at times rudimentary in some existing DBRE methods, resulting in a residual hidden semantics in the schema prior to transforming the database. We discover that this is perhaps due to the strict assumptions existing DBRE methods place on the relational schema to be in a higher normal form. In this thesis, we propose a formal framework for information capacity preservation in first-order schema transformations, such as DBRE. The framework is guided by model-theoretic concepts of schema dominance and schema equivalence (called here, losslessness), two desirable properties of any schema transformation. The problem of information capacity preservation is characterised by a pairwise entailment check involving the formulas expressing both schemas and a set of first-order mappings. The framework is applied to typical DBRE scenarios, to check and confirm whether all the information in an existing relational schema will be completely and correctly realised in the resulting conceptual schema. We also established a formal association between the schema dominance and lossless join and the dependency preservation, two essential properties of database design—which is a crucial part of the reverse engineering process. In addressing the issue of component extraction, we provide a catalogue of atomic and complex transformation patterns or rules which can be exploited as an executable document used to instruct on design possibilities in the resulting conceptual schema. The catalogue encompasses database decompositions, the analysis of data and key/non-key attribute correlations, object identification schemes for the relational schema and ultimately the lossless transformation into a conceptual output.