Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Database Normalization and Separation of Concerns in Databases: A Holistic Approach, Study notes of Data Structures and Algorithms

The importance of database normalization and separation of concerns in database design. It highlights the challenges of ensuring evolvability, managing redundancy, and enforcing constraints in SQL databases. The document also proposes the use of viewed tables and database normalization as solutions to these problems. It also argues that database normalization is a domain-specific application of the separation of concerns theorem to data elements in databases.

Typology: Study notes

2019/2020

Uploaded on 11/27/2020

husni-abad
husni-abad 🇮🇳

1 document

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Baltic J. Modern Computing, Vol. 4 (2016), No. 1, 5-33
The Database Normalization Theory and the Theory
of Normalized Systems: Finding a Common Ground
Erki EESSAAR
Department of Informatics, Tallinn University of Technology,
Akadeemia tee 15A, 12618 Tallinn, Estonia
Erki.Eessaar@ttu.ee
Abstract: Database normalization theory offers formalized guidelines how to reduce data
redundancy and thus problems that it causes in databases. More lately, researchers have started to
formalize ideas that the problems caused by unwanted dependencies and redundancy can be
observed in case of any modular system like software, hardware, or organization. To tackle these
problems, they have proposed the theory of normalized systems that complete application frees
systems of combinatorial effects and thus the impact of a change in a system does not depend on
the system size. At the time of writing this paper, the explanations of the theory of normalized
systems do not say much about database normalization. We think that the theories are deeply
related. In this paper, we search a common ground of the database normalization theory and the
theory of normalized systems.
Keywords: database normalization, principle of orthogonal design, normalized system (NS),
evolvability, separation of concerns, combinatorial effect (CE).
1. Introduction and Related Works
Normalization theory of relational databases dates back to the E.F. Codd’s first seminal
papers about the relational data model (Codd, 1970). Since then it has been extended a
lot (see, for instance, Date (2007, Chap. 8)) and the work is ongoing. There are proposals
how to apply similar principles in case of other data models like object-oriented data
model (Merunka et al., 2009), (hierarchical) XML data model (Lv et al., 2004), or
(hierarchical) document data model (Kanade et al., 2014). Database normalization
process helps database developers to reduce (not to eliminate) data redundancy and thus
avoid certain update anomalies that appear because there are combinatorial effects (CEs)
between propositions that are recorded in a database. For instance, in case of SQL
databases each row in a base table (table in short) represents a true proposition about
some portion of the world. CEs mean in this context that inserting, updating, or deleting
one proposition requires insertion, update, or deletion of additional propositions in the
same table or other tables. The more there are recorded propositions, the more a data
manager (human and/or software system) has to make this kind of operations in order to
keep the data consistent. Thus, the amount of work needed depends on the data size and
increases over time as the data size increases. Failing to make all the needed updates
leads to inconsistencies. The update anomalies within a table appear because of certain
dependencies between columns of the same table. Vincent (1998) shows how these
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Database Normalization and Separation of Concerns in Databases: A Holistic Approach and more Study notes Data Structures and Algorithms in PDF only on Docsity!

Baltic J. Modern Computing, Vol. 4 (2016), No. 1, 5-

The Database Normalization Theory and the Theory

of Normalized Systems: Finding a Common Ground

Erki EESSAAR

Department of Informatics, Tallinn University of Technology, Akadeemia tee 15A, 12618 Tallinn, Estonia

Erki.Eessaar@ttu.ee

Abstract: Database normalization theory offers formalized guidelines how to reduce data redundancy and thus problems that it causes in databases. More lately, researchers have started to formalize ideas that the problems caused by unwanted dependencies and redundancy can be observed in case of any modular system like software, hardware, or organization. To tackle these problems, they have proposed the theory of normalized systems that complete application frees systems of combinatorial effects and thus the impact of a change in a system does not depend on the system size. At the time of writing this paper, the explanations of the theory of normalized systems do not say much about database normalization. We think that the theories are deeply related. In this paper, we search a common ground of the database normalization theory and the theory of normalized systems.

Keywords: database normalization, principle of orthogonal design, normalized system (NS), evolvability, separation of concerns, combinatorial effect (CE).

1. Introduction and Related Works

Normalization theory of relational databases dates back to the E.F. Codd’s first seminal papers about the relational data model (Codd, 1970). Since then it has been extended a lot (see, for instance, Date (2007, Chap. 8)) and the work is ongoing. There are proposals how to apply similar principles in case of other data models like object-oriented data model (Merunka et al., 2009), (hierarchical) XML data model (Lv et al., 2004), or (hierarchical) document data model (Kanade et al., 2014). Database normalization process helps database developers to reduce (not to eliminate) data redundancy and thus avoid certain update anomalies that appear because there are combinatorial effects (CEs) between propositions that are recorded in a database. For instance, in case of SQL databases each row in a base table (table in short) represents a true proposition about some portion of the world. CEs mean in this context that inserting, updating, or deleting one proposition requires insertion, update, or deletion of additional propositions in the same table or other tables. The more there are recorded propositions, the more a data manager (human and/or software system) has to make this kind of operations in order to keep the data consistent. Thus, the amount of work needed depends on the data size and increases over time as the data size increases. Failing to make all the needed updates leads to inconsistencies. The update anomalies within a table appear because of certain dependencies between columns of the same table. Vincent (1998) shows how these

6 Eessaar

dependencies lead to data redundancy. Informally speaking, these anomalies appear if different sets of columns of the same table contain data about different types of real- world entities and their relationships. By rewording a part of third normal form definition of Merunka et al. (2009), we can say that these sets of columns have independent interpretation in the modeled system. According to the terminology in (Panchenko, 2012) these sets of columns have different themes. Thus, the table does not completely follow the separation of concerns principle because database designers have not separated sets of columns with independent interpretations into different software elements (tables in this case). The update anomalies across different tables within a database may occur because of careless structuring of the database so that one may have to record the same propositions in multiple tables. The update anomalies across different databases (that may or may not constitute a distributed database) may occur if one has designed the system architecture in a manner that forces duplication of data to different databases. Conceptually similar update (change) anomalies could appear in the functionality of any system or its specification and these make it more difficult and costly to make changes in the system or its specification. Pizka and Deissenböck (2007) comment that redundancy is a main cost driver in software maintenance. The need to deal with the update anomalies in the systems that are not designed to prevent them is inevitable because the systems have to evolve due to changing requirements just like the value of a database variable changes over time. For instance, some of the changes are caused by the changes in the business, legal, or technical environment where the system has to operate, some by changing goals of the organization, and some by the improved understanding of the system domain and requirements by its stakeholders. The theory of normalized systems (NS) (Mannaert et al., 2012b) reflects understanding of the dangers of the update anomalies and offers four formalized design theorems that complete application helps developers to achieve systems that are free of CEs and are thus modular, highly evolvable, and extensible. The NS theory speaks about modules and submodular tasks. The work with the NS theory started after the invention of the database normalization theory. Its proponents see it as a general theory that applies to all kinds of systems like software, hardware, information system, or organization or specifications of these systems. Like the database normalization theory, its goal is to improve the design of systems and facilitate their evolution. In our view, it would be useful to bring these two theories together to be able to understand their similarities and differences. Possibly, we can use the ideas that have been worked out for one theory in case of the other theory as well. Nowadays there is a lot of talk about object-relational impedance mismatch between highly normalized relational or SQL databases and object-oriented applications that use these. Thus, researchers and developers look these as quite distinct domains that require different skills and knowledge as well as have different associated problems, theories, methods, languages, and tools. Hence, in addition to technical impedance mismatch there is a mental one as well. We support the view that database design is programming and has the same challenges as the programming in the “traditional” sense like ensuring quality and high evolvability, separating concerns, managing redundancy and making redundancy controlled, testing the results, versioning, and creating tools that simplify all this. Database and application developers sometimes have antagonistic views to the normalization topic. Merunka et al. (2009) mention a common myth in object-oriented development community that any normalization is not needed. Komlodi (2000)

8 Eessaar

In case of the database normalization theory there is a well-known technique of denormalization that requires reduction of normalization level of one or more tables (if we speak about SQL databases) to achieve pragmatic goals like better performance of some queries in a particular environment. Whether and how much to use it depends on context. For instance, Helland (2011) gives an impression that that in large-scale systems do not need database normalization because users add new facts instead of updating existing facts. Kolahi and Libkin (2010) provide formal justification that databases with good third normal form design (there are higher normal forms) offer the best balance between redundancy reduction and efficiency of query answering. Grillenberger and Romeike (2014) argue that computer science has to “rethink the idea of redundancy from something that generally should be avoided to a method that is applied in order to achieve certain specific goals” due to the emergence of NoSQL systems and Big Data. One starts to wonder, does it mean that these ideas also apply to the redundancy of functionality. Perhaps understanding universalness of normalization principles, similarity of concepts, potential problems of redundancy, and situations when redundancy is tolerable or even desirable helps us also reduce the mental gap between the application development and database development domain. Fotache (2006) observes that database normalization theory has failed to become universal practical guide of designing relational databases and points to different reasons of that. Badia and Lemire (2011) mediate a report that in a typical Fortune 100 company database normalization theory is not used. There are even such provocative calls like “normalization is for sissies” (Helland, 2009). If one understands better the relationship between the NS theory and the database normalization theory, then one can learn from the problems of one theory how to make things better in case of the other. Badia and Lemire (2011) observe with regret that traditional database design (thus, also normalization) “is not a mainstream research topic any more” and is often considered “a solved problem.” One could hope that the interest towards the NS theory will also increase the interest towards the database normalization. Grillenberger and Romeike (2014) suggest that at the age of Big Data the topic of strictly normalizing data schema is not any more a broadly influential fundamental database concept that deserves teaching in the general education. They also call for discussion as to whether (data) redundancy is such general concept any more. However, understanding fundamental similarities between the NS theory and the database normalization theory would strengthen the understanding that redundancy is an important concept of both data and management. It influences systems in general and thus certainly deserves teaching. Thus, the goal of the paper is to bring together and search a common ground of the theory of NS and the theory of database normalization. We pointed to some observable similarities and differences between the theories in our earlier paper (Eessaar, 2014). However, the topic deserves a deeper analysis. To our knowledge, there has not been this kind of analysis yet in the literature. Since the NS theory is independent of any programming language, platform, or technology, we want to understand as to whether the theory of database normalization could be seen as a specialization of the NS theory in the domain of databases. We organize the rest of the paper as follows. Firstly, we present an additional explanation of the main principles and reasons behind database normalization and the NS theory. After that, we explore what existing NS literature says about databases and the database normalization theory. Thirdly, we explore a common ground of the normalization theories. We name some problems of SQL databases and database

Database Normalization Theory and Theory of Normalized Systems 9

management systems (DBMS) that are caused by the insufficient separation of concerns, which is a violation of the NS theory. We present a conceptual model that offers a unified view of the theories. Finally, we conclude and point to the further work with the current topic.

2. Normalization Theories

Here, we further explore the theories to explain the context and improve understanding.

2.1. Theory of Database Normalization

The normalization theory provides a formalized theoretical basis for the structuring of databases. The goal of its application is to reduce data redundancy in databases and hence avoid certain update anomalies. Although this is desirable in case of any data model, we refer to the relational data model in our discussions because the theory is probably the best known in this domain. Of course, one may choose to allow redundancy for the sake of improving some other aspects of the system but in this case it is a conscious design decision and one must take into account all of its good and bad results. Technical limitations of data management platforms often cause decisions to permit certain degree of data redundancy. Because of these, allowing the redundancy is the best possible way to speed up certain queries or, in case of data models that do not provide an operation that is similar to relational join, to make possible certain operations in the first place. Our understanding of the relational data model, which is also the basis of the SQL database language, is based on The Third Manifesto (Date and Darwen, 2006). It differs from the underlying data model of SQL in many crucial details. However, in the discussions of the database normalization theory, we will use the terminology of SQL – table, column, and row. We will do it because SQL is well known and we hope that it will make the discussion more understandable. We will point to the differences of the underlying data model of SQL and the relational model where we need it. Data redundancy in a database means that there is at least one proposition that has two or more distinct representations in the database (two or more separately registered propositions) (Date, 2006a). The update anomalies make data modification operations more time consuming and error prone. If one wants to insert, update, or delete a proposition, then the DBMS or its invoker have to make more work (how much more depends on the level of normalization and the number of already registered propositions) to complete the operation and to ensure consistency of propositions. If the system design determines that applications that use data have to be aware of the redundancy and ensure consistency, then it increases coupling between the applications and the database. In addition, data structures that feature data redundancy and update anomalies restrict propositions that one can record in the database because the system treats independent propositions at the database level as dependent propositions. Carver and Halpin (2008) note that NULLs in a fact (proposition) indicate that the fact is not atomic. Contrary to SQL, The Third Manifesto does not permit us to use NULLs to depict missing values. Thus, one cannot bundle together propositions to a row as a composite proposition if some parts of the composite proposition are missing. Normalization is a multi-step process where each step takes the involved tables to a higher normal form that is defined in terms of certain well-formedness rules. In other

Database Normalization Theory and Theory of Normalized Systems 11

someone has to instruct the DBMS how to propagate the updates and thus there is more work for the developers. The principle of orthogonal design addresses data duplication across multiple tables. It is not a part of the normalization theory. It requires that no two tables in a database should have overlapping meanings (Date, 2006a). The definitions of normal forms, except first normal form, depend on the existence of keys in the tables, thus eliminating the problem of repeating rows in the tables. However, definition of first normal form does not require the existence of keys in tables according to the interpretation of SQL but does require the existence of at least one key in each table according to the interpretation of the relational model. Keyless tables lead to possible repeating rows in the tables that is another form of redundancy. As Date (2006b, Chap. 10) shows, it leads to, for instance, problems with interpreting the query results as well as optimizing the queries by the system. Carver and Halpin (2008) argue that the previously described normalization process is inadequate because a table (as a variable) may have different values (sets of rows), some of which do not have multivalued or join dependencies but still have fact- redundancy. They do not question the need of normalization but the process of achieving fully normalized tables. They argue that if one creates a conceptual model that represents atomic fact types, then one can synthesize fully normalized tables from it. They comment that the process of deciding as to whether a fact type is atomic or not requires knowledge about the domain and is informal. Fotache (2006) also points out that alternatively one could use normalization to check the results of deriving database structure from a conceptual model.

2.2. The Theory of Normalized Systems (NS)

Databases are only one, albeit often very important, component of information systems. Intuitively, it is understandable that some design problems that appear in databases can appear in some form in any type of systems. These systems could be technical, sociotechnical, social, or natural. For instance, there could be multiple software modules in a software system that implement the same task, multiple forms in the user interface of the same actor providing access to the same task, multiple process steps, organizational units or organizations that fulfill the same task, or identical or semantically similar models that describe the same tasks. These examples potentially mean unnecessary wasting of resources and more complicated and time-consuming modification of tasks and their models. Being duplicates of each other, the parts have undeclared dependencies, meaning that changing one requires cascading modifications of its duplicates to keep consistency. The more there are such duplicates, the more changes we need to keep consistency. If there are multiple unrelated or weakly related tasks put together to a module, then it is more difficult to understand, explain, and manage the module. Such modules have more dependencies with each other, meaning that changes in one require examination and possible modifications in a big amount of dependent modules. The less the general information hiding design principle is followed, the more cascading changes are needed. For instance, intuitively, one can understand how difficult it would be to understand places of waste and duplication in a big organization and after that reorganize it. In organizations, the more fine-grained are its tasks, the easier it is to distribute these between different parties and in this way achieve separation of duties and reduce the possibility of fraud.

12 Eessaar

Observers have noticed the same general problems in case of user interface design as well. Cooper et al. (2014, p. 274) write that navigation is “any action that takes the user to a new part of the interface or that requires him or her locate objects, tools, or data elsewhere in the system” and that the need of such actions should be minimized or eliminated. From the end users perspective, one could see it as a call to denormalize by bundling tasks together to one form or page. From the perspective of the developers, it is a warning of dangers because if there are duplicated tasks in different places, then they have to navigate to modify these. The latter demonstrates conflicting interests of different stakeholders and possibly the need to have different decomposition principles at the different system layers. Thus, this knowledge is not new and the authors of the NS theory are not its discoverers. For instance, “Once and only once” software development best practice requires that there should not be duplication of behavior in a system (WEB, a). “Don’t repeat yourself” best practice is a little bit more relaxed, meaning that each data element or action element (functionality) must have single authoritative representation in the system (Wilson et al., 2014). If there is duplication, then it must be controlled (automated) (WEB, b). Similarly, De Bruyn et al. (2012) refer to different code smells. Many of these indicate code duplication. Their analysis shows that avoidance of most of the smells (14 out of 22) contribute towards achieving NS. However, the NS theory tries to offer more formalized approach how to achieve the system that is free of such problems. The Lehman’s laws of software evolution state that E-type evolutionary software degrades over time unless it is rigorously maintained and adapted as well as its functional content is increased to maintain satisfaction of users (Godfrey and German, 2014). The NS theory assumes unlimited system evolution over unlimited time (Mannaert et al., 2012b). Of course, the stakeholders of the system want the evolution process to be as easy and problem-free as possible. However, this is not the case if the system grows larger and more complex over time. The bigger it gets, the more there are dependencies so that changing one part of the system requires changes in other unrelated parts as the ripple effect. Unfortunately Mannaert et al. (2012b) remain vague about what does “unrelated” mean here. The theory of NS calls such dependencies combinatorial effects (CEs) and calls for their complete elimination. Only if this is achieved, then the impact of change will not depend on the size of the system any more but only on the nature of the change itself. Thus, the system becomes stable with respect to a set of anticipated changes. Mannaert et al. (2012b) define a minimal set of such changes. The theory suggests four prescriptive design theorems to constrain the modular structure of systems and to guarantee that the system is free of CEs and thus highly evolvable. The Lehman’s law of increasing complexity states that each E-type software system grows increasingly complex over time, unless stakeholders explicitly work to reduce its complexity (Godfrey and German, 2014). In this case, explicit work means work that is needed during the creation or modification of the system to achieve conformance to the theorems. The theory is generic in the sense that according to the authors one could apply it to any modular system. To achieve this, the theory is described in terms of very generic primitives like data elements and action elements that configuration forms a system. Anticipated changes of the system mean changes in the configuration of these elements. The design theorems that proofs Mannaert et al. (2012b) present in their paper have the following informal descriptions (Eessaar, 2014).

14 Eessaar

beginning of March 2015) into the initial set of papers. We also added to the initial set the earlier papers of Mannaert and Verelst that the selected journal paper references. For the review, we selected papers (from the initial set of papers) that’s main topic is the NS theory or its application to some system and full text is available to us. In total, we reviewed 40 papers from 2006 to 2014 (all in English) that were available as pdf files. We used the following case insensitive search keywords to search parts of the papers that are relevant in terms of the research questions: “normal form”, “relational”, “sql”, “denormalize”, “denormalization”, “database”, “database management system”, “update anomaly”, “upadate anomalies”, “orthogonal design “, “duplicate”, “duplication”, and “redundancy”. In addition, we searched case sensitive word “Codd” as well as “NF” and “DBMS” that are the abbreviations of “normal form” and “database management system”, respectively. Next, we summarize and discuss our findings. If there are multiple publications that present similar claims, then we will not present all the publications but make a selection. Only Verelst et al. (2013) and our previous work (Eessaar, 2014) mention database normalization theory. Verelts et al. (2013) refer to only the paper of Codd (1970) in this regard. They say that it is a well-documented approach how to eliminate many CEs in case of databases. They correctly point out that the theory does not eliminate all the effects and thus does not eliminate all the redundancy. This is a crucial difference between the NS theory and the database normalization theory because the former requires us to remove all the CEs. None of the reviewed papers explicitly refers to the concept “denormalization”. However, Verelst et al. (2013) speak about the need to eliminate CEs at the software level but relaxing this requirement at the higher levels. This relaxation is nothing else than denormalization in terms of NS. Similarly, in case of databases it is possible to denormalize views that constitute the external level of a database without denormalizing tables based on that the views have been defined (Burns, 2011). In case of software systems, the analogy is, for instance, user interface where each form/page could offer unrelated or weakly related functionality and thus violate the separation of concerns theorem that is one of the founding theorems of the NS theory. In case of documentation, an example are diagrams that could couple unrelated or weakly related model elements. Thus, denormalization is clearly a topic that the NS theory should consider. Verelst et al. (2013) incorrectly claim that the CEs that the database normalization eliminates are caused by the “redundant definition of attributes.” The definitions are not redundant but the attributes are grouped together so that there will be CEs between propositions that are represented by the recorded attribute values. Terminology here is also imprecise because in SQL columns and attributes are structural components of tables and structured types, respectively. Moreover, it is the principle of orthogonal design that addresses redundant definition of columns in different tables. The principle is related to but not a part of the database normalization theory. Only Eessaar (2014) mentions the principle. Only Eessaar (2014) mentions database normal forms and does so while giving an example of similarities of the normalization theories. Thus, we conclude that the database normalization theory has not been an important basis in working out the NS theory and there is a gap in the research, namely search of a common ground of these theories. There are few mentions of relational databases (three papers), object-relational mapping (one paper), database management systems (two papers), and SQL (three papers) as examples of possible implementation technologies of systems. The NS theory is generic and these are just some possible implementation technologies. Data element is

Database Normalization Theory and Theory of Normalized Systems 15

one of the element types of the NS. According to Mannaert et al. (2012b), one can store instances of data elements in corresponding relational database tables. They look a DBMS as an external technology to applications and do not treat it and its provided data model in terms of the NS theory. The lack of references also shows that the research regarding database normalization has not been an important basis in working out the NS theory. Seventeen papers refer to databases. Mannaert et al. (2012b) explain that if a transaction fails because, for instance, it violated integrity rules or if a system fails during active transactions, then the system must be aware of all actions that it has performed for the recovery purposes. Thus, the system needs state keeping and that is what the fourth NS theorem prescribes. Mannaert et al. (2012b) do not mention it but many DBMSs implement this by using, for instance, rollback/undo segments. Mannaert et al. (2012b) comment that over time programming languages have evolved to be more consistent with the NS theorems. The design of DBMSs is themselves a subject of the NS theory and the systems implement or facilitate independent implementation of many of the concerns, which separation the NS theory requires. We did not find analysis of DBMSs in terms of how much support they offer in building NS and how it has evolved over time. Thus, in our view, the systems deserve more attention in this regard (see Section 4.1). At the higher level, applications store in databases the general observable states of real-world systems. Mannaert et al. (2012a) call these states macrostates. This is another manifestation of state keeping, required by the fourth NS theorem. Some of the reviewed publications mention databases as a part of architecture of information systems. Their authors argue that the architecture would benefit from the application of the NS theory. Mannaert et al. (2012b) note that multi-tier architecture with a separate layer of database logic is a manifestation of the separation of concerns theorem. In their examples, the papers concentrate to the application layer. Maes et al. (2014) stress that checking based on database as to whether a user has an authorization to use an IT application and its different functions is a separate concern. Developers must implement it in a separate module for the sake of evolvability. Ideally, all the applications will reuse it. Maels et al. (2014) call for reusing such software modules in case of developing new applications. De Bruyn et al. (2014) refer to many different cross-cutting concerns of systems like authorization policy, logging, integrity checking, external communication, and bookkeeping adapter that one should implement in separate modules. Integrity checking is a part of the underlying data model of a DBMS. Others are services that one can build on top of the model in a DBMS. Modern SQL DBMSs provide more or less separation of concerns between the model and the services by providing means to manage the services separately of managing elements determined by the data model. A goal of the use of the NS theory is to reduce dependencies between modules. Coupling is a measure of such dependencies. Van der Linden et al. (2013) list seven different types of couplings. They name external coupling as the third tightest coupling. In this case, two or more modules communicate by using an external database. Access of this external resource is a concern that all the modules duplicate, meaning that these modules have multiple concerns, which violates the separation of concerns theorem. Fowler (a) calls this kind of approach integration database. The loosest types of couplings are stamp coupling and message coupling in which case modules communicate by passing data structures that they use only partially and messages,

Database Normalization Theory and Theory of Normalized Systems 17

deal with conceptually similar questions but as we later point out, the use of highly normalized tables makes it easier to implement data version transparency. Each subtype inherits all the properties of its supertype. The database normalization theory does not deal with problems that are conceptually similar to the theorems about encapsulation and versioning. Thus, it is incorrect to say that the NS theory is its generalization. On the other hand, the separation of concerns theorem deals broadly with the same questions as the database normalization theory and therefore one can say that these theories have an overlap. The NS theory describes systems in terms different primitives, including action element and data element. WEB (c) defines concern as “A canonical solution abstraction that is relevant for a given problem.” However, the separation of concerns, according to the NS theory, applies specifically to the action elements. “Essentially, this theorem describes the required transition of submodular tasks—as identified by the designer—into actions at the modular level” (Mannaert et al., 2012b). On the other hand, users of the database normalization theory apply it to the data elements. We argue that database normalization is a domain-specific application of separation of concerns theorem to the data elements in the domain of databases. We wanted to validate our impression that literature does not explain the database normalization theory in terms of separation of concerns (as of October 2015). Firstly, we looked all the materials (books and papers) about normalization topic that the current paper mentions and did not find any references to the separation of concerns principle. Secondly, we searched Google Scholar™ with the search phrases “separation of concerns in databases” (one result) and combination of “separation of concerns” and “database normalization” (41 results). We also looked the 357 papers (as of October

  1. that cite the work of Hürsch and Lopes (1995) and searched with phrases “normalization” (two results) and “normal forms” (four results) within this set of papers by using Google Scholar™. Similar search with the phrase “separation of concerns” from the 216 papers citing Fagin (1979) returned four results. We found only one previous source (Adamus, 2005) that comments database normalization in terms of the separation of concerns. He mentions database normalization only once, claiming that it causes tangling, which is not good. He applies the principles of aspect-oriented programming to object-oriented databases and defines aspects so broadly that every software feature that “can be isolated, named, described and documented at a high abstraction level” is an aspect (concern). Both Adamus (2005) and WEB (c) see concerns as conceptual abstractions that implementation involves the creation of one or more elements in the implementation environment. Adamus (2005) gives an example that an aspect (concern) Person , which one could represent as an entity type in a conceptual data model, could be implemented by using multiple tables in a relational or SQL database. Adamus (2005) thinks that because of the creation of multiple tables the concern is scattered to multiple places of the database. He characterizes such concern as tangling , meaning that due to the restrictions of an implementation environment (in this case a SQL DBMS) implementers have no other choice than to scatter the concern. Of course, this does not have to be the case if the DBMS properly supports definition of new types and using these as column types as the relational model requires. Each concern is a conceptual abstraction that according to WEB (c) depends on problem at hand. The definitions leave it fuzzy as to what is an appropriate abstraction level to consider something as a concern. Hence, in this case, one could change the level of abstraction. Instead of looking Person as a concern that one must implement with the help of tables Person_detail and Person_e_mail , one could consider these tables as data elements that bijectively map to

18 Eessaar

and represent more fine-grained concerns. Based on these two tables, one could create a denormalized view (also a data element) that corresponds to the more coarse-grained concern Person. Database normalization helps us to achieve more fine-grained tables that correspond to more fine-grained concepts (see the work of De Vos (2014)). Thus, we can use the database normalization to find conceptual structure of the system and define its conceptual (data) model. Usually the order of creation is the opposite. Firstly, modelers create a conceptual data model. Based on that they create the database design models by using model transformation. Very informally speaking, in case of tables the separation of concerns means that each table must address the main general concern of a database that is to record data corresponding to some entity type or relationship type. In case of tables that are only in first normal form, each table is a structure that contains data about multiple types of entities and relationships. For instance, it makes it more complex to perform data modification operations and enforce certain integrity constraints. The higher is the normalization level of tables, the more fine-grained the concerns will become and thus the separation of concerns gradually increases. Hürsch and Lopes (1995) note that there must be a gluing mechanism that holds the decoupled concerns together. In case of relational or SQL databases and “traditional” projection-based normalization these are candidate keys and foreign keys of tables. One can use the values of the keys to join the decomposed tables back together in the nonloss manner (recouple concerns). What about constraints to data, which help us to enforce business rules? In our view, one should look these rules as separate concerns as well. Ideally, one could implement each such rule by using one declarative database language statement. However, if due to the restrictions of a DBMS the implementation of a constraint requires the creation of multiple trigger procedures, which have to react to different events and are perhaps attached to different tables, then this is an example of tangled database concern. Declarative statements and triggers are examples of action elements of the NS theory. Implementation platforms of concerns determine what concerns one can and cannot separate. According to the terminology of Tarr et al. (1999), data and functionality would be different dimensions of concerns along of that to decompose the system. For instance, object-oriented systems couple functionality and data because objects contain both methods, which implement behavior and attributes, which hold data. The same is true in case of user-defined structured types in SQL that couple attributes (data) and methods to access the attributes. These methods also have to enforce whatever constraints there are to the values of the attributes in addition to the type of the attribute. On the other hand, the relational data model takes the approach that it is possible to specify separately operators for performing operations with data, constraints to restrict data, data structures, and data types. This distinction follows the spirit of the separation of concerns principle. Hürsch and Lopes (1995) write about the benefits of separating concerns and observe that we must separate the concerns at both the conceptual and the implementation level to achieve these. Section 4.1 explains that unfortunately current SQL DBMSs have many shortcomings in this regard. Hürsch and Lopes (1995) note that separating concerns makes it possible to reason about and understand concerns in isolation. One could say the same about tables that one creates as the result of decomposition during the normalization process. Each table heading represents a generalized claim (external predicate) about some portion of the world. The lower is the normalization level of a table, the more its predicate contains weakly connected sub-predicates as conjuncts. During the normalization process, these

20 Eessaar

between the layers. Because each layer (an implementation of a coarse-grained concern) has its own responsibilities, avoiding their tight coupling is a direct application of the separation of concerns principle. An example of interface is virtual data layer, which can contain views that join (recouple) data from different tables (Burns, 2011). There could be duplication of tasks at different layers because at different layers they help us to achieve different goals. For instance, data validation in user interface gives quick feedback and reduces network load whereas database constraints among other things express the meaning of data and help the system to optimize operations. However, there is a dependency between the tasks and changes in one must be propagated to another. Preferably, it must happen automatically to make the redundancy controlled. Hürsch and Lopes (1995) comment that weak coupling of concerns increases their reusability. Burns (2011) notes that reusability of data in case of different applications and business uses is one of the basic principles of data management. If a table contains data about different general business areas (in other words about separate concerns), then it could discourage data reuse. For instance, if a program has to register personal data about clients but the table Client contains also information about the contracts with clients and products or services that the clients consume, then it could increase temptation to create a separate table just for this program to register some personal details of clients. Database normalization deals with separating concerns at the conceptual database level according to the ANSI/SPARC layered architecture of DBMSs. The layered architecture is themselves an example of separation of concerns. DBMSs separate concerns like persistence, checking privileges, speeding up performance of operations, failure recovery, and replication in the sense that the DBMS together with its human operators deals with these questions in the background and the users of data ideally do not have to refer to these in statements of data manipulation language. If one wants to achieve the highest separation of concerns in case of designing tables, then one must create tables that are in sixth normal form. If a table has the key (a set of columns) and in addition at most one column, then it is in sixth normal form. We cannot decompose such tables in a nonloss manner to tables that have fewer columns than the original. Date (2006a) calls it the ultimate normal form with respect to normalization. The use of this kind of tables offers advantages like better handling of temporal data, better handling of missing information, and simplification of schema evolution. Regarding the last property, Panchenko (2012) notes that modifiability of a database schema is one of the most important quality criteria of applications that use the database. Conceptually, this argument is also the driving force behind the NS theory, which offers guidance how to create highly evolvable systems, which retain this characteristic over time. The sixth normal form tables have been popularized by the anchor modeling (Rönnbäck et al., 2010), which is a model-driven development approach for creating highly evolvable databases that are well suited for storing temporal data. It results with tables that are mostly in sixth normal form. However, the approach also requires the creation of views and function that present more denormalized view of data. Thus, there are multiple interfaces for working with the data. One provides direct access to tables that correspond to anchors and attributes to make it possible to register new true propositions about the world. Another contains elements that implement less granular concerns (for instance, the latest view in case of anchors that presents the latest values of historized attributes). Rönnbäck et al. (2010) do not write about separation of concerns in the context of their approach.

Database Normalization Theory and Theory of Normalized Systems 21

In relational databases, first normal form requires that each field of a row must contain exactly one value that belongs to the type of the corresponding column. Although this is not a part of the database normalization theory, one must be able to use any type (including types that have complex internal structure; only excluding the type pointer ) as a column type. Selection of column types determines the granularity of values that database users and DBMS can process (read from and write back to the system) as one value. Thus, one could decide that in case of the entity type Person there is a table Person with the columns first_name and last_name with type VARCHAR or perhaps user-defined type Name_T. It means that one wants to treat values in these columns as the main units of processing (reading and writing) data of persons. On the other hand, one can decide to create the user-defined type Person_T with components of its possible representation first_name and last_name and to create a column with this type to be able to record values with this type. In this case, one treats values with the type Person_T as the main units of processing. The relational normalization theory does not look inside the recorded values of the column types and does not deal with possible data redundancy within these values. Mananert et al. (2012b) state that the identification of tasks that one should treat as separate concerns and should place to different modules is to some extent arbitrary. In relational databases, the use of tables in sixth normal form together with the possibility to use simple or complex types as column types offers a flexible model in determining the granularity of concerns. One could design tables so that dealing with first names and last names are separate concerns, and the corresponding data is in separate tables. On the other hand, one may decide to consider dealing with data about persons as one concern and register data of persons in a table that has exactly one column, which has the type Person_T. This column is also the key column of the table. The latter design is less flexible, just like we expect from less-separated concerns. It is more difficult to implement recording historic attribute values in case of some attributes (but not all attributes) of Person. Difficulties in database evolution like starting to register data corresponding to new attributes or making constraints to attributes stronger or weaker depend on the inheritance model of types that the DBMS offers. For instance, a new requirement is to start registering national identification numbers of persons. It could be that the system does not support type inheritance or its inheritance model does not permit definition of subtypes that values the system cannot represent in terms of the possible representation of its supertype. In other words, we cannot add to the possible representations of the subtype a new component national identifier. In this case, we have to create a new type without inheritance and have to create a new table that has the column with the new type. Now there are two places (tables) in the database where the names of persons are registered. Developers could externally couple modules (action elements) by using an integration database (Van der Linden et al., 2013). In this case, they can implement data elements as relational database tables. NSs must exhibit data version transparency, meaning that action elements must function even if there are multiple versions of data elements. Mannaert et al. (2012b) describe anticipated changes in systems in terms of very generic primitive elements. Two of these changes are addition of a data attribute/field and addition of a data element. If we use the anchor database approach, then both these modifications are non-invasive to existing tables, meaning creating new tables, not altering the existing tables. Thus, every old database conceptual schema version is a proper subset of the latest schema. We can hide such changes in the schema behind the interface of views and functions that encapsulate the database. Anchor

Database Normalization Theory and Theory of Normalized Systems 23

Merunka et al. (2009) address the same problem in the definition of second and fourth normal form for the object-oriented data model and require elimination of such CEs. Aspect-oriented software development describes cross-cutting concerns as concerns that are scattered to multiple other concerns leading to duplication or significant dependencies. They could exist because of the restrictions of the implementation environments that do not allow implementation of such concerns in any other way. Adamus (2005) characterizes such concerns as tangled aspects. However, implementation in a manner that increases dependencies and duplication could also be a choice of designers. This is the case in case of violations of the orthogonal design principle in databases. It means that two or more tables do not have mutually independent meaning in the sense that the same row can satisfy the predicates of multiple tables and thus appear in multiple tables. The choice to violate the principle is not from absolute technical necessity. The reason could be an expectation of better performance of some read operations. In case of the NS systems, denormalization would mean knowing violation of one or more design theorems in at least one part of the system to improve the overall satisfaction with the system. The database normalization theory defines intermediate levels of normalization (normal forms) whereas the NS theory only states the end goal that the system must satisfy all the design theorems. It means that if one wants to reverse the normalization process after its completion or perhaps not to complete it in the first place, then the database normalization theory offers possible levels where to stop but the NS theory does not. The proponents of the NS theory promote full normalization at the software level but allow relaxation of the rules at the higher levels (Verelst et al., 2013) without exact guidelines where to stop. Both theories offer normalization as a tool that one should use according to his/her best understanding and the needs of a particular context. Different levels of database normalization make it possible to do database normalization iteratively in a manner that different iterations take tables to different normal forms. The NS theory also suggest possibility of iterative normalization (Mannaert et al., 2012b) in the same way. Because of the lack of different levels, the authors have to use vague descriptions like “making the elements ever more fine-grained over time.” One could say that denormalization in terms of separation of concerns appears in layered architectures. For instance, Pizka (2005) notes that normalizing (and thus reducing update/change anomalies) on one level of abstraction does not guarantee that higher levels of abstraction are free from these anomalies. For instance, a page or a form in a user interface is a user connector element in terms of the NS theory. This element may combine different tasks (functionality) and thus recouple concerns of the lower system layers. Moreover, it may present data about different entity types and relationship types that one considers separate concerns at the database level and thus recouple these concerns. Another example is that in the database one can implement the virtual data layer (Burns, 2011), which consists of functions and procedures that are both action elements as well as views that are data elements. Again, these elements could recouple concerns of the lower system layers. Yet another example are macros in many applications that recouple lower-level actions and considerably improve the usability of the system. Thus, we see that denormalization at the higher system layers in terms of lower layers is even desirable to ensure usability of the system. In this context, we cannot speak about total separation of concerns as required by the NS theory but only

24 Eessaar

about a goal to increase the separation within and between layers but not necessarily in the interfaces of these layers. “Concern” concept is very flexible. Thus, an interpretation of the previous section is that in case of different layers the completely separated (atomic) concerns have different granularity. It is the general property of the layered architecture that elements at the higher layers “abstract away” and hide elements at the lower layers. Thus, they recouple concerns that are separate at the lower layers. For instance, if we create tables in a SQL database, then it abstracts away from the users of the tables things like internal data storage, indexing to improve query performance, algorithms for checking integrity constraints, and logging data modifications to be able to roll them back. At the internal database level (layer), one would consider these as separate concerns but at the conceptual level of databases one recouples these concerns and works with a higher- level concept that is a table. Moreover, denormalization at one layer may lead to denormalization at the upper layers. Raymond and Tompa (1992) write that data redundancy leads to redundant processing in applications. Because such redundancy violates the canonical form property of concerns it means that CEs in data lead to the CEs in applications. One can mitigate the effect with the help of views, which can give impression of data redundancy to readers while reducing redundant processing because there is no need to update data in multiple places. Tarr et al. (1999) explain the concept of one single, dominant dimension of separating concerns at a time in typical software environments. Data models (like the relational data model) that support creating views and thus implementing virtual data layer break the “tyranny of the dominant decomposition” within the data dimension of concerns. Akşit et al. (2001) describe six “c” properties that each concern should have. The database tables (as implementations of concerns) have all the six properties. Tables correspond to solution domain concerns that describe parts of systems that one creates to solve problems of clients. Nonloss decomposition of tables ensures that the resulting tables have canonical form property, meaning that in the result of decomposition there are no redundant tables, which one does not need to restore the original table. Tables are composable by using join operator. Moreover, the closure property of relational algebra ensures that one could further compose the composed tables by again using join operation because output from one such operation could be an input to another relational algebra operation. Tables are computable , meaning that they are first class abstractions in the implementation language (for instance, SQL) and thus one could create them in the implementation environment (DBMS). Tables have closure property, meaning that both separated and composed tables have all the same general properties (no duplicate rows, each column has a type and unique name within the table, etc.). Concerns must also have certifiability property, meaning that it must be possible to evaluate and control their quality. There are certainly methods for evaluating and improving table design (one of them is the database normalization theory) but it would be a topic of another paper. Hürsch and Lopes (1995) note that redundant system elements help us to achieve fault-tolerance in computing. The database normalization theory does not deal with the data redundancy caused by the need to protect data assets by making distinct copies of them (by using replication or by making backups). Similarly, we want to protect source code or documents by making copies of them, thus increasing redundancy and CEs. This is outside the scope of the NS theory as well. In both cases, we need appropriate tools and processes for version control. Please note that this example also shows that there are types of redundancies that are outside the scope of the theories.