Posts tagged Data modeling
Database Inheritance : Subtyping
0If you are familiar with any object oriented programming language, the term inheritance may not be new to you. In OO programming such as in Java or C++ you basically put all the common attributes and behavior in a parent class and then you provide specific attribute and behavior implementation in each subclass. This feature is also supported by all major DBMS; it is called subtyping.
Some examples of subtyping are as follows:
- A School can be broken down into: regular school, charter school, administrative school, etc…
- A person can be broken down into: man or woman.
- A customer can be broken down into: an internal customer or an external customer.
- A product can be broken down into several products, books, music, movies.
- A book can be of different types: Hard Cover, PDF format, kindle format etc…
Let’s take a look at it in action:
Let’s depict the diagram above:
Book Entity
- Is the Parent entity, aka Supertype
- The Supertype is at the one and only one side of the relationship.
- Has all the information that is common among all of its subtypes.
- It’s PK is going to be the PK of the Subtype as well so that we can get to the details and/or generic information back and forth.
Paperback Book and Digital Book Entities
- Are the Child entities, aka Subtypes.
- The Subtypes are at the zero or one side of the relationship.
- They contain information specific to them only.
- Their PK is the Parent entity primary key so that you can easily get to the common information.
Alert: Some people put a char column in the parent entity to be able to determine the subtype of that record without having to look at the subtype. Be very careful with this approach because if a parent entity can be of both types the solution would not be consistent; if you do that make sure that the subtypes are mutually exclusive.
Self-Referencing relationships
1Also known as recursive relationships, these are unary type relationships. These type of relationships occur when an entity can reference itself.
Recursive relationships occur often in hierarchies, and also in objects that are part of a whole. Below are some scenarios that could be modeled using recursive relationships:
- A course at a school or university can have 1 or more prerequisites.
- Organizations have people hierarchies (CEO,VP, Director, Managers, Employees).
- A finished product can be made of many sub products or parts.
- Also biological hierarchies(Dad, mom, siblings, son, etc..)
Let’s look at it in action, grab some Java mug, just an advice
We are going to solve this with the four steps as follows:
- Get a visual of the hierarchy (Extremely important).
- Create the database model.
- Fill it with some data.
- Write some queries.
Step #1: Get a visual of the hierarchy
The University in Illinois has one of the most reputable computer science programs in the world, so I have decided to model a very tiny section of their curriculum to illustrate self-referencing relationships. The following mind map represents two master level computer science classes and its prerequisites:
Step #2: Create the database model
The database table that will support this solution looks as follows:
Alert: Notice how the relationship line is optional at each end, you must do this otherwise the recursion would go in an infinite loop.
Step #3: Fill it with some data
The following screen shot shows all the data in the COURSE table that I have manually entered.
Step #4: Write some queries
You can write the queries in two styles:
1. By doing a query for each level of the hierarchy and combining them together by doing union all.
2. By writing a more generic query and providing an in clause. This step is not recommended if you have many rows in the table as it will drive you more insane than the first step.
Style #1: Query by “union all”:
We are going to do the first query technique to demonstrate what the data looks like for the master level course CS 414 (Refer to the course on the left in the mind map).
select c.id, c.course_code, c.course_name, c.course_category, c.next_course from course c inner join course d on c.next_course = d.id where c.id in (1) union all select c.id, c.course_code, c.course_name, c.course_category, c.next_course from course c inner join course d on c.next_course = d.id where c.id in (2) union all select c.id, c.course_code, c.course_name, c.course_category, c.next_course from course c inner join course d on c.next_course = d.id where c.id in (3) union all select c.id, c.course_code, c.course_name, c.course_category, c.next_course from course c inner join course d on c.next_course = d.id where c.id in (4)
The query above produces the resultset as follows:
Style #2: Query by “in” clause
Now we are going to write a shorter query that will demonstrate what the data looks like for the master level course CS 421 (Refer to the courses hanging to the right of CS 421 in the mind map).
To obtain the dependencies hanging to the left of CS 421 we could write the following query:
select c.id,
c.course_code,
c.course_name,
c.course_category,
c.next_course
from course c
inner join course d
on c.next_course = d.id
where c.id in (5,6,8,9)
This query will produce the resultset as follows:
To obtain the dependencies hanging to the right of CS 421 we could write the following query:
select c.id,
c.course_code,
c.course_name,
c.course_category,
c.next_course
from course c
inner join course d
on c.next_course = d.id
where c.id in (5,7,10,11,12,13)
This query will produce the resultset as follows:
Improvements
Now that I have finished writing the post i realize that I could improve the queries by replacing the next_course column with a next_course_name column. If you want to do that comment out the c.next_course column from the select clause and replace it with d.course_name as next_course_name.
Best Practices
The best practice is to have a collection of views that give you all the data representation that you need to have for each hierarchy. That way you only have to pull your hair once and hope that no reorganization is going to happen (EVIL LAUGH HERE) . If it does, then you will have to go through a similar painful exercise again.
Conclusion: Recursive relationships can be very tedious and difficult to implement. Hence, the bullet points below:
- documentation to understand the recursion is very important.
- A Mind Map is the best way to document a recursive relationship.
- A business group should be responsible for owning that document and passing it along to IT staff. This document should be revised at least once a year (or more) to ensure the hierarchy has not changed.
Without the mind map I would have not had the right state of mind to solve this problem. The tool I use for mind mapping is called Novamind (it is not free though) but very good for learning, brainstorming, and other purposes.
Many-to-Many Relationships
1This topic deserves special attention because many people from database designers to application developers (sometimes one person does both) do not get it right. A Many-to-Many relationship occurs when each instance from the entity on the left can have many instances from the entity on the right, and vice versa.
Some examples of many-to-many relationships are as follows:
- Each order can contain many items. Each item can also belong to many orders as it can be purchased by different customers.
- Each student can be signed up for many coursers. Each course can be assigned to multiple students.
Incorrect implementation of many-to-many relationships
We see this problem when people are trying to use foreign keys to solve this problem. Let’s illustrate it with the second example listed above as follows:
What we are trying to say here is that each student can take many courses and that each course can be taken by many students. So to implement this we have put a FK to the student table which will point to the course table.
Do you see any problems with this?
- Yes!!! we will have repeated student rows for each different course a student is taking.
- At the moment you do this your database would no longer be in 3NF because we would have repeated groups.
- This issue above can potentially yield to update anomalies.
- This relationship is said to be unresolved.
Correct implementation of many-to-many relationships
In order to correct the many-to-many implementation above we have to resolve relationship. This is done as follows:
- By adding another table between the two entities; this table is called associative entity or intersection table.
- By making the PK of the associative entity a composite key that consist of each of the parent’s PK (We are using Identifying relationship).
- By naming the intersection table as follows (not required by DBMS, but suggested): ENTITYA_ENTITYB. In our case STUDENT_COURSE or STUDENT_COURSE_REL.
- By Changing the cardinality of the parent tables to be One-to-Many to the associative entity.
Let’s see it in action:
What improvements do you see?
- The student is defined once so we avoid duplicate data.
- The courses are defined once and we avoid the same problem as if we would have put the FK in the course table in scenario #1.
- We do not run the risk of update anomalies.
- We can add meaningful fields in the associative entity. Those are called the Fixed Intersection data.
Alert: The FID will become your FACT tables and the parent entities would become your DIMENSION tables in OLAP database structures. Dimensional modeling is a different modeling technique which uses a lot of denormalization so that you can yield better query performance as data is only 1 join away at most.
One-to-Many Relationships
0A One-to-Many relationship occurs when an entity can have multiple instances of another entity in a relationship. The cardinality of the relationship between the two entities will dictate the number of those instances: 0, 1, or more. The entity on the many side must reference one and only one record in the parent entity.
Implementation
You have to bring the primary key of the parent entity (the one side of a relationship) as a foreign key to the child entity (the many side of the relationship). If your primary key on the parent entity is composed of more than one attribute then the foreign key of the child entity should also have those attributes.
There are two ways of implementing One-to-Many relationships. They are as follows:
- By using Non-Identifying Foreign key.
- By using Identifying Foreign key.
Approach #1: Using Non-Identifying Foreign key
A non-identifying relation happens when the foreign key in the dependent entity is not part of its own primary key; this is the most common approach. In the example below we have a school/student relationship with the following details:
- Business Rules: Each school can have zero or more students.
Each student must attend one school only. - Parent entity: SCHOOL
- Child or Dependent entity: STUDENT
- 1 to 1 side: SCHOOL
- many side: STUDENT
- Parent to child phrase: Each school can have 1 or more students
- Child to parent phrase: Each student must belong to a school
Approach #2: Using Identifying Foreign key
An identifying foreign key relation happens when the foreign key in the dependent entity is part of its own primary key. This approach is less popular, but perfectly valid. Let’s see in action:
- Business Rules: Each student can take many different tests during the school year.
Each test can only be taken once during a school year, and each test must belong to 1 student only. - Parent entity: STUDENT
- Child or Dependent entity: TESTS
- 1 to 1 side: STUDENT
- many side: TESTS
- Parent to child phrase: Each student can take 1 or more students
- Child to parent phrase: Each test must belong to a student
As you see in this approach the FK is part of the primary key in the dependent entity. Also by using identifying relationship here we ensure that one particular test can only be taken once by a given student each school year. So if the student was allowed to take the same test more than once in a given school year we would have to follow the non identifying foreign key approach.
Note: The students/tests relationship could have been better implemented as a many-to-many relationship since many students can take the same test. This style was just done to illustrate the identifying foreign key concept.
One-to-One Relationships
0A One-to-One relationship is a binary relationship. It occurs when one record in one entity references one record in the second entity.
Some modelers suggest that One-to-One relationships should be combined into a single entity, and that is usually considered best practice.
Implementation
The implementation is a little tricky; in order to enforce the semantics of a One-to-One relationship and avoid implementing a One-to-Many relationship we will have to put a unique index constraint in the foreign key.
In the diagram above, the STUDENT_ID foreign key in the STUDENT_DETAILS entity has a unique constraint which forces the DBMS to have only one row with that value. Without the unique constraint we would be allowing the database to have repeated STUDENT_ID values in the STUDENT_DETAILS entity and the relationship line would be out of synch with the actual implementation.
Alert:
When not to combine both entities:
- If you want better separation of information.
- If the two entities are managed by different owners.
- If the information in the referenced entity is optional and it can be null.
- If you anticipate the One-to-One relationship potentially becoming a One-to-Many.
Alert:
When to combine both entities:
- If you do not want to have an extra join in your queries.
- If the two entities are managed by the same group.
- If the information in the referenced entity is optional and it can be null (You would have come up with default values or allow for nulls).
- If you are completely sure that it will never become a One-to-Many.
Anatomy of a database relationship
0Every relationship is said to have cardinality. The cardinality is the degree of the relationship. There are two types of degrees:
- Optionality: Is the minimum degree of the relationship and it can be of two types: Optional (circle) or Mandatory (perpendicular line).
- Cardinality: Is the maximum degree of the relationship and it can be represented with a crow’s feed for (many) or omitted for one.
Also every relationship has two sentences (or at least you should make an effort of typing them for readability and understanding purposes). The relationship phrases are as follows:
- Parent to child phrase.
- Child to parent phrase.
The following diagram illustrates the anatomy of a database relationship.
As you can see it is very important to determine the cardinality of the relationships in your data model as that will dictate your business rules. One last note, the dependent entity is called the child, and the referenced table is called the parent.
Database Relationships
5In the database world the lines that connect the different entities (tables) together are called relationships; these are the relationships that occur in real life between entities as part of a process. Relationships in the database are very important because can play a significant role in the areas as follows:
- Data quality
- Non repeating groups
- business rules enforcement
- Communication on how the business works (or better said how it has been implemented) to IT staff and business users with some modeling skills.
Every RDBMS out there supports four types of relationships. They are as follows:
- One-to-One relationships.
- One-to-Many relationships.
- Many-to-Many relationships.
- Self referencing relationships (aka recursive relationships)
Relationships can be between 1 or more tables. Here is some nomenclature:
| Number of tables in relationship | Tecnical term | Friendly term |
|---|---|---|
| 1 | Unary | Recursive relationship |
| 2 | Binary | Two table relationship |
| 3 | Ternary | Three table relationship |
| 4 or more | N-ary | Four or more tables relationship |
Update Anomalies and Normalization
3In the database world “update” is a very famous word, one that can make people swallow some saliva in some instances (evil laugh here) . Update anomalies is one of the many computer science paradigms, just so that you can get a feel for the magnitude of the word “update” it also means “insert” and “delete”. Update anomalies in the database space occur as a side effect of having redundant data and a poor normalized data model. The “update” anomalies are as follows:
- Insert anomalies
- Update anomalies
- Delete anomalies
To illustrate these issues we are going to use the database model and its data shown below:
STUDENT_ID STUDENT_FIRST_NAME STUDENT_LAST_NAME
1 Dario Pardo
3 Zinedine Zidane
4 Ronaldinho Gaucho
STUDENT_ID COURSE_CODE COURSE_NAME INSTRUCTOR COURSE_CREDITS
1 1 Data Modeling Graeme Simsion 4
1 2 Analysis Algorithms Weiss 4
3 2 Analysis Algorithms Weiss 4
4 2 Analysis Algorithms Weiss 4
Insert anomalies
What problems do you see?
- Repeated information (course code, course name, instructor, credits).
- Because of the first problem we cannot identify a primary key since every piece of data is repeated across multiple rows.
- Because of the second problem, we have create a composite key that consists of the student id and the course code (This will yield to the insertion anomaly).
The anomaly
The insert anomaly happens when we want to create a course but we do not have a student assigned to it yet. As you can see because of this horrible design we cannot add a new course unless we create a dummy record (in this case a dummy record in the student table) that would be assigned to every new course not available to valid students yet.
Do you see how ugly this solution is?
Do you see how this might be complicated to understand by other developers and business users?
Can you also imagine the ugly code that you will have to do at the application level to filter in/out courses not available yet?
Solution
Normalization. Redesign the database model at least to Third Normal Form.
Update anomalies
What problems do you see?
- Repeated groups of data (Course related and instructor related)
The anomaly
The update anomaly here would happen when we try to update any course related information or any instructor related information information. Say for example that we change the course name “Analysis Algorithms” to “Fundamentals of Algorithms” and that the update only happens on one row instead of on every row possible, we would have left the database in an inconsistent state.
Solution
Normalization. Redesign the database model at least to Third Normal Form.
Delete anomalies
What problems do you see?
- Repeated groups of data (Course related and instructor related)
The anomaly
The delete anomaly here would happen when we try to delete a student . It makes sense that if we delete a student, we also delete any associated rows in the course assignment for that student no? Let’s see what happens if we delete Dario Pardo which is assigned to Data Modeling class among others. Since nobody else is assigned to the Data Modeling class and that was the only row that had information about that class as well as the instructor we would have lost all the data about the course and the instructor. Now…depending on how critical and what type of information that was you or other people above you could loose their job over this; I am not kidding (no evil laugh here)
Solution
Normalization. Redesign the database model at least to Third Normal Form.
A normalized data model
This is a mini solution (it definitely could be much more elaborated, e.g self referencing relationship on courses to handle prerequisites) but for now it will get us away from the update anomalies problem by redesigning the database towards third normal form. Student, Staff, and Courses have been separated out into its own entities and have been connected through associative entities to resolve the many to many relationships to avoid redundant data. With this new design we can do as follows:
- Define new courses without creating a dummy record like we explained in the insertion anomaly.
- Updating course or teacher information now only happens once and touching 1 row only thanks to referential integrity. So we do not have to worry on any possible left outs like explained in the update anomaly.
- Deleting a student and its associated rows in the student_schedule table will not actually delete the course information like explained in the deletion anomaly.

















