Guiding Principles of Relational Data Modeling

Since 1997 when I first wrote a set of documents to capture principles and checklists for good data modeling, I have carried them with me to every employer and client I have worked for. Every one of them needed this set of guides and standards. The software and data engineers have greatly benefited. Well, rather than keep them to myself, I’ve decided to share these with the public, in hopes they will help your data group to produce better data deliverables. Please comment below if you disagree and why, see text that needs improvement, or know a guiding principle that is missing.

Really Know Your Data

This is the most important principle, period. If you are curious about all aspects of the data you are working with, for which you are designing storage and retrieval mechanisms, you will become indispensable. You will be amazing at modeling, database development, queries and performance tuning.

A good engineer and architect is not content focusing on their little keyhole view of the system landscape. They want to know everything about the system they are a part of. And when it comes to the entity presently being designed, they want to know every nuance about every attribute. For example:

  • What kind of data will be captured? How much data will there be?
  • Is there a minimum or maximum length of the attribute?
  • What will be kept in every column? What values will be expected?
  • Will the users enter the data manually, or is the field fully controlled by the system?
  • Is the column value required or optional?
    • If it is required and the user doesn’t supply it, what is the default value?
  • How long will the data be kept?
  • Are there legal, financial or privacy concerns?
  • And so forth…

This thirst for knowledge about the data under your stewardship — and how it plays with the data in other systems where you are not the steward — is critical. See the Business Analysis doc (future post) for more hints and prompts about the sort of questions that are asked by a curious data designer. Knowing your data will help you make good design decisions now, and later, make you indispensable as a query wizard.

Model First. Model Cleanly. Model Thoroughly.

Model cleanly means to follow basic relational design principles and to not compromise or accept technical debt in this area. In essence, “do it right the first time.” It is just not worth it taking shortcuts on the system’s foundation for convenience or deadlines.

If the data foundation starts out compromised, it isn’t even suitable for the initial phases of the project, let alone enterprise-grade usage, expansion and growth of future phases. A poor data foundation will cause nothing but increasing pain and money as it ages.

It is far less costly to experiment, cut corners, and take risks with the exterior treatments, the user-friendly façade of a system. NEVER experiment or cut corners with the data foundation. Here are some structures where the foundation wasn’t properly prepared. The same WILL happen with software systems built hastily on a poor data layer.

On the other hand, if you start with a solid, clean data model, it is easy to build, re-build, extend and re-model on top of it later when needed.

The third level of normalization (3NF or BCNF) is usually sufficient for a clean model. Less than 3NF is lazy and reckless. More than that (4NF, 5NF, 6NF) runs into the law of diminishing returns, and results in models that are too complex to comprehend easily.

Once the model is clean and the database platform has been selected, proceed to the physical model, adding indexes, surrogate keys, constraints, historical tables and audit columns. These items support and optimize data queries, processing, and maintenance. Continue to avoid redundancy as much as possible.
After your base data model is solid, reviewed and tested, use it.

  • If the model is complex enough that it is difficult for developers to write joins, do the joins for them: write views that pre-join the harder parts of the model. Help them with their embedded and generated SQL
  • If the model is too slow for performance requirements, and other options have been exhausted (optimizing hardware, memory, database configuration, statistics, SQL tuning, etc.) THEN consider denormalization. The best denormalization options are those where the database automatically keeps the duplicates in sync for you, in real time, or periodic refreshes if real-time is not required. Oracle’s materialized views with mv logs and refresh on commit are a good example of auto-managed denormalizations. If you are going to duplicate data to improve performance, ensure that robust mechanisms are used to keep the duplicates immediately or eventually up-to-date.

Model thoroughly implies taking ownership and caring about doing a fantastic job. If you grok the current state of the business, and anticipate where it is going, your data model is much more likely to age gracefully and support the business well into the future.

If this is not done, you will produce a clean data model that is a dumpster fire for the business. A large part of modeling well is wrapped up in the first principle above, Really Know Your Data. Therefore:

  • Ask lots of the right questions to ensure that the business and data requirements are complete, and your model meets the needs of the system.
  • Capture and document all the entities, attributes, relationships, business rules and “tribal knowledge” mentioned in conversations, meetings, whiteboards, chats, emails and documents.
    • Collaboration spaces like OneNote, OneDrive, Teams and ADO can be used to store this documentation. However, with key players hiring on and leaving all the time, the documents will likely get stale quickly and fall into disrepair. The only place the documentation stands a slight chance of staying updated is in the object comments that are kept with the table, view, and column in the database. So ensure that every table, view and column has a full, well-written comment that is injected into the database with the COMMENT ON {table}.{column} IS ”; syntax.

Model first is a mindset that applies to database change management. It means new data structures and changed data structures all begin in the data modeling tool upon the versioned data model file, not manually altered in the database or in hand-crafted DDL scripts.

In an ideal world, all database changes would begin with the data architect for your project. These changes would be entered in the modeling tool, validated against existing standards, committed to the enterprise data model repository, and then forward engineered into DDL scripts for your Flyway folder. The DDL scripts can then be tested in development, then committed to the Flyway folder for release to higher environments.

When this model-first culture is achieved, we will be able to:

  • generate and validate data glossaries and data lineage
  • build and maintain slick diagrams and documentation from the modeling tool for publishing, reference and discussions
  • compare data models from prior releases to see changes over time
  • generate data model change release notes
  • maintain a much cleaner information model
  • re-use common entity, attribute, and domain definitions across models in the same enterprise
  • perform cross-system dependency analysis
  • drive a data structure change request from initial requirement to production-ready in 10 minutes (data management can be Agile too) and much, more more.

To some “model first” sounds like a fairy tail, just an unattainable fable. But I have built and lived in a model-first culture twice before, and it was awesome! It is do-able, and it is well worth the effort.

Duplication is a Plague

Period. It just plain curses you to unending headaches… unless you have very robust self-healing, self-reporting, automated mechanisms to keep the various copies in sync with the master record.

Each unique fact should not be found more than once in the database.

As I write this at home, there are three digital and analog clocks within view.

Every one of them is showing me a different time! Immediately I distrust all of them and have to check my phone, which talks to a more trusted network-connected master system. In essence the clocks in my home have become unreliable decorations, almost useless approximations of the truth.

When business users encounter inconsistency in the data you manage for them, the same thing happens. If not caught in time, account and business-ending decisions can be made. Goodwill and trust are then lost in you and your data. This trust is hard to regain.

The primary enemy of consistency and accuracy is duplication.

Duplicating a piece of data increases the points of failure. By failure I mean the very real possibility of data getting created, updated or removed in one place, but not in another. This likelihood increases as time passes, systems accumulate layers, and tribal memory is lost.

This topic deserves its own white paper. We could discuss scores of examples of well-meaning duplication that ended badly, examples I’ve witnessed in the last 25 years, and that was on fairly clean and well-designed systems!

Avoid duplicating entities and attributes at all costs…until you are forced to duplicate. If you must duplicate, do it well and ensure the copies are kept in synch with the master.

Name Things Well

Whenever I’ve spent time agonizing over the name of a entity or an attribute, I end up questioning the value of the time spent. How can something as trivial as naming something take so much time?

A well known quote from the late VP of Netscape, Phil Karlton, warns us “There are only two hard things in Computer Science: cache invalidation and naming things.” In the thick of it, giving something a good name seems trivial, but it is not.

Another famous quote by Donald Knuth helps explain why good names matter: “Programs are meant to be read by humans and only incidentally for computers to execute.”

A well-named object pays for itself many times over. A poor name might be unreadable, misleading, opaque, confusing, incomplete or even the opposite of what it actually contains. Once a database has a few handful of these, it starts to feel like you’re living in a shabby neighborhood with absentee landlords. Every time you try to explain the model to someone, you have to punctuate your explanation with caveats, warnings and apologies.

It IS worth it to name your tables and columns well, to accurately represent what they contain.
Good naming is more an art form, than a science. It requires creativity, practicality and a good handle on the English language.

I’d like to share a little tip that has helped me: Read the table/column comments out loud to yourself. Obviously if there is no documentation for the object you are naming, that needs to be obtained first. Once the item has been fully understood and documented, then a good name organically materializes. About once a year I’ve had to rename a column which I named poorly during the initial design. Literally every time this has happened, I found the better name hiding in the column comments. If I had simply paid better attention to the very documentation I had gathered, it would have been named right the first time around.

See the Naming Guide (future post) for a better treatment on this subject. It has a single-page naming syntax cheat sheet which can be printed and referenced until it is memorized and part of you.

Keep it Simple

“Simplicity is the most difficult thing to secure in this world; it is the last limit of experience and the last effort of genius.” – George Sand
“Simplicity is prerequisite for reliability.” – Edsger Dijkstra
“If you can’t explain it to a six-year-old, you don’t understand it yourself.” – Albert Einstein

A data model diagram and the content it depicts should be fairly easy to grasp, even by non-technical people. If you are having a hard time understanding the model, much less explaining it, or writing queries against it, the design is not done yet. Difficulty or complexity usually signifies the requirements were insufficient, the model was over-engineered, or there are modeling mistakes. With enough experience, you’ll keep refining a model until it “feels right.” See the principle of modeling cleanly above. When a model is not clean, it looks and feels bloated, complex, dirty. Viewing the model in a diagram is an enormous help because it can quickly show you where foreign keys or unique constraints were missed, where columns should have been NOT NULL, where circular relationships are found, etc.

Revisit and refine the model until it is simple. Over time I have found that this state is not reached until at least the fifth iteration. It takes time to do it right. If you have to invest some extra time and dollars, do it here; invest in the application’s foundation for short and long-term paybacks.

Keeping things simple and avoiding duplication are related to the “DRY” principle of software engineering, which states “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” (Hunt and Thomas)

A satirical article about duplication not being “all that bad” referred to the new NoSQL world of spreading data everywhere redundantly as the “WET principle”, which stands for “Write Everything Thrice”, “We Enjoy Typing” or “Waste Everyone’s Time.”

  • Everything should have a home.
  • Put things where they belong.
  • Reduce and re-use.
  • Ensure each entity and attribute and function does one thing, and one thing well.

Bonus Principles

These additional principles have served me well over the last two decades of data architecture and design for large and small companies:

Protect Your Data

One of the primary reasons that databases exist is to protect the integrity and accuracy of critical data. Ignoring integrity constraints is like building a supercar without dashboard, steering or brakes, leaving you with just the engine and a pretty shell. It’ll go, but be hard to steer, brake and protect the passenger.

  • Use foreign key constraints, ensuring only valid values are allowed in a column.
  • Use primary key constraints for the surrogate keys, and unique constraints on the natural keys.
  • Use check constraints, default values, and NOT NULL constraints.
  • Define the columns with the right data type and constrain the maximum length (if applicable).
  • Secure the data (this could be a book unto itself, so I’ll just leave this here for further expansion).

Unleash Your Database

Too many treat their database like a hole in the ground, only useful for dumping data. This is akin to purchasing a fine sports car, but never leaving 1st gear. Use the full capabilities of your database. Let your database do what it does best. There is no need for an application to write its own mechanisms to store, join, sort, cache, search, validate, or protect the data. This is the job of the database. It has been coded and optimized by hundreds of software engineers. There is no need to attempt to replicate their efforts on your own.

Of course, this is not carte blanche to use every new feature the database releases. Put new features through appropriate testing before adopting them.

Design to End Goals from Day One

Determine the security, performance, functionality, flexibility and data quality goals of your system before beginning. These things cannot be stapled onto the data model right before release. They must be the driving forces behind your design sessions from day one. Automated, round-trip, and agile development tools and processes are making late-stage changes easier, but it will still cost less to design properly the first time around.

If you had to choose one phase of your project on which to expend extra time, it should be during data model design. Infuse the model with features that meet system goals from the start, not after trouble comes along in production.

CREATE VIEW AS SELECT * does not dynamically adjust to the underlying object

It never ceases to amaze me how I can be humbled every day.

When you work with a technology for so long, teaching, writing, presenting, training and mentoring…you start to think you’re an expert.
You think by now you know all the basics, advanced aspects and nuances.
And yet, despite your experience, there it is: an aspect of Oracle that has been there all along, but somehow I missed it! 25 years doing this for a living, and I missed it!

That or I’m starting to go senile.

In every Oracle development environment in which I’ve worked, either I or an architect that preceded me has included in the SQL standard the common rule to always use explicit column lists when building SELECT or INSERT statements. The simplified version of the rule’s justification is that things tend to change and when they do, the generic SELECT * and INSERT with no column list will break your app.

With my current employer’s most recent project, for the first time I consciously chose to break that rule.

I thought my reason was sound, e.g. the remote table on another Oracle database was being altered frequently and I needed our local view of that table to adjust dynamically, rather than have to check in yet another script into our database build every time they added another column. Ask most DBAs what a view actually is, and they tend to agree with the Oracle documentation that says views are stored queries. They don’t contain the data. They contain the query definition to get at the data at runtime when the view is queried.

I figured that if my column list was the taboo “SELECT * FROM {remote table over db link}”, it would automatically see and include the new column(s) as it parsed and executed the SELECT * query.

As I understood it, upon feeding Oracle a view creation statement, it first strips out any SQL comments found at the top or bottom of the query, ensures the syntax is correct, and that it has the privileges it needed to function. After passing validations, it would then take the view’s defining query and store it as-is in the data dictionary. Then when a user queries the view, Oracle retrieves the view SQL out of the data dictionary, merges it with any predicates (WHERE clause criteria) in the user’s query, then handles it from there like any other SQL statement (optimizing, parsing, checking the SQL cache, executing, etc.)

The red text above is incorrect. My problem is found in my misunderstanding of how the view’s defining query is stored. If the defining query follows the common rule of using an explicit column list, it will store that SELECT statement as-is, preserving aliases, character case, comments, etc.  But if the defining query uses SELECT *, Oracle probes the data dictionary of the underlying objects to get every current column name. It then re-writes your “dynamic” SELECT statement to be static, replacing the “*” with a list of upper-cased, double-quoted column names as they stood at the time of view creation.

With a table defined like this:

DROP VIEW dyn_view;
DROP TABLE test_tab PURGE;

CREATE TABLE test_tab (
 ID  INTEGER NOT NULL
,NAME VARCHAR2(100)
,DESCRIPTION VARCHAR2(500)
)
/

INSERT INTO test_tab VALUES (1, 'First Row', 'I am the first row');
COMMIT;

And a “dynamic” adjustable view foolishly created like this:

CREATE VIEW dyn_view AS
SELECT * FROM test_tab;

We can look at how Oracle processed our view, whether it preserved the SELECT * or exploded it into an explicit, static column list:

SQL> SELECT text FROM user_views WHERE view_name = 'DYN_VIEW';
TEXT
----------------------------------------------
SELECT "ID","NAME","DESCRIPTION" FROM test_tab

As you can see above. The data dictionary now contains the current list of columns in test_tab, not the dynamic SELECT * as hoped for.

We can query the view just to see it in action before the table change:

SQL> SELECT * FROM dyn_view;
ID NAME      DESCRIPTION
-- --------- ------------------
 1 First Row I am the first row

Now let’s alter the table, as if we were the merciless department constantly changing the remote table we depend on:

ALTER TABLE test_tab
  ADD load_date DATE DEFAULT SYSDATE NOT NULL;

Did the content of the view happen to change because we changed the underlying object?

  
SQL> SELECT text FROM user_views WHERE view_name = 'DYN_VIEW';
TEXT
----------------------------------------------
SELECT "ID","NAME","DESCRIPTION" FROM test_tab

Nope, sadly as seen above, it is the same definition as it was before. So it is not possible to have a stored query that says “give me whatever you’ve got at the present time.” Instead, Oracle is going to force us to always store a static column list at the time of view creation.

Does the new column exist on the base table?

SQL> SELECT * FROM test_tab;
ID NAME      DESCRIPTION        LOAD_DATE
-- --------- ------------------ -----------
 1 First Row I am the first row 2017Aug17 16:09:39

Just to be doubly-certain, has our view adjusted to include it?

SQL> SELECT * FROM dyn_view;
ID NAME      DESCRIPTION
-- --------- ------------------
 1 First Row I am the first row

No, it has not. Well, that’s that.

So it turns out, the view has to be-recreated with an altered explicit column list to pick up the new column, or the original SELECT * creation script must be re-run:

CREATE OR REPLACE VIEW dyn_view AS
SELECT * FROM test_tab;

We can look at how Oracle processed our view, whether it preserved the SELECT * or exploded it into an explicit, static column list:

SQL> SELECT text FROM user_views WHERE view_name = 'DYN_VIEW';
TEXT
----------------------------------------------
SELECT "ID","NAME","DESCRIPTION","LOAD_DATE" FROM test_tab

And there it is. Now the view finally sees the new column.

Note: While researching the Oracle docs for this topic, I finally found that Oracle does spell this out clearly in the Database Administrator’s Guide, in the section on Managing Views, Sequences and Synonyms, under the heading Expansion of Defining Queries at View Creation Time. It says “When a view is created, Oracle Database expands any wildcard (*) in a top-level view query into a column list.” Whether it said this in the Oracle 7 manuals in 1995 when I first learned about views, is probably not worth pursuing.

It wasn’t until our application broke that I discovered my clever plan to have the view adjust dynamically was fundamentally flawed and would never work. It made me feel like it was my first week on the job with Oracle all over again. Because I had always followed the best practice of explicit column lists, I’d never run into this gap in my knowledge. Quite disappointing, but I learned something new, so it was still a good day.