Thursday, December 6, 2012

The tricky swamps of database versioning

If any developer have to draw a picture of most solutions, he/she usually draw the cylinder that represents the database first. Unfortunately when you start to apply version control to the solution, you start versioning the code first and more often than not completely omit the database from this process. One can ascribe this to the fact that database versioning can become relatively tricky, but also because few people understand exactly how to version a database.

When it comes to database versioning there are quite a few options and tools out there and each caters for different development and technology scenarios. I have come across two vastly different strategies in my (short) life:

Main repository strategy
In this strategy, you set up a reference database on a central server that all developers can access. This database is widely accepted as the definitive model, and all changes that have not been committed to this reference database's structure can be assumed as non-existent or still in development. When changes are made to the solution, a developer would have to first compare his local database with this reference database before starting a unit of work. When updates are run on external servers - whether that is on the QA, staging or production environments - a compare is done between the database being updated and the reference database. Most differences identified in this way can be applied by the comparison tool being used but in some situations they will have to be applied by hand.

Benefits of this strategy:
  • At any given moment in time, anyone on the team can find out and apply the database structure as it should be in order not to break the solution.
  • There are multiple comparison tools in the market, both open source and commercial.
  • A database administrator might have access to copy the reference database in order to obtain a clean slate solution. The major benefit of this is that you wouldn't have to make sure that any look up tables are pre-populated, since the reference database should already carry this information.
Challenges of this strategy:
  • Theoretically two developers might be able to make conflicting changes to the reference database at the same time which in turn will cause a breaking build. If the data layer is accessed through Entity Framework or uses any other technology where a central access point is used, one can circumvent this problem using Team Foundation Server. If an internal rule is instated that all database changes should be committed in a short sprint comprising of the developer first checking out and locking all database access files, committing the database change both in the reference database and in the code, and then checking in the access files' changes; no two developers would have the lock on these files at the same time. This effectively enforces a one-at-a-time access mechanism to the database. The sprint doesn't have to contain the entire solution for the requested change - only the database scaffolding and any breaking code fixes should be included.
  • Updates to remote servers would require VPN access or exposure to the Internet and might also be terribly slow. This can be circumvented by either developing a small in-house tool or finding an existing one that lets you save the database structure from the reference database in a file. One can then copy this file to the remote server which also has a copy of the tool installed, and compare against the file instead of a direct database compare.
  • Additional data manipulations would have to be communicated either by email or by checking in scripts together with the solution. Issues such as the default value for a non nullable column that is assumed wrong can get quite out of hand. Not to mention when columns should be populated through the transformation of a sub set of other data.
  • The version of any given database can only be determined through a comparison. Rolling back to any given version might become a nightmare since there are no clearly defined version intervals. This would make tracking down specific bugs difficult unless a copy of the database on which it occurs can be obtained.
Versioning scripts strategy
The general gist of this versioning option is that each change to the database is accompanied by a change script which is then checked in together with the code of your solution into TFS. The first step of this strategy is to create a baseline script which would generate your entire database and populate any look up tables. This baseline can be generated by hand or by using Microsoft SQL Server tools. It can also be in one big file or comprise of a file per table or stored procedure or other objects. The preferred method is to generate all the table scripts in one file and split out each stored procedure and view into its own separate file.

This baseline is then augmented by change scripts that each follows a sequential number. With each change script any additional data manipulations can be included such as default values for non nullable columns, data that could be drawn from other columns, and so forth.

These change scripts should be accompanied by a table carrying versioning information on each database. This serves not only the purpose of clearly defining what version the database is on or what change scripts have not been run yet, but also allows one to use tools to apply the change scripts automatically. Whenever a major release number changes, it is advisable to create a new baseline script in order to make fresh roll-outs easier.

Benefits of this strategy:
  • Simultaneous updates to the database structures are explicitly handled by TFS. If developers try to check in change scripts at the same time, TFS would stop the one that is later by a milli-second. However this would only be the case if an internal rule is instated that change scripts should carry a specific name such as "c[major release].[minor release].[point release].sql".
  • A developer can easily create a clean database simply by running the baseline and the change scripts against it. In the same way, a developer can generate a database only up to a certain version in order to recreate a specific bug thrown by solutions that run on an earlier database version.
  • There might not be a definitive database structure stored somewhere, but one can be easily created.

Challenges of this strategy:
  • If a tool isn't used to run the change scripts, it might take some time and training before developers get in the habit of running them manually at frequent intervals.
  • When major release numbers change and new baselines are created, a situation might arise where a certain implementation is still on the previous version but needs a bug fix that would instigate a database change. In these situations, one can either generate a change script for the previous version, and augment the baseline of the new version; or one can generate change scripts for both versions. In both options you would have to keep track of which implementations is running which version and apply only the necessary change scripts.
  • The sheer number of change scripts might become a problem on solutions where frequent database changes are committed if tools aren't used to manage this.
A more in depth look at the change script strategy can be found here. The previous entries regarding the matter are listed at the top of the blog entry.

Database versioning might be something that seems unnecessary in a one man development, but the moment multiple developers or multiple deployments are involved, it becomes a critical problem if not managed correctly. Database versioning is something you should at least consider and start to plan for before writing even a line of code on a new project. If your database is not being versioned at the moment the best time to start would've been yesterday.

No comments:

Post a Comment