[By Matthew French]
Choosing a database product to keep all your important data safe can be a very simple task, or a tremendously complicated and fractious ordeal.
It is easy when you don’t have a choice. Many applications are written to work with only one database. When you choose the application, the database is chosen for you.
Sure, you might only use dBase for your data right now, but that will change when the sales department takes a liking to a shiny new application that makes their lives a whole lot easier. Too bad that application only runs on Oracle.
When you have a choice, the issue becomes considerably more complicated. Once you buy into a particular database, you buy into its ecosystem. The longer you work with a particular database vendor, the harder it becomes to escape its grip. It’s no surprise then that many large organisations have an installation of every major database product. The only variation is how much is how much of each database is installed.
Pundits of standards will no doubt point out that SQL has been around since the 1970s, and that the first Ansi SQL specification was formalised in 1986. Since most major databases support the SQL standard, it should be quite easy to modify an application to use another database engine.
Unfortunately, despite it being almost 25 years old, the SQL standard still doesn’t cover everything most modern databases do. And nor should it. There must be space for innovation and competition. The downside is that some programmer will find and use a feature in the database that is not compatible with any other database. Even seemingly trivial issues, like how the database handles sorting, can introduce strange behavior when the programmer hasn’t anticipated it.
Over time the lock-in just increases. Converting dates, manipulating strings, using triggers or writing stored procedures will only result in you being tied further into a specific database.
Object-relational tools like Hibernate can make it easier to support multiple database vendors by hiding the underlying database. But this will only hold off the inevitable. Unless the application developers continuously run comprehensive tests against every supported database, chances are the application will be tied to whatever database they developed on.
So, how does one choose a database technology? The first question is whether you need a database at all. That is because these days the term “database” is synonymous with relational database, so we tend to forget there are alternatives. The file system is a free and easily accessible database. It is simple to write programs to use it and even novice users understand how it works. Sure, the indexing is poor and you don’t have transactions, but for some applications it provides everything you need.
Then there are object databases. These essentially store data in the same way that most modern applications access it. In theory, object databases make it possible to retrieve and manipulate related objects faster than any relational database ever could. In practice projects that use object databases tend to suffer.
The reasons are complex and many — programmers try to use the databases as if they were relational databases, the technology is not widely supported, reporting and business intelligence tools don’t work well with object databases, and it has been suggested that when one wants to query the database, performance becomes an issue.
Despite this, object databases seem to have found several niches, such as for spatial data and molecular biology. They have also been used with some success as large data caches. So, while the technology is not widely adopted, it is still a good choice for certain problems.
However, the assumption is that relational databases are what we are looking for, so which one do we choose? There is one other important criteria we need to consider: do you need transactions?
The answer is usually either “What are transactions?” or “Why wouldn’t I want transactions?”. Answering the first question is beyond the scope of this article, but a quick summary is that transactions make it possible to undo stupid mistakes, and they make it possible to have many users accessing the database at the same time without strange things happening.
The second question is more relevant. Most modern relational databases do transactions and if we need them they are there, and if we don’t need them we can just ignore the feature.
The problem is that to ensure consistency, a database that supports transactions needs to write the same data many times over. The process can be roughly described as writing a log that says: “This is what I am going to do. I am doing it. Now it is done.” The reason is that if something terrible should happen in the middle of a transaction, like the janitor unplugging the server so they can vacuum the server room, then the data can still be recovered to a consistent state. The last thing you want is all the debits without the credits.
This process of logging transactions brings a huge performance penalty. For many applications it is one well worth paying. But sometimes you don’t want to pay the penalty. Examples are Web applications that spend most of their time reading data, applications that use the database as a temporary store, and small embedded applications that use the database as a convenient way to manage internal application data.
If you don’t want transactions then you probably want to look at the simpler databases. Microsoft Access databases are dreaded by administrators who have had to deal with corruption and locking issues, but they work well in scenarios where the data is temporary or there is only one user.
For Web applications, MySQL has become a firm favorite. Though it supports transactions, you have the option to turn them off. On Unix there are also lightweight open-source databases like DBM and its successors, which are adequate for small tasks, even if you wouldn’t want to use them for your company accounts.
Then there are dozens of databases where the entire engine is written in a higher-level language like Java or Python. The goal is to make it possible to embed the databases directly into applications. These databases are not going to win any scalability awards, but are quite adequate for what they do. Often their performance is surprisingly good.
If you want a fully transactional database then you have even more choice. If you like open source then PostgreSQL should probably be at the top of your list. Not so long ago MySQL would have been right beside it, but since it was acquired by Oracle there has been some concern around how long it will remain free. Even if you are not dedicated to open source, these are both competent database servers used in many big companies.
If your company is allergic to open source, there are plenty of commercial alternatives. SQL Server is popular among Windows developers and Microsoft is fond of showing off how well its database can match the other contenders. Sybase is popular in many Unix shops and IBM’s DB2 has loyal followers who will never hesitate to point out its obvious (to them) superiority.
Then there is the big cheese of databases: Oracle. Though Oracle CEO Larry Ellison might be brash and obnoxious, he seems to have no problem persuading customers to buy his database. Oracle has a lot to offer — its database is fast and flexible, but it is expensive and needs skilled hands to keep it working properly. Of course, the competition will say that Oracle is not faster.
Finally, if you think Oracle is too cheap, there are always mainframes and dedicated high-performance databases like Teradata.
Here we have a wide variety of choice, and we haven’t even looked at all the options. Choosing a database is not easy.
The simple truth is that the set theory behind relational databases has been well understood for two decades. The basic principals haven’t changed, which means that there is very little to distinguish between the core functionality of different products. Some database implementations might be faster or more scalable than others, but the difference is usually small.
As computers have become faster it has become less important to squeeze those extra milliseconds out of running a query. Obviously, if you are responsible for the retail banking application at a large financial institution, or for itemised billing at a cellphone network, then it is worth your while to do an intensive comparison.
For the rest of us, chances are you already have experience with at least one major database vendor. You probably have the relationships with the community and the vendor, and possibly you have all the tools for the database. Almost certainly you have the scars of experience.
So back to the question: which is the best database? The answer is: the database you know.
- Matthew French is an independent consultant with more than 20 years of experience in the IT industry