Marco's Blog

All content personal opinions or work.
en eo

Web 2.0 Databases

2008-05-04 4 min read Projects marco

One thing I noticed in interviews recently is the shift in focus for the candidates. It’s been a while now that tech has shown itself resilient to the overall downturn in the economy, so we have been living in and with a candidate market (vs. employer market). One of the many reasons cited by people to turn you down is simply lack of technological fit. In particular, people are really fond of their infrastructure.

If you have a Ruby on Rails person in front of you, it’s like the good old days when you had to talk to a Forth Dimension geek. Ruby is perfect, and Rails makes it sublime. PHP people are not happy with PHP alone any longer, now they swear by CakePHP or some other framework. Java folks won’t even talk to you if you don’t use JBoss.

All in all, this is a good development. There has been way too much investment in reinventing the wheel, and a rational approach that reuses concepts and code is of enormous value. At the same time, though, all the frameworks strike me as ill-concieved in some form. They work fine as long as it’s five people using them, but they don’t scale well at all.

Looking at the infrastructure, it seems all of the frameworks make the same set of mistakes:

  • frequent and redundant database calls – the login sequence for Joomla 1.0.x requests the user information no less than 5 times
  • joins used to gain flexibility – what’s great for extendibility is a nightmare from a performance perspective
  • object oriented database design – instead of focusing on processes (what data are people going to request), the focus is on objects (what data belongs together)
  • lack of abstraction between query and request – a problem borne by the structure of database drivers, that make it a snap to forget that you may want to manage a more complex database than the one you have on your developer machine

Thinking about this got me into more ruminating. I ended up realizing the problem is not necessarily in the coding or the framework, but in the underlying repository. I seriously think that relational databases are not ideally suited for web development.

Each of the items brought up above speak to a deficiency of relational databases. Redundant database calls wouldn’t be an issue if the database had a 304 response (use cache, data not modified). Joins wouldn’t be such a nightmare if the database operated on them automatically.

Let’s see what I mean. Currently, when you want to specify a 1:n relationship in a table, you use a foreign key. Say you want to store the credit cards for each individual in your database. Since you know there is going to be only one individual for each card, you can create the table for the individuals and then add a foreign key to the individual into the credit card table.

A more complex relationship might be that of parents and children. Initially, you assume that every child has exactly one father and one mother. Then you realize that’s not really the case, since your parents might divorce and get remarried. So each parent might have many children and each child might have many fathers and mothers.

If we want to model this in our database, we have to create a JOIN table. We have a table with persons, and a join table that has two persons and their relationship. We might have a record like: (“Alice”, “Bob”, “MOTHER”)

The vast majority of people has only one father and one mother – not because the majority of people really have only one, but because we typically wouldn’t want to bother the software with keeping track of our prior relations. When we ask the database for Bob’s mother, it has to go to the JOIN table and figure out who the mother is, despite the fact that in most cases, that should be a simple call to the person table.

Instead of creating tables that know that a relationship is m:n, I posit that databases should be smarter about things. For instance, when we describe a table, we should be able to specify that a particular field is a “multiple” – which means it typically requires only a single allocation, but that in some cases it might actually require multiple results:

CREATE TABLE PERSON (

id AUTO PRIMARY KEY,

name VARCHAR(255),

father MULTIPLE FOREIGN KEY,

mother MULTIPLE FOREIGN KEY);

This, of course, is just an example of how things could be. In general, it surprises me that a lot of energy has been used to create perfect replicas of enterprise databases and tools that work around the issues of such databases, instead of spending the energy to come up with a database design better suited for the web.