Using GUIDs as row identifiers
- 📅
- 📝 387 words
- 🕙 2 minutes
- 📦 .NET
- 💬 13 responses
Wade Wright is preaching that IDs in a database should always be GUIDs and lists four reasons. I commented there with my opinions but it hasn’t shown up — some people like to censor if you don’t agree with them completely. My points addressing each of his four ‘reasons’ were:
- Avoid round-trips when creating new business objects I don’t know about you but I don’t populate my IDs until I persist and not on object creation so this reason is irrelevant.
- Data merging is easy Yes, this is a good reason to use GUIDs if you know merging is likely to happen.
- Type/table ignorance Basically he says here that a foreign key reference might relate to a row in any table which breaks relational concepts big time.
- Some weird anecdote I don’t quite get
He does acknowledge that querying the database by hand is a bit of a pain but underestimates the Guid/uniqueidentifier performance penalty in SQL Server especially with regards to INSERTs (might be able to optimize there by having .NET generate the GUID instead of SQL Server).
He also fails to consider how a page full of GUIDs will bloat out HTML page size especially if they end up used in a number of full drop-down SELECT boxes or collapsible JavaScript tree on a single page although deflate and GZip compression should take care of that if you’ve turned it on.
There are many valid scenarios for using GUIDs as unique identifiers but one size does not fit all using them without consideration will likely lead to somewhere bad ;-)
If you are using them and space is tight such as uncompressed HTML or the GET/POST size down you can write and parse the Guid as Base64 instead of hex format which cuts a Guid of 81187ecf-c452-4550-9ed2-8a51e3c46da1
(36 bytes) to z34YgVLEUEWe0opR48RtoQ==
(22–24 bytes) which soon adds up. To do that in C# and .NET simply:
string base64 = Convert.ToBase64String(guid.ToByteArray());
Guid guid = new Guid(Convert.FromBase64String(base64));
For a further squeeze you could use ASCII85 encoding with Jeff Atwood’s ASCII85 implementation in C#.
[)amien
13 responses to Using GUIDs as row identifiers
I read most of people’s feedback here plus made some tests on some DBMS.
Guid are a 128bit integer. Behind it is an algorithm that obviously will be slower than old standard sequential/smaller integers. But, on the contrary to some here, I believe this actually boost up the relational concept all together if you implement this in large scale.
I’ve tested a similar concept with IBM’s DB2 token datatype on a huge os/390 mainframe. This has proven very powerful in sharing/replicating data in between mainframes across the world. Since it gives you a very high potential of IDs to be globally unique.
This ROW id technique is being used now in between some govs I worked for. It’s perfect to create entities and link them with a bunch of other “sattelite” information coming from other external systems and building up a huge database of links in between entities.
That’s my opinion.
@Robert: I knew I would drag it out of you if I just kept at it! Thanks for the lesson, I get it now :)
@George: No, a Primary Key is not just some unique string. Unless either a) you don’t care if you get duplicates or b) some other unique string Really identifies the thing. The point of the contemplation is that a true RDBMS models, in the fulsome mathematical sense, whereas IMS/XML/OODBMS/etc. do not. With solid state machines, there is no advantage to using clunky non-relational, redundant type datastores; such are thought to be faster since they don’t join. With solid state machines, use the Relational Model as Codd intended, and present the data as redundantly to the user as the user desires. Just know enough not to impose such on the datastore. Or the relational mantra: one fact, one place, one time. Or its lemma: storing a fact twice doesn’t make it twice as true.
And that’s quite the point: IMS/XML style datastores REQUIRE the coder to manage said pointers.
Quote: Whereas Codasyl IDS, 1964 was based on a network model of data, IBM’s database IMS, 1968 used a hierarchical structure. …Both the IBM and the Codasyl products were sometimes called navigational databases because they required the user to program or navigate around a data set. Bachman’s Turing Award lecture in 1973, in fact, was entitled “The Programmer as Navigator” (Bachman, 1973; Cardenas, 1979) -National Research Council/1999
I suggest Celko’s book, then Elmasri, then Date.
@robert: I’m slow again today (2-days in a row! Maybe I should see a doctor…) I read the posts you recommended — twice. Not sure why you directed me over there. It seems to be making my point. Memory, solid state or otherwise, is getting cheaper and cheaper, faster and faster. GUIDs are big, long, and ugly — to humans. Machines don’t care. They use pointers. A primary key is a primary key: a way of uniquely identifying a row in a table. What’s an “Intelligent” primary key? Speaking of referential integrity, I’m old enough to remember the guy in the office next to me getting chewed out by the boss for using a varchar column instead of an int column for his primary key in the Inventory database. This is the same argument today, only its with GUIDs vs. ints. Times have changed… They will continue to change. Ask most 22 year old programmers today if using GUIDs are bad because of performance penalties and they’ll look at you funny. Some of them are even ballsy enough to reply, “You think Google sticks with ints for PKs in their databases?” ;)
I have seen decimals used to get around the merge problem (on oracle) (e.g. all ids on a database are x.1, all on another are x.2) It seemed to work well (I’m not a DBA though)
I commented with pretty much the same view yesterday, which also hasn’t turned up. The only differences were that I said point 4 looked to me like it was the same as point 2 and actually still wasn’t preferable over just bumping up the keys on one machine by X million if you know you’re going to merge data.
Also the possibility of collisions just isn’t good enough in certain domains.
I too posted some counter-arguments to one of his points and never saw my post show up…maybe someone should try posting in favor and see if their comment is allowed in :)
@George:
The contemplation — http://groups.google.com/group/comp.databases.theory/browse_thread/thread/a11dc89d00073757/03fa754c9393f185?hl=en&lnk=raot#03fa754c9393f185
Start with CELKO’s first post.
Surrogate primary keys (whether GUID or otherwise) are consumate evil; they turn a RDBMS into just a filesystem datastore. The reason being that without intelligent PKs, there is no hope of defining RI. And so on.
@Robert — I must be slow today… I’ve comtemplated for about a minute and I still don’t get your “True RDBMS” approach… insert confused look here
@steve et al: I wanted to point out that one thing that GUID’s work great for is in mobile/disconnected apps where you need to create rows on the local disconnected database (SQL CE) and then add them to the server store when synchronizing later. This is a case where you want to create your new ID’s because you are persisting them locally. The GUIDs allow you to easily insert/merge records from many occasionally connected mobile devices with the server.
I tried submitting a number of counter points to his post as well and they never appeared. Being DBA with warehousing experience, Steve had it right, I would rip his nuts off.
Read the OP. He’s obviously a coder, not a database architect. Ambler and Fowler wannabe. When solid state machines (Apple is rumoured to be close), esp. solid state discs (the main point, of course), become available; the True RDBMS approach will blow all the IMS/XML/OODBMS/etc. out of the water. Only a minute’s contemplation is needed to see why.
Actually if your DB can store GUIDs as a native column type rather than as strings the IDs will only be twice the size of a 64-bit number I guess. Even so for something that will impact just about every DB access, particularly those you wanted to be fast anyway via indexes, it’s far from justifiable.
Load of crap. Sure there might be reasons for using them sometimes, but his justifications don’t stack up.
Reason 1: If he thinks the best way to avoid round-tripping to the database to get a new ID is GUIDs, he’s naive. There are plenty of ways to do this; if you don’t care about contiguous keys (which he obviously doesn’t with GUIDs), a far simpler way is to just use an appserver-side block key allocators which pull a range of keys from the central pool in one DB access and then hand them out to client objects as they request them. Sure you lose any unused keys from that block when you restart your appserver, but you can still have more keys that you could ever use in a single, efficient 64-bit number and not a bloatware 36 bytes.
Reason 2: Store-and-forward systems have been functioning for years without using GUIDs universally. It’s a simple case of converting local to global references. The cost of resolving that at the merge point versus clogging the entire system with GUIDs is negligible.
Reason 3: Suggests application developers shouldn’t necessarily be database designers. Most DB architects I know would want to shoot him for that approach.
Reason 4: Using test data in a live environment is just incompetence. And besides, it’s not exactly hard to validate data using a number of fields instead of just one in rare screw-up cases like this, GUIDs are far from the unique life-saver he claims.
Basically the whole article screams lack of practical DB experience to me, and selling it as ‘Gospel’ and making stupid comments like ‘Mac developers could even do it’ doesn’t help his case. He’s clearly never had to do any serious performance analysis on a database app otherwise he wouldn’t be claiming the difference between using 64-bit identifiers and 288-bit identifiers for the primary mechanism in which a database optimises queries was justifiable. Oh sure, your indexes being 4.5 times as big won’t make any difference at all. Right. Tell that to a professional DBA and they’ll rip your nuts off.