UUID For Database Primary Key

May 1, 2023

Why, and when do you use a universally unique identifier (UUID) as a primary key in a database?

  1. Can create records concurrently because it does not have to wait for the next ID like in sequential ID. So multiple machines can generate them independently which means potentially faster write.
  2. Scaling is easier. We generally do not have to worry about ID clashing between databases held in different servers.
  3. Leak fewer information on HTTP response or URL compared to integer primary key.

What are UUID?

They are 32 hexadecimal characters consisting of 16 octets, thus 16*8 = 128 bits. It has an 8-4-4-4-12 format length separated by dashes where each section corresponds to a specific way of encoding.

They look something like this

c50ab83b-b632-419d-91b9-626895e705ec

Figure 1: An example of a UUID

Version of UUIDs

There are many versions ranging from V1 to V5. V4 is what almost everyone uses. It is native to postgres and mysql. It has 6 bits predetermined that contains a version (V4) and a variant. That leaves 122 bits for other random bits for a total of 2122 possibilities.

V1 and V2 include hardware MAC for easier scaling between nodes in a sharded database. So, there’s no collision of primary keys between the shards. These include timestamps of 100 nanosecond precision in the leading portion of UUID, so they are k-sortable. More details on what k-sortable is in the later section.

V3 and V5 are namespace-based with different hash, MD5 and SHA1 respectively.

Collision

Collision happens when generated results have an identical value. This is a significant issue when used as a primary key which requires uniqueness. Generally, collisions have a low chance of happening. To find out, we use the Birthday Paradox formula. We know V4 has 122 bits left for random part, thus:

/images/uuid-birthday-paradox.png Figure 2: Birthday Paradox calculation https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

“This number is equivalent to generating 1 billion UUIDs per second for about 85 years. A file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes.” https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

One in 2.71 * 1018 is an incomprehensible small number. To visualise this, the chance of getting two identical UUID is less than a combination of getting hit by an asteroid (1 in 1,600,000 https://www.nationalgeographic.com/science/article/160209-meteorite-death-india-probability-odds ), and hit by a tsunami (1 in 5000 years https://www.abc.net.au/news/2018-10-16/what-would-happen-if-a-tsunami-hit-sydney/10376680 ) (1-12 ) every single day for a year.

Locality

The reason against using UUID instead of serial ID is that is bad for indexex. Serial integer has no issue with indexes because the way they are indexed in a database are next to each other.

img.png Figure 3: Where the numbers are stored in a tree https://www.programiz.com/dsa/b-tree

Here we can see that the integers in the leaf nodes are close to each other.

If a random UUID like V4 is used, a sequence of them will be all over the place, making lookup longer. For example, take a look at the following sequence eight UUID V4 where the top comes first and the bottom is the last item in a table.

4ab20c83-d608-479b-9e3a-9a4d3c68b814
62135145-2ccd-4796-a43a-0d8d4e985e20
6e2b46f6-1f13-46da-8405-9e098c46682c
1a42dd94-6f8c-4d37-a5d9-5f35d3cb8646
5fa03e57-060a-464f-bd98-cec94dad3cf0
2c24a9a5-7638-41fb-814b-5a8ad91e6a69
89d1cde2-52f0-4e9a-87c1-c2ce0ed22c60
948bbc46-92b1-4611-a46b-41055bfdf249

When we index using b-tree, it will look like this

UUID V4 b-tree Figure 4: An Example of UUID V4 b-tree index

The items above only use the first eight characters because only the first few characters are important to demonstrate locality. As you can see, going from the first to the second UUID (4ab20c83 to 62135145) is relatively close, but going to the third one which is 6e2b46f6 is relatively further away. It keeps jumping between nodes which means more pages (memory) and cpu cycles needed to find the nest item. In other words, new items are far away from each other.

UUID new drafts

To combat this locality issue, there are new UUID draft that include version V6, V7, and V8. Both V6 and V7 include timestamps just like V1 and V2, but with an option of a nanosecond precision instead of 100ms precision. The rest are pretty much random numbers. As a result, these UUIDs are k-sortable because they are located near each other, providing near-optimum for a b-tree index.

Anatomy of UUID V7

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           unix_ts_ms                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          unix_ts_ms           |  ver  |       rand_a          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|                        rand_b                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                            rand_b                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

Figure 5: Anatomy of draft UUID V7 https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#name-uuid-version-7

Examples of UUID V7 look like this:

# a sequence of 10 UUID V7 with nanosecond precision
061e25e4-1006-74fe-8081-00da525240d0
061e25e4-1006-74fe-ba51-0003f230c720
061e25e4-1006-74ff-8e5b-000923405452
061e25e4-1006-74ff-985b-00c1ce4ad579
061e25e4-1006-74ff-a201-004757a8a95d
061e25e4-1006-74ff-ab75-0078d72c6d62
061e25e4-1006-74ff-b54d-00a551e8bf84
061e25e4-1006-74ff-bed5-00bd7cbafc48
061e25e4-1006-7500-888f-001ea722617d
061e25e4-1006-7500-920d-00a5a9687d6a

Broken down (4 bit is 1 character):

061e25e4-1006    : We can see the first 12 characters are the timestamp. All are created in the same nanosecond
7                : 4 bits for version are all 7 for version.
4fe              : 3 random characters
8081-00da525240d0: 2 bits for variant and the rest are 62 bits of random characters

UUID V7 vs Serial ID

In terms of the difference between UUID V7 and a serial ID, one major advantage of UUID V7 is that generation is independent of nodes or shards. However, it leaks timestamps so an attacker can extract a record’s creation date.

Nevertheless, it leaks no other stuff found in serial ID such as:

  1. Size: If a client receives a record with id=10004578, they can guess that 4578 orders have been made.
  2. Rate of growth: Receiving two different orders means they can track the growth rate of record insertion.
  3. Iteration attack: If your API endpoints do not have authorization, an attacker can try to access with GET /api/users/1, GET /api/users/2, GET /api/users/3, etc. UUID makes this next to impossible.

As we will see later in benchmark section, querying makes no difference. Insertion will be longer because we needed to create the UUIDs client-side — albeit done concurrently.

Difference between UUID V7 and Serial ID

The table below summarises the difference between the two.

UUID V7 (Draft) Serial ID
Status Draft n/a
Generate No need of internal function to calculate next ID but client-side generation can take longer Internal function to calculate next ID
Insertion Time Longer because it has to be indexed. But should not be a bottleneck Fast
Package Not native to mysql/postgres. Extra time to create MySQL: BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY; Postgres: BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY;
No Collision False. But pretty much non-existent True*
Sortable Yes, k-sortable Yes
Random Yes No
Timestamp Precision Up to nanosecond n/a
Btree Index Good Excellent
Shard between nodes Yes Have to coordinate between nodes so IDs do not clash
Safe for replication MySql: No. Might only be applicable to UUIDv4 Have to take care of duplicate IDs between nodes
Potential expose information to client Yes, because timestamp is viewable Yes. Size, rate of growth, iteration API call attack: user ID = 1, 2, 3, …)

Table 1: Comparison between UUID V7 and Serial ID

* In postgres, if you use bigserial unsigned (serial) instead of bigint generated always as identity (identity), there can be a case when ID clashes - https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/

Benchmark

A quick unscientific benchmark shows there are not much difference between using UUID V7 and serial ID

Simple schemas were used:

CREATE TABLE serial (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    counter bigint
);

CREATE TABLE uuid (
    id uuid PRIMARY KEY,
    counter bigint
);

Code 1: Schema for both UUID and serial ID

UUID V7 was generated using https://github.com/gofrs/uuid library.

The amount of data is 100 million. And we measured record retrieval in milliseconds (switched on using \timing true;) by using their respective primary key. We pick 10 arbitrary records by taking them somewhere in the middle. A sequence of them are used to prevent query cache.

select * from uuid where counter = 50000000; -- 50 million
select * from uuid where counter = 50000001;
select * from uuid where counter = 50000002;
...

Then use them to query the records on our commodity hardware.

select * from uuid where id = '0187bf9d-6a93-7c42-873b-c566cee2d749'; -- and the next 9 
select * from serial where id = 50000000; -- and 50000001, 50000002, ...

An average is calculated for each of the UUID and serial methods.

Benchmark Average Query Time Between UUID V7 and Serial ID Figure 6: A query time benchmark of 10 separate SELECT from 10 million records

As you can see, both returns a sub-millisecond response at 0.292ms and 0.288ms for UUID and Serial method respectively. In the grand scheme of things, an average difference of ~4 microseconds is hardly material and falls within the margin of error.

Alternatives

UUID is not the only random-ish primary key you can use. Some alternatives are

ULID

KSUID

MongoID

Conclusion

Using UUID or other k-sortable alternatives eliminates many problems with using serial ID for example sensitive data exposure to client. Randomness is bad for b-tree index but a leading time component makes it k-sortable and thus more index-friendly. Insert operations can be done concurrently and sharding becomes easier. However, the timestamp is visible to the clients.