UUID For Database Primary Key

May 1, 2023

Why, and when do you use a universally unique identifier (UUID) as a primary key in a database?

  1. Can create records concurrently because it does not have to wait for the next ID like in sequential ID. So multiple machines can generate them independently which means potentially faster write.
  2. Scaling is easier. We generally do not have to worry about ID clashing between databases held in different servers.
  3. Leak fewer information on HTTP response or URL compared to integer primary key.

What are UUID?

They are 32 hexadecimal characters consisting of 16 octets, thus 16*8 = 128 bits. It has an 8-4-4-4-12 format length separated by dashes where each section corresponds to a specific way of encoding.

They look something like this


Figure 1: An example of a UUID

Version of UUIDs

There are many versions ranging from V1 to V5. V4 is what almost everyone uses. It is native to postgres and mysql. It has 6 bits predetermined that contains a version (V4) and a variant. That leaves 122 bits for other random bits for a total of 2122 possibilities.

V1 and V2 include hardware MAC for easier scaling between nodes in a sharded database. So, there’s no collision of primary keys between the shards. These include timestamps of 100 nanosecond precision in the leading portion of UUID, so they are k-sortable. More details on what k-sortable is in the later section.

V3 and V5 are namespace-based with different hash, MD5 and SHA1 respectively.


Collision happens when generated results have an identical value. This is a significant issue when used as a primary key which requires uniqueness. Generally, collisions have a low chance of happening. To find out, we use the Birthday Paradox formula. We know V4 has 122 bits left for random part, thus:

/images/uuid-birthday-paradox.png Figure 2: Birthday Paradox calculation https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

“This number is equivalent to generating 1 billion UUIDs per second for about 85 years. A file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes.” https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

One in 2.71 * 1018 is an incomprehensible small number. To visualise this, the chance of getting two identical UUID is less than a combination of getting hit by an asteroid (1 in 1,600,000 https://www.nationalgeographic.com/science/article/160209-meteorite-death-india-probability-odds ), and hit by a tsunami (1 in 5000 years https://www.abc.net.au/news/2018-10-16/what-would-happen-if-a-tsunami-hit-sydney/10376680 ) (1-12 ) every single day for a year.


The reason against using UUID instead of serial ID is that is bad for indexex. Serial integer has no issue with indexes because the way they are indexed in a database are next to each other.

img.png Figure 3: Where the numbers are stored in a tree https://www.programiz.com/dsa/b-tree

Here we can see that the integers in the leaf nodes are close to each other.

If a random UUID like V4 is used, a sequence of them will be all over the place, making lookup longer. For example, take a look at the following sequence eight UUID V4 where the top comes first and the bottom is the last item in a table.


When we index using b-tree, it will look like this

UUID V4 b-tree Figure 4: An Example of UUID V4 b-tree index

The items above only use the first eight characters because only the first few characters are important to demonstrate locality. As you can see, going from the first to the second UUID (4ab20c83 to 62135145) is relatively close, but going to the third one which is 6e2b46f6 is relatively further away. It keeps jumping between nodes which means more pages (memory) and cpu cycles needed to find the nest item. In other words, new items are far away from each other.

UUID new drafts

To combat this locality issue, there are new UUID draft that include version V6, V7, and V8. Both V6 and V7 include timestamps just like V1 and V2, but with an option of a nanosecond precision instead of 100ms precision. The rest are pretty much random numbers. As a result, these UUIDs are k-sortable because they are located near each other, providing near-optimum for a b-tree index.

Anatomy of UUID V7

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|                           unix_ts_ms                          |
|          unix_ts_ms           |  ver  |       rand_a          |
|var|                        rand_b                             |
|                            rand_b                             |

Figure 5: Anatomy of draft UUID V7 https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#name-uuid-version-7

Examples of UUID V7 look like this:

# a sequence of 10 UUID V7 with nanosecond precision

Broken down (4 bit is 1 character):

061e25e4-1006    : We can see the first 12 characters are the timestamp. All are created in the same nanosecond
7                : 4 bits for version are all 7 for version.
4fe              : 3 random characters
8081-00da525240d0: 2 bits for variant and the rest are 62 bits of random characters

UUID V7 vs Serial ID

In terms of the difference between UUID V7 and a serial ID, one major advantage of UUID V7 is that generation is independent of nodes or shards. However, it leaks timestamps so an attacker can extract a record’s creation date.

Nevertheless, it leaks no other stuff found in serial ID such as:

  1. Size: If a client receives a record with id=10004578, they can guess that 4578 orders have been made.
  2. Rate of growth: Receiving two different orders means they can track the growth rate of record insertion.
  3. Iteration attack: If your API endpoints do not have authorization, an attacker can try to access with GET /api/users/1, GET /api/users/2, GET /api/users/3, etc. UUID makes this next to impossible.

As we will see later in benchmark section, querying makes no difference. Insertion will be longer because we needed to create the UUIDs client-side — albeit done concurrently.

Difference between UUID V7 and Serial ID

The table below summarises the difference between the two.

UUID V7 (Draft) Serial ID
Status Draft n/a
Generate No need of internal function to calculate next ID but client-side generation can take longer Internal function to calculate next ID
Insertion Time Longer because it has to be indexed. But should not be a bottleneck Fast
No Collision False. But pretty much non-existent True*
Sortable Yes, k-sortable Yes
Random Yes No
Timestamp Precision Up to nanosecond n/a
Btree Index Good Excellent
Shard between nodes Yes Have to coordinate between nodes so IDs do not clash
Safe for replication MySql: No. Might only be applicable to UUIDv4 Have to take care of duplicate IDs between nodes
Potential expose information to client Yes, because timestamp is viewable Yes. Size, rate of growth, iteration API call attack: user ID = 1, 2, 3, …)

Table 1: Comparison between UUID V7 and Serial ID

* In postgres, if you use bigserial unsigned (serial) instead of bigint generated always as identity (identity), there can be a case when ID clashes - https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/


A quick unscientific benchmark shows there are not much difference between using UUID V7 and serial ID

Simple schemas were used:

    counter bigint

    id uuid PRIMARY KEY,
    counter bigint

Code 1: Schema for both UUID and serial ID

UUID V7 was generated using https://github.com/gofrs/uuid library.

The amount of data is 100 million. And we measured record retrieval in milliseconds (switched on using \timing true;) by using their respective primary key. We pick 10 arbitrary records by taking them somewhere in the middle. A sequence of them are used to prevent query cache.

select * from uuid where counter = 50000000; -- 50 million
select * from uuid where counter = 50000001;
select * from uuid where counter = 50000002;

Then use them to query the records on our commodity hardware.

select * from uuid where id = '0187bf9d-6a93-7c42-873b-c566cee2d749'; -- and the next 9 
select * from serial where id = 50000000; -- and 50000001, 50000002, ...

An average is calculated for each of the UUID and serial methods.

Benchmark Average Query Time Between UUID V7 and Serial ID Figure 6: A query time benchmark of 10 separate SELECT from 10 million records

As you can see, both returns a sub-millisecond response at 0.292ms and 0.288ms for UUID and Serial method respectively. In the grand scheme of things, an average difference of ~4 microseconds is hardly material and falls within the margin of error.


UUID is not the only random-ish primary key you can use. Some alternatives are





Using UUID or other k-sortable alternatives eliminates many problems with using serial ID for example sensitive data exposure to client. Randomness is bad for b-tree index but a leading time component makes it k-sortable and thus more index-friendly. Insert operations can be done concurrently and sharding becomes easier. However, the timestamp is visible to the clients.