Hi All,
Apologies for a long e-mail.
TLDR: We need to define format of primary keys (UUID) so that it is possible to transform
the primary and foreign keys from VARCHAR(36) columns into database-native binary format.
This is in particular important to document in 2.5 in Storage Ids section of the new User
Storage SPI [1]
Long version:
I have looked at current database model and while generally it looks well, there is an
interesting issue with primary / foreign keys that causes performance degradation on both
Java and - most importantly - database side, causing even deadlocks for some databases.
The issue comes from database handling of IDs. IDs are in fact UUIDs, i.e. series of bytes
that are represented by Strings in KC JPA classes. Why this causes performance degradation
is due to various representations conversions (byte array vs String in Java) and - most
importantly for database - character set conversions. In the worst case, The conversions
occur both in JDBC driver and the database. The consequences are demonstrated by Jira
issue KEYCLOAK-3210 [2] when several simultaneous requests lead to deadlocking the
database.
When JDBC driver obtains a string, it converts its representation into a character set
understood by database. Database might need to convert the string to a character set
specified for the column. This is nicely illustrated in MSSQL which makes distinction
between VARCHAR (8-bit codepages) and NVARCHAR (UCS-2 Unicode charset). IDs are VARCHARs
which is indeed an efficient way to use strings that consist of ASCII-only characters
(though not optimal for UUID, read below). However, if Unicode characters are to be
supported, MSSQL JDBC driver sends all character parameters as Unicode Strings [3].
Database then performs a conversion from Unicode to 8-bit charset which generally loses
some data. To account for this loss, instead of performing an index scan that directly
points to a requested row, it returns a range where the requested record should be. This
has fatal impact on performance. For more detailed analysis of the resulting plans, see
comment in [2].
Clearly, the scan by id should be fast and the format of IDs in database matters. It
should avoid conversions as much as possible. Hence the following plan came:
* In the result, all primary keys and corresponding keys have to be represented by binary
UUID data type (where supported, some databases represent UUID as e.g. VARBINARY(16)),
i.e. 16 bytes instead of 36 bytes
* All keys in the JPA classes should be of type UUID, not String
As a result, database indices get smaller (16 bytes of indexed data per record vs 36 bytes
as it is now in case of 8-bit storage of characters), and no character conversions are in
place, hence the overall performance increases.
This task is a slighty big one so it won't fit into KC 2.5 timeframe, but we should
definitely aim for 3.0.
This has several preconditions:
* The String keys in keycloak JPA classes, wherever used, are restricted to UUID format
* This format is documented and respected by all custom implementations, namely User
Storage implementations.
* There exists conversion from String to native UUID for used databases (this is certainly
possible for PostgreSQL MSSQL, DB2, and MySQL, most likely others)
Similarly to JPA, Infinispan classes should be revisited and optimized to save some bytes
that might be important for cluster replication by replacing String with UUIDs
Thoughts?
--Hynek
[1]
https://github.com/keycloak/server_development_guide/blob/6b82f0868c0d7a1...
[2]
https://issues.jboss.org/browse/KEYCLOAK-3210
[3]
https://msdn.microsoft.com/en-us/library/ms378857(v=sql.110).aspx