Creating a Unique MySQL Collation

There are some cases when we need something unique solution on database level. Most of the case, it can be solved by a well-built query, but sometimes we really need to dig deeper to bring the best solution. Let’s take a look at unique collations with MySQL.

Prologue

Recently, I was working on a project, that required a proper MySQL sorting order. The project’s language – also, the database record’s language – is Hungarian. It means, there are some not basic characters: á, é, í, ó, ö, ő, ú, ü, ű, gy, ly, ny, ty dz, dzs, sz, zs .

We use the utf8mb4 charset by default.

So I had the following problems: the letter a and the letter á is not the same character and if I want MySQL to handle them differently, I had to use the utf8mb4_bin collation, because the bitwise comparison can make the difference. But, when it’s about sorting, the characters with an accent was at the end of the results. It means after the letter á was after letter z, which is totally wrong since in Hungarian á follows the letter a. Then I tried with the utf8mb4_hungarian_ci, and the unicode_ci, but both had the same issue. The did not make any difference between a and á. Also, as you see, we have consonants that consist of two or three letters. It means, we somehow need to tell MySQL that it’s one piece of letter and not three and calculate it when sorting.

Some of my colleagues gave a tip, to install MySQL 8+ or MariaDB 10.3+ since they have the proper collations. I installed MySQL 8 in local, and I can tell you it messed up everything, so I decided to leave that. Then I installed MariaDB on the forge server I have and none of the collations were there what I needed. Maybe I made something wrong I don’t know, but at least it forced me to solve this problem and let me share what I’ve found.

Adding a Unique Collation to MySQL

MySQL and MariaDB stores their charsets and collations in a directory. Referring to the documentation, we can retrieve the path of this directory running the following query:

show variables like 'character_sets_dir';
+--------------------+----------------------------------------------------------+
| Variable_name      | Value                                                    |
+--------------------+----------------------------------------------------------+
| character_sets_dir | /usr/local/Cellar/[email protected]/5.7.24/share/mysql/charsets/ |
+--------------------+----------------------------------------------------------+
The result depends on your OS also, how you installed MySQL. This example is installed by homebrew.

If you navigate to the directory, you can find an Index.xml file, that we need to modify. Here we can refer to a charset and add a custom collation. As it was mentioned before, we will add a collation to the utf8mb4 charset. Like in the documentation, in the XML file, we need to add an LDML implementation of the collation rules.

<charsets max-id="99">
...
  <charset name="utf8mb4">
    <collation name="utf8mb4_hungarian_as_ci" id="1029" order="Hungarian">
      <rules>
        <reset>A</reset>
        <p>\u00c1</p>
        <t>\u00e1</t>
        <reset>C</reset>
        <p>Cs</p>
        <t>cs</t>
        ...
      </rules>
    </collation>
  </charset>
...
</charsets>

Few things to notice: The collation id is 1029 because we need to give a non-existing one. This way we can make it sure it’s not used.

Also, we named the collation to utf8mb4_hungarian_as_ci, because this way, the collation is accent sensitive (as) but case-insensitive (ci). Like the id, this name has to be unique as well.

The <reset> rule “resets” the ordering for subsequent shift rules to cause them to be taken in relation to a given character. The <p> and <t> shift rules define primary and tertiary differences of a character from another character.

If the new rules were added, we need to restart the MySQL services and it will take care of the rest. From now, the new charset is available and orders the records with the new rules we configured.

Summary

It’s a nice way to extend MySQL with our unique collations. It can be very handy in some situations, especially if you need to work with a language different to English.

Another advantage is that this solution works the same way with MySQL and MariaDB, so you don’t need to worry about the compatibility issue.

Special thanks for the following recource(s): Icon made by Freepik from www.flaticon.com