Skip to main content

Command Palette

Search for a command to run...

Understanding Database Collation in Rails

Updated
3 min read
Understanding Database Collation in Rails

You've likely seen collation settings in your schema.rb or database.yml many times, but do you know exactly how they impact your application?

In a standard Rails application, you might see this in a migration:

add_column :users, :name, :string, collation: "utf8mb4_unicode_ci"

Or in your database.yml:

encoding: utf8mb4
collation: utf8mb4_0900_ai_ci

Let's break down what collation actually is, why it matters, and which settings you should use in a modern Rails environment.

1. The Core Concept: Character Set vs. Collation

To understand collation, you must distinguish it from character sets.

  • Character Set (The Container): Defines which characters you can store.

  • Collation (The Rulebook): Defines how those characters are compared and sorted.

Example Scenario: Imagine you have a list of names: ["Amit", "amit", "Ámit"].

  • utf8mb4_unicode_ci: Case-insensitive. It sees all three as identical matches for a search.

  • utf8mb4_bin: Binary comparison. It sees three completely different values.

2. Why Collation Matters

Collation directly dictates how your database engine (and by extension, Rails) handles data retrieval and organization.

  1. Search Behavior (WHERE clauses): If a user searches for name = 'AMIT', the result depends entirely on collation.

  2. Sorting Order (ORDER BY): Sorting multilingual content requires language-aware rules (e.g., how to sort accented characters like ñ or ö).

  3. Data Integrity (Joins): Joining two tables with different collations can cause SQL errors or severe performance degradation because the database cannot reliably compare the keys.

3. The "Types" of Collation

When looking at a MySQL collation string (e.g., utf8mb4_0900_ai_ci), the suffixes tell you the rules used:

  • _ci (Case-Insensitive)

  • _cs (Case-Sensitive)

  • _ai (Accent-Insensitive)

  • _as (Accent-Sensitive)

  • _bin (Binary)

4. Best Practices: Which to Use When?

For a modern Rails application (Rails 6/7+) running on MySQL 8, here are the recommended strategies:

✅ The Modern Default

utf8mb4_0900_ai_ci

  • What it is: The modern Unicode standard. It is accent-insensitive and case-insensitive.

  • Use case: General storage for names, descriptions, and comments where you want "cafe" to be found even if the user types "café".

🔍 Searchable Text

utf8mb4_unicode_ci or utf8mb4_0900_ai_ci

  • Use case: Usernames, emails, and addresses.

  • Why: Ensures a smooth user experience. A user searching for "STEPHEN" should generally find "Stephen."”

🔐 Strict Security & Tokens

utf8mb4_bin

  • Use case: Password hashes, API tokens, invite codes, or case-sensitive identifiers (e.g., a URL shortener code where AbC is different from abc).

  • Why: You need an exact, byte-for-byte comparison.

5. Implementation in Rails

Global Default (database.yml)

This sets the baseline for any new table or column created.

production:
  adapter: mysql2
  encoding: utf8mb4
  collation: utf8mb4_0900_ai_ci

Column-Level Override (Migrations)

Use this when a specific column requires different behavior than the database default (e.g., a strict token field).

class AddTokenToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :api_token, :string, collation: "utf8mb4_bin"
  end
end

Fixing Existing Data

If you have "broken" sorting or emoji support, you cannot just change the config file. You must migrate the actual data table.

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Summary

  • Character Set is for storing data; Collation is for comparing data.

  • Inconsistent collation leads to join errors and weird search bugs.

  • Recommendation: Start your projects with utf8mb4_0900_ai_ci as the default.

  • Exception: Use _bin (binary) collation only for fields that require strict exact matching (tokens, hashes).