Understanding Database Collation in Rails

You've likely seen collation settings in your schema.rb or database.yml many times, but do you know exactly how they impact your application?
In a standard Rails application, you might see this in a migration:
add_column :users, :name, :string, collation: "utf8mb4_unicode_ci"
Or in your database.yml:
encoding: utf8mb4
collation: utf8mb4_0900_ai_ci
Let's break down what collation actually is, why it matters, and which settings you should use in a modern Rails environment.
1. The Core Concept: Character Set vs. Collation
To understand collation, you must distinguish it from character sets.
Character Set (The Container): Defines which characters you can store.
Collation (The Rulebook): Defines how those characters are compared and sorted.
Example Scenario: Imagine you have a list of names: ["Amit", "amit", "Ámit"].
utf8mb4_unicode_ci: Case-insensitive. It sees all three as identical matches for a search.utf8mb4_bin: Binary comparison. It sees three completely different values.
2. Why Collation Matters
Collation directly dictates how your database engine (and by extension, Rails) handles data retrieval and organization.
Search Behavior (WHERE clauses): If a user searches for name = 'AMIT', the result depends entirely on collation.
Sorting Order (ORDER BY): Sorting multilingual content requires language-aware rules (e.g., how to sort accented characters like ñ or ö).
Data Integrity (Joins): Joining two tables with different collations can cause SQL errors or severe performance degradation because the database cannot reliably compare the keys.
3. The "Types" of Collation
When looking at a MySQL collation string (e.g., utf8mb4_0900_ai_ci), the suffixes tell you the rules used:
_ci(Case-Insensitive)_cs(Case-Sensitive)_ai(Accent-Insensitive)_as(Accent-Sensitive)_bin(Binary)
4. Best Practices: Which to Use When?
For a modern Rails application (Rails 6/7+) running on MySQL 8, here are the recommended strategies:
✅ The Modern Default
utf8mb4_0900_ai_ci
What it is: The modern Unicode standard. It is accent-insensitive and case-insensitive.
Use case: General storage for names, descriptions, and comments where you want "cafe" to be found even if the user types "café".
🔍 Searchable Text
utf8mb4_unicode_ci or utf8mb4_0900_ai_ci
Use case: Usernames, emails, and addresses.
Why: Ensures a smooth user experience. A user searching for "STEPHEN" should generally find "Stephen."”
🔐 Strict Security & Tokens
utf8mb4_bin
Use case: Password hashes, API tokens, invite codes, or case-sensitive identifiers (e.g., a URL shortener code where AbC is different from abc).
Why: You need an exact, byte-for-byte comparison.
5. Implementation in Rails
Global Default (database.yml)
This sets the baseline for any new table or column created.
production:
adapter: mysql2
encoding: utf8mb4
collation: utf8mb4_0900_ai_ci
Column-Level Override (Migrations)
Use this when a specific column requires different behavior than the database default (e.g., a strict token field).
class AddTokenToUsers < ActiveRecord::Migration[7.1]
def change
add_column :users, :api_token, :string, collation: "utf8mb4_bin"
end
end
Fixing Existing Data
If you have "broken" sorting or emoji support, you cannot just change the config file. You must migrate the actual data table.
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Summary
Character Set is for storing data; Collation is for comparing data.
Inconsistent collation leads to join errors and weird search bugs.
Recommendation: Start your projects with
utf8mb4_0900_ai_cias the default.Exception: Use
_bin(binary) collation only for fields that require strict exact matching (tokens, hashes).


