ClearYourDoubt

Understanding Multi-Peer Video Conferencing: From P2P to SFU

Atish Maske — Wed, 14 Jan 2026 07:03:59 GMT

What is a Peer and How Do They Connect?

So when we talk about conference apps like Zoom, Google Meet, or WhatsApp group calls, we're essentially talking about multiple devices connecting to each other in real-time. Each device is called a peer. A peer is basically any participant in the call—your laptop, mobile phone, or any client that's part of the video conference.

How do peers connect? At the basic level, peers need to establish a connection to send and receive video and audio streams. In the simplest case, when there are just two peers (let's say Atish calling Rohit), they can establish a direct connection to each other. But when there are multiple peers, things get more complex. The key challenge is that each peer needs to somehow transmit its data to all other peers and receive data from all of them.

This is where the architecture matters. The way peers connect depends entirely on which architecture you choose—whether it's P2P, Mesh, MCU, or SFU.

What is WebRTC and Why Use UDP Instead of TCP?

Before diving into different architectures, we need to understand WebRTC. WebRTC (Web Real-Time Communication) is a protocol and technology that enables real-time communication directly in web browsers and applications. It allows peers to establish connections and exchange audio, video, and data directly.

Now, here's the critical part: normal HTTP uses TCP (Transmission Control Protocol), but for real-time video conferencing, this becomes a problem.

Why not use TCP? TCP is designed to ensure that every packet of data reaches its destination in order and without loss. It has built-in error correction and retransmission. While this sounds good for reliability, it's terrible for real-time communication. Here's why:

Imagine you're in a video call and a few packets of video data get lost. TCP will pause the entire stream and wait for those lost packets to be resent. This causes buffering, freezing, and latency. A 200ms delay might not seem like much, but in a real-time conversation, it becomes extremely noticeable and frustrating.

So instead, we use UDP (User Datagram Protocol). UDP is connectionless and doesn't guarantee delivery or order. If a few packets get lost? UDP doesn't care—it just keeps sending. This is perfect for video because:

Losing a few video frames is acceptable. Your eye won't even notice if one or two frames are missing.
Speed is prioritized over perfection. A fast, slightly degraded video is better than a perfectly buffered but delayed one.
Low latency is achieved. UDP has minimal overhead, so data travels much faster.

WebRTC uses UDP to transmit media streams. This is the foundation that makes real-time video calls possible with acceptable latency.

Architectures: P2P, Mesh, MCU, and Why SFU?

Now that we understand peers and UDP, let's talk about how multiple peers can connect using different architectures.

1. Peer-to-Peer (P2P)

In P2P, two peers connect directly to each other. Atish sends his video stream directly to Rohit, and Rohit sends his stream directly to Atish. No server involvement for media, no processing overhead.

Advantages:

No server cost for media handling
Direct connection means lowest possible latency
Simple to implement for 1:1 calls

Problems:

Doesn't scale. As soon as you add a third peer, it becomes complicated.
Each peer would need to connect to every other peer individually.

2. Mesh Network

A mesh extends P2P to multiple peers. Here, every peer connects to every other peer. With 3-4 peers, it might seem okay. But let's imagine 10-12 peers.

Peer 1 has to:

Send its video stream to peers 2, 3, 4, 5... up to 12 (11 outgoing connections)
Receive video streams from all other 11 peers (11 incoming streams to decode)
Process and potentially display 11 different video feeds

Multiply this for all 12 peers, and you have each client uploading and downloading massive amounts of data while decoding multiple streams. The CPU and bandwidth usage explodes exponentially. It's a mess.

Advantages:

Still no server cost for media

Problems:

Bandwidth consumption is massive
Doesn't scale beyond a handful of participants

3. MCU (Multipoint Control Unit)

To solve the mesh problem, MCU was introduced. Here's how it works:

Each peer sends a single stream to a central server (MCU). The MCU then:

Receives all streams from all peers
Decodes them
Mixes/composites them into one combined video layout (like a grid showing all participants)
Re-encodes the mixed video
Sends this single combined stream back to each peer

Advantages:

Each client only uploads one stream and receives one stream. Much simpler on the client side.
Can handle many more participants than mesh

Problems:

The server is doing heavy work: decoding multiple streams, compositing them, and re-encoding. This is CPU-intensive.
High latency: All this processing takes time. You have encoding latency, processing latency, and transmission latency stacked together.
Very expensive server infrastructure needed
Not ideal for modern scalable applications

4. SFU (Selective Forwarding Unit) — The Smart Choice

SFU is where we get the best of both worlds. Here's how it works:

Each peer sends its stream(s) to the SFU server. The server doesn't mix or compose anything. Instead, it:

Inspects which streams are relevant
Decides which streams should go to which peers
Forwards each stream separately (without combining)
Each peer receives multiple individual streams
The peer's client decides what to do with these streams: which ones to display, which to hide, which quality to accept, etc.

Advantages:

Far less CPU overhead than MCU. The server isn't mixing; it's just routing.
Better latency. No heavy composition work means faster processing.
Peers control their own experience. They can choose to watch high-quality streams from speakers and lower quality from others.
Bandwidth management. Peers can decide to pause certain streams or request lower resolutions.
Highly scalable. One SFU server can handle many more participants than an MCU because it's doing less work.
Significantly lower server costs compared to MCU

This is why modern platforms like Zoom, Google Meet, and others use SFU-based architectures. It's the sweet spot between server cost, scalability, and latency.

Conclusion

When you build a multi-party video conferencing application, your architecture choice directly impacts scalability, latency, and costs.

P2P works for 1:1 calls. Mesh doesn't scale. MCU is expensive and slow. SFU is the architecture driving most modern real-time platforms because it balances all these concerns. It's efficient, scalable, and cost-effective.

Understanding these differences helps you make informed decisions when designing your own real-time communication systems.

How to Use Multiple GitHub Accounts

Atish Maske — Wed, 10 Dec 2025 11:32:00 GMT

If you have multiple GitHub accounts, it can be a challenge to manage them on the same computer. Fortunately, with a few configuration changes, you can easily use multiple GitHub accounts.

Here are the steps to set up multiple GitHub accounts:

This guide is useful for developers who:

Work on both personal projects and company-owned repositories
Contribute to open-source projects with a personal account while working at a company
Maintain multiple projects under different GitHub accounts
Want to keep work and personal contributions separate

Prerequisites

Before you begin, ensure you have:

Two or more GitHub accounts already created
Access to your terminal/command line
Basic familiarity with Git and SSH
macOS, Linux, or WSL on Windows (Git and SSH pre-installed)
Git already installed and configured on your machine

1. Generate SSH keys for each account

First, you need to generate SSH keys for each GitHub account you want to use. You can use the following command to generate an SSH key for your "Account 1" GitHub account:

ssh-keygen -t rsa -b 4096 -C "your-email@example.com"

Be sure to replace "your-email@example.com" with the email address associated with your "Account 1" GitHub account. When prompted, save the SSH key to the default location (~/.ssh/id_rsa).

Repeat this process for each GitHub account you want to use, using a unique name for each SSH key (e.g., id_rsa_account1 for "Account 1", id_rsa_account2 for "Account 2", etc.).

2. Add SSH keys to GitHub

Next, you need to add the SSH keys to each GitHub account. Log in to your "Account 1" GitHub account and go to the "Settings" page. Click on the "SSH and GPG keys" tab, then click the "New SSH key" button. Paste the contents of the id_rsa_account1.pub file (located in ~/.ssh/) into the "Key" field and give the key a descriptive name. Click "Add SSH key" to save the key.

Repeat this process for each GitHub account you want to use, using the appropriate SSH key for each account.

3. Create a configuration file

Next, you need to create a configuration file for SSH to specify the different hosts and SSH keys to use for each GitHub account. Use the following command to create a new configuration file:

nano ~/.ssh/config

In the configuration file, add the following lines for each GitHub account:

# Account 1
Host github.com-account1
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_account1

# Account 2
Host github.com-account2
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_account2

Replace "account1" and "account2" with unique names for each GitHub account.

4. Clone repositories and set up the remote origin

Next, clone the repositories using Git. For each GitHub account:

Open a terminal and navigate to the directory where you want to clone the repository.
Use the following command to clone:

git clone git@github.com-account1:username/repository.git

Replace username with the username of your GitHub account, and repository with the name of the repository.

Important: I made a mistake initially by not using the Git hostname correctly. I was using the incorrect command:
git clone git@github.com:username/repository.git
By omitting the correct hostname, I experienced "access denied" errors when attempting to push or pull.

To avoid this, make sure to use the correct format:
git clone git@github.com-account1:username/repository.git

Repeat this process for each GitHub account and repository.

5. Associate existing local code with the remote origin

If you have existing code on your local machine:

Navigate to the local repository directory:

cd /path/to/repository

View the current remote URLs:

git remote -v

Update the remote origin URL:

git remote set-url origin git@github.com-account1:username/repository.git

Verify the update:

git remote -v

By following these steps, you can associate your existing local code with the appropriate

6. Set SSH Key Permissions

For security reasons, your SSH keys should have specific file permissions. Once you've generated your SSH keys, ensure they have the correct permissions:

# Set permissions for SSH directory
chmod 700 ~/.ssh

# Set permissions for private keys
chmod 600 ~/.ssh/id_rsa_account1
chmod 600 ~/.ssh/id_rsa_account2

# Set permissions for public keys
chmod 644 ~/.ssh/id_rsa_account1.pub
chmod 644 ~/.ssh/id_rsa_account2.pub

These permissions ensure:

700 for ~/.ssh: Only you can read, write, and execute the directory
600 for private keys: Only you can read and write your private keys
644 for public keys: Only you can write, but everyone can read

7. Configure Git User Identity Per Repository

When working with multiple accounts, you may want to configure your Git user identity per repository to ensure commits are attributed to the correct account:

# Navigate to your repository
cd /path/to/repository

# Set user name for this repository
git config user.name "Your Name"

# Set user email for this repository
git config user.email "your-email@example.com"

# Verify the configuration
git config --list

If you want to set this globally for all repositories, use the --global flag:

git config --global user.name "Your Name"
git config --global user.email "your-email@example.com"

7.1 Important: Fixing Commits with Wrong Account

If you notice that your commits are still being attributed to your old/work account instead of the correct one, you need to fix the repository configuration:

# Navigate to your repository
cd /path/to/repository

# Set the correct user email for this repository
git config user.email "your_email@example.com"

# Verify the configuration was updated
git config user.email

Make sure to:

Use the correct email address for your intended account
Run this command inside the repository directory (not globally)
Verify the output shows your intended email address
Test with a new commit to ensure it's attributed correctly

If commits were already made with the wrong account, you'll need to rewrite the commit history to fix the author information. This is beyond the scope of this guide, but tools like git filter-repo or git filter-branch can help.8. Verify SSH Connection

Before cloning or pushing to a repository, verify that your SSH connection is working correctly:

# Test connection to GitHub with account1
ssh -T git@github.com-account1

# Test connection to GitHub with account2
ssh -T git@github.com-account2

You should see a response like:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

If you encounter any issues, use the verbose flag to debug:

ssh -vT git@github.com-account1

Troubleshooting Common Issues

Issue 1: Permission Denied (publickey)

Problem: You get "Permission denied (publickey)" error when trying to push or pull.

Solution:

Verify your SSH key is added to the GitHub account
Check that the SSH key file has correct permissions (chmod 600)
Ensure you're using the correct hostname (e.g., git@github.com-account1)
Test the connection: ssh -T git@github.com-account1

Issue 2: Using Wrong Account

Problem: Git is using the wrong account for push/pull operations.

Solution:

Verify the remote URL with git remote -v
Ensure the hostname matches your SSH config (e.g., github.com-account1 not github.com)
Check git config: git config --local user.email
Update remote if needed: git remote set-url origin git@github.com-account1:username/repo.git

Issue 3: SSH Key Not Found

Problem: "Could not open a connection to your authentication agent" or SSH key not found.

Solution:

Start SSH agent: eval "$(ssh-agent -s)"
Add your SSH key: ssh-add ~/.ssh/id_rsa_account1
Verify key is added: ssh-add -l

Issue 4: Multiple SSH Keys in Agent

Problem: SSH agent tries wrong key first, causing "too many authentication failures".

Solution: Modify your ~/.ssh/config to specify which key to use first:

Host github.com-account1
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_account1
  IdentitiesOnly yes

Host github.com-account2
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_account2
  IdentitiesOnly yes

The IdentitiesOnly yes option ensures only the specified key is used.

Best Practices and Tips

Use Descriptive Hostnames: Make your SSH config hostnames descriptive (e.g., github.com-work, github.com-personal)
Keep Keys Secure: Never share your private SSH keys and consider using a passphrase for added security
Regularly Rotate Keys: Periodically generate new SSH keys for security
Use SSH Agent: Add your keys to SSH agent for seamless authentication:
```
 ssh-add ~/.ssh/id_rsa_account1
 ssh-add ~/.ssh/id_rsa_account2
```
Document Your Setup: Keep a record of which key belongs to which account
Test Before Critical Operations: Always test SSH connection before important push/pull operations
Monitor SSH Activity: Regularly check active SSH keys on your GitHub accounts (Settings > SSH and GPG keys)

Conclusion

Managing multiple GitHub accounts on the same machine is a common workflow for developers who maintain both personal and work projects. By following these steps, you can:

Generate and manage multiple SSH keys securely
Configure SSH to use different keys for different accounts
Associate repositories with the correct GitHub accounts
Debug and troubleshoot common authentication issues

The key to success is ensuring that:

SSH keys have correct permissions
SSH config maps hostnames to the right keys
Git remotes use the correct custom hostnames
Each repository is configured with the right user identity

With this setup, you can seamlessly switch between multiple GitHub accounts without authentication headaches. Remember to keep your SSH keys secure and regularly review your active keys on GitHub's settings page.remote origin URL for each GitHub account.

Inside the Engine: How NoSQL Optimizes for Massive Writes

Atish Maske — Sun, 07 Dec 2025 11:13:56 GMT

Most developers have a mental model of how a database works: You insert a row, the database finds the right spot on the hard drive (usually using a B-Tree), and slots it in perfectly. It’s neat, organized, and reliable.

But what happens when you need to handle 10 million writes per second?

At that scale, the "neat and organized" B-Tree falls apart. The constant disk seeking and rebalancing would bring your server to its knees.

Enter the LSM Tree (Log-Structured Merge Tree)—the engine under the hood of Cassandra, RocksDB, and HBase. It achieves speed by doing something that sounds counter-intuitive: It stops trying to be organized on disk immediately.

Here is how the magic works.

The "Lazy" Genius: Append-Only Storage

To make writes instant, NoSQL databases adopt a simple philosophy: Never modify a file. Always append.

Imagine keeping a diary. If you wanted to change an entry from last Tuesday, you wouldn't get an eraser, find the page, and scrub it out. You’d just write a new entry today saying, "Update regarding last Tuesday..."

This is Sequential I/O, and disks love it. It is orders of magnitude faster than hopping around the disk (Random I/O) to update old records.

# The "Lazy" Write
# Instead of finding a specific index on disk, we just tack it to the end.
def simple_append(key, value)
  timestamp = Time.now.to_i
  entry = "#{timestamp}:#{key}:#{value}\n"

  # 'a' mode opens the file for appending, ensuring O(1) write speed
  File.open('database.log', 'a') do |file|
    file.write(entry)
  end
end

But if we just dump data into a file, reading it back becomes a nightmare. We’d have to scan the whole file to find anything. We need a compromise.

The Architecture: RAM is the New Disk

The LSM Tree uses a dual-layer approach. It treats Memory (RAM) as the staging area and Disk as the permanent archive.

Write Operation

Read Operation

1. The MemTable (The VIP Lounge)

When a write request hits your database, it doesn't touch the hard drive yet. It goes straight into the MemTable (Memory Table).

What is it? A sorted data structure (like a Red-Black Tree or Skip List) living entirely in RAM.
Why? Because RAM is lightning fast. We can insert and sort data here in microseconds.

# Inside the Database Engine
class MemTable
  def initialize
    # In a real DB, this would be a Red-Black Tree or Skip List 
    # to keep keys sorted automatically.
    @data = {} 
  end

  def put(key, value)
    # O(log N) insert speed because we are in RAM
    @data[key] = value
  end

  def get(key)
    @data[key]
  end
end

2. The SSTable (The Immutable Archive)

Once the MemTable gets full (say, it hits 64MB), the database freezes it. It flushes the entire sorted list to the disk in one go.

This file is called an SSTable (Sorted String Table).

Crucial Rule: SSTables are Immutable. Once written, they are never changed. This means no lock contention and no complex disk management.

The "Needle in a Haystack" Problem

Now we have a problem. We have data scattered across RAM and multiple immutable files on disk. How do we find "User: 123" without opening every single file?

The Sparse Index

We cheat. We don't create an index for every key. We create a Sparse Index.

Imagine an encyclopedia. You don't have a bookmark for every word. You have a guide that says "Aardvark starts on page 1, Apple starts on page 50."

Because our SSTables are sorted, we only need to store the offset of the first key of every 64KB block.

# We don't store every key. We only store the "signposts".
sparse_index = [
  { key: "ID-001", offset: 0 },      # Start of Block 1
  { key: "ID-500", offset: 65536 },  # Start of Block 2
  { key: "ID-900", offset: 131072 }  # Start of Block 3
]

def find_block_for(target_key, index)
  # If we want ID-750, we know it MUST be in Block 2
  # because 750 is between 500 and 900.

  candidate_block = index.select { |entry| entry[:key] <= target_key }.last

  return candidate_block ? candidate_block[:offset] : nil
end

If you are looking for ID-250, and the index says ID-001 is at Block A and ID-500 is at Block B, you know for a fact that ID-250 must be in Block A.
We load just that small block, scan it, and find our data.

The Edge Cases: Dealing with Reality

The architecture above is fast, but it creates some unique engineering challenges. Here is how NoSQL solves them.

1. "How do I delete data if files are immutable?"

If we can't edit the file, how do we delete a row?

We don't.

We write a Tombstone.

A Tombstone is a new record that literally says: "ID-123 is dead."

TOMBSTONE = "##DELETED##"

def delete_key(key)
  # We don't delete. We write a new record marking it as dead.
  # This is treated exactly like a normal Write operation.
  @db.write(key, TOMBSTONE)
end

When the database reads data, it sees the Tombstone in the recent file and ignores any older versions of ID-123 found in older files. It shadows the old data until it's eventually cleaned up.

2. "What if the server crashes?"

The MemTable is in RAM. If the power plug is pulled, that data is gone forever.

To prevent this, we use a WAL (Write-Ahead Log).

Every write is simultaneously appended to a "dumb" log file on disk before it hits memory. We never read this file—unless the server crashes. On reboot, the database replays the WAL to reconstruct the MemTable. It’s the "Black Box" flight recorder of the database.

def robust_write(key, value)
  # 1. Safety First: Write to disk log
  @wal.append(key, value)

  # 2. Speed Second: Write to Memory
  @mem_table.put(key, value)
end

3. "The 'Not Found' Penalty" (Bloom Filters)

The most expensive query in an LSM Tree is looking for a key that doesn't exist.

The database checks the MemTable (Miss). Then the newest SSTable (Miss). Then the next one (Miss)... all the way to the oldest file. You might scan 50 files just to realized the data isn't there.

To stop this wasted effort, we use Bloom Filters.

A Bloom Filter is a space-efficient probabilistic data structure. Think of it as a very smart, very small memory bit-array.

How it works:

Multiple Hashes: When you write a key (e.g., "ID-123"), the system runs it through multiple hash functions.
Bit Flipping: Each hash function points to a specific bit in an array, and we flip those bits to 1.
The Check: When you want to read "ID-123", we run the hashes again. If ALL the corresponding bits are 1, the key might exist. If ANY bit is 0, the key definitely does not exist.

This gives us two powerful guarantees:

False Negatives are Impossible: If the filter says "No", the data is absolutely not there.
False Positives are Possible: It might say "Maybe", but the key isn't actually there (rare, but possible).

class BloomFilter
  def initialize(size)
    @bit_array = Array.new(size, 0)
  end

  def add(key)
    # Hash the key multiple times and flip bits to 1
    indices = hash_functions(key) 
    indices.each { |i| @bit_array[i] = 1 }
  end

  def might_contain?(key)
    indices = hash_functions(key)
    # If ANY bit is 0, the key is definitely not here.
    # We can skip the expensive disk read!
    return false if indices.any? { |i| @bit_array[i] == 0 }

    return true # It might be here, go check the disk.
  end
end

def read_with_optimization(key)
  # 1. Check RAM
  return @mem_table.get(key) if @mem_table.contains?(key)

  # 2. Ask the Bouncer (Bloom Filter)
  unless @bloom_filter.might_contain?(key)
    return "Key Definitely Not Found" # We skip the disk entirely!
  end

  # 3. Only check disk if the Bloom Filter says "Maybe"
  return check_sstables_on_disk(key)
end

This simple trick saves massive amounts of I/O for missing keys.

4. Compaction: Taking Out the Trash

Over time, you might have 1,000 SSTables, and ID-123 might exist in 50 of them (49 updates and 1 current value).

A background process called Compaction runs quietly. It wakes up, merges the 1,000 files into 100 larger files, tosses out the duplicates and Tombstones, and goes back to sleep.

Summary

NoSQL systems aren't magic; they are a masterclass in tradeoffs. By accepting that files should be immutable and RAM should be the primary workspace, they achieve write speeds that traditional databases can't touch.

The Recipe for Speed:

Append Only: Never seek, just write.
MemTable: Sort in RAM first.
Tombstones: Don't delete, just mark as dead.
Bloom Filters: Use math to avoid searching for what isn't there.

How Apache Lucene Makes Searching Super Fast

Atish Maske — Sat, 06 Dec 2025 09:49:20 GMT

Today, I want to talk about something really cool: how Apache Lucene stores and retrieves data so efficiently.

We're not diving into Elasticsearch (it's built on top of Apache Lucene) but into the magic that makes Lucene so powerful for full-text search.

Let's say you're building an app where users search for courses like "Ruby on Rails" or "Java," and you need to return matching results fast. Sounds simple, right?

But when you look under the hood, it's not that straightforward. Let's explore why traditional databases fall short and how Lucene solves the problem.

Why SQL Isn't Great for Full-Text Search

What's the Problem with SQL?

Okay, let's talk about SQL databases like MySQL. They're great for structured data, but they kind of struggle when it comes to searching text.

Why? Because SQL stores data in tables and uses something called B+tree indexes to speed things up.

Now, B+trees are cool—they organize data hierarchically so you can look stuff up faster. But when it comes to searching through a lot of text, they aren't up to the task.

Imagine you run this query:

SELECT * FROM courses WHERE description LIKE '%microservice%';

Here's what happens:

SQL has to check every row in the table for the word "microservice."
If you've got thousands (or millions!) of rows, this is like asking someone to flip through a book page by page looking for one word. Painful, right?

On top of that, if you've got n rows and m characters in course, the time complexity can shoot up to O(N×M²); that's not good.

Now imagine you're running something huge like LinkedIn. If users are searching for stuff and your database is doing this kind of heavy lifting every time, it's only a matter of time before everything slows down—or worse, crashes.

Clearly, SQL isn't built for this kind of work.

What About NoSQL? Is It Any Better?

So, maybe you're thinking, "What about NoSQL? Isn't it made for scaling?" Well, yes and no.

Key-value stores like Redis are great for quick lookups, but if you want to search text, you still need to scan every row for matches.
Document stores like MongoDB do better, but even they aren't optimized for advanced text search.

Bottom line? Neither SQL nor NoSQL gives you the speed and precision you need for full-text search.

The Game-Changer: The Inverted Index

Now here's where things get interesting. Instead of trying to force traditional databases to do something they're not built for, Lucene takes a completely different approach. It uses something called an inverted index.

What Is an Inverted Index?

Before diving into the explanation, let me ask you a question:

What data structure could we use to solve this problem? Two possibilities are a HashMap and a Trie.

However, there's a catch. If we use a HashMap as the data structure, the key would be the record ID/document ID, and the value would contain all the words present in that record or document. Sounds reasonable, right? But let's see how that works in practice.

Traditional HashMap Structure

Here's how data might look if stored in a traditional HashMap-like structure:

documents = {
  1 => ["Ruby", "Programming", "Beginners", "Learning"],
  2 => ["Java", "Advanced", "OOP"],
  3 => ["Python", "Data", "Science"]
}

If you wanted to search for "Ruby," you'd have to go through each record to find where it appears. Slow, right?

Now, what if we reverse the structure?

Instead of storing records as keys and their words as values, we can flip the structure. This gives us something called an Inverted Index.

Inverted Index Structure

Here's how the data would look in an inverted index:

inverted_index = {
  "ruby" => [1],
  "programming" => [1],
  "beginners" => [1],
  "learning" => [1],
  "java" => [2],
  "advanced" => [2],
  "oop" => [2],
  "python" => [3],
  "data" => [3],
  "science" => [3]
}

With this structure, searching for "Ruby" takes you straight to document ID 1. No more flipping through pages—just instant results!

Keys: The words (terms) from your dataset.
Values: Lists of document IDs where those words appear.

How Lucene Uses an Inverted Index (Step-by-Step)

Lucene doesn't just use an inverted index—it improves it with extra steps to make searching even faster and more accurate.

1. Tokenization

First, Lucene takes the text and splits it into individual words, or tokens. For example:

Input: "Learn Ruby programming for beginners."
Tokens: ["Learn", "Ruby", "programming", "for", "beginners"]

2. Lowercasing

Next, all tokens are converted to lowercase. This way, "Ruby" and "ruby" are treated the same.

Tokens: ["learn", "ruby", "programming", "for", "beginners"]

3. Removing Stop Words

Common words like "for," "a," or "the" are removed. These words don't add much meaning to searches, so Lucene skips them.

Tokens: ["learn", "ruby", "programming", "beginners"]

4. Stemming

Lucene reduces words to their root form (this is called stemming). For example:

"learn" and "learning" → "learn"
"programming" and "programs" → "program"

Now, we've got: ["learn", "ruby", "program", "beginner"]

5. Building the Inverted Index

Finally, Lucene maps these tokens to the inverted index.

Each token becomes a key in the index.
Each key points to a list of documents (and positions) where the word appears.

inverted_index = {
  "learn" => [{ doc: 1, positions: [0] }],
  "ruby" => [{ doc: 1, positions: [1] }],
  "program" => [{ doc: 1, positions: [2] }],
  "beginner" => [{ doc: 1, positions: [3] }]
}

Now, if you search for "Ruby," Lucene can instantly tell you it's in document 1. Easy, right?

Why is lucene used so widely?

So why does Lucene beat traditional databases for full-text search?

Speed - With the inverted index, Lucene doesn't need to scan entire datasets. It goes straight to the relevant documents.
Ranking and Scoring - Lucene doesn't just find matches—it ranks them by relevance. It uses various algorithms.
Scalability - Lucene powers systems like Elasticsearch, OpenSearch, MongoDB Atlas, Solr, etc. that handle billions of documents.
Customizability - You can tweak Lucene to fit your needs—custom tokenizers, stop-word lists, analyzers, etc.

But Wait… Doesn't MySQL Do Full-Text Search?

You're right—MySQL does support full-text search using inverted indexes. But here's the thing:

It's not as fast or scalable as Lucene.
It doesn't have advanced ranking features like TF-IDF or BM25.
It's harder to customise for specific use cases.

For more, check out MySQL's Full-Text Search Docs.

Key Takeaways

So, here's the takeaway:

SQL and NoSQL databases are great, but they're not built for full-text search.
Lucene's inverted index makes searching lightning-fast by flipping how data is stored.
It takes things even further with features like stemming, tokenization, and ranking.

That's why tools like Elasticsearch (which is built on Lucene) are so popular. They take Lucene's speed and scalability and make it even easier to use.

Resources

Understanding Database Collation in Rails

Atish Maske — Fri, 05 Dec 2025 16:35:14 GMT

You've likely seen collation settings in your schema.rb or database.yml many times, but do you know exactly how they impact your application?

In a standard Rails application, you might see this in a migration:

add_column :users, :name, :string, collation: "utf8mb4_unicode_ci"

Or in your database.yml:

encoding: utf8mb4
collation: utf8mb4_0900_ai_ci

Let's break down what collation actually is, why it matters, and which settings you should use in a modern Rails environment.

1. The Core Concept: Character Set vs. Collation

To understand collation, you must distinguish it from character sets.

Character Set (The Container): Defines which characters you can store.
Collation (The Rulebook): Defines how those characters are compared and sorted.

Example Scenario: Imagine you have a list of names: ["Amit", "amit", "Ámit"].

utf8mb4_unicode_ci: Case-insensitive. It sees all three as identical matches for a search.
utf8mb4_bin: Binary comparison. It sees three completely different values.

2. Why Collation Matters

Collation directly dictates how your database engine (and by extension, Rails) handles data retrieval and organization.

Search Behavior (WHERE clauses): If a user searches for name = 'AMIT', the result depends entirely on collation.
Sorting Order (ORDER BY): Sorting multilingual content requires language-aware rules (e.g., how to sort accented characters like ñ or ö).
Data Integrity (Joins): Joining two tables with different collations can cause SQL errors or severe performance degradation because the database cannot reliably compare the keys.

3. The "Types" of Collation

When looking at a MySQL collation string (e.g., utf8mb4_0900_ai_ci), the suffixes tell you the rules used:

_ci (Case-Insensitive)
_cs (Case-Sensitive)
_ai (Accent-Insensitive)
_as (Accent-Sensitive)
_bin (Binary)

4. Best Practices: Which to Use When?

For a modern Rails application (Rails 6/7+) running on MySQL 8, here are the recommended strategies:

✅ The Modern Default

utf8mb4_0900_ai_ci

What it is: The modern Unicode standard. It is accent-insensitive and case-insensitive.
Use case: General storage for names, descriptions, and comments where you want "cafe" to be found even if the user types "café".

🔍 Searchable Text

utf8mb4_unicode_ci or utf8mb4_0900_ai_ci

Use case: Usernames, emails, and addresses.
Why: Ensures a smooth user experience. A user searching for "STEPHEN" should generally find "Stephen."”

🔐 Strict Security & Tokens

utf8mb4_bin

Use case: Password hashes, API tokens, invite codes, or case-sensitive identifiers (e.g., a URL shortener code where AbC is different from abc).
Why: You need an exact, byte-for-byte comparison.

5. Implementation in Rails

Global Default (database.yml)

This sets the baseline for any new table or column created.

production:
  adapter: mysql2
  encoding: utf8mb4
  collation: utf8mb4_0900_ai_ci

Column-Level Override (Migrations)

Use this when a specific column requires different behavior than the database default (e.g., a strict token field).

class AddTokenToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :api_token, :string, collation: "utf8mb4_bin"
  end
end

Fixing Existing Data

If you have "broken" sorting or emoji support, you cannot just change the config file. You must migrate the actual data table.

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Summary

Character Set is for storing data; Collation is for comparing data.
Inconsistent collation leads to join errors and weird search bugs.
Recommendation: Start your projects with utf8mb4_0900_ai_ci as the default.
Exception: Use _bin (binary) collation only for fields that require strict exact matching (tokens, hashes).