<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[ClearYourDoubt]]></title><description><![CDATA[ClearYourDoubt]]></description><link>https://clearyourdoubt.in</link><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 06:27:35 GMT</lastBuildDate><atom:link href="https://clearyourdoubt.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Understanding Multi-Peer Video Conferencing: From P2P to SFU]]></title><description><![CDATA[What is a Peer and How Do They Connect?
So when we talk about conference apps like Zoom, Google Meet, or WhatsApp group calls, we're essentially talking about multiple devices connecting to each other in real-time. Each device is called a peer. A pee...]]></description><link>https://clearyourdoubt.in/understanding-multi-peer-video-conferencing-from-p2p-to-sfu</link><guid isPermaLink="true">https://clearyourdoubt.in/understanding-multi-peer-video-conferencing-from-p2p-to-sfu</guid><category><![CDATA[WebRTC]]></category><category><![CDATA[Video conferencing]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Atish Maske]]></dc:creator><pubDate>Wed, 14 Jan 2026 07:03:59 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-what-is-a-peer-and-how-do-they-connect">What is a Peer and How Do They Connect?</h3>
<p>So when we talk about conference apps like Zoom, Google Meet, or WhatsApp group calls, we're essentially talking about multiple devices connecting to each other in real-time. Each device is called a peer. A peer is basically any participant in the call—your laptop, mobile phone, or any client that's part of the video conference.</p>
<p>How do peers connect? At the basic level, peers need to establish a connection to send and receive video and audio streams. In the simplest case, when there are just two peers (let's say Atish calling Rohit), they can establish a direct connection to each other. But when there are multiple peers, things get more complex. The key challenge is that each peer needs to somehow transmit its data to all other peers and receive data from all of them.</p>
<p>This is where the architecture matters. The way peers connect depends entirely on which architecture you choose—whether it's P2P, Mesh, MCU, or SFU.</p>
<hr />
<h3 id="heading-what-is-webrtc-and-why-use-udp-instead-of-tcp">What is WebRTC and Why Use UDP Instead of TCP?</h3>
<p>Before diving into different architectures, we need to understand WebRTC. WebRTC (Web Real-Time Communication) is a protocol and technology that enables real-time communication directly in web browsers and applications. It allows peers to establish connections and exchange audio, video, and data directly.</p>
<p>Now, here's the critical part: normal HTTP uses TCP (Transmission Control Protocol), but for real-time video conferencing, this becomes a problem.</p>
<p>Why not use TCP? TCP is designed to ensure that every packet of data reaches its destination in order and without loss. It has built-in error correction and retransmission. While this sounds good for reliability, it's terrible for real-time communication. Here's why:</p>
<p>Imagine you're in a video call and a few packets of video data get lost. TCP will pause the entire stream and wait for those lost packets to be resent. This causes buffering, freezing, and latency. A 200ms delay might not seem like much, but in a real-time conversation, it becomes extremely noticeable and frustrating.</p>
<p>So instead, we use UDP (User Datagram Protocol). UDP is connectionless and doesn't guarantee delivery or order. If a few packets get lost? UDP doesn't care—it just keeps sending. This is perfect for video because:</p>
<ol>
<li><p>Losing a few video frames is acceptable. Your eye won't even notice if one or two frames are missing.</p>
</li>
<li><p>Speed is prioritized over perfection. A fast, slightly degraded video is better than a perfectly buffered but delayed one.</p>
</li>
<li><p>Low latency is achieved. UDP has minimal overhead, so data travels much faster.</p>
</li>
</ol>
<p>WebRTC uses UDP to transmit media streams. This is the foundation that makes real-time video calls possible with acceptable latency.</p>
<hr />
<h3 id="heading-architectures-p2p-mesh-mcu-and-why-sfu">Architectures: P2P, Mesh, MCU, and Why SFU?</h3>
<p>Now that we understand peers and UDP, let's talk about how multiple peers can connect using different architectures.</p>
<h2 id="heading-1-peer-to-peer-p2p">1. Peer-to-Peer (P2P)</h2>
<p>In P2P, two peers connect directly to each other. Atish sends his video stream directly to Rohit, and Rohit sends his stream directly to Atish. No server involvement for media, no processing overhead.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768371636079/1dd18022-e5a9-4935-a2aa-d24dd627f308.png" alt class="image--center mx-auto" /></p>
<p>Advantages:</p>
<ul>
<li><p>No server cost for media handling</p>
</li>
<li><p>Direct connection means lowest possible latency</p>
</li>
<li><p>Simple to implement for 1:1 calls</p>
</li>
</ul>
<p>Problems:</p>
<ul>
<li><p>Doesn't scale. As soon as you add a third peer, it becomes complicated.</p>
</li>
<li><p>Each peer would need to connect to every other peer individually.</p>
</li>
</ul>
<h2 id="heading-2-mesh-network">2. Mesh Network</h2>
<p>A mesh extends P2P to multiple peers. Here, every peer connects to every other peer. With 3-4 peers, it might seem okay. But let's imagine 10-12 peers.</p>
<p>Peer 1 has to:</p>
<ul>
<li><p>Send its video stream to peers 2, 3, 4, 5... up to 12 (11 outgoing connections)</p>
</li>
<li><p>Receive video streams from all other 11 peers (11 incoming streams to decode)</p>
</li>
<li><p>Process and potentially display 11 different video feeds</p>
</li>
</ul>
<p>Multiply this for all 12 peers, and you have each client uploading and downloading massive amounts of data while decoding multiple streams. The CPU and bandwidth usage explodes exponentially. It's a mess.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768371648736/c13e139e-3b64-4810-bd0e-e18a58fc33f6.png" alt class="image--center mx-auto" /></p>
<p>Advantages:</p>
<ul>
<li>Still no server cost for media</li>
</ul>
<p>Problems:</p>
<ul>
<li><p>Bandwidth consumption is massive</p>
</li>
<li><p>Doesn't scale beyond a handful of participants</p>
</li>
</ul>
<h2 id="heading-3-mcu-multipoint-control-unit">3. MCU (Multipoint Control Unit)</h2>
<p>To solve the mesh problem, MCU was introduced. Here's how it works:</p>
<p>Each peer sends a single stream to a central server (MCU). The MCU then:</p>
<ul>
<li><p>Receives all streams from all peers</p>
</li>
<li><p>Decodes them</p>
</li>
<li><p>Mixes/composites them into one combined video layout (like a grid showing all participants)</p>
</li>
<li><p>Re-encodes the mixed video</p>
</li>
<li><p>Sends this single combined stream back to each peer</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768373789523/1c4aaa62-edb0-4405-ac4c-9d75ae227472.png" alt class="image--center mx-auto" /></p>
<p>Advantages:</p>
<ul>
<li><p>Each client only uploads one stream and receives one stream. Much simpler on the client side.</p>
</li>
<li><p>Can handle many more participants than mesh</p>
</li>
</ul>
<p>Problems:</p>
<ul>
<li><p>The server is doing heavy work: decoding multiple streams, compositing them, and re-encoding. This is CPU-intensive.</p>
</li>
<li><p>High latency: All this processing takes time. You have encoding latency, processing latency, and transmission latency stacked together.</p>
</li>
<li><p>Very expensive server infrastructure needed</p>
</li>
<li><p>Not ideal for modern scalable applications</p>
</li>
</ul>
<h2 id="heading-4-sfu-selective-forwarding-unit-the-smart-choice">4. SFU (Selective Forwarding Unit) — The Smart Choice</h2>
<p>SFU is where we get the best of both worlds. Here's how it works:</p>
<p>Each peer sends its stream(s) to the SFU server. The server doesn't mix or compose anything. Instead, it:</p>
<ul>
<li><p>Inspects which streams are relevant</p>
</li>
<li><p>Decides which streams should go to which peers</p>
</li>
<li><p>Forwards each stream separately (without combining)</p>
</li>
<li><p>Each peer receives multiple individual streams</p>
</li>
<li><p>The peer's client decides what to do with these streams: which ones to display, which to hide, which quality to accept, etc.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768374038865/9846e327-a5ad-4de3-97a7-b0410dcc19d6.png" alt class="image--center mx-auto" /></p>
<p>Advantages:</p>
<ul>
<li><p>Far less CPU overhead than MCU. The server isn't mixing; it's just routing.</p>
</li>
<li><p>Better latency. No heavy composition work means faster processing.</p>
</li>
<li><p>Peers control their own experience. They can choose to watch high-quality streams from speakers and lower quality from others.</p>
</li>
<li><p>Bandwidth management. Peers can decide to pause certain streams or request lower resolutions.</p>
</li>
<li><p>Highly scalable. One SFU server can handle many more participants than an MCU because it's doing less work.</p>
</li>
<li><p>Significantly lower server costs compared to MCU</p>
</li>
</ul>
<p>This is why modern platforms like Zoom, Google Meet, and others use SFU-based architectures. It's the sweet spot between server cost, scalability, and latency.</p>
<hr />
<p><strong>Conclusion</strong></p>
<p>When you build a multi-party video conferencing application, your architecture choice directly impacts scalability, latency, and costs.</p>
<p>P2P works for 1:1 calls. Mesh doesn't scale. MCU is expensive and slow. SFU is the architecture driving most modern real-time platforms because it balances all these concerns. It's efficient, scalable, and cost-effective.</p>
<p>Understanding these differences helps you make informed decisions when designing your own real-time communication systems.</p>
]]></content:encoded></item><item><title><![CDATA[How to Use Multiple GitHub Accounts]]></title><description><![CDATA[If you have multiple GitHub accounts, it can be a challenge to manage them on the same computer. Fortunately, with a few configuration changes, you can easily use multiple GitHub accounts.
Here are the steps to set up multiple GitHub accounts:
This g...]]></description><link>https://clearyourdoubt.in/how-to-use-multiple-github-accounts</link><guid isPermaLink="true">https://clearyourdoubt.in/how-to-use-multiple-github-accounts</guid><category><![CDATA[GitHub]]></category><category><![CDATA[Git]]></category><dc:creator><![CDATA[Atish Maske]]></dc:creator><pubDate>Wed, 10 Dec 2025 11:32:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765366033222/8919009c-2634-46a8-8edb-fee4aeb31617.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you have multiple GitHub accounts, it can be a challenge to manage them on the same computer. Fortunately, with a few configuration changes, you can easily use multiple GitHub accounts.</p>
<p>Here are the steps to set up multiple GitHub accounts:</p>
<h2 id="heading-this-guide-is-useful-for-developers-who">This guide is useful for developers who:</h2>
<ul>
<li><p>Work on both personal projects and company-owned repositories</p>
</li>
<li><p>Contribute to open-source projects with a personal account while working at a company</p>
</li>
<li><p>Maintain multiple projects under different GitHub accounts</p>
</li>
<li><p>Want to keep work and personal contributions separate</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, ensure you have:</p>
<ul>
<li><p>Two or more GitHub accounts already created</p>
</li>
<li><p>Access to your terminal/command line</p>
</li>
<li><p>Basic familiarity with Git and SSH</p>
</li>
<li><p>macOS, Linux, or WSL on Windows (Git and SSH pre-installed)</p>
</li>
<li><p>Git already installed and configured on your machine</p>
</li>
</ul>
<h2 id="heading-1-generate-ssh-keys-for-each-account">1. Generate SSH keys for each account</h2>
<p>First, you need to generate SSH keys for each GitHub account you want to use. You can use the following command to generate an SSH key for your "Account 1" GitHub account:</p>
<pre><code class="lang-ruby">ssh-keygen -t rsa -b <span class="hljs-number">4096</span> -C <span class="hljs-string">"your-email@example.com"</span>
</code></pre>
<p>Be sure to replace "<a target="_blank" href="mailto:your-email@example.com">your-email@example.com</a>" with the email address associated with your "Account 1" GitHub account. When prompted, save the SSH key to the default location (~/.ssh/id_rsa).</p>
<p>Repeat this process for each GitHub account you want to use, using a unique name for each SSH key (e.g., id_rsa_account1 for "Account 1", id_rsa_account2 for "Account 2", etc.).</p>
<h2 id="heading-2-add-ssh-keys-to-github">2. Add SSH keys to GitHub</h2>
<p>Next, you need to add the SSH keys to each GitHub account. Log in to your "Account 1" GitHub account and go to the "Settings" page. Click on the "SSH and GPG keys" tab, then click the "New SSH key" button. Paste the contents of the id_rsa_<a target="_blank" href="http://account1.pub">account1.pub</a> file (located in ~/.ssh/) into the "Key" field and give the key a descriptive name. Click "Add SSH key" to save the key.</p>
<p>Repeat this process for each GitHub account you want to use, using the appropriate SSH key for each account.</p>
<h2 id="heading-3-create-a-configuration-file">3. Create a configuration file</h2>
<p>Next, you need to create a configuration file for SSH to specify the different hosts and SSH keys to use for each GitHub account. Use the following command to create a new configuration file:</p>
<pre><code class="lang-ruby">nano ~<span class="hljs-regexp">/.ssh/config</span>
</code></pre>
<p>In the configuration file, add the following lines for each GitHub account:</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># Account 1</span>
Host github.com-account1
  HostName github.com
  User git
  IdentityFile ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account1

<span class="hljs-comment"># Account 2</span>
Host github.com-account2
  HostName github.com
  User git
  IdentityFile ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account2
</code></pre>
<p>Replace "account1" and "account2" with unique names for each GitHub account.</p>
<h2 id="heading-4-clone-repositories-and-set-up-the-remote-origin">4. Clone repositories and set up the remote origin</h2>
<p>Next, clone the repositories using Git. For each GitHub account:</p>
<ol>
<li><p>Open a terminal and navigate to the directory where you want to clone the repository.</p>
</li>
<li><p>Use the following command to clone:</p>
</li>
</ol>
<pre><code class="lang-ruby">git clone git@github.com-<span class="hljs-symbol">account1:</span>username/repository.git
</code></pre>
<p>Replace <code>username</code> with the username of your GitHub account, and <code>repository</code> with the name of the repository.</p>
<blockquote>
<p>Important: I made a mistake initially by not using the Git hostname correctly. I was using the incorrect command:</p>
<pre><code class="lang-ruby">git clone git@github.<span class="hljs-symbol">com:</span>username/repository.git
</code></pre>
<p>By omitting the correct hostname, I experienced "access denied" errors when attempting to push or pull.</p>
<p>To avoid this, make sure to use the correct format:</p>
<pre><code class="lang-ruby">git clone git@github.com-<span class="hljs-symbol">account1:</span>username/repository.git
</code></pre>
</blockquote>
<p>Repeat this process for each GitHub account and repository.</p>
<h2 id="heading-5-associate-existing-local-code-with-the-remote-origin">5. Associate existing local code with the remote origin</h2>
<p>If you have existing code on your local machine:</p>
<ul>
<li>Navigate to the local repository directory:</li>
</ul>
<pre><code class="lang-ruby">cd /path/to/repository
</code></pre>
<ul>
<li>View the current remote URLs:</li>
</ul>
<pre><code class="lang-ruby">git remote -v
</code></pre>
<ul>
<li>Update the remote origin URL:</li>
</ul>
<pre><code class="lang-ruby">git remote set-url origin git@github.com-<span class="hljs-symbol">account1:</span>username/repository.git
</code></pre>
<ul>
<li>Verify the update:</li>
</ul>
<pre><code class="lang-ruby">git remote -v
</code></pre>
<p>By following these steps, you can associate your existing local code with the appropriate</p>
<h2 id="heading-6-set-ssh-key-permissions">6. Set SSH Key Permissions</h2>
<p>For security reasons, your SSH keys should have specific file permissions. Once you've generated your SSH keys, ensure they have the correct permissions:</p>
<pre><code class="lang-basic"># Set permissions <span class="hljs-keyword">for</span> SSH directory
chmod <span class="hljs-number">700</span> ~/.ssh

# Set permissions <span class="hljs-keyword">for</span> private keys
chmod <span class="hljs-number">600</span> ~/.ssh/id_rsa_account1
chmod <span class="hljs-number">600</span> ~/.ssh/id_rsa_account2

# Set permissions <span class="hljs-keyword">for</span> public keys
chmod <span class="hljs-number">644</span> ~/.ssh/id_rsa_account1.pub
chmod <span class="hljs-number">644</span> ~/.ssh/id_rsa_account2.pub
</code></pre>
<p>These permissions ensure:</p>
<ul>
<li><p><strong>700 for ~/.ssh</strong>: Only you can read, write, and execute the directory</p>
</li>
<li><p><strong>600 for private keys</strong>: Only you can read and write your private keys</p>
</li>
<li><p><strong>644 for public keys</strong>: Only you can write, but everyone can read</p>
</li>
</ul>
<h2 id="heading-7-configure-git-user-identity-per-repository">7. Configure Git User Identity Per Repository</h2>
<p>When working with multiple accounts, you may want to configure your Git user identity per repository to ensure commits are attributed to the correct account:</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># Navigate to your repository</span>
cd /path/to/repository

<span class="hljs-comment"># Set user name for this repository</span>
git config user.name <span class="hljs-string">"Your Name"</span>

<span class="hljs-comment"># Set user email for this repository</span>
git config user.email <span class="hljs-string">"your-email@example.com"</span>

<span class="hljs-comment"># Verify the configuration</span>
git config --list
</code></pre>
<p>If you want to set this globally for all repositories, use the <code>--global</code> flag:</p>
<pre><code class="lang-ruby">git config --global user.name <span class="hljs-string">"Your Name"</span>
git config --global user.email <span class="hljs-string">"your-email@example.com"</span>
</code></pre>
<h2 id="heading-71-important-fixing-commits-with-wrong-account">7.1 Important: Fixing Commits with Wrong Account</h2>
<p>If you notice that your commits are still being attributed to your old/work account instead of the correct one, you need to fix the repository configuration:</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># Navigate to your repository</span>
cd /path/to/repository

<span class="hljs-comment"># Set the correct user email for this repository</span>
git config user.email <span class="hljs-string">"your_email@example.com"</span>

<span class="hljs-comment"># Verify the configuration was updated</span>
git config user.email
</code></pre>
<p>Make sure to:</p>
<ol>
<li><p>Use the <strong>correct email address</strong> for your intended account</p>
</li>
<li><p>Run this command <strong>inside the repository directory</strong> (not globally)</p>
</li>
<li><p>Verify the output shows your intended email address</p>
</li>
<li><p>Test with a new commit to ensure it's attributed correctly</p>
</li>
</ol>
<p>If commits were already made with the wrong account, you'll need to rewrite the commit history to fix the author information. This is beyond the scope of this guide, but tools like <code>git filter-repo</code> or <code>git filter-branch</code> can help.8. Verify SSH Connection</p>
<p>Before cloning or pushing to a repository, verify that your SSH connection is working correctly:</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># Test connection to GitHub with account1</span>
ssh -T git@github.com-account1

<span class="hljs-comment"># Test connection to GitHub with account2</span>
ssh -T git@github.com-account2
</code></pre>
<p>You should see a response like:</p>
<pre><code class="lang-ruby">Hi username! You<span class="hljs-string">'ve successfully authenticated, but GitHub does not provide shell access.</span>
</code></pre>
<p>If you encounter any issues, use the verbose flag to debug:</p>
<pre><code class="lang-ruby">ssh -vT git@github.com-account1
</code></pre>
<h2 id="heading-troubleshooting-common-issues">Troubleshooting Common Issues</h2>
<h3 id="heading-issue-1-permission-denied-publickey">Issue 1: Permission Denied (publickey)</h3>
<p><strong>Problem</strong>: You get "Permission denied (publickey)" error when trying to push or pull.</p>
<p><strong>Solution</strong>:</p>
<ol>
<li><p>Verify your SSH key is added to the GitHub account</p>
</li>
<li><p>Check that the SSH key file has correct permissions (chmod 600)</p>
</li>
<li><p>Ensure you're using the correct hostname (e.g., <a target="_blank" href="mailto:git@github.com"><code>git@github.com</code></a><code>-account1</code>)</p>
</li>
<li><p>Test the connection: <code>ssh -T</code> <a target="_blank" href="mailto:git@github.com"><code>git@github.com</code></a><code>-account1</code></p>
</li>
</ol>
<h3 id="heading-issue-2-using-wrong-account">Issue 2: Using Wrong Account</h3>
<p><strong>Problem</strong>: Git is using the wrong account for push/pull operations.</p>
<p><strong>Solution</strong>:</p>
<ol>
<li><p>Verify the remote URL with <code>git remote -v</code></p>
</li>
<li><p>Ensure the hostname matches your SSH config (e.g., <a target="_blank" href="http://github.com"><code>github.com</code></a><code>-account1</code> not <a target="_blank" href="http://github.com"><code>github.com</code></a>)</p>
</li>
<li><p>Check git config: <code>git config --local</code> <a target="_blank" href="http://user.email"><code>user.email</code></a></p>
</li>
<li><p>Update remote if needed: <code>git remote set-url origin</code> <a target="_blank" href="mailto:git@github.com"><code>git@github.com</code></a><code>-account1:username/repo.git</code></p>
</li>
</ol>
<h3 id="heading-issue-3-ssh-key-not-found">Issue 3: SSH Key Not Found</h3>
<p><strong>Problem</strong>: "Could not open a connection to your authentication agent" or SSH key not found.</p>
<p><strong>Solution</strong>:</p>
<ol>
<li><p>Start SSH agent: <code>eval "$(ssh-agent -s)"</code></p>
</li>
<li><p>Add your SSH key: <code>ssh-add ~/.ssh/id_rsa_account1</code></p>
</li>
<li><p>Verify key is added: <code>ssh-add -l</code></p>
</li>
</ol>
<h3 id="heading-issue-4-multiple-ssh-keys-in-agent">Issue 4: Multiple SSH Keys in Agent</h3>
<p><strong>Problem</strong>: SSH agent tries wrong key first, causing "too many authentication failures".</p>
<p><strong>Solution</strong>: Modify your ~/.ssh/config to specify which key to use first:</p>
<pre><code class="lang-ruby">Host github.com-account1
  HostName github.com
  User git
  IdentityFile ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account1
  IdentitiesOnly yes

Host github.com-account2
  HostName github.com
  User git
  IdentityFile ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account2
  IdentitiesOnly yes
</code></pre>
<p>The <code>IdentitiesOnly yes</code> option ensures only the specified key is used.</p>
<h2 id="heading-best-practices-and-tips">Best Practices and Tips</h2>
<ol>
<li><p><strong>Use Descriptive Hostnames</strong>: Make your SSH config hostnames descriptive (e.g., <a target="_blank" href="http://github.com"><code>github.com</code></a><code>-work</code>, <a target="_blank" href="http://github.com"><code>github.com</code></a><code>-personal</code>)</p>
</li>
<li><p><strong>Keep Keys Secure</strong>: Never share your private SSH keys and consider using a passphrase for added security</p>
</li>
<li><p><strong>Regularly Rotate Keys</strong>: Periodically generate new SSH keys for security</p>
</li>
<li><p><strong>Use SSH Agent</strong>: Add your keys to SSH agent for seamless authentication:</p>
<pre><code class="lang-ruby"> ssh-add ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account1
 ssh-add ~<span class="hljs-regexp">/.ssh/id</span>_rsa_account2
</code></pre>
</li>
<li><p><strong>Document Your Setup</strong>: Keep a record of which key belongs to which account</p>
</li>
<li><p><strong>Test Before Critical Operations</strong>: Always test SSH connection before important push/pull operations</p>
</li>
<li><p><strong>Monitor SSH Activity</strong>: Regularly check active SSH keys on your GitHub accounts (Settings &gt; SSH and GPG keys)</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Managing multiple GitHub accounts on the same machine is a common workflow for developers who maintain both personal and work projects. By following these steps, you can:</p>
<ul>
<li><p>Generate and manage multiple SSH keys securely</p>
</li>
<li><p>Configure SSH to use different keys for different accounts</p>
</li>
<li><p>Associate repositories with the correct GitHub accounts</p>
</li>
<li><p>Debug and troubleshoot common authentication issues</p>
</li>
</ul>
<p>The key to success is ensuring that:</p>
<ul>
<li><p>SSH keys have correct permissions</p>
</li>
<li><p>SSH config maps hostnames to the right keys</p>
</li>
<li><p>Git remotes use the correct custom hostnames</p>
</li>
<li><p>Each repository is configured with the right user identity</p>
</li>
</ul>
<p>With this setup, you can seamlessly switch between multiple GitHub accounts without authentication headaches. Remember to keep your SSH keys secure and regularly review your active keys on GitHub's settings page.remote origin URL for each GitHub account.</p>
]]></content:encoded></item><item><title><![CDATA[Inside the Engine: How NoSQL Optimizes for Massive Writes]]></title><description><![CDATA[Most developers have a mental model of how a database works: You insert a row, the database finds the right spot on the hard drive (usually using a B-Tree), and slots it in perfectly. It’s neat, organized, and reliable.
But what happens when you need...]]></description><link>https://clearyourdoubt.in/inside-the-engine-how-nosql-optimizes-for-massive-writes</link><guid isPermaLink="true">https://clearyourdoubt.in/inside-the-engine-how-nosql-optimizes-for-massive-writes</guid><dc:creator><![CDATA[Atish Maske]]></dc:creator><pubDate>Sun, 07 Dec 2025 11:13:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765105787590/94f27ad8-4c4f-4e26-ac9a-968071a80d38.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most developers have a mental model of how a database works: You insert a row, the database finds the right spot on the hard drive (usually using a B-Tree), and slots it in perfectly. It’s neat, organized, and reliable.</p>
<p>But what happens when you need to handle <strong>10 million writes per second</strong>?</p>
<p>At that scale, the "neat and organized" B-Tree falls apart. The constant disk seeking and rebalancing would bring your server to its knees.</p>
<p>Enter the <strong>LSM Tree (Log-Structured Merge Tree)</strong>—the engine under the hood of Cassandra, RocksDB, and HBase. It achieves speed by doing something that sounds counter-intuitive: <strong>It stops trying to be organized on disk immediately.</strong></p>
<p>Here is how the magic works.</p>
<hr />
<h2 id="heading-the-lazy-genius-append-only-storage">The "Lazy" Genius: Append-Only Storage</h2>
<p>To make writes instant, NoSQL databases adopt a simple philosophy: <strong>Never modify a file. Always append.</strong></p>
<p>Imagine keeping a diary. If you wanted to change an entry from last Tuesday, you wouldn't get an eraser, find the page, and scrub it out. You’d just write a new entry today saying, <em>"Update regarding last Tuesday..."</em></p>
<p>This is <strong>Sequential I/O</strong>, and disks love it. It is orders of magnitude faster than hopping around the disk (Random I/O) to update old records.</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># The "Lazy" Write</span>
<span class="hljs-comment"># Instead of finding a specific index on disk, we just tack it to the end.</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simple_append</span><span class="hljs-params">(key, value)</span></span>
  timestamp = Time.now.to_i
  entry = <span class="hljs-string">"<span class="hljs-subst">#{timestamp}</span>:<span class="hljs-subst">#{key}</span>:<span class="hljs-subst">#{value}</span>\n"</span>

  <span class="hljs-comment"># 'a' mode opens the file for appending, ensuring O(1) write speed</span>
  File.open(<span class="hljs-string">'database.log'</span>, <span class="hljs-string">'a'</span>) <span class="hljs-keyword">do</span> <span class="hljs-params">|file|</span>
    file.write(entry)
  <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>But if we just dump data into a file, reading it back becomes a nightmare. We’d have to scan the whole file to find anything. We need a compromise.</p>
<hr />
<h2 id="heading-the-architecture-ram-is-the-new-disk">The Architecture: RAM is the New Disk</h2>
<p>The LSM Tree uses a dual-layer approach. It treats Memory (RAM) as the staging area and Disk as the permanent archive.</p>
<h3 id="heading-write-operation">Write Operation</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765105092404/8ddc3225-621e-4e28-9e78-a2ebc09264c7.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-read-operation">Read Operation</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765105149580/fb8e7992-5677-440d-ac20-c29928928a64.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-1-the-memtable-the-vip-lounge">1. The MemTable (The VIP Lounge)</h3>
<p>When a write request hits your database, it doesn't touch the hard drive yet. It goes straight into the <strong>MemTable</strong> (Memory Table).</p>
<ul>
<li><p><strong>What is it?</strong> A sorted data structure (like a Red-Black Tree or Skip List) living entirely in RAM.</p>
</li>
<li><p><strong>Why?</strong> Because RAM is lightning fast. We can insert and sort data here in microseconds.</p>
</li>
</ul>
<pre><code class="lang-ruby"><span class="hljs-comment"># Inside the Database Engine</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MemTable</span></span>
  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">initialize</span></span>
    <span class="hljs-comment"># In a real DB, this would be a Red-Black Tree or Skip List </span>
    <span class="hljs-comment"># to keep keys sorted automatically.</span>
    @data = {} 
  <span class="hljs-keyword">end</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">put</span><span class="hljs-params">(key, value)</span></span>
    <span class="hljs-comment"># O(log N) insert speed because we are in RAM</span>
    @data[key] = value
  <span class="hljs-keyword">end</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get</span><span class="hljs-params">(key)</span></span>
    @data[key]
  <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<h3 id="heading-2-the-sstable-the-immutable-archive">2. The SSTable (The Immutable Archive)</h3>
<p>Once the MemTable gets full (say, it hits 64MB), the database freezes it. It flushes the entire sorted list to the disk in one go.</p>
<p>This file is called an SSTable (Sorted String Table).</p>
<ul>
<li><strong>Crucial Rule:</strong> SSTables are <strong>Immutable</strong>. Once written, they are never changed. This means no lock contention and no complex disk management.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765103155493/df018bdf-f9f8-4841-a207-e931de87052c.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-the-needle-in-a-haystack-problem">The "Needle in a Haystack" Problem</h2>
<p>Now we have a problem. We have data scattered across RAM and multiple immutable files on disk. How do we find "User: 123" without opening every single file?</p>
<h3 id="heading-the-sparse-index">The Sparse Index</h3>
<p>We cheat. We don't create an index for <em>every</em> key. We create a <strong>Sparse Index</strong>.</p>
<p>Imagine an encyclopedia. You don't have a bookmark for every word. You have a guide that says <em>"Aardvark starts on page 1, Apple starts on page 50."</em></p>
<p>Because our SSTables are sorted, we only need to store the offset of the <strong>first key</strong> of every 64KB block.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765103298961/d07053fe-26ae-4e3d-ad1f-22fa717eb3e7.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-ruby"><span class="hljs-comment"># We don't store every key. We only store the "signposts".</span>
sparse_index = [
  { <span class="hljs-symbol">key:</span> <span class="hljs-string">"ID-001"</span>, <span class="hljs-symbol">offset:</span> <span class="hljs-number">0</span> },      <span class="hljs-comment"># Start of Block 1</span>
  { <span class="hljs-symbol">key:</span> <span class="hljs-string">"ID-500"</span>, <span class="hljs-symbol">offset:</span> <span class="hljs-number">65536</span> },  <span class="hljs-comment"># Start of Block 2</span>
  { <span class="hljs-symbol">key:</span> <span class="hljs-string">"ID-900"</span>, <span class="hljs-symbol">offset:</span> <span class="hljs-number">131072</span> }  <span class="hljs-comment"># Start of Block 3</span>
]

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">find_block_for</span><span class="hljs-params">(target_key, index)</span></span>
  <span class="hljs-comment"># If we want ID-750, we know it MUST be in Block 2</span>
  <span class="hljs-comment"># because 750 is between 500 and 900.</span>

  candidate_block = index.select { <span class="hljs-params">|entry|</span> entry[<span class="hljs-symbol">:key</span>] &lt;= target_key }.last

  <span class="hljs-keyword">return</span> candidate_block ? candidate_block[<span class="hljs-symbol">:offset</span>] : <span class="hljs-literal">nil</span>
<span class="hljs-keyword">end</span>
</code></pre>
<ul>
<li><p>If you are looking for <code>ID-250</code>, and the index says <code>ID-001</code> is at Block A and <code>ID-500</code> is at Block B, you know for a fact that <code>ID-250</code> <strong>must</strong> be in Block A.</p>
</li>
<li><p>We load just that small block, scan it, and find our data.</p>
</li>
</ul>
<hr />
<h2 id="heading-the-edge-cases-dealing-with-reality">The Edge Cases: Dealing with Reality</h2>
<p>The architecture above is fast, but it creates some unique engineering challenges. Here is how NoSQL solves them.</p>
<h3 id="heading-1-how-do-i-delete-data-if-files-are-immutable">1. "How do I delete data if files are immutable?"</h3>
<p>If we can't edit the file, how do we delete a row?</p>
<p>We don't.</p>
<p>We write a Tombstone.</p>
<p>A Tombstone is a new record that literally says: "ID-123 is dead."</p>
<pre><code class="lang-ruby">TOMBSTONE = <span class="hljs-string">"##DELETED##"</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">delete_key</span><span class="hljs-params">(key)</span></span>
  <span class="hljs-comment"># We don't delete. We write a new record marking it as dead.</span>
  <span class="hljs-comment"># This is treated exactly like a normal Write operation.</span>
  @db.write(key, TOMBSTONE)
<span class="hljs-keyword">end</span>
</code></pre>
<p>When the database reads data, it sees the Tombstone in the recent file and ignores any older versions of <code>ID-123</code> found in older files. It shadows the old data until it's eventually cleaned up.</p>
<h3 id="heading-2-what-if-the-server-crashes">2. "What if the server crashes?"</h3>
<p>The MemTable is in RAM. If the power plug is pulled, that data is gone forever.</p>
<p>To prevent this, we use a WAL (Write-Ahead Log).</p>
<p>Every write is simultaneously appended to a "dumb" log file on disk <em>before</em> it hits memory. We never read this file—unless the server crashes. On reboot, the database replays the WAL to reconstruct the MemTable. It’s the "Black Box" flight recorder of the database.</p>
<pre><code class="lang-ruby"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">robust_write</span><span class="hljs-params">(key, value)</span></span>
  <span class="hljs-comment"># 1. Safety First: Write to disk log</span>
  @wal.append(key, value)

  <span class="hljs-comment"># 2. Speed Second: Write to Memory</span>
  @mem_table.put(key, value)
<span class="hljs-keyword">end</span>
</code></pre>
<h3 id="heading-3-the-not-found-penalty-bloom-filters">3. "The 'Not Found' Penalty" (Bloom Filters)</h3>
<p>The most expensive query in an LSM Tree is looking for a key that doesn't exist.</p>
<p>The database checks the MemTable (Miss). Then the newest SSTable (Miss). Then the next one (Miss)... all the way to the oldest file. You might scan 50 files just to realized the data isn't there.</p>
<p>To stop this wasted effort, we use <strong>Bloom Filters</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765103403550/a08b163a-e2ad-4151-a1e9-6f0a9518961f.png" alt class="image--center mx-auto" /></p>
<p>A Bloom Filter is a space-efficient probabilistic data structure. Think of it as a very smart, very small memory bit-array.</p>
<p><strong>How it works:</strong></p>
<ol>
<li><p><strong>Multiple Hashes:</strong> When you write a key (e.g., "ID-123"), the system runs it through multiple hash functions.</p>
</li>
<li><p><strong>Bit Flipping:</strong> Each hash function points to a specific bit in an array, and we flip those bits to <code>1</code>.</p>
</li>
<li><p><strong>The Check:</strong> When you want to read "ID-123", we run the hashes again. If <strong>ALL</strong> the corresponding bits are <code>1</code>, the key <em>might</em> exist. If <strong>ANY</strong> bit is <code>0</code>, the key <strong>definitely does not exist</strong>.</p>
</li>
</ol>
<p>This gives us two powerful guarantees:</p>
<ul>
<li><p><strong>False Negatives are Impossible:</strong> If the filter says "No", the data is absolutely not there.</p>
</li>
<li><p><strong>False Positives are Possible:</strong> It might say "Maybe", but the key isn't actually there (rare, but possible).</p>
</li>
</ul>
<pre><code class="lang-ruby"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">BloomFilter</span></span>
  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">initialize</span><span class="hljs-params">(size)</span></span>
    @bit_array = Array.new(size, <span class="hljs-number">0</span>)
  <span class="hljs-keyword">end</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add</span><span class="hljs-params">(key)</span></span>
    <span class="hljs-comment"># Hash the key multiple times and flip bits to 1</span>
    indices = hash_functions(key) 
    indices.each { <span class="hljs-params">|i|</span> @bit_array[i] = <span class="hljs-number">1</span> }
  <span class="hljs-keyword">end</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">might_contain?</span><span class="hljs-params">(key)</span></span>
    indices = hash_functions(key)
    <span class="hljs-comment"># If ANY bit is 0, the key is definitely not here.</span>
    <span class="hljs-comment"># We can skip the expensive disk read!</span>
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span> <span class="hljs-keyword">if</span> indices.any? { <span class="hljs-params">|i|</span> @bit_array[i] == <span class="hljs-number">0</span> }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span> <span class="hljs-comment"># It might be here, go check the disk.</span>
  <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_with_optimization</span><span class="hljs-params">(key)</span></span>
  <span class="hljs-comment"># 1. Check RAM</span>
  <span class="hljs-keyword">return</span> @mem_table.get(key) <span class="hljs-keyword">if</span> @mem_table.contains?(key)

  <span class="hljs-comment"># 2. Ask the Bouncer (Bloom Filter)</span>
  <span class="hljs-keyword">unless</span> @bloom_filter.might_contain?(key)
    <span class="hljs-keyword">return</span> <span class="hljs-string">"Key Definitely Not Found"</span> <span class="hljs-comment"># We skip the disk entirely!</span>
  <span class="hljs-keyword">end</span>

  <span class="hljs-comment"># 3. Only check disk if the Bloom Filter says "Maybe"</span>
  <span class="hljs-keyword">return</span> check_sstables_on_disk(key)
<span class="hljs-keyword">end</span>
</code></pre>
<p>This simple trick saves massive amounts of I/O for missing keys.</p>
<h3 id="heading-4-compaction-taking-out-the-trash">4. Compaction: Taking Out the Trash</h3>
<p>Over time, you might have 1,000 SSTables, and ID-123 might exist in 50 of them (49 updates and 1 current value).</p>
<p>A background process called Compaction runs quietly. It wakes up, merges the 1,000 files into 100 larger files, tosses out the duplicates and Tombstones, and goes back to sleep.</p>
<hr />
<h2 id="heading-summary">Summary</h2>
<p>NoSQL systems aren't magic; they are a masterclass in tradeoffs. By accepting that <strong>files should be immutable</strong> and <strong>RAM should be the primary workspace</strong>, they achieve write speeds that traditional databases can't touch.</p>
<p><strong>The Recipe for Speed:</strong></p>
<ol>
<li><p><strong>Append Only:</strong> Never seek, just write.</p>
</li>
<li><p><strong>MemTable:</strong> Sort in RAM first.</p>
</li>
<li><p><strong>Tombstones:</strong> Don't delete, just mark as dead.</p>
</li>
<li><p><strong>Bloom Filters:</strong> Use math to avoid searching for what isn't there.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How Apache Lucene Makes Searching Super Fast]]></title><description><![CDATA[Today, I want to talk about something really cool: how Apache Lucene stores and retrieves data so efficiently.

We're not diving into Elasticsearch (it's built on top of Apache Lucene) but into the magic that makes Lucene so powerful for full-text se...]]></description><link>https://clearyourdoubt.in/how-apache-lucene-makes-searching-super-fast</link><guid isPermaLink="true">https://clearyourdoubt.in/how-apache-lucene-makes-searching-super-fast</guid><category><![CDATA[lucene]]></category><category><![CDATA[elasticsearch]]></category><category><![CDATA[search]]></category><dc:creator><![CDATA[Atish Maske]]></dc:creator><pubDate>Sat, 06 Dec 2025 09:49:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765014181374/754bdaac-fa15-47ec-80be-fe3e637c21ec.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, I want to talk about something really cool: how <strong>Apache Lucene</strong> stores and retrieves data so efficiently.</p>
<p><img src="https://media.licdn.com/dms/image/v2/D4D12AQFP23RCH302gA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1732213655710?e=1766620800&amp;v=beta&amp;t=M3_IN1fVDn15gTXZRGYwFvnxmZNLMXWlZhJCi_ILI0A" alt /></p>
<p>We're not diving into <strong>Elasticsearch</strong> (it's built on top of Apache Lucene) but into the magic that makes Lucene so powerful for full-text search.</p>
<p>Let's say you're building an app where users search for courses like "<strong>Ruby on Rails</strong>" or "<strong>Java</strong>," and you need to return matching results fast. Sounds simple, right?</p>
<p>But when you look under the hood, it's not that straightforward. Let's explore why traditional databases fall short and how Lucene solves the problem.</p>
<h2 id="heading-why-sql-isnt-great-for-full-text-search">Why SQL Isn't Great for Full-Text Search</h2>
<h3 id="heading-whats-the-problem-with-sql">What's the Problem with SQL?</h3>
<p>Okay, let's talk about SQL databases like MySQL. They're great for structured data, but they kind of struggle when it comes to searching text.</p>
<p><strong>Why</strong>? Because SQL stores data in tables and uses something called <strong>B+tree indexes</strong> to speed things up.</p>
<blockquote>
<p>Now, B+trees are cool—they organize data hierarchically so you can look stuff up faster. But when it comes to searching through a lot of text, they aren't up to the task.</p>
</blockquote>
<p>Imagine you run this query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> courses <span class="hljs-keyword">WHERE</span> description <span class="hljs-keyword">LIKE</span> <span class="hljs-string">'%microservice%'</span>;
</code></pre>
<p>Here's what happens:</p>
<ul>
<li><p>SQL has to check every row in the table for the word "microservice."</p>
</li>
<li><p>If you've got thousands (or millions!) of rows, this is like asking someone to flip through a book page by page looking for one word. Painful, right?</p>
</li>
</ul>
<blockquote>
<p>On top of that, if you've got <strong>n rows</strong> and <strong>m characters</strong> in course, the time complexity can shoot up to O(N×M²); that's not good.</p>
</blockquote>
<p>Now imagine you're running something huge like LinkedIn. If users are searching for stuff and your database is doing this kind of heavy lifting every time, it's only a matter of time before everything slows down—or worse, crashes.</p>
<p>Clearly, SQL isn't built for this kind of work.</p>
<h3 id="heading-what-about-nosql-is-it-any-better">What About NoSQL? Is It Any Better?</h3>
<p>So, maybe you're thinking, "What about NoSQL? Isn't it made for scaling?" Well, yes and no.</p>
<ul>
<li><p><strong>Key-value stores</strong> like Redis are great for quick lookups, but if you want to search text, you still need to scan every row for matches.</p>
</li>
<li><p><strong>Document stores</strong> like MongoDB do better, but even they aren't optimized for advanced text search.</p>
</li>
</ul>
<p>Bottom line? Neither SQL nor NoSQL gives you the speed and precision you need for full-text search.</p>
<h2 id="heading-the-game-changer-the-inverted-index">The Game-Changer: The Inverted Index</h2>
<p>Now here's where things get interesting. Instead of trying to force traditional databases to do something they're not built for, Lucene takes a completely different approach. It uses something called an <strong>inverted index</strong>.</p>
<h3 id="heading-what-is-an-inverted-index">What Is an Inverted Index?</h3>
<p>Before diving into the explanation, let me ask you a question:</p>
<p><strong>What data structure could we use to solve this problem?</strong> Two possibilities are a <strong>HashMap</strong> and a <strong>Trie</strong>.</p>
<p>However, there's a catch. If we use a HashMap as the data structure, the key would be the <strong>record ID/document ID</strong>, and the value would contain all the words present in that record or document. Sounds reasonable, right? But let's see how that works in practice.</p>
<h3 id="heading-traditional-hashmap-structure">Traditional HashMap Structure</h3>
<p>Here's how data might look if stored in a traditional HashMap-like structure:</p>
<pre><code class="lang-ruby">documents = {
  <span class="hljs-number">1</span> =&gt; [<span class="hljs-string">"Ruby"</span>, <span class="hljs-string">"Programming"</span>, <span class="hljs-string">"Beginners"</span>, <span class="hljs-string">"Learning"</span>],
  <span class="hljs-number">2</span> =&gt; [<span class="hljs-string">"Java"</span>, <span class="hljs-string">"Advanced"</span>, <span class="hljs-string">"OOP"</span>],
  <span class="hljs-number">3</span> =&gt; [<span class="hljs-string">"Python"</span>, <span class="hljs-string">"Data"</span>, <span class="hljs-string">"Science"</span>]
}
</code></pre>
<p>If you wanted to search for "Ruby," you'd have to go through each record to find where it appears. Slow, right?</p>
<h3 id="heading-now-what-if-we-reverse-the-structure">Now, what if we reverse the structure?</h3>
<p>Instead of storing records as keys and their words as values, we can <strong>flip the structure</strong>. This gives us something called an <strong>Inverted Index</strong>.</p>
<h3 id="heading-inverted-index-structure">Inverted Index Structure</h3>
<p>Here's how the data would look in an inverted index:</p>
<pre><code class="lang-ruby">inverted_index = {
  <span class="hljs-string">"ruby"</span> =&gt; [<span class="hljs-number">1</span>],
  <span class="hljs-string">"programming"</span> =&gt; [<span class="hljs-number">1</span>],
  <span class="hljs-string">"beginners"</span> =&gt; [<span class="hljs-number">1</span>],
  <span class="hljs-string">"learning"</span> =&gt; [<span class="hljs-number">1</span>],
  <span class="hljs-string">"java"</span> =&gt; [<span class="hljs-number">2</span>],
  <span class="hljs-string">"advanced"</span> =&gt; [<span class="hljs-number">2</span>],
  <span class="hljs-string">"oop"</span> =&gt; [<span class="hljs-number">2</span>],
  <span class="hljs-string">"python"</span> =&gt; [<span class="hljs-number">3</span>],
  <span class="hljs-string">"data"</span> =&gt; [<span class="hljs-number">3</span>],
  <span class="hljs-string">"science"</span> =&gt; [<span class="hljs-number">3</span>]
}
</code></pre>
<p>With this structure, searching for "Ruby" takes you straight to document ID 1. No more flipping through pages—just instant results!</p>
<ul>
<li><p><strong>Keys:</strong> The words (terms) from your dataset.</p>
</li>
<li><p><strong>Values:</strong> Lists of document IDs where those words appear.</p>
</li>
</ul>
<h2 id="heading-how-lucene-uses-an-inverted-index-step-by-step">How Lucene Uses an Inverted Index (Step-by-Step)</h2>
<p>Lucene doesn't just use an inverted index—it improves it with extra steps to make searching even faster and more accurate.</p>
<h3 id="heading-1-tokenization">1. Tokenization</h3>
<p>First, Lucene takes the text and splits it into individual words, or <strong>tokens</strong>. For example:</p>
<ul>
<li><p><strong>Input:</strong> "Learn Ruby programming for beginners."</p>
</li>
<li><p><strong>Tokens:</strong> ["Learn", "Ruby", "programming", "for", "beginners"]</p>
</li>
</ul>
<h3 id="heading-2-lowercasing">2. Lowercasing</h3>
<p>Next, all tokens are converted to lowercase. This way, "Ruby" and "ruby" are treated the same.</p>
<ul>
<li><strong>Tokens:</strong> ["learn", "ruby", "programming", "for", "beginners"]</li>
</ul>
<h3 id="heading-3-removing-stop-words">3. Removing Stop Words</h3>
<p>Common words like "for," "a," or "the" are removed. These words don't add much meaning to searches, so Lucene skips them.</p>
<ul>
<li><strong>Tokens:</strong> ["learn", "ruby", "programming", "beginners"]</li>
</ul>
<h3 id="heading-4-stemming">4. Stemming</h3>
<p>Lucene reduces words to their root form (this is called <strong>stemming</strong>). For example:</p>
<ul>
<li><p>"learn" and "learning" → "learn"</p>
</li>
<li><p>"programming" and "programs" → "program"</p>
</li>
</ul>
<p>Now, we've got: ["learn", "ruby", "program", "beginner"]</p>
<h3 id="heading-5-building-the-inverted-index">5. Building the Inverted Index</h3>
<p>Finally, Lucene maps these tokens to the inverted index.</p>
<ul>
<li><p>Each <strong>token</strong> becomes a key in the index.</p>
</li>
<li><p>Each <strong>key</strong> points to a list of documents (and positions) where the word appears.</p>
</li>
</ul>
<pre><code class="lang-ruby">inverted_index = {
  <span class="hljs-string">"learn"</span> =&gt; [{ <span class="hljs-symbol">doc:</span> <span class="hljs-number">1</span>, <span class="hljs-symbol">positions:</span> [<span class="hljs-number">0</span>] }],
  <span class="hljs-string">"ruby"</span> =&gt; [{ <span class="hljs-symbol">doc:</span> <span class="hljs-number">1</span>, <span class="hljs-symbol">positions:</span> [<span class="hljs-number">1</span>] }],
  <span class="hljs-string">"program"</span> =&gt; [{ <span class="hljs-symbol">doc:</span> <span class="hljs-number">1</span>, <span class="hljs-symbol">positions:</span> [<span class="hljs-number">2</span>] }],
  <span class="hljs-string">"beginner"</span> =&gt; [{ <span class="hljs-symbol">doc:</span> <span class="hljs-number">1</span>, <span class="hljs-symbol">positions:</span> [<span class="hljs-number">3</span>] }]
}
</code></pre>
<p>Now, if you search for "Ruby," Lucene can instantly tell you it's in document 1. Easy, right?</p>
<h2 id="heading-why-is-lucene-used-so-widely">Why is lucene used so widely?</h2>
<p>So why does Lucene beat traditional databases for full-text search?</p>
<ol>
<li><p><strong>Speed</strong> - With the inverted index, Lucene doesn't need to scan entire datasets. It goes straight to the relevant documents.</p>
</li>
<li><p><strong>Ranking and Scoring</strong> - Lucene doesn't just find matches—it ranks them by relevance. It uses various algorithms.</p>
</li>
<li><p><strong>Scalability</strong> - Lucene powers systems like Elasticsearch, OpenSearch, MongoDB Atlas, Solr, etc. that handle billions of documents.</p>
</li>
<li><p><strong>Customizability</strong> - You can tweak Lucene to fit your needs—custom tokenizers, stop-word lists, analyzers, etc.</p>
</li>
</ol>
<h2 id="heading-but-wait-doesnt-mysql-do-full-text-search">But Wait… Doesn't MySQL Do Full-Text Search?</h2>
<p>You're right—MySQL does support full-text search using inverted indexes. But here's the thing:</p>
<ul>
<li><p>It's not as fast or scalable as Lucene.</p>
</li>
<li><p>It doesn't have advanced ranking features like TF-IDF or BM25.</p>
</li>
<li><p>It's harder to customise for specific use cases.</p>
</li>
</ul>
<p>For more, check out <a target="_blank" href="https://dev.mysql.com/doc/refman/8.4/en/innodb-fulltext-index.html">MySQL's Full-Text Search Docs</a>.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<p>So, here's the takeaway:</p>
<ul>
<li><p>SQL and NoSQL databases are great, but they're not built for full-text search.</p>
</li>
<li><p>Lucene's <strong>inverted index</strong> makes searching lightning-fast by flipping how data is stored.</p>
</li>
<li><p>It takes things even further with features like stemming, tokenization, and ranking.</p>
</li>
</ul>
<p>That's why tools like Elasticsearch (which is built on Lucene) are so popular. They take Lucene's speed and scalability and make it even easier to use.</p>
<h2 id="heading-resources">Resources</h2>
<ul>
<li><p><a target="_blank" href="https://www.geeksforgeeks.org/inverted-index/">GeeksforGeeks: Inverted Index</a></p>
</li>
<li><p><a target="_blank" href="https://lucene.apache.org/core/">Apache Lucene Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/apache/lucene/blob/main/lucene/core">Lucene GitHub Repository</a></p>
</li>
<li><p><a target="_blank" href="http://opensearchlab.otago.ac.nz/paper_10.pdf">Research Paper: Inverted Index</a></p>
</li>
<li><p><a target="_blank" href="https://whimsical.com/">Whimsical</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Understanding Database Collation in Rails]]></title><description><![CDATA[You've likely seen collation settings in your schema.rb or database.yml many times, but do you know exactly how they impact your application?
In a standard Rails application, you might see this in a migration:
add_column :users, :name, :string, colla...]]></description><link>https://clearyourdoubt.in/understanding-database-collation-in-rails</link><guid isPermaLink="true">https://clearyourdoubt.in/understanding-database-collation-in-rails</guid><dc:creator><![CDATA[Atish Maske]]></dc:creator><pubDate>Fri, 05 Dec 2025 16:35:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764952202159/e441e5fb-927d-4635-a866-2e9f063658dd.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You've likely seen <strong>collation</strong> settings in your schema.rb or database.yml many times, but do you know exactly how they impact your application?</p>
<p>In a standard Rails application, you might see this in a migration:</p>
<pre><code class="lang-ruby">add_column <span class="hljs-symbol">:users</span>, <span class="hljs-symbol">:name</span>, <span class="hljs-symbol">:string</span>, <span class="hljs-symbol">collation:</span> <span class="hljs-string">"utf8mb4_unicode_ci"</span>
</code></pre>
<p>Or in your database.yml:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">encoding:</span> <span class="hljs-string">utf8mb4</span>
<span class="hljs-attr">collation:</span> <span class="hljs-string">utf8mb4_0900_ai_ci</span>
</code></pre>
<p>Let's break down what collation actually is, why it matters, and which settings you should use in a modern Rails environment.</p>
<h2 id="heading-1-the-core-concept-character-set-vs-collation">1. The Core Concept: Character Set vs. Collation</h2>
<p>To understand collation, you must distinguish it from character sets.</p>
<ul>
<li><p><strong>Character Set (The Container):</strong> Defines which characters you can store.</p>
</li>
<li><p><strong>Collation (The Rulebook):</strong> Defines how those characters are <strong>compared</strong> and <strong>sorted</strong>.</p>
</li>
</ul>
<p><strong>Example Scenario:</strong> Imagine you have a list of names: ["Amit", "amit", "Ámit"].</p>
<ul>
<li><p><code>utf8mb4_unicode_ci</code>: Case-insensitive. It sees all three as identical matches for a search.</p>
</li>
<li><p><code>utf8mb4_bin</code>: Binary comparison. It sees three completely different values.</p>
</li>
</ul>
<h2 id="heading-2-why-collation-matters">2. Why Collation Matters</h2>
<p>Collation directly dictates how your database engine (and by extension, Rails) handles data retrieval and organization.</p>
<ol>
<li><p><strong>Search Behavior (WHERE clauses):</strong> If a user searches for name = 'AMIT', the result depends entirely on collation.</p>
</li>
<li><p><strong>Sorting Order (ORDER BY):</strong> Sorting multilingual content requires language-aware rules (e.g., how to sort accented characters like ñ or ö).</p>
</li>
<li><p><strong>Data Integrity (Joins):</strong> Joining two tables with different collations can cause SQL errors or severe performance degradation because the database cannot reliably compare the keys.</p>
</li>
</ol>
<h2 id="heading-3-the-types-of-collation">3. The "Types" of Collation</h2>
<p>When looking at a MySQL collation string (e.g., <code>utf8mb4_0900_ai_ci</code>), the suffixes tell you the rules used:</p>
<ul>
<li><p><code>_ci</code> (Case-Insensitive)</p>
</li>
<li><p><code>_cs</code> (Case-Sensitive)</p>
</li>
<li><p><code>_ai</code> (Accent-Insensitive)</p>
</li>
<li><p><code>_as</code> (Accent-Sensitive)</p>
</li>
<li><p><code>_bin</code> (Binary)</p>
</li>
</ul>
<h2 id="heading-4-best-practices-which-to-use-when">4. Best Practices: Which to Use When?</h2>
<p>For a modern Rails application (Rails 6/7+) running on MySQL 8, here are the recommended strategies:</p>
<h3 id="heading-the-modern-default">✅ The Modern Default</h3>
<p><strong>utf8mb4_0900_ai_ci</strong></p>
<ul>
<li><p><strong>What it is:</strong> The modern Unicode standard. It is accent-insensitive and case-insensitive.</p>
</li>
<li><p><strong>Use case:</strong> General storage for names, descriptions, and comments where you want "cafe" to be found even if the user types "café".</p>
</li>
</ul>
<h3 id="heading-searchable-text">🔍 Searchable Text</h3>
<p><strong>utf8mb4_unicode_ci or utf8mb4_0900_ai_ci</strong></p>
<ul>
<li><p><strong>Use case:</strong> Usernames, emails, and addresses.</p>
</li>
<li><p><strong>Why:</strong> Ensures a smooth user experience. A user searching for "STEPHEN" should generally find "Stephen."”</p>
</li>
</ul>
<h3 id="heading-strict-security-amp-tokens">🔐 Strict Security &amp; Tokens</h3>
<p><strong>utf8mb4_bin</strong></p>
<ul>
<li><p><strong>Use case:</strong> Password hashes, API tokens, invite codes, or case-sensitive identifiers (e.g., a URL shortener code where AbC is different from abc).</p>
</li>
<li><p><strong>Why:</strong> You need an exact, byte-for-byte comparison.</p>
</li>
</ul>
<h2 id="heading-5-implementation-in-rails">5. Implementation in Rails</h2>
<h3 id="heading-global-default-databaseyml">Global Default (database.yml)</h3>
<p>This sets the baseline for any new table or column created.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">production:</span>
  <span class="hljs-attr">adapter:</span> <span class="hljs-string">mysql2</span>
  <span class="hljs-attr">encoding:</span> <span class="hljs-string">utf8mb4</span>
  <span class="hljs-attr">collation:</span> <span class="hljs-string">utf8mb4_0900_ai_ci</span>
</code></pre>
<h3 id="heading-column-level-override-migrations">Column-Level Override (Migrations)</h3>
<p>Use this when a specific column requires different behavior than the database default (e.g., a strict token field).</p>
<pre><code class="lang-ruby"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AddTokenToUsers</span> &lt; ActiveRecord::Migration[7.1]</span>
  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">change</span></span>
    add_column <span class="hljs-symbol">:users</span>, <span class="hljs-symbol">:api_token</span>, <span class="hljs-symbol">:string</span>, <span class="hljs-symbol">collation:</span> <span class="hljs-string">"utf8mb4_bin"</span>
  <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<h3 id="heading-fixing-existing-data">Fixing Existing Data</h3>
<p>If you have "broken" sorting or emoji support, you cannot just change the config file. You must migrate the actual data table.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">CONVERT</span> <span class="hljs-keyword">TO</span> <span class="hljs-built_in">CHARACTER</span> <span class="hljs-keyword">SET</span> utf8mb4 <span class="hljs-keyword">COLLATE</span> utf8mb4_0900_ai_ci;
</code></pre>
<h2 id="heading-summary">Summary</h2>
<ul>
<li><p><strong>Character Set</strong> is for storing data; <strong>Collation</strong> is for comparing data.</p>
</li>
<li><p>Inconsistent collation leads to join errors and weird search bugs.</p>
</li>
<li><p><strong>Recommendation:</strong> Start your projects with <code>utf8mb4_0900_ai_ci</code> as the default.</p>
</li>
<li><p><strong>Exception:</strong> Use <code>_bin</code> (binary) collation only for fields that require strict exact matching (tokens, hashes).</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>