{"id":8146,"date":"2025-10-09T09:18:25","date_gmt":"2025-10-09T09:18:25","guid":{"rendered":"https:\/\/www.hostinger.com\/blog\/?p=8146"},"modified":"2025-10-09T09:56:36","modified_gmt":"2025-10-09T09:56:36","slug":"email-outage-october-2025","status":"publish","type":"post","link":"https:\/\/www.hostinger.com\/blog\/email-outage-october-2025","title":{"rendered":"Deep dive into our October 7 email outage: Causes, fixes, and what\u2019s next"},"content":{"rendered":"<p>On October 7, 2025, we faced a disruption in our email service: some users were unable to receive mail and access mailboxes due to an unexpected technical problem in our storage system. Throughout the incident, <strong>your data was never at risk<\/strong> &ndash; our engineers prioritized data protection while carefully restoring full functionality.<\/p><p>We know how important a reliable email service is and sincerely apologize for the disruption. The rest of this post explains what happened, how we resolved it, and what steps we&rsquo;re taking to make our systems stronger.<\/p><h2 class=\"wp-block-heading\" id=\"h-what-happened\">What happened<\/h2><p>We use <strong>CEPH<\/strong>, a distributed storage system trusted by leading organizations <a href=\"https:\/\/ceph.io\/en\/news\/blog\/2017\/new-luminous-scalability\/\" target=\"_blank\" rel=\"noreferrer noopener\">such as CERN<\/a>, and designed for high availability and data safety.&nbsp;<\/p><p>The root cause was related to <strong>BlueFS allocator fragmentation<\/strong>, triggered by an unusually high volume of small-object operations and metadata writes under heavy load.<\/p><p>In other words, the <strong>internal metadata space within CEPH became fragmented<\/strong>, which caused some object storage daemons (OSDs) to stop functioning correctly even though the system had plenty of free space available.<\/p><h2 class=\"wp-block-heading\" id=\"h-incident-timeline\">Incident timeline<\/h2><p>All times are on <strong>October 7, 2025 (UTC)<\/strong>:<\/p><ul class=\"wp-block-list\">\n<li><strong>09:17<\/strong> &ndash; Monitoring systems alerted us about abnormal behavior in one OSD node, and the engineering team immediately began investigating.<\/li>\n\n\n\n<li><strong>09:25<\/strong> &ndash; More OSDs began showing instability (&ldquo;flapping&rdquo;). The cluster was temporarily configured <strong>not to automatically remove unstable nodes<\/strong>, preventing unnecessary data rebalancing that could worsen performance.<\/li>\n\n\n\n<li><strong>09:30<\/strong> &ndash; OSDs repeatedly failed to start, entering crash loops. Initial diagnostics ruled out hardware and capacity issues &ndash; disk usage was below recommended thresholds.<\/li>\n\n\n\n<li><strong>10:42<\/strong> &ndash; Debug logs revealed a failure in the BlueStore allocator layer, confirming an issue within the RocksDB\/BlueFS subsystem.<\/li>\n\n\n\n<li><strong>10:45<\/strong> &ndash; Engineers began conducting multiple recovery tests, including filesystem checks and tuning resource limits. The checks confirmed there were <strong>no filesystem errors<\/strong>, but OSDs continued crashing during startup. <strong>Up to this point, there were no problems for email service users.<\/strong><\/li>\n\n\n\n<li><strong>11:00 <\/strong>&ndash; A <a href=\"https:\/\/statuspage.hostinger.com\/incidents\/ylrrx953mgxl\" target=\"_blank\" rel=\"noreferrer noopener\">Statuspage<\/a> was created.<\/li>\n\n\n\n<li><strong>13:12<\/strong> &ndash; The team hypothesized that the internal metadata space had become <strong>too fragmented<\/strong> and decided to <strong>extend the RocksDB metadata volume<\/strong> to provide additional room for compaction.<\/li>\n\n\n\n<li><strong>13:55<\/strong> &ndash; <strong>Additional NVMe drives<\/strong> were first installed in one OSD server to test whether adding more space would remediate the <strong>fragmentation<\/strong> issue.<\/li>\n\n\n\n<li><strong>15:02<\/strong> &ndash; After validating the solution, <strong>additional NVMe drives were installed<\/strong> on the remaining affected servers to expand metadata capacity.<\/li>\n\n\n\n<li><strong>15:10<\/strong> &ndash; Engineers started performing <strong>on-site migrations of RocksDB metadata<\/strong> to the newly installed NVMe drives.<\/li>\n\n\n\n<li><strong>16:30<\/strong> &ndash; The first OSD <strong>successfully started after migration<\/strong> &ndash; confirming the fix &ndash; and we performed the same migration and verification process across the remaining OSDs.<\/li>\n\n\n\n<li><strong>19:17<\/strong> &ndash; The storage cluster stabilized, and we started gradually bringing the infrastructure back online.<\/li>\n\n\n\n<li><strong>20:07<\/strong> &ndash; <strong>All email systems became fully operational<\/strong>, and cluster performance has normalized (see the image below). Crucially, all incoming emails were successfully moving from the queue to users&rsquo; inboxes, allowing users to access and read their emails.<\/li>\n<\/ul><div class=\"wp-block-image\">\n<figure data-wp-context='{\"imageId\":\"69df77395d0a9\"}' data-wp-interactive=\"core\/image\" class=\"aligncenter size-large wp-lightbox-container\"><img decoding=\"async\" width=\"1557\" height=\"807\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/public\" alt=\"October 7 email downtime data.\" class=\"wp-image-8148\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/w=1557,fit=scale-down 1557w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/w=300,fit=scale-down 300w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/w=1024,fit=scale-down 1024w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/w=768,fit=scale-down 768w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2025\/10\/email-downtime-october-2025.png\/w=1536,fit=scale-down 1536w\" sizes=\"(max-width: 1557px) 100vw, 1557px\" \/><button class=\"lightbox-trigger\" type=\"button\" aria-haspopup=\"dialog\" aria-label=\"Enlarge\" data-wp-init=\"callbacks.initTriggerButton\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-style--right=\"state.imageButtonRight\" data-wp-style--top=\"state.imageButtonTop\">\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewbox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\"><\/path>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure><\/div><ul class=\"wp-block-list\">\n<li><strong>00:29 (October 8)<\/strong> &ndash; All queued incoming emails were delivered to the corresponding users&rsquo; mailboxes, and users were able to access their mailboxes.<\/li>\n<\/ul><h2 class=\"wp-block-heading\" id=\"h-full-technical-background\">Full technical background<\/h2><p>The issue was caused by BlueFS allocator exhaustion, influenced by the default parameter <code>bluefs_shared_alloc_size = 64K<\/code> and triggered by an unusually high volume of small-object operations and metadata writes under heavy load.<\/p><p>Under these <strong>metadata-heavy workloads, <\/strong>the internal metadata space within CEPH became <strong>fragmented<\/strong> &ndash; the allocator ran out of contiguous blocks to assign, even though the drive itself still had plenty of free space. This caused some object storage daemons (OSDs) to stop functioning correctly.<\/p><p>Because CEPH is designed to protect data through replication and journaling, <strong>no data loss occurred<\/strong> &ndash; your data remained completely safe throughout the incident. The recovery process focused on <strong>migrating and compacting metadata<\/strong> rather than rebuilding user data.<\/p><h2 class=\"wp-block-heading\" id=\"h-our-response-and-next-steps\">Our response and next steps<\/h2><p>Once we identified the cause of the issue, our engineers focused on restoring service safely and quickly. Our team prioritized <strong>protecting your data first<\/strong>, even if that meant the recovery took longer. Every recovery step was handled with care and thoroughly validated before execution.<\/p><p>Thanks to our resilient architecture, <strong>all incoming emails were successfully delivered once the storage system was restored, and no emails were lost.<\/strong><\/p><p>Our work doesn&rsquo;t stop with restoring service &ndash; we&rsquo;re committed to making our infrastructure stronger for the future.<\/p><p>To improve performance and resilience, we installed <strong>dedicated NVMe drives on every OSD server<\/strong> to host RocksDB metadata. This significantly <strong>boosted I\/O speed<\/strong> and <strong>reduced metadata-related load<\/strong>.<\/p><p>We also <strong>strengthened our monitoring and alerting systems<\/strong> to track fragmentation levels and allocator health more effectively, enabling us to <strong>detect similar conditions earlier<\/strong>.<\/p><p>We also captured detailed logs and metrics, and we&rsquo;re <strong>collaborating closely with the CEPH developers<\/strong> to share our findings and contribute improvements that can help the broader community avoid similar issues and <strong>make the system even more resilient<\/strong>.<\/p><p>We appreciate your patience and understanding as we worked through this incident. Thank you for trusting us &ndash; we&rsquo;ll keep learning, improving, and ensuring that your services stay fast, reliable, and secure. And if you need any help, our Customer Success team is here for you 24\/7.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>On October 7, 2025, we faced a disruption in our email service: some users were unable to receive mail and access mailboxes due to an unexpected technical problem in our storage s\u2026<\/p>\n","protected":false},"author":41,"featured_media":8152,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[82],"tags":[],"hashtags":[],"class_list":["post-8146","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering"],"hreflangs":[],"_links":{"self":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/8146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/comments?post=8146"}],"version-history":[{"count":2,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/8146\/revisions"}],"predecessor-version":[{"id":8151,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/8146\/revisions\/8151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media\/8152"}],"wp:attachment":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media?parent=8146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/categories?post=8146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/tags?post=8146"},{"taxonomy":"hashtags","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/hashtags?post=8146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}