\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cbr>With those performance numbers and a sense of scale in mind, let’s add some numbers that matter every day. Let’s say our data source is \u003Ccode>X\u003C/code>, where what \u003Ccode>X\u003C/code> is doesn’t matter. It could be SQL, or a microservice, or a macroservice, or a leftpad service, or Redis, or a file on disk, etc. The key here is that we’re comparing that source’s performance to that of RAM. Let’s say our source takes…\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\n\u003Cul>\u003Cli>100ns (from RAM - fast!)\u003C/li>\u003Cli>1ms (10,000x slower)\u003C/li>\u003Cli>100ms (100,000x slower)\u003C/li>\u003Cli>1s (1,000,000x slower)\u003C/li>\u003C/ul>\n\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\nI don’t think we need to go further to illustrate the point: \u003Cstrong>even things that take only 1 millisecond are way, \u003Cem>way\u003C/em> slower than local RAM\u003C/strong>. Remember: millisecond, microsecond, nanosecond – just in case anyone else forgets that a 1000ns != 1ms like I sometimes do…\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nBut not all cache is local. For example, we use Redis for shared caching behind our web tier (\u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#redis\">which we’ll cover in a bit\u003C/a>). Let’s say we’re going across our network to get it. For us, that’s a 0.17ms roundtrip and you need to also send some data. For small things (our usual), that’s going to be around 0.2–0.5ms total. Still 2,000–5,000x slower than local RAM, but also a lot faster than most sources. Remember, these numbers are because we’re in a small local LAN. Cloud latency will generally be higher, so measure to see your latency.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nWhen we get the data, maybe we also want to massage it in some way. Probably Swedish. Maybe we need totals, maybe we need to filter, maybe we need to encode it, maybe we need to fudge with it randomly just to trick you. That was a test to see if you’re still reading. You passed! Whatever the reason, the commonality is generally \u003Cem>we want to do \u003Ccode><x>\u003C/code> once\u003C/em>, and not \u003Cem>every time we serve it\u003C/em>.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nSometimes we’re saving latency and sometimes we’re saving CPU. One or both of those are generally why a cache is introduced. Now let’s cover the flip side…\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\n\u003Ch3 id=\"why-wouldnt-we-cache\">\u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#why-wouldnt-we-cache\">Why Wouldn’t We Cache?\u003C/a>\u003C/h3>\n\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\nFor everyone who hates caching, this is the section for you! Yes, I’m totally playing both sides.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nGiven the above and how drastic the wins are, why \u003Cem>wouldn’t\u003C/em> we cache something? Well, because \u003Cstrong>\u003Cem>every single decision has trade-offs\u003C/em>\u003C/strong>. Every. Single. One. It could be as simple as time spent or opportunity cost, but there’s still a trade-off.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nWhen it comes to caching, adding a cache comes with some costs:\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\n\u003Cul>\u003Cli>Purging values if and when needed (cache invalidation – \u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#cache-invalidation\">we’ll cover that in a few\u003C/a>)\u003C/li>\u003Cli>Memory used by the cache\u003C/li>\u003Cli>Latency of access to the cache (weighed against access to the source)\u003C/li>\u003Cli>Additional time and mental overhead spent debugging something more complicated\u003Cbr>\u003Cbr>\u003C/li>\u003C/ul>\n\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\nWhenever a candidate for caching comes up (usually with a new feature), we need to evaluate these things…and that’s not always an easy thing to do. Although caching is an exact science, much like astrology, it’s still tricky.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nHere at Stack Overflow, our architecture has one overarching theme: keep it as simple as possible. Simple is easy to evaluate, reason about, debug, and change if needed. Only make it more complicated if and when it \u003Cstrong>\u003Cem>needs\u003C/em>\u003C/strong> to be more complicated. That includes cache. Only cache if you need to. It adds more work and \u003Ca href=\"https://shouldiblamecaching.com/\">more chances for bugs\u003C/a>, so unless it’s needed: don’t. At least, not yet.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nLet’s start by asking some questions.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\n\u003Cul>\u003Cli>Is it that much faster to hit cache?\u003C/li>\u003Cli>What are we saving?\u003C/li>\u003Cli>Is it worth the storage?\u003C/li>\u003Cli>Is it worth the cleanup of said storage (e.g. garbage collection)?\u003C/li>\u003Cli>Will it go on the large object heap immediately?\u003C/li>\u003Cli>How often do we have to invalidate it?\u003C/li>\u003Cli>How many hits per cache entry do we think we’ll get?\u003C/li>\u003Cli>Will it interact with other things that complicate invalidation?\u003C/li>\u003Cli>How many variants will there be?\u003C/li>\u003Cli>Do we have to allocate just to calculate the key?\u003C/li>\u003Cli>Is it a local or remote cache?\u003C/li>\u003Cli>Is it shared between users?\u003C/li>\u003Cli>Is it shared between sites?\u003C/li>\u003Cli>Does it rely on quantum entanglement or does debugging it just make you think that?\u003C/li>\u003Cli>What color is the cache?\u003Cbr>\u003Cbr>\u003C/li>\u003C/ul>\n\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\nAll of these are questions that come up and affect caching decisions. I’ll try and cover them through this post.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\n\u003Ch3 id=\"layers-of-cache-at-stack-overflow\">\u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#layers-of-cache-at-stack-overflow\">Layers of Cache at Stack Overflow\u003C/a>\u003C/h3>\n\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\nWe have our own “L1”/”L2” caches here at Stack Overflow, but I’ll refrain from referring to them that way to avoid confusion with the CPU caches mentioned above. What we have is several types of cache. Let’s first quickly cover local and memory caches here for terminology before a deep dive into the common bits used by them:\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\n\u003Cul>\u003Cli>\u003Cstrong>“Global Cache”\u003C/strong>: In-memory cache (global, per web server, and backed by Redis on miss)\u003Cul>\u003Cli>Usually things like a user’s top bar counts, shared across the network\u003C/li>\u003Cli>This hits local memory (shared keyspace), and then Redis (shared keyspace, using Redis database 0)\u003C/li>\u003C/ul>\u003C/li>\u003Cli>\u003Cstrong>“Site Cache”\u003C/strong>: In-memory cache (per site, per web server, and backed by Redis on miss)\u003Cul>\u003Cli>Usually things like question lists or user lists that are per-site\u003C/li>\u003Cli>This hits local memory (per-site keyspace, using prefixing), and then Redis (per-site keyspace, using Redis databases)\u003C/li>\u003C/ul>\u003C/li>\u003Cli>\u003Cstrong>“Local Cache”\u003C/strong>: In-memory cache (per site, per web server, backed by \u003Cem>nothing\u003C/em>)\u003Cul>\u003Cli>Usually things that are cheap to fetch, but huge to stream and the Redis hop isn’t worth it\u003C/li>\u003Cli>This hits local memory only (per-site keyspace, using prefixing)\u003Cbr>\u003Cbr>\u003C/li>\u003C/ul>\u003C/li>\u003C/ul>\n\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\nWhat do we mean by “per-site”? Stack Overflow and the Stack Exchange network of sites is \u003Ca href=\"https://nickcraver.com/blog/2016/02/17/stack-overflow-the-architecture-2016-edition/\">a multi-tenant architecture\u003C/a>. Stack Overflow is just one of \u003Ca href=\"https://stackexchange.com/sites#traffic\">many hundreds of sites\u003C/a>. This means one process on the web server hosts all the sites, so we need to split up the caching where needed. And we’ll have to purge it (\u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#cache-invalidation\">we’ll cover how that works too\u003C/a>).\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\n\u003Ch3 id=\"redis\">\u003Ca href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#redis\">Redis\u003C/a>\u003C/h3>\n\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\nBefore we discuss how servers and shared cache work, let’s quickly cover what the shared bits are built on: Redis. So what is \u003Ca href=\"https://redis.io/\">Redis\u003C/a>? It’s an open source key/value data store with many useful data structures, additional publish/subscriber mechanisms, and rock solid stability.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nWhy Redis and not \u003Ccode><something else>\u003C/code>? Well, because it works. And it works well. It seemed like a good idea when we needed a shared cache. It’s been \u003Cem>incredibly\u003C/em> rock solid. We don’t wait on it – it’s incredibly fast. We know how it works. We’re very familiar with it. We know how to monitor it. We know how to spell it. We maintain one of the most used open source libraries for it. We can tweak that library if we need.\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nIt’s a piece of infrastructure we \u003Cem>just don’t worry about\u003C/em>. We basically take it for granted (though we still have an HA setup of replicas – we’re not \u003Cem>completely\u003C/em> crazy). When making infrastructure choices, you don’t just change things for perceived possible value. Changing takes effort, takes time, and involves risk. If what you have works well and does what you need, why invest that time and effort and take a risk? Well…you don’t. There are thousands of better things you can do with your time. Like debating which cache server is best!\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nWe have a few Redis instances to separate concerns of apps (but on the same set of servers), here’s an example of what one looks like:\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:image -->\n\u003Cfigure class=\"wp-block-image\">\u003Cimg src=\"https://nickcraver.com/blog/content/SO-Caching/SO-Cache-Opserver.png\" alt=\"Opserver: Redis View\"/>\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\nFor the curious, some quick stats from last Tuesday (2019-07-30) This is across all instances on the primary boxes (because we split them up for organization, not performance…one instance could handle everything we do quite easily):\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\n\u003Cul>\u003Cli>Our Redis physical servers have 256GB of memory, but less than 96GB used.\u003C/li>\u003Cli>1,586,553,473 commands processed per day (3,726,580,897 commands and 86,982 per second peak across all instances – due to replicas)\u003C/li>\u003Cli>Average of 2.01% CPU utilization (3.04% peak) for the entire server (< 1% even for the most active instance)\u003C/li>\u003Cli>124,415,398 active keys (422,818,481 including replicas)\u003C/li>\u003Cli>Those numbers are across 308,065,226 HTTP hits (64,717,337 of which were question pages)\u003Cbr>\u003Cbr>\u003C/li>\u003C/ul>\n\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\n\u003Csub>Note: None of these are Redis limited – we’re far from any limits. It’s just how much activity there is on our instances.\u003C/sub>\u003Cbr>\u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\nThere are also non-cache reasons we use Redis, namely: we also use the pub/sub mechanism \u003Ca href=\"https://nickcraver.com/blog/2016/02/17/stack-overflow-the-architecture-2016-edition/#websockets-httpsgithubcomstackexchangenetgain\">for our websockets\u003C/a> that provide realtime updates on scores, rep, etc. Redis 5.0 \u003Ca href=\"https://redis.io/topics/streams-intro\">added Streams\u003C/a> which is a perfect fit for our websockets and we’ll likely migrate to them when some other infrastructure pieces are in place (mainly limited by Stack Overflow Enterprise’s version at the moment).\u003Cbr>\u003Cbr>\u003Cbr>\u003Cbr>To read the rest of this post, head over to \u003Ca rel=\"noreferrer noopener\" aria-label=\"Nick's blog (opens in a new tab)\" href=\"https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/\" target=\"_blank\">Nick's blog\u003C/a>. \u003Cbr>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cem>It is also #5 in a \u003C/em>\u003Ca href=\"https://nickcraver.com/blog/2016/02/03/stack-overflow-a-technical-deconstruction/\">\u003Cem>very long series of posts\u003C/em>\u003C/a>\u003Cem> on Stack Overflow’s architecture. Previous post (#4): \u003C/em>\u003Ca href=\"https://nickcraver.com/blog/2018/11/29/stack-overflow-how-we-do-monitoring/\">\u003Cem>Stack Overflow: How We Do Monitoring - 2018 Edition\u003C/em>\u003C/a>\u003Cem> \u003C/em>\n\u003C!-- /wp:paragraph -->","html","2019-08-06T19:32:07.000Z",{"current":1131},"how-stack-overflow-caches-apps-for-a-multi-tenant-architecture",[1133,1141,1144,1149,1153],{"_createdAt":1134,"_id":1135,"_rev":1136,"_type":1137,"_updatedAt":1134,"slug":1138,"title":1140},"2023-05-23T16:43:21Z","wp-tagcat-bulletin","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":1139},"bulletin","Bulletin",{"_createdAt":1134,"_id":1142,"_rev":1136,"_type":1137,"_updatedAt":1134,"slug":1143,"title":90},"wp-tagcat-cache",{"current":90},{"_createdAt":1134,"_id":1145,"_rev":1136,"_type":1137,"_updatedAt":1134,"slug":1146,"title":1148},"wp-tagcat-engineering",{"current":1147},"engineering","Engineering",{"_createdAt":1134,"_id":1150,"_rev":1136,"_type":1137,"_updatedAt":1134,"slug":1151,"title":1152},"wp-tagcat-redis",{"current":1152},"redis",{"_createdAt":1134,"_id":1154,"_rev":1136,"_type":1137,"_updatedAt":1134,"slug":1155,"title":1157},"wp-tagcat-stack-overflow",{"current":1156},"stack-overflow","stack overflow","How Stack Overflow Caches Apps for a Multi-Tenant Architecture",[1160,1166,1172,1178],{"_id":1161,"publishedAt":1162,"slug":1163,"sponsored":12,"title":1165},"370eca08-3da8-4a13-b71e-5ab04e7d1f8b","2025-08-28T16:00:00.000Z",{"_type":10,"current":1164},"moving-the-public-stack-overflow-sites-to-the-cloud-part-1","Moving the public Stack Overflow sites to the cloud: Part 1",{"_id":1167,"publishedAt":1168,"slug":1169,"sponsored":1120,"title":1171},"e10457b6-a9f6-4aa9-90f2-d9e04eb77b7c","2025-08-27T04:40:00.000Z",{"_type":10,"current":1170},"from-punch-cards-to-prompts-a-history-of-how-software-got-better","From punch cards to prompts: a history of how software got better",{"_id":1173,"publishedAt":1174,"slug":1175,"sponsored":12,"title":1177},"65472515-0b62-40d1-8b79-a62bdd2f508a","2025-08-25T16:00:00.000Z",{"_type":10,"current":1176},"making-continuous-learning-work-at-work","Making continuous learning work at work",{"_id":1179,"publishedAt":1180,"slug":1181,"sponsored":12,"title":1183},"1b0bdf8c-5558-4631-80ca-40cb8e54b571","2025-08-21T14:00:25.054Z",{"_type":10,"current":1182},"research-roadmap-update-august-2025","Research roadmap update, August 2025",{"count":1185,"lastTimestamp":1186},3,"2023-05-25T09:46:48Z",["Reactive",1188],{"$sarticleModal":1189},false,["Set"],["ShallowReactive",1192],{"sanity-Y0Y6sDBguKoyPoPS6a0EPVstkgIf3Ov7wmOT5S5ylRw":-1,"sanity-comment-wp-post-12076-1756387747014":-1},"/2019/08/06/how-stack-overflow-caches-apps-for-a-multi-tenant-architecture"]