\u003C/a>\n\nYou can click the \"Source Code\" button on the app to see the R code that built this app. Notice the shapes of the curves, and how they change when you move the sliders. We need bigger sample sizes to measure small effect sizes, or to achieve low significance levels. If the baseline rate is higher to start with, the sample size needed for a given power goes down. These complicated interactions affect our A/B tests at Stack Overflow.\n\n\"We realized that we couldn’t standardize power calculations across all tests. Some parts of our funnel were highly optimized and converted well, which meant we needed smaller samples sizes to detect the same effect we would want to see in an area that didn’t convert as well,\" Des says. \"Other areas had higher volume, like page views, but did not convert at well. While higher volume helps us reach the sample size need faster, we needed a larger effect size for the change to make an impact.\"\n\n\u003Ch2>Analyzing results\u003C/h2>\n\nWhat happens after the test? After we have collected enough events to meet our sample size requirements, it's time to analyze the results. At Stack Overflow, we have testing infrastructure for teams to automatically see analysis of results, or if I am performing an analysis myself, I might use a statistical test like a \u003Ca href=\"https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html\">proportion test using R\u003C/a>. \"We know we can end a test when we’ve reached the sample size we set out to collect, and then we check out the p-value,\" Des says. The p-value of an A/B test is the probability that we would get the observed difference between the A and B groups (or a more extreme difference) by random chance. When the p-value is high, that means the probability that we could just randomly see that difference between the A and B groups is high, due only to sampling noise. When the p-value of our A/B test is low enough (below our threshold), we can say that the probability of seeing such a difference randomly is low and we can feel confident about making the change to the new alternative from our original version.\n\nIf you pay attention to the world of statistics, you may have seen some hubbub about changing the threshold for p-values; a \u003Ca href=\"https://cos.io/blog/we-should-redefine-statistical-significance/\">recent paper claimed that moving from a threshold of 0.05 to 0.005\u003C/a> would solve the reproducibility crisis in science and fix, well, lots of things. It's true that using a threshold of p < 0.05 means being \u003Ca href=\"https://xkcd.com/882/\">fooled 1 in 20 times\u003C/a>, but ultimately, the problem with using statistics and measurement isn't p-values. \u003Ca href=\"https://twitter.com/thosjleeper/status/888806305103785985\">The problem is us.\u003C/a> We can't apply these kinds of thresholds without careful consideration of context and domain knowledge, and a commitment to honesty (especially to ourselves!) when it comes to p-values. We are sticking with a p-value threshold of 0.05 for our A/B tests, but these tests must always be interpreted holistically by human beings with an understanding of our data and our business.\n\n\u003Ch2>When to JUST SAY NO to an A/B test\u003C/h2>\n\nTests like the ones Des and I have talked about in this post are a powerful tool, but sometimes the best choice is knowing when not to run an A/B test. We at Stack Overflow have encountered this situation when considering a feature used by a small number of users and a potential change to that feature that we have other reasons for preferring to the status quo. The length of a test needed to achieve adequate statistical power in such a situation is impractically long, and the best choice for us in our real-life situation is to forgo a test and make the decision based on non-statistical considerations.\n\n\"Product thinking is critical here. Sometimes a change is obviously better UX but the test would take months to be statistically significant. If we are confident that the change aligns with our product strategy and creates a better experience for users, we may forgo an A/B test. In these cases, we may take qualitative approaches to validate ideas such as running usability tests or user interviews to get feedback from users,\" says Des. \"It’s a judgement call. If A/B tests aren’t practical for a given situation, we’ll use another tool in the toolbox to make progress. Our goal is continuous improvement of the product. In many cases, A/B testing is just one part of our approach to validating a change. \"\n\nAlong the same lines, sometimes the results of an A/B test can be inconclusive, with no measurable difference between the baseline and new version, either positive or negative. What should we do then? Often we stay with the original version of our feature, but in some situations, we still decide to make a change to a new version, depending on other product considerations.\n\nDealing with data means becoming comfortable with uncertainty, and A/B tests make this reality extremely apparent. Handling uncertainty wisely and using statistical tools like A/B tests well can give us the ability to make better decisions. Des and her team have used extensive testing to make \u003Ca href=\"https://stackoverflow.com/jobs?utm_source=so-owned&utm_medium=blog&utm_campaign=gen-blog&utm_content=blog-link\">Stack Overflow Jobs\u003C/a> a great tool for developers in the market for a new opportunity, so be sure to check it out!","html","2017-10-17T13:00:30.000Z",{"current":394},"power-calculations-p-values-ab-testing-stack-overflow",[396,404,409],{"_createdAt":397,"_id":398,"_rev":399,"_type":400,"_updatedAt":397,"slug":401,"title":403},"2023-05-23T16:43:21Z","wp-tagcat-announcements","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":402},"announcements","Announcements",{"_createdAt":397,"_id":405,"_rev":399,"_type":400,"_updatedAt":397,"slug":406,"title":408},"wp-tagcat-engineering",{"current":407},"engineering","Engineering",{"_createdAt":397,"_id":410,"_rev":399,"_type":400,"_updatedAt":397,"slug":411,"title":413},"wp-tagcat-insights",{"current":412},"insights","Insights","From Power Calculations to P-Values: A/B Testing at Stack Overflow",[416,422,428,430],{"_id":417,"publishedAt":418,"slug":419,"sponsored":12,"title":421},"1b0bdf8c-5558-4631-80ca-40cb8e54b571","2025-08-21T14:00:25.054Z",{"_type":10,"current":420},"research-roadmap-update-august-2025","Research roadmap update, August 2025",{"_id":423,"publishedAt":424,"slug":425,"sponsored":12,"title":427},"5ff6f77f-c459-4080-b0fa-4091583af1ac","2025-08-20T14:00:00.000Z",{"_type":10,"current":426},"documents-the-architect-s-programming-language","Documents: The architect’s programming language",{"_id":16,"publishedAt":17,"slug":429,"sponsored":12,"title":20},{"_type":10,"current":19},{"_id":431,"publishedAt":432,"slug":433,"sponsored":12,"title":435},"f0807820-02d7-4fc5-845f-3d76514b81c0","2025-08-11T16:00:00.000Z",{"_type":10,"current":434},"renewing-chat-on-stack-overflow","Renewing Chat on Stack Overflow ",{"count":437,"lastTimestamp":438},20,"2023-05-25T09:46:30Z",["Reactive",440],{"$sarticleModal":388},["Set"],["ShallowReactive",443],{"sanity-0OKFPknF7A9AFD99NoqiR2SaGBrWh_53oQvVpAOCVpY":-1,"sanity-comment-wp-post-8377-1756009506433":-1},"/2017/10/17/power-calculations-p-values-ab-testing-stack-overflow/?cb=1"]