From Power Calculations to P-Values: A/B Testing at Stack Overflow

If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process. Which version of a button, predictive model, or ad is better? We don’t have to guess blindly, but instead we can use tests as part of our decision-making toolkit.

I get excited about A/B tests because tests like these harness the power of statistics and data to impact the day-to-day details of our business choices. Des Navadeh is the product manager of the Jobs team here at Stack Overflow, and she has used testing extensively on her team to guide decisions. Des says, “A/B testing helps us gain confidence in the change we’re making. It helps us validate new ideas and guides decision making. Without A/B testing, we’re leaving much of what we do up to chance.”

At the same time, there can be confusion about how to approach an A/B test, what the statistical concepts involved in such a test are, and what you do before a test vs. after a test. Des and her team have learned a lot by implementing many tests, but also have had some stumbles. “We didn’t realize it at the time, but when we started A/B testing, we took a very strict approach in the calculations to determine sample size. As a result, we were running tests for an unnecessary length of time and most were deemed inconclusive. We basically set up our tests to be almost 100% confident which isn’t very realistic or productive!” Des says.

To start testing off on the right foot, we need to plan for an A/B test and perform a power calculation. This requires defining a hypothesis and test groups, and then considering two questions.

  • How sure do we need to be that we are measuring a real change?
  • How big is the change we expect to see because of the new version, compared to the baseline?

Let’s start with the first question.

How sure do you need to be?

I am sad to have to break this to you all, but the answer to that first question can’t be 100%. When we measure something in the real world, we never measure with exact accuracy and precision. (That’s basically why I have a job, I think!) There are two main quantities that statisticians use to talk about how much and in what way we can be wrong in measuring.

  • What percentage of the time are we willing to miss a real effect? This is measured by power.
  • What percentage of the time are we willing to be fooled into seeing an effect by random chance? This is called significance level, and more precisely, we would state this as the probability of rejecting the null hypothesis.

We also talk about these kinds of errors as the false negative rate and false positive rate, which can be very easy to understand given the right example.

Typical statistical standards for these quantities are 80% for power (i.e., 20% chance of a false negative) and 5% for significance level. Why are these standards used in practice? That’s a great question with a fair amount of baggage and tradition behind it. If we choose standards that are too strict, perhaps 95% for power and 1% for significance level, all our A/B tests will need to run longer and we will have to invest more time and resources into testing. We won’t be able to iterate quickly to solve our business problems. On the other hand, we’re not curing cancer here, right?! What if we relaxed these statistical standards? Then we risk making change after change in our product that does not improve anything, and investing work from our developers and other team members in changes that do not move us forward toward our goals. We want to be Goldilocks-just-right when it comes to these standards for our purposes. For us at Stack Overflow, that means consistently using 80% for power and 5% for significance level in our power calculations before an A/B test.

How big is your change?

Our second question here is not about statistical standards, but instead is about how big of a difference we expect to see with the proposed change compared to the status quo. Some phrases that people use to talk about this concept are effect size, expected improvement, and improvement threshold. Effect size can be different in different contexts and different parts of our business.

Estimating effect size requires strategic product thinking. Des says, “You need to first understand how different areas of your product perform. Understanding how each part of your funnel converts today helps you decide how big of an effect you’d need to see for the new change to be worth it. We use different questions to help estimate the effect size. How much development work is required to graduate the test? How strategically important is it? Does this feature support future plans? What is the size of audience or action are we optimizing for? These answers are detailed as success criteria in our test plans.” Some of the factors Des takes into account when estimating effect size are volume of events that enter the funnel that is being considered, baseline conversion rate of the feature, and how the expected improvement impacts overall product metrics.

Power calculations

Once we have estimated an effect size for our test and know the statistical standards we are going to use in planning, we can do a power calculation to find out how big of a sample size we need for our test. The point of power calculations like these is to find out what sample size we need for our A/B test, how many views or users or form submissions or other interactions we need in each group to achieve the necessary power for our test. Then we can finally start our test! Time to wait for those events to roll in.

How do we calculate how big of a sample we need, to measure the change we expect with the statistical standards we’ve chosen? For most tests, our product teams use online calculators to find the sample size. I’m an R developer, so I would use a function in R for such a test. For more complicated tests, we on the data team sometimes run simulations for power calculations.

When we calculate power, we see first-hand how power, significance level, and effect size interact with sample size and the baseline conversion rate that we were dealing with to start with. I built a Shiny app to demonstrate how these factors are related for a proportion test, which is typically applicable in our A/B tests.

You can click the “Source Code” button on the app to see the R code that built this app. Notice the shapes of the curves, and how they change when you move the sliders. We need bigger sample sizes to measure small effect sizes, or to achieve low significance levels. If the baseline rate is higher to start with, the sample size needed for a given power goes down. These complicated interactions affect our A/B tests at Stack Overflow.

“We realized that we couldn’t standardize power calculations across all tests. Some parts of our funnel were highly optimized and converted well, which meant we needed smaller samples sizes to detect the same effect we would want to see in an area that didn’t convert as well,” Des says. “Other areas had higher volume, like page views, but did not convert at well. While higher volume helps us reach the sample size need faster, we needed a larger effect size for the change to make an impact.”

Analyzing results

What happens after the test? After we have collected enough events to meet our sample size requirements, it’s time to analyze the results. At Stack Overflow, we have testing infrastructure for teams to automatically see analysis of results, or if I am performing an analysis myself, I might use a statistical test like a proportion test using R. “We know we can end a test when we’ve reached the sample size we set out to collect, and then we check out the p-value,” Des says. The p-value of an A/B test is the probability that we would get the observed difference between the A and B groups (or a more extreme difference) by random chance. When the p-value is high, that means the probability that we could just randomly see that difference between the A and B groups is high, due only to sampling noise. When the p-value of our A/B test is low enough (below our threshold), we can say that the probability of seeing such a difference randomly is low and we can feel confident about making the change to the new alternative from our original version.

If you pay attention to the world of statistics, you may have seen some hubbub about changing the threshold for p-values; a recent paper claimed that moving from a threshold of 0.05 to 0.005 would solve the reproducibility crisis in science and fix, well, lots of things. It’s true that using a threshold of p < 0.05 means being fooled 1 in 20 times, but ultimately, the problem with using statistics and measurement isn’t p-values. The problem is us. We can’t apply these kinds of thresholds without careful consideration of context and domain knowledge, and a commitment to honesty (especially to ourselves!) when it comes to p-values. We are sticking with a p-value threshold of 0.05 for our A/B tests, but these tests must always be interpreted holistically by human beings with an understanding of our data and our business.

When to JUST SAY NO to an A/B test

Tests like the ones Des and I have talked about in this post are a powerful tool, but sometimes the best choice is knowing when not to run an A/B test. We at Stack Overflow have encountered this situation when considering a feature used by a small number of users and a potential change to that feature that we have other reasons for preferring to the status quo. The length of a test needed to achieve adequate statistical power in such a situation is impractically long, and the best choice for us in our real-life situation is to forgo a test and make the decision based on non-statistical considerations.

“Product thinking is critical here. Sometimes a change is obviously better UX but the test would take months to be statistically significant. If we are confident that the change aligns with our product strategy and creates a better experience for users, we may forgo an A/B test. In these cases, we may take qualitative approaches to validate ideas such as running usability tests or user interviews to get feedback from users,” says Des. “It’s a judgement call. If A/B tests aren’t practical for a given situation, we’ll use another tool in the toolbox to make progress. Our goal is continuous improvement of the product. In many cases, A/B testing is just one part of our approach to validating a change. ”

Along the same lines, sometimes the results of an A/B test can be inconclusive, with no measurable difference between the baseline and new version, either positive or negative. What should we do then? Often we stay with the original version of our feature, but in some situations, we still decide to make a change to a new version, depending on other product considerations.

Dealing with data means becoming comfortable with uncertainty, and A/B tests make this reality extremely apparent. Handling uncertainty wisely and using statistical tools like A/B tests well can give us the ability to make better decisions. Des and her team have used extensive testing to make Stack Overflow Jobs a great tool for developers in the market for a new opportunity, so be sure to check it out!

Related Articles

Comments

  • Thomas Haslam

    Thanks for making the source code available for the shiny app. Btw, a good general intro to A/B.

  • Clifford Anderson-Bergman

    *It’s true that using a threshold of p < 0.05 means being fooled 1 in 20 times*

    I hate to be that guy…but that's not entirely correct. You would be fooled 1 in 20 times *if you were just looking at null effects*.

  • lebeau10

    I enjoyed the post, but the following definition of power from your post is misleading/inaccurate. “What percentage of the time are we willing to miss a real effect? This is measured by power.”

    Power is the ability to detect a non-zero effect (i.e. reject a false null hypothesis). What is discussed is beta, the inverse to power, or a type II error. I agree, to what I think is suggested, that power and alpha are discussed more often (in my opinion, for important reasons), but why not define power more directly for readers?

    • That’s a reasonable question. We run into issues like these a lot in an organization like ours. How can we explain these statistical concepts with the best balance of accuracy and… understand-ability for our stakeholders? What do we focus on? I am not sure it’s a settled question.

  • Daniel

    Great post. Occasionally you might want to do an A/A test on an existing feature to see how much natural variation you have anyway. This could help with setting thresholds.

  • Jon

    Can we please get rid of these huge images you guys put on the blog posts and on the so site when you go anon.

  • Bartosz P

    Have you considered that the A/B testing does not only measure, but also affects your users? In particular, in some cases I saw a label chaning its content/position in a matter of minutes – honestly, sometimes this was really irritating… 🙂

    • Marc Gravell

      Hi; that is not the expected outcome and I know from the data that what you describe is not common. Two options, then: 1) there’s something odd about your connection – I’d be happy to take a look, but would need a user id (there are a lot of “Bartosz P*” folks in the system), or 2) it wasn’t actually anything to do with an A/B test, and was moving for some other content reason

      • Bartosz P

        My SO user id is 2642204 if you want to look into it. I don’t care that much, especially because it doesn’t happen any more, but if you do, an explanation would be interesting.

  • justina colmena

    P-values belong to the so-called “frequentist” domain of pre-WWI Austrian (Freudian / Jungian) *psychology*. Mathematically, they are missing one step of application, namely to take into account the prior probabilities of the null and alternative hypotheses, and use Bayes’ Rule to calculate the “posterior probability” of the alternative hypothesis given the power of the statistical test to reject the null hypothesis, assuming a positive result.

    • We’ve explored whether to switch to a Bayesian approach to A/B testing and have decided against it for the time being: http://varianceexplained.org/r/bayesian-ab-testing/
      FREQUENTIST ALL THE WAY 😉

      • justina colmena

        I did not mean to get into the specious “debate” among statisticians of using likelihoods conditioned on prior events of unknown probabilities versus baying at the moon to fix those prior probabilities. Of course you may ask yourself, is a 90% confidence level “enough” — or 95%, or even 98.5% or 99%? — but again it depends on just how strong is the standing evidence for the null hypothesis which you intend to overcome with the evidence from your experiment — and then we’re right back at Bayes’ Rule, but we’re making implicit assumptions of prior probabilities rather than the explicit assumptions that “Bayesians” use. And I’m not saying those explicit assumptions would necessarily be any better than the implicit assumptions of a so-called “frequentist” approach in any particular case. It’s just that I believe it’s important to understand and explain to the public the results of your frequentist experiments within a Bayesian paradigm. All real-world experiments are by necessity “frequentist,” after all.

        The Baysian approach: we had no idea whether or not A or B was true. It was 50/50 to us. Then we did an experiment, and found that the results we obtained from our experiment are much more likely in case A than in case B. So we conclude that A is in fact much more probable than B. But we had to apply Bayes’ Rule with the 50/50 prior probablities to arrive at that conclusion, whether we realized it or not.

  • Occasionally you might want to do an A/A test on an existing feature to see how much natural variation you have anyway,a good general intro to A/B.
    Hy Guys! please checkout my first tribute page run 3, photos edit

  • Daniel Temkin

    I thoroughly enjoyed the article, thanks for writing it up. I did have a couple questions though, about the sampling method and the “effect size”.

    1) I read your Meta Stack Overflow post (https://meta.stackoverflow.com/questions/337745/were-soon-to-be-a-b-testing-some-changes-to-the-top-nav) and if I am reading it correctly, you select the test group at random from all visitors to the site? Would it not be better to restrict the sample population to those who have accounts so that you can test with knowledge of past behavior? Otherwise, how do you determine the effect the change had, since you never really knew the intent of the visitor?

    2) The misuse of “effect size” notwithstanding (I blame the Marketing majors), I found it curious that your process of selecting changes seemed almost entirely qualitative. I was curious have you guys ever tried to do ICA? I was also interested in how you determined the minimal affective change, it seemed to be that if you changed the size and color of a button you would have no way of know which, if any, change is causing any resulting difference in behavior?

    Thanks again for the post.

  • Bangladesh all eduaitonboard JSC result 2017and PSC Result 2017 You can Find Our website