Research Update: A/B Testing the New Question Form
Welcome to November’s installment of Stack Overflow research updates! This month marks one year since my colleagues in UX research and I started sharing bite-size updates about the quantitative and qualitative research we use to understand our communities and make decisions.
In recent months, we have invested time and energy in improving the question-asking experience on Stack Overflow, one of the most fundamental interactions on our site. In August, I outlined what we learned from the question wizard, our first major change to the question-asking workflow in a decade. In September, Lisa shared the results of her qualitative research that has informed our next steps. Today, I want to present the results of A/B testing for the changes currently live on the site.
Wizards and a unified experience
The question wizard represented a move in the right direction in terms of question quality and interactions via comments, but some of the decisions made for the wizard turned out to be brittle and inappropriate for our scale. For both technical and design reasons, we have chosen to pursue a single question design with modals specific to different kinds of users, not a two-mode workflow based on reputation.
To measure the impact of changes to the question workflow, we use A/B testing. People in the baseline arm of the test had the old version of the question workflow as it already existed. We shipped changes to the question workflow iteratively so that people in the experiment arm of the test experienced a new workflow; these iterative changes in the experiment arm were necessary because the changes we wanted to test against the old workflow were so extensive and had complex dependencies. For simplicity, we can summarize the changes in two “steps”:
- Step 1: The first group of changes launched in September and included pretty dramatic UI changes, along with a welcome modal and what-to-expect modal for new users.
- Step 2: The next group of changes launched in October and focused mostly on a review interface, consolidating and organizing validation warnings.
People in the baseline arm did not see any of these changes but had the old question workflow only.
Posting your question
One of the most important metrics for us when we work with the question workflow is the conversion from clicking the “Ask Question” button to finally posting a question. The new question workflow, compared to the old, allows users to be more successful in this task, with increases of 3% in this conversion throughout the entire process (both Step 1 and Step 2). Adding the review interface did not impact the ease of use of the new question form, as measured by this conversion from initial click to final post.

It may be difficult to see in this graph because they are so small, but the gray errorbars show the uncertainty on how we have measured the proportion here.
Another important metric for the question workflow is question quality, which we define and explain here. More questions are being asked with the new workflow, but what are these questions like?

During Step 1 (the major UI changes plus modals for new question askers), we saw a 1.5% decrease in good quality questions. Not great news! During that part of this major revamp of the question-asking experience, we had increased the number of questions (and the overall number of good questions) but the proportion of questions that were good was down slightly.
Fortunately, one of the main reasons we are redesigning the question workflow is that our new approach is more flexible and easier to iterate on. In fact, that’s exactly what we did next. Step 2 of our rollout focused on consolidating and organizing validation warnings, and during this step, the quality gap between the baseline and experiment groups decreased to virtually zero. We fixed the regression in question quality by iterating in this more flexible framework. We see similar results during the test if we measure bad quality instead of good quality.
Next steps
As of today, the new question workflow performs better in terms of task success (people who intend to ask a question successfully posting their question) and the same in terms of question quality. From a technical perspective, the new workflow is easier to maintain and build on moving forward. Our next steps will include more iteration to continue improving question quality, along with other concerns of all kinds of users, from the most to the least experienced. We have graduated this new question workflow and in the future, we’ll be testing any further changes against this new baseline. The next time you ask a question on Stack Overflow, look for the results of these carefully planned and tested changes!
Tags: bulletin, community, insights, stackoverflow
16 Comments
> task success: people who intend to ask a question successfully posting their question
Don’t forget that the real success is finding a solution for their problem, not posting a question. (Having more content might be a success for the site though). People might click the “Ask a question” button, but then got good duplicate suggestions or were encouraged to formulate their problem clearly, which lead them to the solution.
It might be useful to collect data on that as well, e.g. by tracking whether the user visited similar questions instead of posting a question, or even having a dialog pop up when they leave the page, asking for the reason.
Good to see this work being done. I have two questions:
1. When users find that their question is answered already in the process of asking it, they won’t successfully post their question. I would still consider that a success though. How do you take that into account? It seems to me that optimizing that percentage might be even more important than optimizing the percentage of successful posts.
2. I’m confused by the second graph. Question quality seems to decline both in step 1 and step 2. Did you iterate more after step 2, and eventually improve the question quality above the baseline value of step 1? Or should I understand your explanation to read that the decline in quality is negligible?
Thank you.
It’s a shame that the wizard was scrapped. It introduced a structure to asking questions that’s completely missing from the new dialog. By contrast, the new dialog is pretty much the same as the old one, and all the “advice” is easily ignorable. With the wizard it was in your face, and that’s a good thing.
As an example of what I mean, there was a time when, using the wizard, I ended up not asking a question because the structure of the wizard led me to answer my own question. I’m confident that the same wouldn’t happen with the current question UI since it’s missing crucial guidance. If this is true even for experienced users it’s probably even more so for new users.
That’s probably why the proportion of bad questions went up: the people who would otherwise have posted good questions (i.e. people who are applying brainpower to the process) realised their solution as a result of this process, then did not need to ask.
This is a win for them, and a win for the world.
But it’s not a win for Stack Overflow, which is looking for more posts and hits.
So TLDR; More people who started writing a question went on to post it, but the quality of the question is worse.
Yes, I’d noticed!
What’s going on with making it so that experienced users do not see the pre-school-ish robot and question balloons?
“As of today, the new question workflow performs better in terms […] of question quality.”
Maybe I just don’t understand the graph (and its description) but to me it seems like question quality went down (about 2%). Could you clarify where the better question quality can be seen?
Now I read “[…]and the same in terms of question quality” – was that always there and I just missed it or did you edit the article?
Thank you for sharing useful information
Are the results actually statistically significant? Your bar graphs aren’t very convincing.
When users find that their question is answered already in the process of asking it, they won’t successfully post their question. I would still consider that a success though. How do you take that into account? It seems to me that optimizing that percentage might be even more important than optimizing the percentage of successful posts.
It’s a shame that the wizard was scrapped. It introduced a structure to asking questions that’s completely missing from the new dialog. By contrast, the new dialog is pretty much the same as the old one, and all the “advice” is easily ignorable.
Seems you just copied Konrad’s comment. Why?
“When a measure becomes a target, it ceases to be a good measure.”
The target for StackOverflow should be helping people. Observing whether people dare/are able to ask a question was a good measure. Now if it becomes a target, we might as well remove the similarity search and we’ll get a super high conversion rates. As you noticed yourself, increasing the number of questions lead to an increase in bad questions more than to an increase in good questions, and that is a bad thing.
I’d rather have 10k questions in 72 hours as 5100/4900 good/bad
than 10k questions in just 66 hours but as 4750/5250 good/bad
equaling 5181 good (+1,6%) and 5727 (+16,9%) bad questions in 72 hours.
Jeff Atwood explained last year that SO is a form of wiki, and that clarified so much for me. Thinking of SO in that light, the number of questions is not as important as the quality of questions and answers. There will be many added questions over the years, but the site could decline in value if it is snowed with low quality content. At the same time, we need to welcome the newcomers “should I use Python or C++ to scan a directory and add to my Excel spreadsheet”. A well meaning youngster who can embrace the value of SO is an investment in the future.
SO the company is looking for more visitors, more questions, more page views.
Yes, this is at odds with SO being a form of wiki with quality content.
But someone’s got to pay the bills I guess.