community April 29, 2009

Handling Duplicate Questions

As we get more and more questions in Stack Overflow, the issue of duplicate questions becomes more pressing. The odds of any question being a duplicate, however small, increases with the total number of questions in the system. So it’s worth considering: what makes a question an exact duplicate? As I see it, there are…
Avatar for Jeff Atwood
Co-Founder (Former)

As we get more and more questions in Stack Overflow, the issue of duplicate questions becomes more pressing. The odds of any question being a duplicate, however small, increases with the total number of questions in the system. So it’s worth considering: what makes a question an exact duplicate? As I see it, there are three classes of duplicate questions, from most clear to least clear.

  1. Cut-and-paste duplicate questions. These questions are the very definition of exact duplicates; they are typically from users who willfully take the very same question and post it again. Either they’re not satisfied with the speed of answer, or they just don’t know what they’re doing. We rely on Stack Overflow users to vote down these “questions” and flag them for moderator attention. These sorts of duplicates are typically deleted as soon as we see them, as they’re borderline abuse of the system. They often don’t get answers, so this is fairly easy to deal with. No grey area here.

  2. Accidental duplicates. These questions aren’t copy and paste, but they cover the exact same ground as an earlier Stack Overflow question. The overlap is not ambiguous; the question uses the same words and asks the same fundamental question, with no variation at all. This is a failing on several levels; of the asker to do proper diligence before asking, of our internal ask page title search, and possibly of Google search as well. We rely on Stack Overflow users to link these questions together by closing them as “exact duplicate” and posting the URL (as a comment, or edit) to the question this is a duplicate of. These sometimes have multiple good answers attached to each question. We will use our new moderator question merge function to merge them together without losing any answers or comments.

  3. Borderline duplicates. These questions are ambiguous; they’re in the same ballpark as a previous question, but have subtle differences that may make them legitimately standalone questions. These are subject to interpretation. We rely on Stack Overflow users to tag these questions appropriately so they naturally “group” with the questions they’re related to. The more tags the questions have in common, the more likely they are to show up together on the related questions sidebar. You can also edit in links to the possibly duplicated posts, if appropriate, but be sure to make the tags match so the system can figure out the relationship without as much manual effort. There’s often benefit to having multiple subtle variants of a question around, as people tend to ask and search using completely different words, and the better our coverage, the better odds our fellow programmers can find the answer they’re looking for.

The impending launch of the serverfault.com private beta has interrupted work on this slightly, but better handling of accidental duplicate questions is currently very high on our priority list. We’d like to streamline this so it’s easier, with a friendly UI. (If you have ideas about what UI makes sense in this scenario, we’d love to hear it.) That said, we have implemented a moderator level function to merge duplicate questions — so if we determine two questions are accidental duplicates, we can merge them together without losing anything except the text of one of the questions; all comments and answers are preserved.

Thanks to everyone who helps us find and eliminate duplicate questions. We appreciate it, as do future visitors who hopefully will be able to find their answers a bit faster without excessive duplicate questions cluttering up the system. As you have time, please keep doing what I have highlighted in red, above, to help keep duplication in check!

Podcast logo The Stack Overflow Podcast is a weekly conversation about working in software development, learning to code, and the art and culture of computer programming.

Related

the-loop January 22, 2020

The Loop #2: Understanding Site Satisfaction, Summer 2019

We’re excited to share research highlights about the work we’ve been doing to understand how satisfied people are with Stack Overflow. We’ve been working hard to explore what users like best about Stack Overflow and what their top pain points are, with the goal of improving the overall experience of using the site. To this end, we’ve launched a site satisfaction survey, in which we continually survey users about their experiences using Stack Overflow.
February 5, 2020

The 2020 Developer Survey is now open!

It’s that time of year again—we’re launching our 2020 Developer Survey. We love watching how this survey data evolves year-over-year, and this year we are focused on collecting insights from a sample that is representative of coders around the world.
Wooden figures standing in a circle facing each other
the-loop May 26, 2020

The Loop, May 2020: Dark Mode

We received a bunch of requests to share how we use our feedback framework on specific features. We got excited about this, and given that we just released Dark Mode (and “Ultra Dark Mode”), we thought this was a great opportunity to show how we arrived at our solution.
Avatar for Sara Chipps
Director of Public Q&A