Welcome Wagon: Community and Comments on Stack Overflow
This past summer, we wrote our first blog post about comments on Stack Overflow, focusing on our initial work rating comments internally at Stack Overflow and what we learned. Since then, we’ve fielded this comment rating task more broadly in our community. This blog post shares some of what we are learning.
Engaging our community
I (Jason) wrote a web application that presents a user with a comment thread from a post on Stack Overflow and asks the user to rate each comment in the thread as fine, unwelcoming, or abusive. Our first blog post shared results from when we asked employees at Stack Overflow, including developers, product managers, and executives, to rate comments. In August, we rolled out our new Code of Conduct, along with new flags for comments that align with these categories, one flag for rude/abusive and one flag for unfriendly/unkind. This fall, we extended our comment classification task beyond our employees to our larger community. We invited individuals from three groups to rate comments.
- Moderators on Stack Overflow and other Stack Exchange sites
- Individuals who responded to our blog post in April, indicating they want to help make Stack Overflow more welcoming
- A sample of registered users from our general research list (you can opt in/out of our research list in your Stack Overflow email settings)
To log in to this web app and record data, each user needed a Stack Overflow account, so users had to make an account if they didn’t have one already. We asked participants to invest at least one hour in rating comments, and to not work for more than 20 minutes at one sitting.
What kind of response did we get? Overall, there were 525 users who spent at least 15 minutes or more rating comments. They made 253,807 ratings of 40,358 distinct comments. How many users and comment ratings did we have, for each kind of user?
The moderators demonstrated their enormous commitment to our community through this project, as they do consistently day in, day out; moderators who participated in this project rated an average of over 1,000 comments each. Folks who responded to our blog post expressing interest in welcome/inclusion on Stack Overflow also invested a great deal of time, rating over 500 comments each.
We see can see this visually by looking at the cumulative distribution functions for each kind of user; this kind of plot shows, for each number of comment ratings, the percentage of users who rated that many comments or lower.
If you’re not used to interpreting this kind of graph, take a look at
x = 1000, the location on the x-axis that corresponds to 1,000 comments. The line for the moderators is the lowest, indicating that more moderators submitted more comment ratings compared to the other groups.
Different kinds of people experience Stack Overflow in different ways. If we look at all the ratings made by the different types of users aggregated, how do the different kinds of groups perceive these comments on Stack Overflow?
The highest rates of unwelcoming comments were identified by the internal employees at Stack Overflow, followed by Stack Exchange moderators. We trust and support our moderators, and in this specific project, moderators demonstrated their understanding of unfriendly and unwelcoming behavior in comments.
Regular registered users from our research list perceived the next lowest rate of unfriendly comments, and users who responded to our blog post about making Stack Overflow more welcoming found the lowest rates of unfriendly comments of all. How can we interpret this? We specifically invited users who may not consider themselves active participants in our community in order to gain outside perspective, but then these users saw the lowest rates of unwelcoming behavior.
A possible explanation is that we are seeing a real effect of deep experience with our site; it appears the more invested an individual is here at Stack Overflow, the more sensitive they are to problematic behavior. What do these unfriendly comments look like? The following combine elements of real comments to show typical examples.
- “Why do you want to do this? You have conflated at least three problems here.”
- “It will be very hard to help you with such a trivial bug. It could come from any line in your code, and we have to guess.”
- “How exactly is this going to solve my problem?!”
- “You don’t understand how to use this site. Here nobody codes for you; read the docs and then show us.”
- “What are you actually trying to achieve? Please learn how to use a debugger.”
Our project showed that the more deeply an individual is connected to Stack Overflow (as an employee, or a moderator), the more they are likely to see problems in comments like these. This effect is robust to comparing groups who were shown the same comments, who rated the same number of comments, and other analytical approaches.
What do the distributions of ratings for each individual look like?
Each individual did not rate the same set of comments and worked for different lengths of time, so we expect variability in the results for each individual. Overall, the median percentage of perceived unwelcoming comments per individual was 3.5%, quite a bit lower than the median percentage for employees of 6.5%.
To understand how much agreement there is between raters, we can again look at Krippendorff’s alpha, a measure that ranges from zero (nobody agrees) to one (perfect agreement). This measure accounts for the number of raters, so we can compare agreement among employees to the groups with more raters. What is Krippendorff’s alpha, for comments that were rated by at least three people?
These values for alpha are low compared to what social scientists would use to draw reliable conclusions based on the ratings; social scientists look for values close to 0.8 or more. Notice that Stack Overflow employees rated more comments as unwelcoming than other groups but agreed with each other about what is unwelcoming and abusive at higher rates at the same time. The rate of agreement among moderators and registered users was lowest (although still much higher than for people unfamiliar with Stack Overflow), and the rate of agreement for the users who volunteered to help make Stack Overflow more welcoming was a bit higher. Remember that these were the users who rated the lowest overall levels of unfriendliness; some spot-checking indicates these users identified only the clearest examples of problematic text.
Another factor that impacts interactions on Stack Overflow is reputation. Do we see any difference in how raters perceived unwelcoming behavior with their own reputation? This can help us understand if “power users” (distinguished from moderators) may be driving problems with site culture.
There is no clear evidence in this plot for a relationship, indicating that high-reputation users perceive unfriendly behavior at about the same rate as low-reputation users. We have seen effects similar to this before, for example, in our annual Developer Survey. When asked what the worst or most annoying thing about Stack Overflow is, developers of all experience levels and self-reported activity levels on Stack Overflow mentioned issues with harsh interactions and site culture.
All together, this begins to paint a complex and interesting picture of who understands unwelcoming behavior and in what ways. Moderators and high-reputation users are just as likely, or even more likely, to identify unwelcoming comments compared to new users. Stack Overflow employees identify more comments as problematic and agree with each other more about what is a problem compared to the other kinds of users in this project.
So where do we go from here? For starters, we as employees learned that we don’t always perceive problems in the same way as other members of our community. We will keep this in mind as we move forward with plans to make Stack Overflow a better place for developers to learn and share knowledge.
We plan to use this dataset to investigate how comments are used on questions and answers, toward users of different experience levels, in different communities, and more. Look for more blog posts on these issues in upcoming months. We will continue to use the results from this project in product changes on Stack Overflow, as well as directly using appropriate subsets of this data in machine learning models. Also, in 2019 we will release this dataset (comment IDs, comment ratings, and anonymized/randomized rater ID) upon request so that other people in our community and beyond can explore this data for themselves.
All of this would not be possible without the investment of time and energy of the individuals who participated in this project, and we want to acknowledge each of you who volunteered to help us understand this aspect of our site better. Thank you, for your care and time. Community is central to our identity at Stack Overflow, and we are committed to making Stack Overflow a healthy, inclusive place for developers to learn and share knowledge.
I would be interested in seeing the data from the other direction: For users who had their comments marked abusive or unfriendly, what is their overall reputation score? Or better yet, what was their reputation at the time they made the comment?
It would be interesting to know whether high-rep users are more polite (or more rude!), and whether as users gain rep, they get more polite, and other such things.
Perhaps the next step is to consider and test for a burn-in effect: users who have been here so long as to have 100k+ reputation have probably steeled themselves to a great deal more than users who are not as new (though who have been here long enough to see comments across a wider range of posts, e.g. users with rep of 1k or more).
Likewise, that last plot should have each dot weighted with the user’s *types* of contributions… do those 100k+ rep users regularly participate in moderation action or do they spend all their time answering questions? That seems like an obvious potential source of skewing to me–many users hover around the five-digit mark because they shift a significant portion (even a majority portion) of their time and effort into cleaning up the site/moderating, including flagging comments, downvoting, close voting/delete voting, etc. If the 100k+ users spend all their time answering, it is clear then that they wouldn’t be spending much time looking at anything other than comments under their own posts (e.g. their own comments and comments of users who tend to be appreciative of receiving typically expert help for free), so they are less likely to perceive a problem with the tone or content of comments overall.
Where were these comments drawn from? Was it a true random sample, and from what timeframe? There might be some interesting effects based on how old the comments were when you pulled them (since older comments are more likely to have been already moderated)
Great question, and I left out this part of the analysis for brevity here. These comments were randomly chosen from months over the past year, and then also sampled from years in the past during specific months, so we could compare without seasonal effects (since we have more students during certain months). I can’t rule out the seasonal effects but we can see that there are no long-term trends.
Interesting point/concern, Undo.
I understand that you’re wondering if the pool from which the comments were chosen included comments that had already been moderated (such as hidden/’deleted’ by a moderator or the author.). If the selection is only from comments that have survived, then it’s not rally random/representative.
Julia, your reply addressed the https://en.wikipedia.org/wiki/Eternal_September phenomenon, but as I see it, that’s a bit different from Undo’s qustion, and I’m curious about it too.
I don’t see any measurement for statistical relevance, all of this could just be random noice nicely colored. It’s missing an analysis of other causes that could have led to those effects, representability of the study group, etc.
What I’m saying is, this is just vodoo.
Your employees seem to be a little detached from the other groups. What exactly does “employees” mean? Are all those that participated under that tag actually people that code for a living, or does that include all your staff?
I’m asking because SO is supposed to be developer-to-developer communication. When I talk to a fellow developer, I try to cut out the unnecessary noise aka pleasantries that I would include talking to my boss or the sales people or even the cleaner. Please make sure that our level of judging comments is still developer-to-developer-for-problem-solving, not what the populace in general perceives as welcoming.
You can read the first blog post for more details on this, but in short, mostly developers, but also product managers, community managers, and executives (many of whom are ex-developers, in our case):
The number of employees who do not or have not coded for a living is so small that this seems unlikely to explain the difference. Could be something for me to check, though!
Many of the rude comments are identifying real problems with a post, and are saying things that do need to be said. They’re just phrased badly. Could you suggest alternative phrasings for such instances?
This is not trivial. In English, politeness is often conveyed by hedging and extra verbiage. For ESL speakers, this may make comments more difficult to understand. Furthermore, some cultures (e.g., Israeli, American Deaf) value directness, and would see excessive couching of comments as rude.
This is a great question, and one that we as employees and as a community as a whole are dealing with. You can check out some follow-up discussion here:
Julia Silge I’m a developer and I like StackOverflow and its community head and shoulders above Reddit groups for the reasons above – please keep up the good work.
I think it’s interesting first of all that although you see differences, the differences are not large (every groups medians are in the box part of every other group). But that doesn’t mean they are not meaningful. I think you’ve found an explanation for some of the disconnect that seems to happen between employees and the most active users when there are meta and blog posts on these issues. These groups do have differences in perceptions. I don’t think it’s fair or helpful to characterize this as investment in SO, though I know what you mean — the employees don’t have other jobs that they are taking breaks from when they look at SO plus we know from other posts that you’ve had lots of deep internal discussions on this topic. I bet there are also probably some demographic differences between the groups.
I think it is a quite interesting and smart piece of research. I was one of the people who did ratings, and I found it very challenging. For me it made it more clear that I was not sure what terms like “unwelcoming” mean in practice. The low inter rater reliability indicates that maybe what is needed is to build a collective understanding of that term (abusive too, but I think that’s a different issue). I’m sure the same goes for “welcoming.” I think some of the “I’m just not going to comment any more” posts that you see are really saying “I’m not sure what is considered okay.”
Agreed. The difference between groups is not significant
The difference with the employees is a potential symptom of a real problem, as said in the blog post about disconnect with users, but could also be a statistical glitch/sampling error. Perhaps the participants had some previous bias, e.g. the employees who participated had probably took part in previous discussions of comments similar to those and were more trigger-happy when they saw them again.
The big finding is the low level of agreement. It’s going to be hard to apply such rules consistently. It would be useful to identify some class of comments where there happened to be more agreement, if any.
I was one of the participants in the study. One thing I struggled a bit with was the definition of “unwelcoming.” This especially comes up around comments like “Why do you want to do this? You have conflated at least three problems here.”
I think we can agree a comment like that is not going out of its way to be welcoming and that it could be phrased much better. I would not want to receive that comment on one of my questions. But it’s also not really dismissive as most of the unwelcoming comments are; it’s asking for the context of the question in order to help generate an appropriate answer, presumably because there are signs of an XY problem going on.
While we’ve had a community discussion about how comments like “Please learn how to use a debugger” are unwelcoming and unhelpful, I don’t feel like we’ve had much of a discussion on how to frame the common “Why do you want to do this?” question in a way that’s welcoming, or what exactly we consider to be unwelcoming.
I disagree 90% with your spelling of Krippendorff 🙂
HA thanks so much. Fixed now.
I think it would be interesting to see how this differs when looking at the reputation of the OP at the time of posting their question where these comment threads originate from. i.e. Are we more/less hostile to new users.
Something like this was considered “unfriendly”: “*Why* do you want to do this? You have conflated at least three problems here.” The first part is a quite common inquiry, often made by users I have come to know treat all users in a friendly way. “What are you *actually* trying to achieve?” is another one. It’s followed by “Please learn how to use a debugger,” which, besides being a change of subject, sounds like good advice. If you want to read a snide tone into such comments, then either there is a lack of context for us readers of the blog or the rater was in an unfriendly frame of mind. One reason I think such comments are common is users often post code thinking the problem is with their code and not their approach. To help a user, you often need to know what they were actually trying to achieve. This observation is aside from the fact that for many of the users I see, English is not their first language, and what may seem blunt is simply the best they can do to get their point across.
> “Stack Overflow employees identify more comments as problematic and agree with each other more about what is a problem compared to the other kinds of users in this project.”
What? That’s the conclusion you drew from that data? The conclusion *I* drew from that data is that SO employees are *more likely to misidentify a comment as problematic* compared to the other groups. That is: there’s a self-selection bias that you, as a Stack Overflow Employee, have made about the data.
In addition the line, “social scientists look for values close to 0.8 or more,” along with a graph that shows no alpha above 0.37 tells me that there’s large amounts of disagreement about which comments are problematic, even among the most like-minded group (and that like-mindedness is likely an artifact of the sample *size*).
Salt, and grains thereof, should have be taken when reading the conclusions.
Unsurprisingly, I don’t agree with your overall characterization here, except that yes, there is a lot of disagreement about which comments are problematic. That is in fact what I said.
I would like to point out, though, that measures of inter-rater reliability like Krippendorff’s alpha account for sample size. You can directly compare the reliability measures of a group of small raters with a group of larger raters because it is normalized.
Julia, unfortunately you are ignoring the elephant in the room. The low values of Krippendroff’s alpha suggest the different groups only agree on extreme cases, for example where comments are outright rude or argumentative. You should therefore be extremely careful we using these results to recommend “product changes on Stack Overflow, as well as directly using appropriate subsets of this data in machine learning models”. As highlighted in your article, the only valid conclusion you can draw from this study is that there is a lot of disagreement about what an unwelcoming comment is.
Would love to see how this affected the sites chat. II saw a personal significant drop in all chat activity since the original welcoming posts.
Is the survey data available somewhere for download, like the annual surveys are? If not, are there plans to make it available?
Yes, I talk a little about this at the end of the post. It’s going to be available upon request later in 2019.
You write that “it appears the more invested an individual is here at Stack Overflow, the more sensitive they are to problematic behavior” and “the more deeply an individual is connected to Stack Overflow…the more they are likely to see problems in comments like these.”
Based on my assumptions about the roles these different groups play, it looks like: the more likely someone is to have to deal with problems arising from comments, the more sensitive they become to the potential for comments to create a problem. It’s not so much that regular users care less about the community; it’s that we’re far less likely to have to deal with any kerfuffle that arises when someone (either commenter or commentee) gets their nose out of joint.
Besides the fact that this has really _no_ statistical relevance or doesn’t say anything, have you been thinking if the cultural background of people might have an influence on what they think is ok or not? US americans for example are known to be overly polite (in the rest of the world).
I would once again invite you to help us who comment on the site: tell us how we should say these things. If it’s “unfriendly” to ask what the poster is actually trying to achieve, how should we ask about it? If we don’t understand it how can we help?
And if you think saying politely that they should learn to use a debugger is also unfriendly, what should be said then? “Oh, I noticed you didn’t even try to debug this partial code you posted, let me take some time to make it into a working program and debug it for you”? I know, exaggeration, but please do provide actual helpful content here. How do we tell people, who clearly don’t realize they could use a debugger, to use one? It’s one of the basic tools of the trade and very often the problem is easily solved by using one.
I’m sure I’m not the only one who feels these things aren’t unfriendly, and are very helpful actually. So I’m not surprised that employees find more of these “unfriendly” things. But after you explain how we should phrase these helpful things if this is not acceptable I’m sure we’ll be friendly in no time, in addition to being helpful.
I don’t think I can quite go with “All together, this begins to paint a complex and interesting picture of who understands unwelcoming behavior and in what ways.” On the contrary, I think it paints a fairly simple picture: a low level of “unwelcoming” behavior is perceived in aggregate by all categories of participants, however sliced, without much agreement about which behaviors those are. Perhaps a complex picture is lurking under there somewhere, waiting to be discovered, but I don’t yet see a sign of it. I find that a little surprising, in fact. A nice, simple explanation is that everybody agrees that sometimes others say things that they don’t like.
Indeed, I think the low inter-rater agreement is the most interesting result of this study. It suggests to me that the perceived problem of unwelcoming behavior may be largely intractable: if people do not agree about what is unwelcoming, then what are we supposed to eliminate? In this vein, I would be interested in an analysis oriented toward judging the disparity of opinion on a comment-by-comment basis. That is, what proportion of the comments studied were rated unwelcoming (or worse) by almost all raters, what proportion were rated unwelcoming by a majority, what proportion were rated unwelcoming by a substantial minority, by at least one person, by nobody.
The 6-ish ratings per comment on average may not be enough to support such a study, but I’d like to see what proportion of comments
Echoing another comment, I’d wonder about how representative this sample was compared to those who use Stack Overflow as a whole, as well as to the greater programming community. 300 users is decent number, but previous survey data shows that Stack Overflow survey respondents skew significantly in several ways (including gender) compared to the entire programming community. If a similar skew was also seen in this data, it might be painting a different picture than what is really occurring.
With a Krippendorff’s alpha below 0.67, one would conclude that Stackoverflow employees, moderators, and other users surveyed can’t reliably distinguish offensive comments from non-offensive comments. Therefore, the role of moderators in responding to flags is a key ingredient in achieving the goal of respectful communication on the site.
There are three big issues that I can see with this.
1) Moderators are not grouped by experience moderating. This is quite important because the judgement of whether something is problematic is deeply influenced by past experience of what leads to drama and what not, and this is not as trivial as a non-experienced Stack Overflow employee might think. Instead, groups with wildly different experiences are grouped together.
2) Incentives in answering correctly to the moderation requests vary wildly across your groups. For employees, this stuff is their livelihood. For moderators, their self-image. For other users, just a thing. Perhaps this is a confounding factor that needs to be excluded.
3) As a moderator, I would never, ever moderate a comment on its own. This seems to be a big misunderstanding of what moderating is about. It’s not a bucketing exercise of dividing good comment from bad comments. It’s an exercise in reducing drama (“moderation” literally means that) in _interactions_. Two loud people having a very direct exchange may not be a problem at all. Put some very direct comments next to a quiet person and it might become abusive. Comments are not units of moderation, interactions — threads — are.
“So where do we go from here? For starters, we as employees learned that we don’t always perceive problems in the same way as other members of our community. We will keep this in mind as we move forward with plans to make Stack Overflow a better place for developers to learn and share knowledge.”
I can understand that rude/unfriendly replies are an issue when trying to make SO more welcoming.
Then again, the percentage of rude/unfriendly replies that was determined by all the groups in your research was not *that* high, in my opinion.
Maybe I missed some other efforts SO makes in making the community more welcoming, but what I am realy missing is this:
I have been using SO for years now and I am a very happy consumer of the knowledge it offers.
Consumer and not participator, because SO *does not allow* users with a low reputation to participate. And you can only get a good reputation by participating. A real chicken & egg problem.
In my opinion, not having a good reputation does not mean the user is not able to give good answers to questions!
I can understand the need for some form of filtering, but why is this not done in the same way spam is detected? + and – points determine spam.
At SO a user starts at 0 and only after having gained enough credits (I still do not know how to gain those credits) the user is allowed to ‘speak’. This muzzles a large group of potentialy great contributors, in my opinion.
Maybe something to consider for WelcomeWagon 2.0?
Abusive responses is an important topic, but so is the usefulness of responses. I’ve used the StackOverflow Mathematica site for a few years after being encouraged by a helpful moderator to join. I very soon noticed many of the comments posted by amateurs received unhelpful responses such as “This question should be removed/moved as previously answered” or “Your question/code is not clear enough to answer.” I also noticed that many of the answers were presented in code that seemed to me to use the most arcane formats and notations and to provide so much of this that the time spent in studying the answer caused additional delay. A great many answers were incomprehensible to most amateur coders. This is still the case. I formed the impression that the site was being run by professional coders for other professionals, so I use the site as seldom as possible, finding that the Wolfram Community Forum provided friendlier and more useful answers. As my skills improve, I find StackOverflow more useful for quick reference than I used to, but it still seems best left to the experts. Perhaps there should be a separate site for amateurs. I guess that most people come to the Mathematica site after encountering a code failure. They are seeking a quick and transparent answer to they can get on with their work.
Thanks for your site and very good
I’m from your followers. Thank you for your work for all
“in 2019 we will release this dataset (comment IDs, comment ratings, and anonymized/randomized rater ID) upon request so that other people in our community and beyond can explore this data for themselves.”
Is it available now?