Do you use tabs or spaces for code indentation?
This is a bit of a "holy war" among software developers; one that's been the subject of many debates and in-jokes. I use spaces, but I never thought it was particularly important. But today we're releasing the raw data behind the Stack Overflow 2017 Developer Survey, and some analysis suggests this choice matters more than I expected.
Spaces make more money than tabs
There were 28,657 survey respondents who provided an answer to tabs versus spaces and who considered themselves a professional developer (as opposed to a student or former programmer). Within this group, 40.7% use tabs and 41.8% use spaces (with 17.5% using both). Of them, 12,426 also provided their salary.
Analyzing the data leads us to an interesting conclusion. Coders who use spaces for indentation make more money than ones who use tabs, even if they have the same amount of experience:
Indeed, the median developer who uses spaces had a salary of $59,140, while the median tabs developer had a salary of $43,750. (Note that all the results were converted into US dollars from each respondent's currency). Developers who responded "Both" were generally indistinguishable from ones who answered "Tabs": I'll leave them out of many of the remaining analyses.
This is an amusing result, but of course it's not conclusive by itself. When I first discovered this effect, I assumed that it was confounded by a factor such as country or programming language. For example, it's conceivable that developers in low GDP-per-capita countries could be more likely to use tabs, and therefore such developers tend to have lower salaries on average.
We could examine this by considering whether the effect occurs within each country, for several of the countries that had the most survey respondents.
The effect is smaller in Europe and especially large in India, but it does appear within each country, suggesting this isn't the sole confounding factor.
Did we see the same tabs/spaces gap within each of these groups?
Yes, the effect existed within every subgroup of developers. (This gave a similar result even when filtering for developers only in a specific country, or for ones with a specific range of experience). Note that respondents could select multiple languages, so each of these groups are overlapping to some degree.
I did several other visual examinations of possible confounding factors (such as level of education or company size), and found basically the same results: spaces beat tabs within every group. Now that the raw data is available, I encourage other statisticians to check other confounders themselves.
Estimating the effect
If we control for all of the factors that we suspect could affect salary, how much effect does the choice of tabs/spaces have?
To answer this, I fit a linear regression, predicting salary based on the following factors.
- Tabs vs spaces
- Years of programming experience
- Developer type and language (for the 49 responses with at least 200 "yes" answers)
- Level of formal education (e.g. bachelor's, master's, doctorate)
- Whether they contribute to open source
- Whether they program as a hobby
- Company size
The model estimated that using spaces instead of tabs is associated with an 8.6% higher salary (confidence interval (6%, 10.4%), p-value < 10^-10). (By predicting the logarithm of the salary, we were able to estimate the % change each factor contributed to a salary rather than the dollar amount). Put another way, using spaces instead of tabs is associated with as high a salary difference as an extra 2.4 years of experience.
So... this is certainly a surprising result, one that I didn't expect to find when I started exploring the data. And it is impressively robust even when controlling for many confounding factors. As an exercise I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.
Correlation is not causation, and we can never be sure that we've controlled for all the confounding factors present in a dataset, or indeed that the confounders were measured in the survey at all. If you're a data scientist, statistician, or analyst, I encourage you to download the raw survey data and examine it for yourself. You can find the code behind this blog post here if you'd like to reproduce the analysis. In any case we'd be interested in hearing hypotheses about this relationship.
Though for the sake of my own salary, I'm sticking with spaces for now.