The Impressive Growth of R
We found in a previous post that Python has a solid claim to being the fastest-growing programming language in terms of Stack Overflow visits.
The same analysis showed that the R programming language has shown remarkable growth in the last five years as well. In fact, R is growing at a similar rate to Python in terms of a year-over-year percentage, though this growth is “easier” because it started from a smaller share of traffic.
In another post, we found that much of the growth of Python can be explained by the expansion of data science. Since R is primarily used for statistical analysis, it’s likely that R is part of the same trend. In this post, we’ll analyze how quickly R has grown, examine how its growth differs across industries, and look at what R packages are popular and growing within the ecosystem.
Like Python, a disproportionate amount of traffic to R questions comes from high-income countries (it’s visited about three times as often in those countries as in the rest of the world), so this post will consider Stack Overflow traffic from high-income countries, such as the United States, United Kingdom, Germany, Canada and France. As a disclaimer, we’ll note that the Data Team at Stack Overflow works primarily in R (in fact, we use it to generate almost all of the graphs and results for Insights posts like this one).
Growth of R
There is no sense in which R is “competing” with any of these other languages, which aren’t typically used for data analysis. This comparison is shown only to demonstrate that the kind of sustained growth R has shown is rare among languages of comparable size.
Traffic to C questions shows a strong seasonal pattern (since it’s one of the most common choices for undergraduate programming classes), and R has roughly caught up with that level of traffic. Visits to Swift questions grew rapidly after Apple introduced the language in 2014, but have since leveled off. Typescript, though it’s still a smaller source of traffic, has been showing quite remarkable growth, and will be the subject of some future analyses. As we saw in a previous post, traffic to Ruby and especially to Objective C have been declining over time.
What industries visit R questions the most? (This analysis is restricted to the United States and United Kingdom, the countries in which we can segment our traffic by industry).
R is most visited from universities, where it’s a common choice for academic research, especially in the social sciences and biology. Indeed, in June-July 2017, when most classes aren’t in session, R was the second-most visited tag from universities, second only to Python.
The industry with the second-highest share of R visitors, by a close margin, is healthcare. That probably won’t come as a surprise to biostatisticians, since R is the tool of choice for many statistical methods necessary for clinical studies and bioinformatics.
One industry that doesn’t visit a lot of R, relative to other technologies, is tech: software and web companies. (The Data Team here at Stack Overflow is one exception!) This is partly because data analysis makes up a relatively small portion of the industry’s Stack Overflow visits, compared to software and web development. We separately found that pandas, a data science framework for Python, was less visited in tech than it was in almost all other industries. But it does suggest that the way we use R on our team is not the typical use case for the language.
We saw that Python’s rate of growth has been roughly equal across all industries we can measure. In what industries is R growing the fastest?
R isn’t shrinking within any industry, but visits to R are generally growing faster in industries where it was already more heavily visited, including showing very rapid growth in academia and healthcare. This graph also confirms what we saw in a previous analysis, that R is both disproportionately visited and fast-growing in the government sector. We also see that it’s relatively widely used, and expanding, in consulting and insurance. Each of these are industries where data analysis and visualization play a disproportionate role, relative to software and web development.
One of the areas where we don’t see much growth is tech, confirming that most of R’s expansion appears to be happening outside of the software and web industry. Since in that industry we did see an increase in visits to Python data science frameworks like pandas and NumPy, it’s a reasonable conclusion that Python is becoming a more popular choice for data science within those companies.
In the case of Python, we were interested in what particular applications of the language had been driving its growth, such as data science, web development, and system administration. R is less of a mystery: its primary purpose has always been statistical analysis, machine learning, and data visualization. But we’re still interested in what trends are happening within the R ecosystem.
To examine this, we extracted what R packages were used in particular questions and answers. We extracted this from our public R Questions dataset hosted on Kaggle, containing all (non-deleted) questions and answers with the R tag. This Kaggle kernel shows how we parsed the data, including examining uses of the
What were the most commonly used packages, in all existing R questions and answers?
Many of the most commonly mentioned packages were written by Hadley Wickham, with his packages making up 7 of the top 10 (the others being data.table, shiny, and zoo). It’s worth noting that this metric may be tilted towards the most confusing packages rather than simply the most widely used. However, running this on the most common packages mentioned in answers, not just questions, leads to a very similar list (you can try it yourself!), meaning this is a reasonably faithful representation of the packages R developers find useful in their work.
This data can also give us insight into the fastest growing packages. We’ll measure this over time in terms of the percentage of questions where either the question mentions the package, or one of its answers does. Since R questions in general are becoming more common, we’re examining the changes only as a share of the R ecosystem: most of these packages are growing in terms of raw numbers.
(Note that in some very rare cases packages were edited into older questions or answers, which allows them to “appear” before the package was released).
We can observe some trends in the use of R packages. For example:
- ggplot2 has always been involved in a substantial portion of questions and answers, though its frequency has been slightly declining since the early years of the site.
- The data.table and especially dplyr packages showed rapid growth during Stack Overflow’s lifetime, which has leveled off in the last two years. The interactive web framework Shiny has also shown some substantial growth since its introduction in 2012.
- We can see changes in common tools for solving problems. The plyr and reshape2 packages rose in frequency from about 2009 to 2013, then declined afterwards when Wickham replaced them with the newer dplyr and tidyr packages.
- Older packages like zoo, xml, and grid have been steady or slowly declining as a share of questions.
Another way to visualize growth is to lay out R packages in a network, based on what pairs of packages tended to be used in Stack Overflow answers on the same questions. This gives a sense of what groups of packages tend to solve similar problems. What areas of R package development have recently grown in their share of questions and answers?
This lays out the ecosystem of R packages based on a few smaller subnetworks. Visualization packages generally ended up on the lower left, largely splitting into three clusters: grid graphics (centered around ggplot2), geographical visualization (including the sp, maps and maptools packages) and interactive visualization (with shiny, plotly, DT and htmlwidgets making up some of the more notable nodes). In the center of the ecosystem we see a cluster for data transformation, including dplyr, data.table, and purrr. Other clusters are characterized by text manipulation (stringr, tm), performance optimization (Rcpp, microbenchmark) and time series (lubridate, zoo).
By the definition we chose, most “growth” is centered in newer packages that have plenty of room to grow, such as the tidyverse package (introduced only last year). That means blue regions of the ecosystem don’t represent “stagnant” areas, but rather regions that have already had their share of questions asked. Still, it’s interesting to see that by this definition, two major areas of growth in the ecosystem are data transformation and interactivity. We’d generally agree from our experience in the R community that these are two areas with lots of recent innovation.
Since we use R on the Stack Overflow Data Team, we certainly enjoyed examining how the R ecosystem is changing, and seeing that it’s been a part of the rapid expansion of the data science field. In general, the number of users of a language isn’t directly related to its popularity. But the large and fast-growing community around the R language has certainly contributed to its value as a programming language and as a data analysis environment.
If you use R, and are looking to take the next step in your career, here are some companies hiring R programmers and data scientists right now on Stack Overflow Jobs.
Says so by an R developer.
As always, excellent post! Are you able to share your code for the R Ecosystems plot? Mine never seem to be nicely spaced out like that!
Here’s the code! https://www.kaggle.com/drobinson/analysis-of-r-packages-on-stack-overflow-over-time The data is public within the same Kaggle kernel.
Thanks! That’s epic that it’s a notebook, still working on my integration of R Notebook or even R Markdown into WordPress.
Check out Kaggle kernels, but also take a look at Yihui Xie’s blogdown! It’s set up for building blogs with R Markdown.
People like you, David Robinson, adds Value to this world. There is no excuse. It’s you, and your keen, kind and professional engagement which makes it possible for mere mortals like I self to sit and wallow in quality writing like yours!!
Thanks, David Robinson, warm warm thanks. May your spirited writings spread all over the world. Bravo!
I am going to share your exquisite link with my 30,000 linked in followers ..
What on earth is causing the weird repeating pattern in C usage in the Stack Overflow Traffic to Programming Languages graph?
Probably because the dip is when schools are off and many colleges/schools teach C during the school year, hence the increase.
and a large portion of C use is in schools (because any industries that can have moved on to more modern languages)?
That was more what I find interesting. The TIOBE index shows C to still be very much within the top 3 languages used. So it’s a very popular language, just one that isn’t well represented on SO. I wonder if it’s more that grizzled old seasoned C coders don’t use SO much? Certainly, as an older embedded coder, I don’t see much on SO for my field. I mostly spend my time answering questions for people starting with C.
This is mentioned in the following text: “Traffic to C questions shows a strong seasonal pattern (since it’s one of the most common choices for undergraduate programming classes)”
R has been supplanting SAS and SPSS in undergrad stats classes, wouldn’t you expect to see a similar pattern emerge for it? And as those students graduate, I would expect the industry mix to shift from academia to other industries as well.
Interestingly I might have expected to see such a pattern, but we don’t; the vast majority of R use from universities is by researchers, which we can tell since it doesn’t dip in the summer. Check out this post for more: https://stackoverflow.blog/2017/02/15/how-do-students-use-stack-overflow/
My hypothesis is that there are more and larger classes teaching introductory programming than teaching statistical computing. (Consider the typical size of an intro CS class, and consider that most large intro stats courses probably teach only a little programming if they do at all).
I think there is a hint of a seasonal pattern in the R curve from 2015 onwards, mirroring the C curve to some extent but masked by the upward trend. Some time series analysis might tease it out.
my thought was the cadence of CS classes.
OK, now I know that SO blog is just biased. I’ve heard this guy on the podcast talk about how much he likes R. (this comment is mostly tongue-in-cheek)
You caught me! I was so careful, too; the only places I’ve ever admitted I like R are on that podcast, and in the introduction and conclusion to this blog post. 😉
You mentioned earlier about switching from Python to R and writing a blog post about it. Did you publish that post yet?
Not yet! It’s on my agenda for this month. (it’ll be on my personal blog, Variance Explained).
This doesn’t end up making a difference in the numbers, both because it’s rare compared to language-specific tags and since they’re generally balanced across languages.
For example, consider the three biggest PHP-related tags (by a pretty large margin), laravel, symfony, and codeigniter. Each does have a share of questions that don’t have the PHP tag. But if you added up all their questions without PHP and counted them towards PHP, it would increase PHP’s numbers by about 6%. This wouldn’t come close to catching up to Python in high-income countries, even with a few smaller tags like yii and drupal added and even keeping Python where it is.
And roughly the same increase would happen for Python, which also has widespread usage and a large ecosystem. Django by itself has enough questions without the Python tag to increase Python’s count of questions by 9%. There’s no particular reason to believe that one language has far more frameworks that are missing the language tag than another.
I moved from R to Python/Pandas this year and am very happy. R has way too many grouping functions: apply, tapply, lapply, sapply, etc.
Can you please add “most mentioned Python modules” to your Python article, just as you did with the R packages?
I recently started a course on Machine Learning (Udemy). Both R and Python are explained. I noticed that Python requires far more preparations and far more steps compared to R. In R the same can be achieved with far less code. What is your opinion on this?
The list of packages is neat, and I am willing to bet the vast majority of plyr references are from over a year ago, with dplyr taking over. I would love to see package reference by date.
It’s in there, second to last graph!
I think RStudio played an important role for the popularity of R. Sorry to say this but I have tried many famous IDE’s in python which includes spyder,pycharm,jupyter and rodeo and all those are noway near as good as RStudio. Handsdown RStudio is the best IDE which I have used sofar.
This is exactly what I thought after reading the (excellent) article. It’s missing a view at an important piece of the ecosystem – the IDEs. I believe RStudio (as a company) made R much more accessible and desirable with all their efforts in past years. But that’s probably a little more difficult to capture in the SO questions…
I don’t think necessarily that the high numbers in Health, Academia and Government have to do with analytics trumping development. The top three industries are, by in large, non-profit industries. Having worked in all three, you are constantly confronted with no money do “that”. As such, a free solution like R becomes very attractive. There are other free analytic tools out there for sure, and at times better commercial offerings. But none with the developed ecosystem for performing analytics in the same way R is, for free. I suspect that’s the largest driver.
Can you recommend me some substitution for R? in acedemia, multivariant analysis
I understand that hypothesis, but I’m not sure it fits. Most notably, we found in our earlier analysis of government programmers (https://stackoverflow.blog/2017/07/12/trends-government-software-developers/ ) that most of the disproportionately visited technologies weren’t free: VBA, Excel, MATLAB, and C# were all more disproportionately visited than R, as were Oracle and SQL Server among database technologies. The presence of MATLAB definitely suggests that it’s more about analysis than price.
Generally, I think that paid approaches to data analysis are used more in the R-heavy industries than elsewhere. SAS is very popular in healthcare, and Stata in academia. The tech industry is for-profit and uses little R, but our Python analysis suggests the more common alternative is Python/Pandas, which are free (certainly SAS and Stata are used very rarely in that industry).
For instance, I’ve set up a separate industry graph for MATLAB- academia, government and healthcare are all among the heaviest visitors. (Manufacturing also is, since MATLAB is used as in engineering as well as in data analysis).
Possibly but in England, I’m not sure that’s quite the picture as R is classed as ‘open source’. I recently attended a course on SPC methods and found I was the only one using R in a room full of health analysts. I know quite a lot of colleagues in the NHS and other areas of government/health care are not (as yet) allowed to use R due to the open source issue. So from my perspective, this result is interesting.
Go’s growth stands out.
On phone but would like to see what % is used for Machine Learning or Data Science for Go.
Link is broken on:
“As we saw in a previous post, traffic to Ruby and especially to Objective C have been declining over time”
The tidyverse package includes dplyr, ggplot2, and several other packages. The growth of tidyverse helps to explain some of the dropoff in those packages.
I find the industry assumption used here and by the TIOBE index that “visits to websites for help with language” == “popularity of language for development” very problematic. I use Go on a daily basis and have for almost a year. But I visit Stack Overflow for help on it an order of magnitude less than I did when using C++ for any year, let alone my first year of development with it.
Why? Because Go is an elegantly designed language that I don’t need much help with. And C++ is, in comparison, a monstrous language with too much complexity for its own good for which I needed so much help that Stack Overflow often didn’t even have an answer. I would have to write new (ultimately unanswered) questions for it myself. The result? Much higher Stack Overflow traffic for a language that I ultimately abandoned and much lower Stack Overflow traffic for a language I deeply enjoy.
How certain are you that you don’t visit many Go questions, relative to other technologies? I’ve found people sometimes under- or over-estimate how much they visit particular tags, and that it’s often more representative of one’s language use than people expect.
If you have a Stack Overflow account (or even if you don’t, but are visiting from a browser with a cookie), note that you can get a count of the number of questions you’ve visited from each tag, in the last year or so, from this page:
This could let you get data to back up your suspicion.
I appreciate your data-focused approach to my question. Here are my personal top results, with Go in context:
I’d say there are aspects of both of our points here. The first is that Python is far over-represented compared to how much I use it. I think this is a result of Python’s poor core documentation compared to Go’s. When I need to get data on a Go package, I go directly to its github page. When I need Go documentation, I usually use the documentation found directly on golang.org. This is more supportive of the TIOBE index criteria because I often google it.
But there is no denying my personal experience of spending hours and hours coding Go without ever needing to reference any documentation at all. Compared to when I use Python, I need to reference documentation probably once every 5 or 10 minutes. I think this is also reflected in the numbers above.
Dave, the data is property of Stack Overflow, but it would be great if you could give us a look at the generic code — particularly that behind the Ecosystem of R packages mapping. Alternatively, let’s have R and network analysis course at DataCamp! Thanks, TJH
You can find both the code and the data (for the second half of the post, including the network) here!
How did you make that awesome network plot?!
I cant unsee the ‘heartbeat’ graph C shows us :aaa
Very informative post related to clinical research. It is great help for my new project .By reading this article, I learn some important things that I need to improve. I continuously check this site for regular updates in field. Thanks for putting top notch content in article. I would like to be here again to find another masterpiece article.
Thanks for posting such great article. I know the worth of R language.
This is the most asked Question and after studying both Python and R & using them in Data Analysis, i am glad to comment on the question:
Have you ever tried to run a 100 meter Race with formal shoes ?
I beleive NO. The reason is Sports shoes are well suited for this purpose, while the formal shoes have some another purpose.
The above example illustrates the importance of R over Python as R is created for the Purpose of Data Analysis only.
Now days, python is in boom for Data Science because most of the Programmers are shifting to this field because of its high demand and lucrative career path. Since these are programmers and are well equipped with Programming Languages(Specially Python – due to its easiness), so its easy for them to pick up the same.
However, R was, is and will be the market leader for Data Analysis,EDA and ML as well. The following are the reasons for the same:
1)When doing real time work, everyone prefers the work to be performed in more efficient and quick way. And that is where R stands firm than python as implementation and Analysis is like a cake walk in R as compared to Python. With Python, you have to think more about the Programatic solutions rather than real time feature selection and alalytics paradigm.
Sine R is made for this purpose only, it makes it very easy to perform various tasks
2) R has a much bigger library of statistical packages
3) R is very efficient for Data Visualization. Daa Vizuaization with R is very much efficient comprising of complete Grammer for Graphics . I have tried Data visualisation in Pythin using Matplotlib and seaborn.
You need to look at the docimentation each time before implementing this but GGPLOT in R with the Grammer of Graphics make it so easy that you do not even bother to learn, mug up or even see the implementation again and again in the documentation
4)The packages like tidyverse, make the Data Wrangling and data prepration, a cake walk. You can use SQL like Queries while working with Data. Making data ready is much easier in R as compared to Python.
5)R has thoudands of Packages to perform the best possible task in an efficient manner.
6) R has the concept of pipe operator via dplyr package, that makes it really easy to maintain the code. It increases the look and feel of the code and make it easy to understand.
7) Tidyverse pakage is all time solution for complete Analysis. You can do an efficient analysis in very fast manner. Here is a list of all useful packages:
ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames
stringr and forcats Data Transformation and Cleaning
8) The Keras and Tensorflow are in R as well and R is on front way in Deep Learning as well
R is simply awesome
This is a great Article articulating the growth of R