The Impressive Growth of R

We found in a previous post that Python has a solid claim to being the fastest-growing programming language in terms of Stack Overflow visits.

The same analysis showed that the R programming language has shown remarkable growth in the last five years as well. In fact, R is growing at a similar rate to Python in terms of a year-over-year percentage, though this growth is “easier” because it started from a smaller share of traffic.

In another post, we found that much of the growth of Python can be explained by the expansion of data science. Since R is primarily used for statistical analysis, it’s likely that R is part of the same trend. In this post, we’ll analyze how quickly R has grown, examine how its growth differs across industries, and look at what R packages are popular and growing within the ecosystem.

Like Python, a disproportionate amount of traffic to R questions comes from high-income countries (it’s visited about three times as often in those countries as in the rest of the world), so this post will consider Stack Overflow traffic from high-income countries, such as the United States, United Kingdom, Germany, Canada and France. As a disclaimer, we’ll note that the Data Team at Stack Overflow works primarily in R (in fact, we use it to generate almost all of the graphs and results for Insights posts like this one).

Growth of R

Our previous post about Python’s growth considered visits to Python compared to five other major programming languages (Java, JavaScript, C#, PHP, and C++) that each make up a substantial share of Stack Overflow question visits. R makes up a smaller share than those tags, so we’ll compare its growth to languages of a similar size.

There is no sense in which R is “competing” with any of these other languages, which aren’t typically used for data analysis. This comparison is shown only to demonstrate that the kind of sustained growth R has shown is rare among languages of comparable size.

Traffic to C questions shows a strong seasonal pattern (since it’s one of the most common choices for undergraduate programming classes), and R has roughly caught up with that level of traffic. Visits to Swift questions grew rapidly after Apple introduced the language in 2014, but have since leveled off. Typescript, though it’s still a smaller source of traffic, has been showing quite remarkable growth, and will be the subject of some future analyses. As we saw in a previous post, traffic to Ruby and especially to Objective C have been declining over time.

Projecting future growth of Stack Overflow visits can be tricky, but according to an STL model it’s reasonable to predict that R will be the seventh most visited programming language tag from high-income countries in 2018, after Python, Java, JavaScript, C#, PHP, and C++.

By industry

What industries visit R questions the most? (This analysis is restricted to the United States and United Kingdom, the countries in which we can segment our traffic by industry).

R is most visited from universities, where it’s a common choice for academic research, especially in the social sciences and biology. Indeed, in June-July 2017, when most classes aren’t in session, R was the second-most visited tag from universities, second only to Python.

The industry with the second-highest share of R visitors, by a close margin, is healthcare. That probably won’t come as a surprise to biostatisticians, since R is the tool of choice for many statistical methods necessary for clinical studies and bioinformatics.

One industry that doesn’t visit a lot of R, relative to other technologies, is tech: software and web companies. (The Data Team here at Stack Overflow is one exception!) This is partly because data analysis makes up a relatively small portion of the industry’s Stack Overflow visits, compared to software and web development. We separately found that pandas, a data science framework for Python, was less visited in tech than it was in almost all other industries. But it does suggest that the way we use R on our team is not the typical use case for the language.

We saw that Python’s rate of growth has been roughly equal across all industries we can measure. In what industries is R growing the fastest?

R isn’t shrinking within any industry, but visits to R are generally growing faster in industries where it was already more heavily visited, including showing very rapid growth in academia and healthcare. This graph also confirms what we saw in a previous analysis, that R is both disproportionately visited and fast-growing in the government sector. We also see that it’s relatively widely used, and expanding, in consulting and insurance. Each of these are industries where data analysis and visualization play a disproportionate role, relative to software and web development.

One of the areas where we don’t see much growth is tech, confirming that most of R’s expansion appears to be happening outside of the software and web industry. Since in that industry we did see an increase in visits to Python data science frameworks like pandas and NumPy, it’s a reasonable conclusion that Python is becoming a more popular choice for data science within those companies.

R Packages

In the case of Python, we were interested in what particular applications of the language had been driving its growth, such as data science, web development, and system administration. R is less of a mystery: its primary purpose has always been statistical analysis, machine learning, and data visualization. But we’re still interested in what trends are happening within the R ecosystem.

To examine this, we extracted what R packages were used in particular questions and answers. We extracted this from our public R Questions dataset hosted on Kaggle, containing all (non-deleted) questions and answers with the R tag. This Kaggle kernel shows how we parsed the data, including examining uses of the library() and require() functions.

What were the most commonly used packages, in all existing R questions and answers?

Many of the most commonly mentioned packages were written by Hadley Wickham, with his packages making up 7 of the top 10 (the others being data.table, shiny, and zoo). It’s worth noting that this metric may be tilted towards the most confusing packages rather than simply the most widely used. However, running this on the most common packages mentioned in answers, not just questions, leads to a very similar list (you can try it yourself!), meaning this is a reasonably faithful representation of the packages R developers find useful in their work.

This data can also give us insight into the fastest growing packages. We’ll measure this over time in terms of the percentage of questions where either the question mentions the package, or one of its answers does. Since R questions in general are becoming more common, we’re examining the changes only as a share of the R ecosystem: most of these packages are growing in terms of raw numbers.

(Note that in some very rare cases packages were edited into older questions or answers, which allows them to “appear” before the package was released).

We can observe some trends in the use of R packages. For example:

  • ggplot2 has always been involved in a substantial portion of questions and answers, though its frequency has been slightly declining since the early years of the site.
  • The data.table and especially dplyr packages showed rapid growth during Stack Overflow’s lifetime, which has leveled off in the last two years. The interactive web framework Shiny has also shown some substantial growth since its introduction in 2012.
  • We can see changes in common tools for solving problems. The plyr and reshape2 packages rose in frequency from about 2009 to 2013, then declined afterwards when Wickham replaced them with the newer dplyr and tidyr packages.
  • Older packages like zoo, xml, and grid have been steady or slowly declining as a share of questions.

Another way to visualize growth is to lay out R packages in a network, based on what pairs of packages tended to be used in Stack Overflow answers on the same questions. This gives a sense of what groups of packages tend to solve similar problems. What areas of R package development have recently grown in their share of questions and answers?

This lays out the ecosystem of R packages based on a few smaller subnetworks. Visualization packages generally ended up on the lower left, largely splitting into three clusters: grid graphics (centered around ggplot2), geographical visualization (including the sp, maps and maptools packages) and interactive visualization (with shiny, plotly, DT and htmlwidgets making up some of the more notable nodes). In the center of the ecosystem we see a cluster for data transformation, including dplyr, data.table, and purrr. Other clusters are characterized by text manipulation (stringr, tm), performance optimization (Rcpp, microbenchmark) and time series (lubridate, zoo).

By the definition we chose, most “growth” is centered in newer packages that have plenty of room to grow, such as the tidyverse package (introduced only last year). That means blue regions of the ecosystem don’t represent “stagnant” areas, but rather regions that have already had their share of questions asked. Still, it’s interesting to see that by this definition, two major areas of growth in the ecosystem are data transformation and interactivity. We’d generally agree from our experience in the R community that these are two areas with lots of recent innovation.

Conclusion

Since we use R on the Stack Overflow Data Team, we certainly enjoyed examining how the R ecosystem is changing, and seeing that it’s been a part of the rapid expansion of the data science field. In general, the number of users of a language isn’t directly related to its popularity. But the large and fast-growing community around the R language has certainly contributed to its value as a programming language and as a data analysis environment.

If you use R, and are looking to take the next step in your career, here are some companies hiring R programmers and data scientists right now on Stack Overflow Jobs.

Author

David Robinson
Data Scientist

Related Articles

Comments

  • Jon

    Says so by an R developer.

  • As always, excellent post! Are you able to share your code for the R Ecosystems plot? Mine never seem to be nicely spaced out like that!

  • Joe Rutledge

    What on earth is causing the weird repeating pattern in C usage in the Stack Overflow Traffic to Programming Languages graph?

    • Naftali Lubin

      Probably because the dip is when schools are off and many colleges/schools teach C during the school year, hence the increase.

      • Nathan Harvey

        and a large portion of C use is in schools (because any industries that can have moved on to more modern languages)?

        • Joe Rutledge

          That was more what I find interesting. The TIOBE index shows C to still be very much within the top 3 languages used. So it’s a very popular language, just one that isn’t well represented on SO. I wonder if it’s more that grizzled old seasoned C coders don’t use SO much? Certainly, as an older embedded coder, I don’t see much on SO for my field. I mostly spend my time answering questions for people starting with C.

    • David Robinson

      This is mentioned in the following text: “Traffic to C questions shows a strong seasonal pattern (since it’s one of the most common choices for undergraduate programming classes)”

      • Neal Fultz

        R has been supplanting SAS and SPSS in undergrad stats classes, wouldn’t you expect to see a similar pattern emerge for it? And as those students graduate, I would expect the industry mix to shift from academia to other industries as well.

        • David Robinson

          Interestingly I might have expected to see such a pattern, but we don’t; the vast majority of R use from universities is by researchers, which we can tell since it doesn’t dip in the summer. Check out this post for more: https://stackoverflow.blog/2017/02/15/how-do-students-use-stack-overflow/

          My hypothesis is that there are more and larger classes teaching introductory programming than teaching statistical computing. (Consider the typical size of an intro CS class, and consider that most large intro stats courses probably teach only a little programming if they do at all).

          • I think there is a hint of a seasonal pattern in the R curve from 2015 onwards, mirroring the C curve to some extent but masked by the upward trend. Some time series analysis might tease it out.

    • Grady

      my thought was the cadence of CS classes.

  • Nathan Harvey

    OK, now I know that SO blog is just biased. I’ve heard this guy on the podcast talk about how much he likes R. (this comment is mostly tongue-in-cheek)

    • David Robinson

      You caught me! I was so careful, too; the only places I’ve ever admitted I like R are on that podcast, and in the introduction and conclusion to this blog post. 😉

  • You mentioned earlier about switching from Python to R and writing a blog post about it. Did you publish that post yet?

    • David Robinson

      Not yet! It’s on my agenda for this month. (it’ll be on my personal blog, Variance Explained).

  • Franz Josef Kaiser

    The “Python is great” statement was already wrong. Lots of tag–communities exist without tagging the language. Languages with wide spread usage and a large ecosystem, like JavaScript and PHP, use framework–tags instead of language tags. And those are not in the stats.

    • David Robinson

      This doesn’t end up making a difference in the numbers, both because it’s rare compared to language-specific tags and since they’re generally balanced across languages.

      For example, consider the three biggest PHP-related tags (by a pretty large margin), laravel, symfony, and codeigniter. Each does have a share of questions that don’t have the PHP tag. But if you added up all their questions without PHP and counted them towards PHP, it would increase PHP’s numbers by about 6%. This wouldn’t come close to catching up to Python in high-income countries, even with a few smaller tags like yii and drupal added and even keeping Python where it is.

      And roughly the same increase would happen for Python, which also has widespread usage and a large ecosystem. Django by itself has enough questions without the Python tag to increase Python’s count of questions by 9%. There’s no particular reason to believe that one language has far more frameworks that are missing the language tag than another.

  • TechUser2011

    I moved from R to Python/Pandas this year and am very happy. R has way too many grouping functions: apply, tapply, lapply, sapply, etc.

    Can you please add “most mentioned Python modules” to your Python article, just as you did with the R packages?

    • Zarif Atai

      I recently started a course on Machine Learning (Udemy). Both R and Python are explained. I noticed that Python requires far more preparations and far more steps compared to R. In R the same can be achieved with far less code. What is your opinion on this?

  • The list of packages is neat, and I am willing to bet the vast majority of plyr references are from over a year ago, with dplyr taking over. I would love to see package reference by date.

    • David Robinson

      It’s in there, second to last graph!

  • TechUser2010

    I think RStudio played an important role for the popularity of R. Sorry to say this but I have tried many famous IDE’s in python which includes spyder,pycharm,jupyter and rodeo and all those are noway near as good as RStudio. Handsdown RStudio is the best IDE which I have used sofar.

    • liborm

      This is exactly what I thought after reading the (excellent) article. It’s missing a view at an important piece of the ecosystem – the IDEs. I believe RStudio (as a company) made R much more accessible and desirable with all their efforts in past years. But that’s probably a little more difficult to capture in the SO questions…

  • Jeff

    I don’t think necessarily that the high numbers in Health, Academia and Government have to do with analytics trumping development. The top three industries are, by in large, non-profit industries. Having worked in all three, you are constantly confronted with no money do “that”. As such, a free solution like R becomes very attractive. There are other free analytic tools out there for sure, and at times better commercial offerings. But none with the developed ecosystem for performing analytics in the same way R is, for free. I suspect that’s the largest driver.

    • Simoné S Simon

      Can you recommend me some substitution for R? in acedemia, multivariant analysis

    • David Robinson

      I understand that hypothesis, but I’m not sure it fits. Most notably, we found in our earlier analysis of government programmers (https://stackoverflow.blog/2017/07/12/trends-government-software-developers/ ) that most of the disproportionately visited technologies weren’t free: VBA, Excel, MATLAB, and C# were all more disproportionately visited than R, as were Oracle and SQL Server among database technologies. The presence of MATLAB definitely suggests that it’s more about analysis than price.

      Generally, I think that paid approaches to data analysis are used more in the R-heavy industries than elsewhere. SAS is very popular in healthcare, and Stata in academia. The tech industry is for-profit and uses little R, but our Python analysis suggests the more common alternative is Python/Pandas, which are free (certainly SAS and Stata are used very rarely in that industry).

    • David Robinson

      For instance, I’ve set up a separate industry graph for MATLAB- academia, government and healthcare are all among the heaviest visitors. (Manufacturing also is, since MATLAB is used as in engineering as well as in data analysis).

      https://uploads.disquscdn.com/images/0b4910eab1e19a2aceeeeb6aba1705f041c6fd960acbcc19b0ed2959786684b6.png

    • Louis Thackray

      Possibly but in England, I’m not sure that’s quite the picture as R is classed as ‘open source’. I recently attended a course on SPC methods and found I was the only one using R in a room full of health analysts. I know quite a lot of colleagues in the NHS and other areas of government/health care are not (as yet) allowed to use R due to the open source issue. So from my perspective, this result is interesting.

  • Arthur Bugorski

    To me the most interesting news is the continued growth of Scala and the decline for Clojure. My casual impression had been the reverse. There is a a quiet battle for the runner-up language on the JVM and both have good JavaScript stories as well.

  • Charles_F

    Go’s growth stands out.
    On phone but would like to see what % is used for Machine Learning or Data Science for Go.

  • Zé Silva

    Link is broken on:
    “As we saw in a previous post, traffic to Ruby and especially to Objective C have been declining over time”

  • Andrew Breza

    The tidyverse package includes dplyr, ggplot2, and several other packages. The growth of tidyverse helps to explain some of the dropoff in those packages.

  • I find the industry assumption used here and by the TIOBE index that “visits to websites for help with language” == “popularity of language for development” very problematic. I use Go on a daily basis and have for almost a year. But I visit Stack Overflow for help on it an order of magnitude less than I did when using C++ for any year, let alone my first year of development with it.

    Why? Because Go is an elegantly designed language that I don’t need much help with. And C++ is, in comparison, a monstrous language with too much complexity for its own good for which I needed so much help that Stack Overflow often didn’t even have an answer. I would have to write new (ultimately unanswered) questions for it myself. The result? Much higher Stack Overflow traffic for a language that I ultimately abandoned and much lower Stack Overflow traffic for a language I deeply enjoy.

    • David Robinson

      How certain are you that you don’t visit many Go questions, relative to other technologies? I’ve found people sometimes under- or over-estimate how much they visit particular tags, and that it’s often more representative of one’s language use than people expect.

      If you have a Stack Overflow account (or even if you don’t, but are visiting from a browser with a cookie), note that you can get a count of the number of questions you’ve visited from each tag, in the last year or so, from this page:

      https://stackoverflow.com/users/prediction-data

      This could let you get data to back up your suspicion.

      • I appreciate your data-focused approach to my question. Here are my personal top results, with Go in context:


        python 712
        go 232
        javascript 153
        c++ 115

        I’d say there are aspects of both of our points here. The first is that Python is far over-represented compared to how much I use it. I think this is a result of Python’s poor core documentation compared to Go’s. When I need to get data on a Go package, I go directly to its github page. When I need Go documentation, I usually use the documentation found directly on golang.org. This is more supportive of the TIOBE index criteria because I often google it.

        But there is no denying my personal experience of spending hours and hours coding Go without ever needing to reference any documentation at all. Compared to when I use Python, I need to reference documentation probably once every 5 or 10 minutes. I think this is also reflected in the numbers above.

  • Thomas Haslam

    Dave, the data is property of Stack Overflow, but it would be great if you could give us a look at the generic code — particularly that behind the Ecosystem of R packages mapping. Alternatively, let’s have R and network analysis course at DataCamp! Thanks, TJH

  • Adam Black

    How did you make that awesome network plot?!

  • I cant unsee the ‘heartbeat’ graph C shows us :aaa