On the data team here at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. We use these kinds of relationships all over the place, from making the user experience of everyone coming to Stack Overflow better by suggesting relevant content to helping our clients understand how to hire developers. One way to get at this idea of relationships between technologies is tag correlations. Correlation between tags measures how often tags appear together relative to how often they appear separately. You can check out one of the chapters of my book (written with fellow Stack Overflow data scientist Dave Robinson) for more detailed discussion of this.
Together vs. Apart
We have a number of data sources that we could use to measure tag correlations. For instance, Matt Sherman, an engineering manager here at Stack Overflow, built a tool that measures how often tags appear together on Stack Overflow questions. We could also use our traffic data, and see how often users visit pairs of tags. For this analysis, I’m going to use a different data set, though; I’m going to use the “liked tags” on Developer Stories here at Stack Overflow. If you haven’t made your own Developer Story or explored them, feel free to check out mine. Notice that I have some Stack Overflow tags that I’ve identified that I want to work with in my professional life; for me, it’s R, dplyr, ggplot2, Shiny, and so forth. You can tell from those tags that I have a specific set of skills and do a certain kind of work (if you’re familiar with those technologies, anyway). There is similar signal in other Developer Stories here on Stack Overflow, and we can use the distribution of these tags and how they are related to learn about how technologies are interrelated. The reason I like using Developer Stories for this kind of tag analysis is that it is high signal-to-noise. I am interested in how technologies are connected and how they are used together, and developers’ own descriptions of their work and careers is a great place to get that.
To start with here, we are just looking at which tags are used most often. We see the usual suspects here, some of the most common languages used by developers today.
Notice that these are still many of the same common, important languages that we saw in the first plot. Languages like Java, C, and these four important languages are commonly used together by developers on their Developer Stories with these four languages, but they are just the most common technologies in general. To explore tag correlations, we want to ask a slightly different question. We want to find tags that are more likely to occur together than with other tags in this dataset. Which tags are most correlated with these four languages?
We see a different set of technologies now. These are tags that are more likely to be used by a developer on her or his Developer Story with these four languages than with other tags, and now we are using the aggregate data of developers here on Stack Overflow to gain insight into how technologies are used together. We see here, for example, more evidence about how developers are using Python both for data science along with R (another language used for data science), Pandas, and NumPy, as well as for web development with Django and Flask. We are able to find these related technologies because we calculated tag correlations.
Network of Correlations
We are not restricted to looking at one tag at a time. We can extend this correlation calculation to many more tags, and then build a network of tags based on how they are correlated with each other.
In this interactive network visualization (you can zoom, scroll, and click), the size of each circle represents how often that tag is used; tags with larger circles are used more often. The circles are colored based on their subgroup membership within the network as a whole, which is calculated via many random walks (a cluster walktrap). This network includes tags that are used more than 800 times on Developer Stories and have correlations greater than 0.1 with other tags.
There is so much we can see by exploring this network! One thing we can notice is subgroups within the network that show us tech ecosystems, some of them densely interconnected. We see some groups made up of:
- Microsoft-related technologies including C#, .NET, and SQL Server
- DevOps technologies like AWS and Docker (Go is in this cluster!)
- Mobile technologies including Android and Objective-C
Where are the technologies that you use, and how are they connected? You can explore this network for yourself; the network data structure is publicly available as a dataset at Kaggle. You can check out the Kaggle kernel I created to show how to use the network nodes and links to create a network graph.
Another thing we can notice in this network is that some technologies act as bridges between tech ecosystems. Python, one of the most commonly used languages on Developer Stories, connects to the front-end cluster (through Django), to a Linux/systems administration cluster, to a C/C++/embedded cluster, and to R and machine learning. We see time and again how unique a language Python is becoming in today’s technology landscape. Java, git, and JSON are other “bridge” technologies that connect parts of this network.
This analysis used the liked tags on Developer Stories to explore the rich, complex network of technologies that we work within. When developers share who we are as professionals in ways that we actually care about, like with the technologies we want to use, we can all learn more about the developer community. You can make your own Developer Story today and highlight your career, interests, and what technologies you want to work with.