Mapping Ecosystems of Software Development
On the data team here at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. We use these kinds of relationships all over the place, from making the user experience of everyone coming to Stack Overflow better by suggesting relevant content to helping our clients understand how to hire developers. One way to get at this idea of relationships between technologies is tag correlations. Correlation between tags measures how often tags appear together relative to how often they appear separately. You can check out one of the chapters of my book (written with fellow Stack Overflow data scientist Dave Robinson) for more detailed discussion of this.
Together vs. Apart
We have a number of data sources that we could use to measure tag correlations. For instance, Matt Sherman, an engineering manager here at Stack Overflow, built a tool that measures how often tags appear together on Stack Overflow questions. We could also use our traffic data, and see how often users visit pairs of tags. For this analysis, I’m going to use a different data set, though; I’m going to use the “liked tags” on Developer Stories here at Stack Overflow. If you haven’t made your own Developer Story or explored them, feel free to check out mine. Notice that I have some Stack Overflow tags that I’ve identified that I want to work with in my professional life; for me, it’s R, dplyr, ggplot2, Shiny, and so forth. You can tell from those tags that I have a specific set of skills and do a certain kind of work (if you’re familiar with those technologies, anyway). There is similar signal in other Developer Stories here on Stack Overflow, and we can use the distribution of these tags and how they are related to learn about how technologies are interrelated. The reason I like using Developer Stories for this kind of tag analysis is that it is high signal-to-noise. I am interested in how technologies are connected and how they are used together, and developers’ own descriptions of their work and careers is a great place to get that.
To start with here, we are just looking at which tags are used most often. We see the usual suspects here, some of the most common languages used by developers today.
Next, let’s count up co-occurrences of tags and find which tags are commonly used together. For example, what are the tags that occur most often with a few important languages like C#, C++, JavaScript, and Python on Developer Stories?
Notice that these are still many of the same common, important languages that we saw in the first plot. Languages like Java, C, and these four important languages are commonly used together by developers on their Developer Stories with these four languages, but they are just the most common technologies in general. To explore tag correlations, we want to ask a slightly different question. We want to find tags that are more likely to occur together than with other tags in this dataset. Which tags are most correlated with these four languages?
We see a different set of technologies now. These are tags that are more likely to be used by a developer on her or his Developer Story with these four languages than with other tags, and now we are using the aggregate data of developers here on Stack Overflow to gain insight into how technologies are used together. We see here, for example, more evidence about how developers are using Python both for data science along with R (another language used for data science), Pandas, and NumPy, as well as for web development with Django and Flask. We are able to find these related technologies because we calculated tag correlations.
Network of Correlations
We are not restricted to looking at one tag at a time. We can extend this correlation calculation to many more tags, and then build a network of tags based on how they are correlated with each other.
In this interactive network visualization (you can zoom, scroll, and click), the size of each circle represents how often that tag is used; tags with larger circles are used more often. The circles are colored based on their subgroup membership within the network as a whole, which is calculated via many random walks (a cluster walktrap). This network includes tags that are used more than 800 times on Developer Stories and have correlations greater than 0.1 with other tags.
There is so much we can see by exploring this network! One thing we can notice is subgroups within the network that show us tech ecosystems, some of them densely interconnected. We see some groups made up of:
- Front-end web development technologies from HTML to JavaScript to Bootstrap
- Microsoft-related technologies including C#, .NET, and SQL Server
- DevOps technologies like AWS and Docker (Go is in this cluster!)
- Mobile technologies including Android and Objective-C
Where are the technologies that you use, and how are they connected? You can explore this network for yourself; the network data structure is publicly available as a dataset at Kaggle. You can check out the Kaggle kernel I created to show how to use the network nodes and links to create a network graph.
Another thing we can notice in this network is that some technologies act as bridges between tech ecosystems. Python, one of the most commonly used languages on Developer Stories, connects to the front-end cluster (through Django), to a Linux/systems administration cluster, to a C/C++/embedded cluster, and to R and machine learning. We see time and again how unique a language Python is becoming in today’s technology landscape. Java, git, and JSON are other “bridge” technologies that connect parts of this network.
This analysis used the liked tags on Developer Stories to explore the rich, complex network of technologies that we work within. When developers share who we are as professionals in ways that we actually care about, like with the technologies we want to use, we can all learn more about the developer community. You can make your own Developer Story today and highlight your career, interests, and what technologies you want to work with.
20 Comments
Is this going to be a daily ritual where we use R and come up with useless posts about languages and such? This is way too common from now on.
The Network of Correlation graph drags Firefox to a crawl, almost a freeze, for quite some time. Could you offload that to its own webpage and link to it, maybe including a simple screenshot here instead?
What happened with Java? I guess is somewhere on the network but i can not found it.
Yep, it’s there! Biggest pale blue dot
I guess it is just a little shy :p https://uploads.disquscdn.com/images/ef61582131a95bc1cfda23e287d00445a6d3f3dfff6f5fcefdc3645e2ca97509.png
That interactive network is too sluggish (Firefox). Can’t it be made to work faster? If the Java applet “Graph Layout” (http://www.oracle.com/technetwork/java/example1-136039.html) worked fine in 1996, why not 20 years later?
Could you put out a full screen version of that graph? At any zoom wide enough to see even a single full cluster the tag names are an overlapping, unreadable, mess.
Why is this blog post on all Stack Exchange sites? I got here from Movies.SE. This has nothing at all to do with movies.
You’re absolutely right. We’ll set something up to only push posts like this out to technology-related sites.
Love this stuff, keep it coming!
So you are allowed to post a comment on this blog but not the “3 Ways You Can Be an Ally to Women in Tech”. How are we as programmers going to be a part of fighting sexism in our field and at our workplaces if we don’t get a say? Where absenting opinions are labeled “tirades”. I’m not surprised radical feminism has seeped into SO. The site just gets worse every year.
Doesn’t sound like you’re interested in fighting sexism, my man.
“The site just gets worse every year. Ratings are going down the tubes. Terrible website!”
My dislike of SO the recent years has nothing to do with feminism. But thanks for judging me based on absolutely nothing. I would have been more than happy to explain to you why if you had just asked. Nor did you address any of my arguments related to my post.
sexism is precisely what this blogger represents. She is dividing men and women by stating they’re different and thus women need special treatment or otherwise they cannot manage in tech industry.
Why comments section for https://stackoverflow.blog/2017/10/04/3-ways-can-ally-women-tech/?cb=1 is disabled?
Where is freedom of speech?!
well because then someone could point that if we are the same, women and men, why do women need to be treated in a special way? And if not, they would need to acknowledge that MAYBE women are in fact different and they don’t feel like going in tech industry.
hi.Is this going to be a daily ritual where we use R and come up with useless posts about languages and such? This is way too common from now on.
http://filesoo.com/books/
I am mystified by some of the comments below, having actually read the post. 1. It does not implicitly or explicitly promote radical feminism. 2. It is not a “useless post about languages and such” — it is an exploration of current patterns in the industry and what they might mean. This will interest some people; and not interest others. Here’s a reading strategy for busy people: look at the title, the headings, and quickly skim the opening section. In 10 seconds or less you can decide whether or not to read the entire post. If the post doesn’t add value for you, don’t read it. Ok? But advancing off-topic agendas in the comments is not as helpful as you might think. If hater-ade is your energy drink of choice, be aware of the side-effects and after-effects: which include isolation, alienation, decreased productivity, burnout, and the perceived inability to work well with others or function as part of a team.
Such ecosystem has been developed long time ago like https://graphofknowledge.appspot.com
I want to big thanks for sharing wonderful list Nice work