\n\nFirst look at topic 5. That topic is all English words, not terms from code; the topic model has fit one topic that is not specific to any tag, programming language, or technology used on Stack Overflow but instead aligns with the text people use to talk about their questions. Next, look at topic 3; most of those words look very general to me and applicable to almost all technologies (\"file\", \"error\", \"server\", and so forth). Last, look through some of the other collections of terms. For some tech ecosystems that I am familiar with, these collections of terms make sense together.\n\nWhat if there are words you are interested in, but that you don't see in these plots? We can use tidy data principles to find which topic any word has the highest probability of being generated from. For example, \"git\" and \"docker\" are most likely to be generated from topic 3, \"boost\" is most likely to be generated from topic 10, and \"ggplot2\" (my own personal favorite data visualization tool!) is most likely to be generated from topic 4.\n\n \n\n\u003Ch2>Connecting to tags\u003C/h2>\n\nWe can look at this from a different angle because each question on Stack Overflow has a tag, like \"r\" or \"c#\" or \"sql\". The topic model estimates a probability that each document belongs to each topic; it's the estimated proportion of words from that document that are generated from that topic. We know the tags for each document, so let's examine which tags are associated with each topic.\n\n\u003Cimg class=\"aligncenter size-large wp-image-7369\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/top_tags-1-1024x819.png\" alt=\"\" width=\"1024\" height=\"819\" />\n\nRemember that topic 5 was the one that corresponded to English words where users discuss and describe their problem, so that is a measure of something different than the other topics. Topic 1 looks like front-end web development, topic 4 is databases, topic 10 is C and low-level programming, and so forth. Remember, the tags did \u003Cem>not\u003C/em> go into the unsupervised modeling process; we are just looking at them after the fact. The topic modeling process has taken the raw text of Stack Overflow questions and discovered underlying patterns and structure. This is what topic modeling does, whether you are looking at \u003Ca href=\"http://tidytextmining.com/nasa.html#topic-modeling\">NASA metadata\u003C/a> or \u003Ca href=\"http://tidytextmining.com/topicmodeling.html#library-heist\">classic literature\u003C/a>.\n\nLet's look at a few real examples from this dataset so you can see how this worked out. Each of the following questions is part of the \u003Ca href=\"https://www.kaggle.com/stackoverflow/stacksample/\">StackSample\u003C/a> dataset and this particular topic model.\n\n\u003Ca href=\"https://stackoverflow.com/questions/24049020/nsnotificationcenter-addobserver-in-swift\">\u003Cimg class=\"aligncenter wp-image-7367 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/ios_question-1024x442.png\" alt=\"\" width=\"1024\" height=\"442\" />\u003C/a>\n\nThis \u003Ca href=\"https://stackoverflow.com/questions/24049020/nsnotificationcenter-addobserver-in-swift\">first example question\u003C/a> is relatively short, and the topic model estimates that is 91% topic 12 and 6% topic 3. Looks good! I don't see many of the top 10 terms from the first plot in this blog post for topic 12 here, but the topic model has classified it into the topic that is dominated by iOS, Objective-C, iPhone, and Swift.\n\n\u003Ca href=\"https://stackoverflow.com/questions/30216000/why-is-faster-than-list\">\u003Cimg class=\"aligncenter wp-image-7368 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/python_question-1024x669.png\" alt=\"\" width=\"1024\" height=\"669\" />\u003C/a>\n\nOur \u003Ca href=\"https://stackoverflow.com/questions/30216000/why-is-faster-than-list\">second example question\u003C/a> is longer, and the topic model estimates that it is 82% topic 5 and 18% topic 7. This question has a lot of English text and not much code, and that is reflected by the modeling. The model has chosen topic 7, dominated by Python and Django, for this question.\n\n\u003Ca href=\"//stackoverflow.com/questions/17247880/getting-associated-type-synonyms-with-template-haskell\">\u003Cimg class=\"aligncenter wp-image-7366 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/haskell_question-1024x953.png\" alt=\"\" width=\"1024\" height=\"953\" />\u003C/a>\n\nLast, let's look at this \u003Ca href=\"https://stackoverflow.com/questions/17247880/getting-associated-type-synonyms-with-template-haskell\">Haskell question\u003C/a>. Haskell is a sparsely used tag, and did not show up in the plot of top tags for topics at all. Where did this question land? The model estimates that this question is 63% topic 5 and 36% topic 10, with a tiny smidge of topic 7. I actually really like that the model has done this, putting Haskell in with low-level tags like C++/C, arrays, and pointers.\n\nA model like this is not just for analysis; it can be used to make predictions or implement new ideas. For example, one idea for Stack Overflow would be to automatically suggest a list of possible tags for new questions based on the text of a question. It looks like such a feature would work best for questions with at least some code and would be less accurate suggesting tags for questions that are almost all English words, or for very unusual tags. If there are any particular questions or tags \u003Cem>you\u003C/em> would like to explore yourself, fork the \u003Ca href=\"https://www.kaggle.com/juliasilge/topic-modeling-of-questions/\">kernel on Kaggle\u003C/a> and build a topic model yourself!\n\n\u003Cspan style=\"font-weight: 400;\">An approach like topic modeling can provide a way to get from raw text to a deeper understanding of unstructured data, even when we don’t know ahead of time what kind of organization or topics there may be in our text.\u003C/span> In \u003Ca href=\"http://amzn.to/2tZkmxG\">our book\u003C/a>, we discuss these and other text mining tasks, from the nitty gritty of converting back and forth between common text data structures to \u003Ca href=\"https://www.kaggle.com/juliasilge/tf-idf-of-stack-overflow-questions/\">measuring tf-idf\u003C/a> to sentiment analysis. \u003Cspan style=\"font-weight: 400;\">Adopting text mining practices like these allow us to quantitatively handle and understand text, and I put these same practices to the test in my daily work as a data scientist here at Stack Overflow. I love working with text data, and I apply tools exactly like these to real-world data from the developer community, learning about developers worldwide and helping clients make decisions about hiring and engaging with developers.\u003C/span>\n\nIf you also love working with data, discover new opportunities in our \u003Ca href=\"https://stackoverflow.com/jobs/data-scientist-jobs?utm_source=so-owned&utm_medium=blog&utm_campaign=dev-c4al&utm_content=c4al-link\" target=\"_blank\" rel=\"noopener\">data scientist job\u003C/a> listings.","html","2017-07-06T12:00:13.000Z",{"current":407},"text-mining-stack-overflow-questions",[409,417,422,427],{"_createdAt":410,"_id":411,"_rev":412,"_type":413,"_updatedAt":410,"slug":414,"title":416},"2023-05-23T16:43:21Z","wp-tagcat-announcements","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":415},"announcements","Announcements",{"_createdAt":410,"_id":418,"_rev":412,"_type":413,"_updatedAt":410,"slug":419,"title":421},"wp-tagcat-background",{"current":420},"background","Background",{"_createdAt":410,"_id":423,"_rev":412,"_type":413,"_updatedAt":410,"slug":424,"title":426},"wp-tagcat-engineering",{"current":425},"engineering","Engineering",{"_createdAt":410,"_id":428,"_rev":412,"_type":413,"_updatedAt":410,"slug":429,"title":431},"wp-tagcat-insights",{"current":430},"insights","Insights","Text Mining of Stack Overflow Questions",[434,440,446,452],{"_id":435,"publishedAt":436,"slug":437,"sponsored":12,"title":439},"370eca08-3da8-4a13-b71e-5ab04e7d1f8b","2025-08-28T16:00:00.000Z",{"_type":10,"current":438},"moving-the-public-stack-overflow-sites-to-the-cloud-part-1","Moving the public Stack Overflow sites to the cloud: Part 1",{"_id":441,"publishedAt":442,"slug":443,"sponsored":396,"title":445},"e10457b6-a9f6-4aa9-90f2-d9e04eb77b7c","2025-08-27T04:40:00.000Z",{"_type":10,"current":444},"from-punch-cards-to-prompts-a-history-of-how-software-got-better","From punch cards to prompts: a history of how software got better",{"_id":447,"publishedAt":448,"slug":449,"sponsored":12,"title":451},"65472515-0b62-40d1-8b79-a62bdd2f508a","2025-08-25T16:00:00.000Z",{"_type":10,"current":450},"making-continuous-learning-work-at-work","Making continuous learning work at work",{"_id":453,"publishedAt":454,"slug":455,"sponsored":12,"title":457},"1b0bdf8c-5558-4631-80ca-40cb8e54b571","2025-08-21T14:00:25.054Z",{"_type":10,"current":456},"research-roadmap-update-august-2025","Research roadmap update, August 2025",{"count":459,"lastTimestamp":460},15,"2023-05-25T09:46:19Z",["Reactive",462],{"$sarticleModal":463},false,["Set"],["ShallowReactive",466],{"sanity-QFpBV1AOS1RDEIU_ET6Tqad6ydtjtJUf2h4GvGmihVY":-1,"sanity-comment-wp-post-7365-1756455347137":-1},"/2017/07/06/text-mining-stack-overflow-questions/?cb=1"]