[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"sanity-MenMTstF-Iu-L1ebvgsGzddyZdjyECfUOQCLPOfKclo":3,"sanity-dPrTrGis_ry-T_l3BLitK1NxEnqmjh2BZdzySA9IcPM":458},{"data":4,"sourceMap":-1},{"latestPodcast":5,"latestReleases":14,"post":39,"recent":433},[6],{"_id":7,"publishedAt":8,"slug":9,"sponsored":12,"title":13},"50f4509c-4f55-4f11-8adc-5556e821ea77","2026-06-30T07:40:00.000Z",{"_type":10,"current":11},"slug","why-intent-prediction-needs-more-than-an-llm",null,"Why intent prediction needs more than an LLM",[15,21,27,33],{"_id":16,"publishedAt":17,"slug":18,"title":20},"eb5b66eb-9410-4329-83bb-22bbff39402a","2026-04-28T13:00:00.000Z",{"_type":10,"current":19},"turn-scattered-knowledge-into-trusted-intelligence","Turning scattered knowledge into trusted intelligence: Stack Internal 2026.3",{"_id":22,"publishedAt":23,"slug":24,"title":26},"369c2401-b62e-4a37-8ff8-bf603023ecad","2026-03-02T15:03:00.988Z",{"_type":10,"current":25},"what-s-new-at-stack-overflow-march-2026","What’s new at Stack Overflow: March 2026",{"_id":28,"publishedAt":29,"slug":30,"title":32},"5e9053a4-07ea-447c-91ea-29e0b6228537","2026-02-02T15:00:00.000Z",{"_type":10,"current":31},"what-s-new-at-stack-overflow-february-2026","What’s new at Stack Overflow: February 2026",{"_id":34,"publishedAt":35,"slug":36,"title":38},"a1b538eb-a8a6-46d0-80a1-ac70ec9bb935","2026-01-05T10:00:00.000-05:00",{"_type":10,"current":37},"what-s-new-at-stack-overflow-january-2026","What’s new at Stack Overflow: January 2026",{"_createdAt":40,"_id":41,"_rev":42,"_type":43,"_updatedAt":44,"author":45,"body":62,"comments":396,"dateUrl":397,"excerpt":398,"image":399,"legacyBody":402,"product":12,"publishedAt":405,"slug":406,"sponsored":12,"tags":408,"title":432,"visible":396},"2023-05-25T09:39:10Z","wp-post-7365","9HpbCsT2tq0xwozQfkfs6x","blogPost","2023-07-13T14:55:11Z",[46],{"_createdAt":47,"_id":48,"_rev":49,"_type":50,"_updatedAt":51,"avatar":52,"employee":57,"name":58,"role":59,"slug":60},"2023-05-23T16:27:18Z","wp-author-125","9HpbCsT2tq0xwozQflrfOh","blogAuthor","2023-08-30T13:15:26Z",{"_type":53,"asset":54},"image",{"_ref":55,"_type":56},"image-8c4101ec5a80bd817bd18a920b025b97dff07164-1022x1024-jpg","reference","former","Julia Silge","Data Scientist",{"current":61},"juliasilge",[63,131,140,148,158,166,174,182,200,204,212,220,228,232,279,283,302,306,325,329],{"_key":64,"_type":65,"children":66,"markDefs":118,"style":130},"c1b9a0b28d85","block",[67,72,78,82,87,91,96,100,105,109,114],{"_key":68,"_type":69,"marks":70,"text":71},"c1b9a0b28d850","span",[],"This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book ",{"_key":73,"_type":69,"marks":74,"text":77},"c1b9a0b28d851",[75,76],"87675dca0c8e","em","Text Mining with R",{"_key":79,"_type":69,"marks":80,"text":81},"c1b9a0b28d852",[]," with ",{"_key":83,"_type":69,"marks":84,"text":86},"c1b9a0b28d853",[85],"dc415698b15f","O'Reilly",{"_key":88,"_type":69,"marks":89,"text":90},"c1b9a0b28d854",[],". We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making. In our book, we demonstrate how using tidy data principles can make text mining easier and more effective. Let's mark this happy occasion with an exploration of Stack Overflow text data, and show how natural language processing techniques we cover in our book can be applied to real-world data to gain insight. For this analysis, I'll use Stack Overflow questions from ",{"_key":92,"_type":69,"marks":93,"text":95},"c1b9a0b28d855",[94],"e64ac0347257","StackSample",{"_key":97,"_type":69,"marks":98,"text":99},"c1b9a0b28d856",[],", a dataset of text from 10% of Stack Overflow questions and answers on programming topics that is freely available on ",{"_key":101,"_type":69,"marks":102,"text":104},"c1b9a0b28d857",[103],"54d56ce737ac","Kaggle",{"_key":106,"_type":69,"marks":107,"text":108},"c1b9a0b28d858",[],". The code that I'm using in this post is available as a ",{"_key":110,"_type":69,"marks":111,"text":113},"c1b9a0b28d859",[112],"61ecea7b75a7","kernel on Kaggle",{"_key":115,"_type":69,"marks":116,"text":117},"c1b9a0b28d8510",[],", so you can fork it for your own exploration. This analysis focuses only on questions posted on Stack Overflow, and uses topic modeling to dig into the text.",[119,122,124,126,128],{"_key":75,"_type":120,"href":121,"reference":12},"link","http://amzn.to/2tZkmxG",{"_key":85,"_type":120,"href":123,"reference":12},"http://www.jdoqocy.com/click-4428796-11290546?url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920067153.do%3Fcmp%3Daf-strata-books-video-product_cj_0636920067153_%25zp&cjsku=0636920067153",{"_key":94,"_type":120,"href":125,"reference":12},"https://www.kaggle.com/stackoverflow/stacksample/",{"_key":103,"_type":120,"href":127,"reference":12},"https://www.kaggle.com/",{"_key":112,"_type":120,"href":129,"reference":12},"https://www.kaggle.com/juliasilge/topic-modeling-of-questions/","normal",{"_key":132,"_type":65,"children":133,"markDefs":138,"style":139},"d6569c755425",[134],{"_key":135,"_type":69,"marks":136,"text":137},"d6569c7554250",[],"What is topic modeling?",[],"h2",{"_key":141,"_type":65,"children":142,"markDefs":147,"style":130},"628fbf64af36",[143],{"_key":144,"_type":69,"marks":145,"text":146},"628fbf64af360",[],"Topic modeling is a machine learning method for discovering \"topics\" that occur in a collection of documents. It is a powerful tool for organizing large collections of raw text. Topic modeling is an unsupervised method, which means that I as the analyst don't decide ahead of time what the topics will be about; we can find topics within text even if we're not sure what we're looking for ahead of time. Topic modeling can be used to discover underlying structure within text. In the context of the kind of topic model I'll implement (LDA topic modeling),",[],{"_key":149,"_type":65,"children":150,"level":155,"listItem":156,"markDefs":157,"style":130},"083be619b946",[151],{"_key":152,"_type":69,"marks":153,"text":154},"083be619b9460",[],"every document is a mixture of topics and",1,"bullet",[],{"_key":159,"_type":65,"children":160,"level":155,"listItem":156,"markDefs":165,"style":130},"da305db83415",[161],{"_key":162,"_type":69,"marks":163,"text":164},"da305db834150",[],"every topic is a mixture of words.",[],{"_key":167,"_type":65,"children":168,"markDefs":173,"style":130},"28abe77f49b9",[169],{"_key":170,"_type":69,"marks":171,"text":172},"28abe77f49b90",[],"Documents can share topics, and topics can share words, in any proportions. In our case for this analysis, each Stack Overflow question is a document. Let's imagine (for the sake of explanation) that there are two topics, one that is made up of the three words \"table\", \"select\", and \"join\" and a second that is made up of the three words \"function\", \"print\", and \"return.\" One question might be 100% topic 2, and another question might be 50% topic 1 and 50% topic 2. The statistical modeling process of topic modeling finds the topics in the text dataset we are dealing with, which words contribute to the topics, and which topics contribute to which documents.",[],{"_key":175,"_type":65,"children":176,"markDefs":181,"style":139},"44d4f2fedbee",[177],{"_key":178,"_type":69,"marks":179,"text":180},"44d4f2fedbee0",[],"Modeling Stack Overflow questions",[],{"_key":183,"_type":65,"children":184,"markDefs":198,"style":130},"4726e5ff3d1c",[185,189,194],{"_key":186,"_type":69,"marks":187,"text":188},"4726e5ff3d1c0",[],"For this blog post, I fit a model with 12 topics to this dataset. The question of how to choose the number of topics in topic modeling is a complicated one, but in this case, 12 topics gives us a good result for exploration. The process of building this topic model also involves cleaning text, removing stop words, and building a document-term matrix, all considerations covered in ",{"_key":190,"_type":69,"marks":191,"text":193},"4726e5ff3d1c1",[192],"b1259fbfb4ad","our book",{"_key":195,"_type":69,"marks":196,"text":197},"4726e5ff3d1c2",[],". One of the most compelling reasons to adopt tidy data principles when doing topic modeling is that we can easily explore which words contribute the most to which topics, and which topics contribute the most to which documents (questions on Stack Overflow, in this case). This is how we find out what kind of content corresponds to the topics fit by the model. Let's look at that for these specific questions. Which words are most important for each topic, in this model with 12 topics?",[199],{"_key":192,"_type":120,"href":121,"reference":12},{"_key":201,"_type":53,"alt":12,"asset":202,"markDefs":12},"ceabc06d8c8e",{"_ref":203,"_type":56},"image-cb7be42b8f3581dd98c57c311527d51aed2a6727-1024x922-png",{"_key":205,"_type":65,"children":206,"markDefs":211,"style":130},"5ca63027c522",[207],{"_key":208,"_type":69,"marks":209,"text":210},"5ca63027c5220",[],"First look at topic 5. That topic is all English words, not terms from code; the topic model has fit one topic that is not specific to any tag, programming language, or technology used on Stack Overflow but instead aligns with the text people use to talk about their questions. Next, look at topic 3; most of those words look very general to me and applicable to almost all technologies (\"file\", \"error\", \"server\", and so forth). Last, look through some of the other collections of terms. For some tech ecosystems that I am familiar with, these collections of terms make sense together. What if there are words you are interested in, but that you don't see in these plots? We can use tidy data principles to find which topic any word has the highest probability of being generated from. For example, \"git\" and \"docker\" are most likely to be generated from topic 3, \"boost\" is most likely to be generated from topic 10, and \"ggplot2\" (my own personal favorite data visualization tool!) is most likely to be generated from topic 4.",[],{"_key":213,"_type":65,"children":214,"markDefs":219,"style":139},"d1646f2b8277",[215],{"_key":216,"_type":69,"marks":217,"text":218},"d1646f2b82770",[],"Connecting to tags",[],{"_key":221,"_type":65,"children":222,"markDefs":227,"style":130},"d5f1d3e52fe0",[223],{"_key":224,"_type":69,"marks":225,"text":226},"d5f1d3e52fe00",[],"We can look at this from a different angle because each question on Stack Overflow has a tag, like \"r\" or \"c#\" or \"sql\". The topic model estimates a probability that each document belongs to each topic; it's the estimated proportion of words from that document that are generated from that topic. We know the tags for each document, so let's examine which tags are associated with each topic.",[],{"_key":229,"_type":53,"alt":12,"asset":230,"markDefs":12},"f9192af24b0d",{"_ref":231,"_type":56},"image-262e11b18b5d43b71999f9ebddf411f97d997aaf-1024x819-png",{"_key":233,"_type":65,"children":234,"markDefs":273,"style":130},"099ed4ccef82",[235,239,243,247,252,256,261,265,269],{"_key":236,"_type":69,"marks":237,"text":238},"099ed4ccef820",[],"Remember that topic 5 was the one that corresponded to English words where users discuss and describe their problem, so that is a measure of something different than the other topics. Topic 1 looks like front-end web development, topic 4 is databases, topic 10 is C and low-level programming, and so forth. Remember, the tags did ",{"_key":240,"_type":69,"marks":241,"text":242},"099ed4ccef821",[76],"not",{"_key":244,"_type":69,"marks":245,"text":246},"099ed4ccef822",[]," go into the unsupervised modeling process; we are just looking at them after the fact. The topic modeling process has taken the raw text of Stack Overflow questions and discovered underlying patterns and structure. This is what topic modeling does, whether you are looking at ",{"_key":248,"_type":69,"marks":249,"text":251},"099ed4ccef823",[250],"6ca67c657c44","NASA metadata",{"_key":253,"_type":69,"marks":254,"text":255},"099ed4ccef824",[]," or ",{"_key":257,"_type":69,"marks":258,"text":260},"099ed4ccef825",[259],"7249bca61656","classic literature",{"_key":262,"_type":69,"marks":263,"text":264},"099ed4ccef826",[],". Let's look at a few real examples from this dataset so you can see how this worked out. Each of the following questions is part of the ",{"_key":266,"_type":69,"marks":267,"text":95},"099ed4ccef827",[268],"7743a5f9002a",{"_key":270,"_type":69,"marks":271,"text":272},"099ed4ccef828",[]," dataset and this particular topic model.",[274,276,278],{"_key":250,"_type":120,"href":275,"reference":12},"http://tidytextmining.com/nasa.html#topic-modeling",{"_key":259,"_type":120,"href":277,"reference":12},"http://tidytextmining.com/topicmodeling.html#library-heist",{"_key":268,"_type":120,"href":125,"reference":12},{"_key":280,"_type":53,"alt":12,"asset":281,"markDefs":12},"72978344bf36",{"_ref":282,"_type":56},"image-57b8bf5fa4fbfeae2b1037e493a92427306a92d3-1024x442-png",{"_key":284,"_type":65,"children":285,"markDefs":299,"style":130},"55c6ee4398e1",[286,290,295],{"_key":287,"_type":69,"marks":288,"text":289},"55c6ee4398e10",[],"This ",{"_key":291,"_type":69,"marks":292,"text":294},"55c6ee4398e11",[293],"116578409993","first example question",{"_key":296,"_type":69,"marks":297,"text":298},"55c6ee4398e12",[]," is relatively short, and the topic model estimates that is 91% topic 12 and 6% topic 3. Looks good! I don't see many of the top 10 terms from the first plot in this blog post for topic 12 here, but the topic model has classified it into the topic that is dominated by iOS, Objective-C, iPhone, and Swift.",[300],{"_key":293,"_type":120,"href":301,"reference":12},"https://stackoverflow.com/questions/24049020/nsnotificationcenter-addobserver-in-swift",{"_key":303,"_type":53,"alt":12,"asset":304,"markDefs":12},"e01354ea2906",{"_ref":305,"_type":56},"image-e6f592fb081445012cc14fc2bfa9b2c4071f49a9-1024x669-png",{"_key":307,"_type":65,"children":308,"markDefs":322,"style":130},"22c68af7fc8b",[309,313,318],{"_key":310,"_type":69,"marks":311,"text":312},"22c68af7fc8b0",[],"Our ",{"_key":314,"_type":69,"marks":315,"text":317},"22c68af7fc8b1",[316],"7c53446626ba","second example question",{"_key":319,"_type":69,"marks":320,"text":321},"22c68af7fc8b2",[]," is longer, and the topic model estimates that it is 82% topic 5 and 18% topic 7. This question has a lot of English text and not much code, and that is reflected by the modeling. The model has chosen topic 7, dominated by Python and Django, for this question.",[323],{"_key":316,"_type":120,"href":324,"reference":12},"https://stackoverflow.com/questions/30216000/why-is-faster-than-list",{"_key":326,"_type":53,"alt":12,"asset":327,"markDefs":12},"667c06cc7002",{"_ref":328,"_type":56},"image-48a3d6cba036b7b53e2c6c7471ff04e687956896-1024x953-png",{"_key":330,"_type":65,"children":331,"markDefs":387,"style":130},"1b7c120d247b",[332,336,341,345,349,353,357,361,365,369,374,378,383],{"_key":333,"_type":69,"marks":334,"text":335},"1b7c120d247b0",[],"Last, let's look at this ",{"_key":337,"_type":69,"marks":338,"text":340},"1b7c120d247b1",[339],"05f8a3c699e6","Haskell question",{"_key":342,"_type":69,"marks":343,"text":344},"1b7c120d247b2",[],". Haskell is a sparsely used tag, and did not show up in the plot of top tags for topics at all. Where did this question land? The model estimates that this question is 63% topic 5 and 36% topic 10, with a tiny smidge of topic 7. I actually really like that the model has done this, putting Haskell in with low-level tags like C++/C, arrays, and pointers. A model like this is not just for analysis; it can be used to make predictions or implement new ideas. For example, one idea for Stack Overflow would be to automatically suggest a list of possible tags for new questions based on the text of a question. It looks like such a feature would work best for questions with at least some code and would be less accurate suggesting tags for questions that are almost all English words, or for very unusual tags. If there are any particular questions or tags ",{"_key":346,"_type":69,"marks":347,"text":348},"1b7c120d247b3",[76],"you",{"_key":350,"_type":69,"marks":351,"text":352},"1b7c120d247b4",[]," would like to explore yourself, fork the ",{"_key":354,"_type":69,"marks":355,"text":113},"1b7c120d247b5",[356],"d8c05fb17a8d",{"_key":358,"_type":69,"marks":359,"text":360},"1b7c120d247b6",[]," and build a topic model yourself! An approach like topic modeling can provide a way to get from raw text to a deeper understanding of unstructured data, even when we don’t know ahead of time what kind of organization or topics there may be in our text. In ",{"_key":362,"_type":69,"marks":363,"text":193},"1b7c120d247b7",[364],"f052ae9dc5d9",{"_key":366,"_type":69,"marks":367,"text":368},"1b7c120d247b8",[],", we discuss these and other text mining tasks, from the nitty gritty of converting back and forth between common text data structures to ",{"_key":370,"_type":69,"marks":371,"text":373},"1b7c120d247b9",[372],"4d4ca79816ba","measuring tf-idf",{"_key":375,"_type":69,"marks":376,"text":377},"1b7c120d247b10",[]," to sentiment analysis. Adopting text mining practices like these allow us to quantitatively handle and understand text, and I put these same practices to the test in my daily work as a data scientist here at Stack Overflow. I love working with text data, and I apply tools exactly like these to real-world data from the developer community, learning about developers worldwide and helping clients make decisions about hiring and engaging with developers. If you also love working with data, discover new opportunities in our ",{"_key":379,"_type":69,"marks":380,"text":382},"1b7c120d247b11",[381],"d36c74939dd1","data scientist job",{"_key":384,"_type":69,"marks":385,"text":386},"1b7c120d247b12",[]," listings.",[388,390,391,392,394],{"_key":339,"_type":120,"href":389,"reference":12},"https://stackoverflow.com/questions/17247880/getting-associated-type-synonyms-with-template-haskell",{"_key":356,"_type":120,"href":129,"reference":12},{"_key":364,"_type":120,"href":121,"reference":12},{"_key":372,"_type":120,"href":393,"reference":12},"https://www.kaggle.com/juliasilge/tf-idf-of-stack-overflow-questions/",{"_key":381,"_type":120,"href":395,"reference":12},"https://stackoverflow.com/jobs/data-scientist-jobs?utm_source=so-owned&utm_medium=blog&utm_campaign=dev-c4al&utm_content=c4al-link",true,"2017/07/06","",{"_type":53,"asset":400},{"_ref":401,"_type":56},"image-4e4c25532303172775b6cf3715e6ddb906205b8f-1200x675-png",{"code":403,"language":404},"This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book \u003Ca href=\"http://amzn.to/2tZkmxG\">\u003Cem>Text Mining with R\u003C/em>\u003C/a> with \u003Ca href=\"http://www.jdoqocy.com/click-4428796-11290546?url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920067153.do%3Fcmp%3Daf-strata-books-video-product_cj_0636920067153_%25zp&amp;cjsku=0636920067153\">O'Reilly\u003C/a>. We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making. In our book, we demonstrate how using tidy data principles can make text mining easier and more effective. Let's mark this happy occasion with an exploration of Stack Overflow text data, and show how natural language processing techniques we cover in our book can be applied to real-world data to gain insight.\n\nFor this analysis, I'll use Stack Overflow questions from \u003Ca href=\"https://www.kaggle.com/stackoverflow/stacksample/\">StackSample\u003C/a>, a dataset of text from 10% of Stack Overflow questions and answers on programming topics that is freely available on \u003Ca href=\"https://www.kaggle.com/\">Kaggle\u003C/a>. The code that I'm using in this post is available as a \u003Ca href=\"https://www.kaggle.com/juliasilge/topic-modeling-of-questions/\">kernel on Kaggle\u003C/a>, so you can fork it for your own exploration.\n\nThis analysis focuses only on questions posted on Stack Overflow, and uses topic modeling to dig into the text.\n\n\u003Ch2>What is topic modeling?\u003C/h2>\n\nTopic modeling is a machine learning method for discovering \"topics\" that occur in a collection of documents. It is a powerful tool for organizing large collections of raw text. Topic modeling is an unsupervised method, which means that I as the analyst don't decide ahead of time what the topics will be about; we can find topics within text even if we're not sure what we're looking for ahead of time. Topic modeling can be used to discover underlying structure within text. In the context of the kind of topic model I'll implement (LDA topic modeling),\n\n\u003Cul>\n\u003Cli>every document is a mixture of topics and\u003C/li>\n\u003Cli>every topic is a mixture of words.\u003C/li>\n\u003C/ul>\n\nDocuments can share topics, and topics can share words, in any proportions. In our case for this analysis, each Stack Overflow question is a document. Let's imagine (for the sake of explanation) that there are two topics, one that is made up of the three words \"table\", \"select\", and \"join\" and a second that is made up of the three words \"function\", \"print\", and \"return.\" One question might be 100% topic 2, and another question might be 50% topic 1 and 50% topic 2. The statistical modeling process of topic modeling finds the topics in the text dataset we are dealing with, which words contribute to the topics, and which topics contribute to which documents.\n\n\u003Ch2>Modeling Stack Overflow questions\u003C/h2>\n\nFor this blog post, I fit a model with 12 topics to this dataset. The question of how to choose the number of topics in topic modeling is a complicated one, but in this case, 12 topics gives us a good result for exploration. The process of building this topic model also involves cleaning text, removing stop words, and building a document-term matrix, all considerations covered in \u003Ca href=\"http://amzn.to/2tZkmxG\">our book\u003C/a>.\n\nOne of the most compelling reasons to adopt tidy data principles when doing topic modeling is that we can easily explore which words contribute the most to which topics, and which topics contribute the most to which documents (questions on Stack Overflow, in this case). \u003Cspan style=\"font-weight: 400;\">This is how we find out what kind of content corresponds to the topics fit by the model. \u003C/span>Let's look at that for these specific questions. Which words are most important for each topic, in this model with 12 topics?\n\n\u003Cimg class=\"aligncenter size-large wp-image-7370\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/top_terms-1-1024x922.png\" alt=\"\" width=\"1024\" height=\"922\" />\n\nFirst look at topic 5. That topic is all English words, not terms from code; the topic model has fit one topic that is not specific to any tag, programming language, or technology used on Stack Overflow but instead aligns with the text people use to talk about their questions. Next, look at topic 3; most of those words look very general to me and applicable to almost all technologies (\"file\", \"error\", \"server\", and so forth). Last, look through some of the other collections of terms. For some tech ecosystems that I am familiar with, these collections of terms make sense together.\n\nWhat if there are words you are interested in, but that you don't see in these plots? We can use tidy data principles to find which topic any word has the highest probability of being generated from. For example, \"git\" and \"docker\" are most likely to be generated from topic 3, \"boost\" is most likely to be generated from topic 10, and \"ggplot2\" (my own personal favorite data visualization tool!) is most likely to be generated from topic 4.\n\n&nbsp;\n\n\u003Ch2>Connecting to tags\u003C/h2>\n\nWe can look at this from a different angle because each question on Stack Overflow has a tag, like \"r\" or \"c#\" or \"sql\". The topic model estimates a probability that each document belongs to each topic; it's the estimated proportion of words from that document that are generated from that topic. We know the tags for each document, so let's examine which tags are associated with each topic.\n\n\u003Cimg class=\"aligncenter size-large wp-image-7369\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/top_tags-1-1024x819.png\" alt=\"\" width=\"1024\" height=\"819\" />\n\nRemember that topic 5 was the one that corresponded to English words where users discuss and describe their problem, so that is a measure of something different than the other topics. Topic 1 looks like front-end web development, topic 4 is databases, topic 10 is C and low-level programming, and so forth. Remember, the tags did \u003Cem>not\u003C/em> go into the unsupervised modeling process; we are just looking at them after the fact. The topic modeling process has taken the raw text of Stack Overflow questions and discovered underlying patterns and structure. This is what topic modeling does, whether you are looking at \u003Ca href=\"http://tidytextmining.com/nasa.html#topic-modeling\">NASA metadata\u003C/a> or \u003Ca href=\"http://tidytextmining.com/topicmodeling.html#library-heist\">classic literature\u003C/a>.\n\nLet's look at a few real examples from this dataset so you can see how this worked out. Each of the following questions is part of the \u003Ca href=\"https://www.kaggle.com/stackoverflow/stacksample/\">StackSample\u003C/a> dataset and this particular topic model.\n\n\u003Ca href=\"https://stackoverflow.com/questions/24049020/nsnotificationcenter-addobserver-in-swift\">\u003Cimg class=\"aligncenter wp-image-7367 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/ios_question-1024x442.png\" alt=\"\" width=\"1024\" height=\"442\" />\u003C/a>\n\nThis \u003Ca href=\"https://stackoverflow.com/questions/24049020/nsnotificationcenter-addobserver-in-swift\">first example question\u003C/a> is relatively short, and the topic model estimates that is 91% topic 12 and 6% topic 3. Looks good! I don't see many of the top 10 terms from the first plot in this blog post for topic 12 here, but the topic model has classified it into the topic that is dominated by iOS, Objective-C, iPhone, and Swift.\n\n\u003Ca href=\"https://stackoverflow.com/questions/30216000/why-is-faster-than-list\">\u003Cimg class=\"aligncenter wp-image-7368 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/python_question-1024x669.png\" alt=\"\" width=\"1024\" height=\"669\" />\u003C/a>\n\nOur \u003Ca href=\"https://stackoverflow.com/questions/30216000/why-is-faster-than-list\">second example question\u003C/a> is longer, and the topic model estimates that it is 82% topic 5 and 18% topic 7. This question has a lot of English text and not much code, and that is reflected by the modeling. The model has chosen topic 7, dominated by Python and Django, for this question.\n\n\u003Ca href=\"//stackoverflow.com/questions/17247880/getting-associated-type-synonyms-with-template-haskell\">\u003Cimg class=\"aligncenter wp-image-7366 size-large\" src=\"https://stackoverflow.blog/wp-content/uploads/2017/06/haskell_question-1024x953.png\" alt=\"\" width=\"1024\" height=\"953\" />\u003C/a>\n\nLast, let's look at this \u003Ca href=\"https://stackoverflow.com/questions/17247880/getting-associated-type-synonyms-with-template-haskell\">Haskell question\u003C/a>. Haskell is a sparsely used tag, and did not show up in the plot of top tags for topics at all. Where did this question land? The model estimates that this question is 63% topic 5 and 36% topic 10, with a tiny smidge of topic 7. I actually really like that the model has done this, putting Haskell in with low-level tags like C++/C, arrays, and pointers.\n\nA model like this is not just for analysis; it can be used to make predictions or implement new ideas. For example, one idea for Stack Overflow would be to automatically suggest a list of possible tags for new questions based on the text of a question. It looks like such a feature would work best for questions with at least some code and would be less accurate suggesting tags for questions that are almost all English words, or for very unusual tags. If there are any particular questions or tags \u003Cem>you\u003C/em> would like to explore yourself, fork the \u003Ca href=\"https://www.kaggle.com/juliasilge/topic-modeling-of-questions/\">kernel on Kaggle\u003C/a> and build a topic model yourself!\n\n\u003Cspan style=\"font-weight: 400;\">An approach like topic modeling can provide a way to get from raw text to a deeper understanding of unstructured data, even when we don’t know ahead of time what kind of organization or topics there may be in our text.\u003C/span> In \u003Ca href=\"http://amzn.to/2tZkmxG\">our book\u003C/a>, we discuss these and other text mining tasks, from the nitty gritty of converting back and forth between common text data structures to \u003Ca href=\"https://www.kaggle.com/juliasilge/tf-idf-of-stack-overflow-questions/\">measuring tf-idf\u003C/a> to sentiment analysis. \u003Cspan style=\"font-weight: 400;\">Adopting text mining practices like these allow us to quantitatively handle and understand text, and I put these same practices to the test in my daily work as a data scientist here at Stack Overflow. I love working with text data, and I apply tools exactly like these to real-world data from the developer community, learning about developers worldwide and helping clients make decisions about hiring and engaging with developers.\u003C/span>\n\nIf you also love working with data, discover new opportunities in our \u003Ca href=\"https://stackoverflow.com/jobs/data-scientist-jobs?utm_source=so-owned&amp;utm_medium=blog&amp;utm_campaign=dev-c4al&amp;utm_content=c4al-link\" target=\"_blank\" rel=\"noopener\">data scientist job\u003C/a> listings.","html","2017-07-06T12:00:13.000Z",{"current":407},"text-mining-stack-overflow-questions",[409,417,422,427],{"_createdAt":410,"_id":411,"_rev":412,"_type":413,"_updatedAt":410,"slug":414,"title":416},"2023-05-23T16:43:21Z","wp-tagcat-announcements","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":415},"announcements","Announcements",{"_createdAt":410,"_id":418,"_rev":412,"_type":413,"_updatedAt":410,"slug":419,"title":421},"wp-tagcat-background",{"current":420},"background","Background",{"_createdAt":410,"_id":423,"_rev":412,"_type":413,"_updatedAt":410,"slug":424,"title":426},"wp-tagcat-engineering",{"current":425},"engineering","Engineering",{"_createdAt":410,"_id":428,"_rev":412,"_type":413,"_updatedAt":410,"slug":429,"title":431},"wp-tagcat-insights",{"current":430},"insights","Insights","Text Mining of Stack Overflow Questions",[434,440,446,452],{"_id":435,"publishedAt":436,"slug":437,"sponsored":12,"title":439},"28e560af-f0aa-4d46-bd90-f435ad604aa7","2026-06-26T14:00:27.102Z",{"_type":10,"current":438},"paging-charity-how-can-engineering-leaders-avoid-becoming-bond-villains","Paging Charity! How can engineering leaders avoid becoming Bond villains?",{"_id":441,"publishedAt":442,"slug":443,"sponsored":12,"title":445},"4b22c2a3-3779-4966-93eb-5230391dbdce","2026-06-23T14:08:58.595Z",{"_type":10,"current":444},"your-ai-shipped-a-backend-that-boots-that-is-the-whole-problem","Your AI shipped a backend that boots. That is the whole problem.",{"_id":447,"publishedAt":448,"slug":449,"sponsored":12,"title":451},"5cf362e1-fe7b-45af-b69c-914731c6a052","2026-06-23T14:00:00.000Z",{"_type":10,"current":450},"the-2026-developer-survey-is-now-open-for-human-developers-only","The 2026 Developer Survey is now open (for human developers only)!",{"_id":453,"publishedAt":454,"slug":455,"sponsored":12,"title":457},"30b995f7-7cb9-4dd8-bf71-d0685940a32b","2026-06-19T14:00:00.000Z",{"_type":10,"current":456},"dispatches-from-o-reilly-from-capabilities-to-responsibilities","Dispatches from O'Reilly: From capabilities to responsibilities",{"data":459,"sourceMap":-1},{"count":460,"lastTimestamp":461},15,"2023-05-25T09:46:19Z"]