Making Sense of the Metadata: Clustering 4,000 Stack Overflow tags with BigQuery k-means

How would you group more than 4,000 active Stack Overflow tags into meaningful groups? This is a perfect task for unsupervised learning and k-means clustering — and now you can do all this inside BigQuery. Let’s find out how.

Visualizing a universe of clustered tags.
Felipe Hoffa is a Developer Advocate for Google Cloud. In this post he works with BigQuery – Google’s serverless data warehouse – to run k-means clustering over Stack Overflow’s published dataset, which is refreshed and uploaded to Google’s Cloud once a quarter. You can check out more about working with Stack Overflow data and BigQuery here and here.

4,000+ tags are a lot

These are the most active Stack Overflow tags since 2018 — they’re a lot. In this picture I only have 240 tags — how would you group and categorize 4,000+ of them?
# Tags with >180 questions since 2018
SELECT tag, COUNT(*) questions
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`,
  UNNEST(SPLIT(tags, '|')) tag
WHERE creation_date > '2018-01-01'
GROUP BY 1
HAVING questions>180
ORDER BY 2 DESC
Top Stack Overflow tags by number of questions.

Hint: Co-occurring tags

Let’s find tags that usually go together:
Co-occurring tags on Stack Overflow questions
These groupings make sense:
  • ‘javascript’ is related to ‘html’.
  • ‘python’ is related to ‘pandas’.
  • ‘c#’ is related to ‘.net’.
  • ‘typescript’ is related to ‘angular’.
  • etc…
So I’ll take these relationships and I’ll save them on an auxiliary table — plus a percentage of how frequently a relationship happens for each tag.
CREATE OR REPLACE TABLE `deleting.stack_overflow_tag_co_ocurrence`
AS
WITH data AS (
  SELECT * 
  FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
  WHERE creation_date > '2018-01-01'
), active_tags AS (
  SELECT tag, COUNT(*) c
  FROM data, UNNEST(SPLIT(tags, '|')) tag
  GROUP BY 1
  HAVING c>180
)

SELECT *, questions/questions_tag1 percent
FROM (
    SELECT *, MAX(questions) OVER(PARTITION BY tag1) questions_tag1
    FROM (
        SELECT tag1, tag2, COUNT(*) questions
        FROM data, UNNEST(SPLIT(tags, '|')) tag1, UNNEST(SPLIT(tags, '|')) tag2
        WHERE tag1 IN (SELECT tag FROM active_tags)
        AND tag2 IN (SELECT tag FROM active_tags)
        GROUP BY 1,2
        HAVING questions>30
    )
)

One-hot encoding

Now get ready for some SQL magic. BigQuery ML does a good job of hot-encoding strings, but it doesn’t handle arrays as I wish it did (stay tuned). So I’m going to create a string first that will define all the columns where I want to find co-occurrence. Then I can use that string to get a huge table, with a 1 for every time a tag co-occurs with the main one at least certain % of time. Let’s see first a subset of these results:
What you see here is a co-occurrence matrix:
  • ‘javascript’ shows a relation to ‘php’, ‘html’, ‘css’, ‘node.js’, and ‘jquery’.
  • ‘android’ shows a relation to ‘java’.
  • ‘machine-learning’ shows a relation to ‘python’, but not the other way around.
  • ‘multi-threading’ shows a relation to ‘python’, ‘java’, ‘c#’, and ‘android`
  • ‘unit-testing’ a relation to almost every column here, except to ‘php’, ‘html’, ‘css’, and ‘jquery’.
You can reduce or augment the sensibility of these relations with the percent threshold:
SELECT tag1 
,IFNULL(ANY_VALUE(IF(tag2='javascript',1,null)),0) Xjavascript
,IFNULL(ANY_VALUE(IF(tag2='python',1,null)),0 ) Xpython
,IFNULL(ANY_VALUE(IF(tag2='java',1,null)),0) Xjava
,IFNULL(ANY_VALUE(IF(tag2='c#',1,null)),0) XcH
,IFNULL(ANY_VALUE(IF(tag2='android',1,null)),0) Xandroid
,IFNULL(ANY_VALUE(IF(tag2='php',1,null)),0) Xphp
,IFNULL(ANY_VALUE(IF(tag2='html',1,null)),0) Xhtml
,IFNULL(ANY_VALUE(IF(tag2='css',1,null)),0) Xcss
,IFNULL(ANY_VALUE(IF(tag2='node.js',1,null)),0) XnodeDjs
,IFNULL(ANY_VALUE(IF(tag2='jquery',1,null)),0) Xjquery
,SUM(questions) questions_tag1
FROM `deleting.stack_overflow_tag_co_ocurrence`
WHERE percent>0.03
GROUP BY tag1
ORDER BY questions_tag1 DESC 
LIMIT 100

K-means clustering time

Now — instead of using this small table, let’s use the whole table to compute k-means with BigQuery. With this line, I’m creating a one-hot encoding string that I can use later to define the 4,000+ columns I’ll use for k-means:
one_hot_big = client.query("""
SELECT STRING_AGG(
FORMAT("IFNULL(ANY_VALUE(IF(tag2='%s',1,null)),0)X%s", tag2, REPLACE(REPLACE(REPLACE(REPLACE(tag2,'-','_'),'.','D'),'#','H'),'+','P'))
) one_hot
FROM (
SELECT tag2, SUM(questions) questions
FROM `deleting.stack_overflow_tag_co_ocurrence`
GROUP BY tag2
# ORDER BY questions DESC
# LIMIT 10
)
""").to_dataframe().iloc[0]['one_hot']
And training a k-means model in BigQuery is really easy:
CREATE MODEL `deleting.kmeans_tagsubtag_50_big_a_01`
OPTIONS ( 
    model_type='kmeans',
    distance_type='COSINE',
    num_clusters=50 )
AS
WITH tag_and_subtags AS (
    SELECT tag1, %s
    FROM `deleting.stack_overflow_tag_co_ocurrence`
    WHERE percent>0.03
    GROUP BY tag1
)
SELECT * EXCEPT(tag1) 
FROM tag_and_subtags
Now we wait — while BigQuery shows us the progress of our training:
And when it’s done, we even get an evaluation of our model:

Davies–Bouldin index: 1.8530
Mean squared distance: 0.8174

Performance note

Do we really need 4,000 one-hot encoded dimensions to obtain better clusters? Turns out that 500 are enough — and I like the results better. It also reduces the time for training the model in BigQuery from 24 minutes to 3. The same with only 30 dimensions lowers the time to 90 seconds — but I like the results better with 500. More on hyper-parameter tuning below.
500 one-hot encoded dimensions reduces time per iteration to 30 seconds, and a lower loss.
Davies–Bouldin index: 1.6910
Mean squared distance: 0.52332

Get ready for the results: The 50 clusters are…

Now it’s time to see our results. I’ll even look out for some tags I’m interested in: How are googleamazon, and azure represented in each cluster? These are the 50 groups that k-means clustering found — given the 1-hot encoding of related tags we did earlier in this post. Some results make a lot of sense — while others give great insight into what are the prevalent surrounding technologies to any Stack Overflow tag. Naming each centroid is always a challenge. Here I used the top 5 centroid weight vectors — see how below.

centroid 45: amazon-web-services, aws-lambda, amazon-s3, amazon-ec2, python
—–
amazon-web-services, amazon-s3, aws-lambda, amazon-ec2, amazon-dynamodb, terraform, aws-sdk, amazon-cloudformation, amazon-redshift, aws-api-gateway, amazon-cognito, boto3, cloud, alexa, amazon-rds, amazon-elastic-beanstalk, amazon-ecs, alexa-skills-kit, amazon-cloudfront, serverless, aws-cli, amazon-iam, amazon-cloudwatch, elastic-beanstalk, amazon-sqs, serverless-framework, amazon-athena, aws-amplify, aws-appsync, amazon-sns, alexa-skill, amazon-route53, amazon, amazon-kinesis, amazon-sagemaker, autoscaling, amazon-elb, amazon-ses, aws-cognito, aws-iot, terraform-provider-aws, api-gateway, amazon-vpc, aws-serverless, aws-codepipeline, aws-codebuild, amazon-rds-aurora, bitnami, amazon-lex, aws-step-functions, aws-code-deploy, aws-iam, aws-fargate, dynamodb-queries, boto

amazon: amazon-cognito, amazon-ses, amazon-redshift, aws-lambda, amazon-ecs, amazon-s3, amazon-web-services, amazon-athena, aws-api-gateway, amazon-rds, amazon, amazon-cloudfront, amazon-lex, aws-iot, amazon-elb, aws-code-deploy, amazon-cloudwatch, aws-cli

centroid 17: android, java, android-layout, android-recyclerview, kotlin
—–
android, json, xml, kotlin, android-studio, android-recyclerview, android-layout, android-fragments, xslt, serialization, android-intent, retrofit2, android-activity, android-room, nullpointerexception, retrofit, gson, android-volley, textview, android-viewpager, xml-parsing, recycler-adapter, android-edittext, android-sqlite, protocol-buffers, xsd, deserialization, android-constraintlayout, android-asynctask, fragment, android-architecture-components, android-livedata, imageview, scrollview, android-databinding, android-glide, android-animation, xquery, xslt-1.0, android-jetpack, android-manifest, navigation-drawer, adapter, bottomnavigationview, xslt-2.0, android-toolbar, onclicklistener, android-tablayout, android-cardview, android-spinner, android-adapter, picasso, android-linearlayout, transformation, android-drawable, android-architecture-navigation, android-imageview, android-custom-view, json-deserialization, android-view, android-actionbar, searchview, biztalk, android-coordinatorlayout, android-lifecycle, android-softkeyboard, floating-action-button, recyclerview-layout, swipe, android-relativelayout, android-xml, android-collapsingtoolbarlayout, android-button, android-scrollview, saxon, android-nestedscrollview, android-styles, xml-namespaces, xsl-fo, android-fragmentactivity, android-dialogfragment, android-viewholder, xml-serialization

centroid 7: android, java, javascript, ios, python
—–
android-gradle, bluetooth, rx-java2, build.gradle, dependencies, rx-java, google-play, sdk, android-ndk, corda, video-streaming, android-emulator, libgdx, android-webview, apk, location, java-native-interface, google-play-services, dagger-2, adb, codenameone, android-8.0-oreo, google-places-api, android-notifications, android-studio-3.0, broadcastreceiver, speech-recognition, arcore, sharedpreferences, streaming, gps, android-service, version, coordinates, androidx, native, sms, here-api, android-camera, android-permissions, uri, android-mediaplayer, locale, vert.x, exoplayer, google-maps-markers, settings, alarmmanager, spinner, proguard, okhttp3, text-to-speech, okhttp, updates, android-camera2, android-source, whatsapp, nfc, share, inputstream, google-fabric, xmpp, calculator, manifest, wifi, mpandroidchart, android-9.0-pie, rx-android, call, android-workmanager, mp4, hls, video-processing, release, barcode, android-support-library, alertdialog, android-viewmodel, dji-sdk, barcode-scanner, filepath, sip, google-cloud-messaging, gradle-plugin, android-arrayadapter, screen, payment, toolbar, google-play-console, dagger, mp3, indexoutofboundsexception, ejabberd, httpurlconnection, libraries, android-proguard, coroutine, h.264, simpledateformat, jacoco, background-process, rtsp, offline, root, sensor, splash-screen, android-bluetooth, android-testing, android-resources, android-tv, emulation, android-bitmap, android-listview, multipart, chromecast, android-broadcastreceiver, video-capture, google-maps-android-api-2, pojo, android-canvas, visibility, broadcast, google-play-games, dao, kotlin-android-extensions, avd, lint, android-jobscheduler, android-library, kotlinx.coroutines, firebase-mlkit, expandablelistview, obfuscation, android-contentprovider, appcelerator, mvp, live-streaming, in-app-billing, android-context, audio-streaming, arabic, android-alertdialog, kotlin-coroutines, zxing, android-videoview, fingerprint, braintree, audio-recording, deprecated, job-scheduling, android-wifi, wear-os, bottom-sheet, android-things, device, marker, right-to-left, google-login, mobile-application, media-player, countdowntimer, opengl-es-2.0, nullable, face-detection, exoplayer2.x, android-8.1-oreo, beacon, drawable, gradlew, mapbox-android, classnotfoundexception, parcelable, android-keystore, voice-recognition, toast, aar, google-places, android-theme, android-progressbar, paging, accelerometer, playback, gradle-kotlin-dsl, samsung-mobile, photo, ibeacon, android-appcompat, noclassdeffounderror, branch.io, rtmp, sceneform, foreground-service, google-cast, appcelerator-titanium, android-widget, logcat, android-pendingintent, android-fileprovider, android-gps, sha1, jodatime, android-sensors, android-appbarlayout, surfaceview, mpeg-dash, android-mvvm

google: google-maps-android-api-2, google-play-services, google-fabric, google-places, google-places-api, google-play-console, google-cast, google-login, google-maps-markers, google-play-games, google-play, google-cloud-messaging

centroid 28: angular, typescript, javascript, angular6, angular5
—–
angular, typescript, angular6, angular5, rxjs, angular-material, angular7, service, observable, routing, angular-cli, components, angular-reactive-forms, karma-jasmine, primeng, ag-grid, angularfire2, angular-material2, ngrx, reactive-programming, httpclient, angular-ui-router, angular-routing, lazy-loading, rxjs6, ngfor, angular-forms, angular-httpclient, angular2-routing, angular-components, angular-router, rx-swift, ng-bootstrap, angular2-forms, angular-universal, angular2-template, angular-services, angular-directive, material, angular4-forms, ngx-bootstrap, typescript2.0, angular2-services, ngrx-store, subscription, rxjs5, angular2-directives, ngrx-effects, angular-material-6, angular-cli-v6, angular-http-interceptors, redux-observable, subscribe, angular-pipe, angular-promise, angular2-observables, reactivex, angular-ngmodel, angular-cdk, tsconfig

centroid 20: apache-spark, java, scala, hadoop, python
—–
apache-spark, scala, pyspark, apache-kafka, hadoop, apache-spark-sql, hive, cassandra, apache-flink, jupyter, hdfs, bigdata, playframework, spark-streaming, sbt, apache-nifi, hiveql, apache-kafka-streams, akka, hbase, pyspark-sql, mapreduce, kafka-consumer-api, rdd, spark-dataframe, amazon-emr, yarn, user-defined-functions, parquet, cassandra-3.0, cluster-computing, avro, databricks, apache-kafka-connect, aws-glue, spark-structured-streaming, flink-streaming, kerberos, apache-zookeeper, sqoop, confluent, presto, kafka-producer-api, impala, akka-stream, hadoop2, apache-spark-mllib, traits, apache-zeppelin, cloudera, datastax, apache-storm, distributed-computing, akka-http, data-modeling, apache-spark-dataset, guice, google-cloud-dataproc, gatling, jmx, hortonworks-data-platform, apache-pig, apache-spark-ml, oozie, azure-databricks, scalatest, cql, playframework-2.6, datastax-enterprise, phoenix, confluent-schema-registry, mesos, implicit, implicit-conversion, ksql, scala-collections, spark-submit, hdinsight, ambari

google: google-cloud-dataproc
amazon: amazon-emr, aws-glue
azure: azure-databricks

centroid 12: bash, python, linux, shell, java
—–
macos, ubuntu, ansible, ssh, raspberry-pi, terminal, vim, raspberry-pi3, subprocess, environment-variables, centos, console, command-line-interface, pipe, arguments, jq, homebrew, iot, applescript, printf, escaping, sftp, windows-subsystem-for-linux, raspbian, exec, redhat, stdout, zsh, alias, wget, eval, paramiko, filenames, glob, command-line-arguments, stdin, remote-access, sudo, file-permissions, slurm, putty, gpio, tar, tmux, rsync, expect, ksh, jsch, scp, ssh-tunnel, cat, portforwarding, openssh

centroid 8: c#, .net, asp.net-core, .net-core, java
—–
logging, exception, error-handling, azure-functions, reflection, configuration, azure-cosmosdb, f#, logstash, exception-handling, azure-web-sites, elastic-stack, azure-service-fabric, console-application, try-catch, wix, azure-application-insights, nunit, core, autofac, xunit, dapper, nest, nhibernate, dotnet-httpclient, nlog, httpwebrequest, serilog, log4net, inversion-of-control, identity, unity-container, webclient, blazor, .net-standard-2.0, system.reactive, clr, type-inference, asp.net-core-signalr, asp.net-core-identity, app-config, masstransit, fluentd, google-cloud-stackdriver, winston, .net-framework-version, ninject, .net-core-2.0, claims-based-identity, syslog, kestrel-http-server

google: google-cloud-stackdriver
azure: azure-cosmosdb, azure-web-sites, azure-functions, azure-application-insights, azure-service-fabric

centroid 18: c#, asp.net, asp.net-mvc, asp.net-core, entity-framework
—–
c#, asp.net, .net, asp.net-mvc, asp.net-core, .net-core, entity-framework, linq, asp.net-web-api, model-view-controller, iis, entity-framework-core, dependency-injection, razor, asp.net-core-2.0, asp.net-core-mvc, wcf, entity-framework-6, json.net, kendo-ui, asp.net-core-webapi, webforms, asp.net-mvc-5, identityserver4, asp.net-identity, asp.net-web-api2, asp.net-core-2.1, asp.net-mvc-4, signalr, odata, kendo-grid, automapper, c#-4.0, web-config, razor-pages, windows-services, moq, ado.net, aspnetboilerplate, ef-code-first, ef-core-2.0, asp.net-core-2.2, linq-to-sql, asp-classic, umbraco, ef-core-2.1, asp.net-ajax, ef-migrations, iis-8, connection-string, windows-authentication, linq-to-entities, repository-pattern, swashbuckle, npgsql, iis-7, dbcontext, hangfire, iis-10, iis-express, model-binding, ef-core-2.2, windows-server-2012, signalr-hub, iis-7.5, linq-to-xml

centroid 10: c#, azure, python, java, javascript
—–
azure, botframework, azure-sql-database, bots, azure-storage, chatbot, timeout, azure-storage-blobs, report, ssrs-2012, azure-data-factory, telegram-bot, azure-web-app-service, expression, azure-logic-apps, ibm-watson, refactoring, domain-driven-design, azureservicebus, gzip, azure-resource-manager, azure-iot-hub, twilio-api, azure-data-factory-2, azure-data-lake, vpn, azure-virtual-machine, microsoft-teams, luis, string-formatting, game-physics, google-assistant-sdk, ssrs-2008, game-development, ads, mesh, windows-7, virtual-reality, vuforia, microsoft-cognitive, azure-webjobs, azure-keyvault, azure-api-management, credentials, directx, facebook-messenger, collision-detection, arm-template, sprite, rdlc, game-engine, azure-search, azure-eventhub, physics, azure-blob-storage, desktop-application, factory, software-design, hololens, u-sql, installer, collision, azure-cli, reporting, google-home, azure-table-storage, azure-cognitive-services, unityscript, startup, azure-bot-service, mysql-connector, ienumerable, qnamaker, instantiation, builder, ssrs-tablix, azure-cosmosdb-sqlapi, azure-stream-analytics, quaternions, reportviewer, skype, azure-machine-learning-studio, azure-servicebus-queues, skype-for-business, ssrs-2008-r2, azure-virtual-network, win-universal-app, azure-log-analytics, unity5, csom, dialogflow-fulfillment

google: google-assistant-sdk, google-home
azure: azure-webjobs, azure-log-analytics, azure-virtual-machine, azure-sql-database, azure-api-management, azure-iot-hub, azure-web-app-service, azure-data-factory-2, azure-table-storage, azure-servicebus-queues, azure-bot-service, azure-virtual-network, azure-data-factory, azure-cognitive-services, azure-blob-storage, azure-storage-blobs, azure-logic-apps, azure-resource-manager

centroid 44: c#, javascript, java, oauth-2.0, php
—–
api, authentication, security, facebook, oauth-2.0, spring-security, cookies, jwt, azure-active-directory, cors, postman, microsoft-graph, login, oauth, microservices, active-directory, ldap, jhipster, authorization, passport.js, keycloak, token, azure-ad-b2c, single-sign-on, passwords, sharepoint-online, telegram, spring-security-oauth2, single-page-application, linkedin, google-signin, openid-connect, session-cookies, owin, csrf, auth0, google-oauth2, saml, access-token, linkedin-api, laravel-passport, saml-2.0, google-authentication, xss, azure-powershell, adal, basic-authentication, azure-ad-graph-api, session-variables, msal, oidc, openid, express-session, bearer-token, logout, refresh-token

google: google-authentication, google-signin, google-oauth2
azure: azure-ad-graph-api, azure-active-directory, azure-powershell, azure-ad-b2c

centroid 19: c#, visual-studio, visual-studio-2017, .net, xamarin.forms
—–
visual-studio, xamarin, xamarin.forms, visual-studio-2017, xaml, uwp, azure-devops, xamarin.android, build, reporting-services, tfs, ssis, xamarin.ios, nuget, visual-studio-2015, msbuild, crystal-reports, windows-installer, nuget-package, mono, cross-platform, azure-pipelines-release-pipeline, .net-standard, mvvmcross, visual-studio-2019, visual-studio-2010, c++-cli, visual-studio-2013, sql-server-data-tools, roslyn, resharper, publish, mstest, .net-assembly, visual-studio-2012, azure-devops-rest-api, tfs2017, azure-pipelines-build-task, visual-studio-mac, tfs2018, tfsbuild, visual-studio-extensions, visual-studio-debugging, visual-studio-app-center, csproj, tfs2015, vsix, azure-mobile-services, picker, tfvc, xamarin.uwp

azure: azure-devops, azure-devops-rest-api, azure-pipelines-release-pipeline, azure-mobile-services, azure-pipelines-build-task

centroid 36: c#, wpf, winforms, javascript, vb.net
—–
wpf, vb.net, winforms, user-interface, listview, charts, events, mvvm, datatable, checkbox, data-binding, datagridview, timer, sapui5, gridview, combobox, binding, drag-and-drop, menu, datagrid, knockout.js, popup, window, textbox, styles, treeview, listbox, telerik, uwp-xaml, devexpress, resources, vb6, user-controls, prism, viewmodel, controls, datetimepicker, webbrowser-control, cefsharp, panel, contextmenu, windows-10-universal, wpfdatagrid, windows-forms-designer, custom-controls, wpf-controls, richtextbox, clickonce, observablecollection, picturebox, mvvm-light, gdi+, menuitem, backgroundworker

centroid 2: c++, c, python, linux, java
—–
c++, c, c++11, templates, assembly, cmake, gcc, memory, opengl, arduino, makefile, visual-c++, boost, c++17, lua, compiler-errors, x86, linux-kernel, memory-management, compilation, memory-leaks, operating-system, io, c++14, fortran, arm, serial-port, cuda, char, language-lawyer, segmentation-fault, clang, linker, stack, gdb, garbage-collection, macros, stl, g++, kernel, embedded, byte, malloc, shared-libraries, out-of-memory, nodes, processing, usb, x86-64, stm32, double, cython, buffer, pthreads, mips, signals, operator-overloading, runtime, gtk, llvm, driver, include, opengl-es, cygwin, operators, bit-manipulation, structure, overloading, nasm, precision, gnu-make, ros, gstreamer, mingw, const, variadic-templates, eigen, heap, gtk3, embedded-linux, esp8266, linux-device-driver, compiler-construction, warnings, cpu, cross-compiling, clion, qt-creator, profiling, ctypes, std, codeblocks, intel, return-value, system, newline, sdl-2, microcontroller, system-calls, pass-by-reference, valgrind, boost-asio, reverse-engineering, dynamic-memory-allocation, move, linker-errors, googletest, c-preprocessor, heap-memory, static-libraries, function-pointers, sdl, template-meta-programming, benchmarking, arduino-uno, libcurl, interrupt, vtk, x86-16, compiler-optimization, constants, stdvector, 64-bit, binaryfiles, bit, swig, quicksort, shared-memory, eclipse-cdt, constexpr, primes, bitwise-operators, x11, shared-ptr, clang++, glfw, binary-search, header-files, singly-linked-list, arduino-esp8266, ld, i2c, main, multiple-inheritance, gnu, smart-pointers, ram, simd, declaration, esp32, preprocessor, elf, undefined-behavior, bison, qtquick2, sfinae, variadic-functions, mingw-w64, unique-ptr, avr, masm, free, typedef, doubly-linked-list, generic-programming, compiler-warnings, glibc, kernel-module, move-semantics, auto, bootloader, c-strings, inline-assembly, ncurses, mmap, stdmap, glm-math, qmake, bit-shift, endianness, cpu-registers, template-specialization, pid, operator-precedence, memory-address

google: googletest

centroid 24: css, html, javascript, bootstrap-4, angular
—–
html, css, jquery, html5, css3, bootstrap-4, twitter-bootstrap, flexbox, sass, datatables, highcharts, html-table, twitter-bootstrap-3, layout, frontend, datepicker, drop-down-menu, css-grid, bootstrap-modal, momentjs, responsive-design, modal-dialog, dropdown, grid, responsive, tabs, font-awesome, navbar, carousel, media-queries, themes, tooltip, alignment, overflow, less, css-position, react-bootstrap, border, accordion, dt, css-transforms, angular-ui-bootstrap, nav, z-index, grid-layout, utc, reactstrap, vertical-alignment, pseudo-element, linear-gradients, mixins, collapse, popover, angular-datatables, angular-flex-layout, centering

centroid 31: delphi, c++, c#, winapi, windows
—–
delphi, winapi, dll, mfc, com, firemonkey, firebird, c++builder, delphi-10.2-tokyo, pinvoke, pascal, indy

centroid 25: django, python, django-models, django-rest-framework, python-3.x
—–
django, django-models, django-rest-framework, django-views, django-forms, django-templates, django-admin, django-queryset, django-orm, django-urls, django-2.0, django-serializer, django-allauth, django-filter, django-class-based-views, django-migrations, serializer

centroid 34: docker, kubernetes, python, google-cloud-platform, java
—–
docker, go, kubernetes, elasticsearch, google-cloud-platform, docker-compose, google-app-engine, dockerfile, deployment, rabbitmq, google-cloud-storage, yaml, airflow, google-kubernetes-engine, containers, kibana, google-compute-engine, google-cloud-dataflow, drupal, virtual-machine, apache-beam, openshift, prometheus, ibm-cloud, grpc, docker-swarm, kubernetes-helm, grafana, gcloud, traefik, google-cloud-datastore, kubectl, load-balancing, google-cloud-sql, kubernetes-ingress, monitoring, google-cloud-pubsub, istio, minikube, nginx-reverse-proxy, speech-to-text, alpine, docker-machine, filebeat, google-translate, google-speech-api, iptables, stackdriver, docker-volume, docker-container, azure-aks, nginx-ingress, daemon, google-cloud-vision, google-vision, azure-kubernetes, google-cloud-composer, openshift-origin, kubeadm, dataflow, amazon-eks, service-accounts, rancher, docker-registry, docker-image, nfs, google-cloud-speech, kops, jupyterhub, rbac, google-cloud-endpoints, standard-sql, google-cloud-build, docker-networking

google: google-cloud-build, google-cloud-datastore, google-speech-api, google-cloud-dataflow, google-cloud-storage, google-cloud-platform, google-kubernetes-engine, google-cloud-sql, google-cloud-speech, google-compute-engine, google-cloud-composer, google-cloud-endpoints, google-cloud-vision, google-vision, google-cloud-pubsub, google-translate, google-app-engine
amazon: amazon-eks
azure: azure-aks, azure-kubernetes

centroid 21: excel, vba, excel-vba, c#, python
—–
excel, vba, excel-vba, google-sheets, ms-access, excel-formula, powerbi, outlook, sharepoint, ms-word, access-vba, office365, apache-poi, sap, dax, formatting, pivot-table, office-js, runtime-error, match, powerpoint, outlook-vba, formula, access, row, vsto, ssas, multiple-columns, powerquery, spreadsheet, vlookup, average, extract, userform, unique, ms-office, excel-2010, word-vba, ms-access-2010, excel-2016, rows, ms-access-2016, onedrive, openxml, powerbi-desktop, copy-paste, transpose, conditional-formatting, office-addins, office-interop, epplus, xlsxwriter, interop, phpspreadsheet, paste, shapes, offset, powerpivot, win32com, powerbi-embedded, export-to-excel, python-docx, countif, array-formulas, autofill, powerpoint-vba, activex, ms-access-2013, solver, m, business-intelligence, xls, ado, ssas-tabular, adodb, xlrd, delete-row, sumifs, openxml-sdk, excel-interop, add-in, excel-2013, worksheet, excel-addins

google: google-sheets

centroid 4: git, github, jenkins, python, docker
—–
git, jenkins, github, jenkins-pipeline, gitlab, continuous-integration, sonarqube, bitbucket, jenkins-plugins, gitlab-ci, devops, azure-pipelines, svn, version-control, phpstorm, webhooks, repository, travis-ci, jekyll, artifactory, atom-editor, jira, pipeline, github-pages, teamcity, slack, push, gitlab-ci-runner, git-bash, github-api, continuous-deployment, branch, nexus, circleci, workflow, git-merge, diff, jenkins-groovy, clone, git-submodules, gitignore, atlassian-sourcetree, pull-request, bitbucket-pipelines, patch, commit, git-branch, versioning, mercurial, gnupg, git-commit, gerrit, rebase, open-source, githooks, allure, ssh-keys, git-push

azure: azure-pipelines

centroid 33: java, hibernate, spring, spring-boot, jpa
—–
hibernate, jpa, spring-data-jpa, spring-data, solr, orm, transactions, spring-batch, wildfly, mapping, many-to-many, lucene, h2, entity, liquibase, jpql, ejb, hql, multi-tenant, persistence, spring-data-rest, hibernate-mapping, eclipselink, querydsl, one-to-many, spring-transactions, hikaricp, criteria, hibernate-criteria, gorm, ehcache, jpa-2.0, criteria-api, entitymanager, transactional

centroid 39: java, python, c#, node.js, android
—–
http, ssl, sockets, curl, server, networking, websocket, encryption, https, socket.io, proxy, openssl, tcp, ssl-certificate, cryptography, http-headers, certificate, localhost, udp, mqtt, connection, client, ip, reverse-proxy, network-programming, aes, rsa, port, tls1.2, chat, client-server, lets-encrypt, cloudflare, haproxy, real-time, virtualhost, wireshark, keystore, ipc, x509certificate, zeromq, bouncycastle, django-channels, tcpclient, http-proxy, serversocket, telnet, cryptojs, public-key-encryption, private-key, tcp-ip, x509, client-certificates, certbot, tor, multicast

centroid 30: java, rest, spring, spring-boot, javascript
—–
java, spring-boot, spring, rest, maven, eclipse, spring-mvc, tomcat, jsp, jdbc, web-services, soap, servlets, swagger, jackson, java-ee, thymeleaf, netbeans, web-applications, apache-camel, architecture, salesforce, spring-webflux, jersey, httprequest, jax-rs, wsdl, http-post, multipartform-data, tomcat8, soapui, response, resttemplate, cxf, httpresponse, rest-assured, api-design, struts2, soap-client, jstl, restsharp, spring-rest, spring-test, spring-restcontroller, jax-ws, put, endpoint, http-status-codes, struts, spring-web

centroid 22: java, spring-boot, spring, maven, eclipse
—–
gradle, intellij-idea, neo4j, jsf, jar, grails, jboss, spring-cloud, spring-integration, jaxb, internationalization, log4j, eclipse-plugin, swagger-ui, jms, websphere, ant, activemq, vaadin, spring-kafka, pom.xml, log4j2, project-reactor, weblogic, java-11, netty, jetty, maven-3, javamail, crud, java-9, spring-cloud-stream, couchbase, ibm-mq, weblogic12c, datasource, glassfish, hazelcast, logback, osgi, hybris, openapi, project, maven-plugin, netflix-eureka, mybatis, cloudfoundry, reactive, eclipse-rcp, swagger-2.0, slf4j, netflix-zuul, quartz-scheduler, jetbrains-ide, spring-boot-actuator, lombok, spring-jdbc, sonarqube-scan, war, cdi, javabeans, tomcat7, interceptor, swt, java-10, freemarker, spring-boot-test, aop, jdbctemplate, dependency-management, spring-cloud-config, aspectj, spring-aop, flyway, amqp, classpath, spring-jms, jackson-databind, spring-cloud-dataflow, spring-websocket, spring-amqp, pivotal-cloud-foundry, spring-cloud-netflix, cucumber-jvm, executable-jar, spring-integration-dsl, swagger-codegen, tomcat9, spring-data-redis, javadoc, jndi, consul, intellij-plugin, dto, maven-surefire-plugin, hystrix, stomp, bean-validation, gateway, mapstruct, birt, spring-tool-suite, properties-file, jackson2, camunda, springfox, spring-data-neo4j, spring-rabbitmq, spring-session, pydev, xtext, servlet-filters, payara, code-formatting, spring-cloud-gateway

google: google-maps

centroid 15: javascript, angular, android, typescript, visual-studio-code
—–
ionic-framework, npm, visual-studio-code, ionic3, google-maps, cordova, debugging, electron, nativescript, ionic4, ionic2, meteor, autocomplete, geolocation, cordova-plugins, markdown, vscode-settings, phonegap, ide, webstorm, vscode-extensions, decorator, editor, ionic-native, onesignal, keyboard-shortcuts, angular2-nativescript, xdebug, hybrid-mobile-app, intellisense, nativescript-angular, tslint, remote-debugging, html-framework-7, breakpoints, syntax-highlighting, vscode-debugger, pylint, ibm-mobilefirst, jsdoc, nativescript-vue, prettier, windbg, phonegap-plugins, code-snippets, inappbrowser, phonegap-build

centroid 42: javascript, html, c#, android, python
—–
unity3d, image, tkinter, svg, animation, canvas, three.js, chart.js, scroll, 3d, html5-canvas, camera, geometry, css-animations, bitmap, rotation, aframe, icons, glsl, shader, fabricjs, rendering, webgl, css-transitions, transform, png, scrollbar, transition, textures, blender, html2canvas, drawing, mask, linechart, draw, jquery-animate, angular-animations, konvajs, raycasting, webvr

centroid 43: javascript, html, jquery, css, reactjs
—–
d3.js, leaflet, google-maps-api-3, magento, magento2, ethereum, jquery-ui, cakephp, jinja2, primefaces, extjs, ember.js, fullcalendar, maps, ckeditor, mapbox, solidity, materialize, jquery-select2, dialog, angularjs-directive, recaptcha, openlayers, coldfusion, polymer, gis, react-apollo, styled-components, tinymce, progress-bar, geojson, javascript-events, position, apollo-client, undefined, addeventlistener, sharepoint-2013, semantic-ui, amcharts, web-deployment, joomla, settimeout, render, p5.js, openstreetmap, element, instagram-api, parent-child, liquid, mapbox-gl-js, focus, angularjs-ng-repeat, setinterval, react-admin, alert, polygon, dom-events, web-component, textarea, react-select, href, web3, contact-form-7, zoom, facebook-javascript-sdk, refresh, graphql-js, overlay, height, slick, content-security-policy, jqgrid, html5-audio, delay, anchor, blogger, html-select, child-process, jquery-plugins, width, html-lists, kendo-asp.net-mvc, loading, zurb-foundation, id, bind, whitespace, dropzone.js, onchange, semantic-ui-react, owl-carousel, media, video.js, hide, netlify, background-color, sticky, geocoding, native-base, webpage, inline, tampermonkey, slick.js, form-data, padding, ternary-operator, event-listener, facebook-messenger-bot, underscore.js, formik, jquery-validate, quill, dc.js, highlight, react-table, electron-builder, bulma, jquery-selectors, fullscreen, multi-select, innerhtml, slideshow, parallax, draggable, footer, styling, react-component, jquery-mobile, selector, swiper, infinite-scroll, contenteditable, sidebar, jquery-ui-datepicker, mobx-react, form-submit, shadow-dom, backbone.js, each, margin, html-form, mathjax, jquery-ui-autocomplete, viewport, c3.js, adsense, sweetalert2, web-development-server, keypress, jquery-ui-sortable, facebook-php-sdk, sweetalert, center, font-awesome-5, react-proptypes, placeholder, summernote, font-face, react-context, web-frontend, ref, css-float, parent, wysiwyg, getelementbyid, font-size, higher-order-functions, lifecycle, dropzone, partial-views, asyncstorage, wai-aria, spfx, custom-element, jstree, bootstrap-datepicker, line-breaks, react-dom, fancybox, css-tables, stylesheet, react-google-maps, react-leaflet, timepicker, option, facebook-marketing-api, gsap, crossfilter, draftjs, directive, fixed, show-hide

google: react-google-maps, google-maps-api-3

centroid 1: javascript, python, html, java, android
—–
wordpress, woocommerce, javafx, swing, google-bigquery, button, google-api, video, google-analytics, plugins, kivy, youtube, onclick, calendar, dynamics-crm, youtube-api, widget, background, slider, radio-button, event-handling, save, amp-html, compression, google-tag-manager, javafx-8, click, hover, fxml, analytics, display, firebase-analytics, mouseevent, resize, background-image, size, listener, google-analytics-api, cross-domain, toggle, jpeg, google-data-studio, google-adwords, rgb, submit, gif, pixel, crop, gallery, tiff, thumbnails, image-resizing, exif, user-experience, src, pillow

google: google-adwords, google-bigquery, google-api, google-tag-manager, google-data-studio, google-analytics-api, google-analytics

centroid 26: javascript, reactjs, node.js, typescript, react-native
—–
javascript, node.js, reactjs, firebase, react-native, flutter, firebase-realtime-database, webpack, dart, redux, google-cloud-firestore, ecmascript-6, react-redux, firebase-authentication, google-cloud-functions, jestjs, promise, async-await, react-router, firebase-cloud-messaging, import, push-notification, dialogflow, material-ui, module, react-navigation, react-native-android, expo, fetch, gulp, mocha, firebase-storage, actions-on-google, create-react-app, enzyme, material-design, jsx, flutter-layout, lodash, babel, navigation, es6-promise, node-modules, apollo, state, npm-install, babeljs, eslint, gatsby, react-native-ios, javascript-objects, react-router-v4, next.js, yarnpkg, webpack-4, react-hooks, redux-form, prestashop, firebase-security-rules, flowtype, react-router-dom, typescript-typings, webpack-dev-server, antd, redux-saga, redux-thunk, router, package.json, react-native-flatlist, firebase-admin, react-props, react-native-navigation, mobx, firebaseui, es6-modules, firebase-hosting, aurelia, pm2, immutability, serverside-rendering, action, setstate, gruntjs, requirejs, ecmascript-5, flutter-dependencies, ssr, react-native-firebase, angular-dart, es6-class, require, laravel-mix, arrow-functions, react-native-maps, npm-scripts, flutter-animation, workbox, firebase-cli, destructuring, babel-loader, immutable.js, minify, browserify, node-sass, nodemon, firebase-security, angularfire, reducers, loader, bower

google: google-cloud-functions, actions-on-google, google-cloud-firestore

centroid 13: javascript, selenium, google-chrome, html, selenium-webdriver
—–
selenium, google-chrome, selenium-webdriver, xpath, dom, google-chrome-extension, firefox, caching, automation, iframe, selenium-chromedriver, mobile, browser, internet-explorer, safari, webdriver, css-selectors, google-chrome-devtools, webrtc, progressive-web-apps, service-worker, local-storage, internet-explorer-11, html5-video, microsoft-edge, ignite, phantomjs, chromium, cross-browser, webdriverwait, capybara, screenshot, cpu-architecture, mobile-safari, firefox-addon, webkit, geckodriver, firefox-webextensions, google-chrome-headless, browser-cache, webdriver-io, v8, specflow, selenium-grid, html-agility-pack, memcached, guava, devtools, cucumber-java, headless, domdocument, extentreports, selenium-ide, watir, cache-control, google-chrome-app, browser-automation, rselenium, mozilla

google: google-chrome, google-chrome-headless, google-chrome-devtools, google-chrome-app, google-chrome-extension

centroid 6: javascript, vue.js, vuejs2, vuex, webpack
—–
vue.js, express, vuejs2, axios, vue-component, vuex, vuetify.js, xmlhttprequest, vue-router, nuxt.js, fetch-api, vue-cli, vue-cli-3, store, nuxt, bootstrap-vue, vue-test-utils, element-ui, vee-validate

centroid 40: mongodb, node.js, javascript, python, express
—–
mongodb, qt, mongoose, graphql, pyqt5, pyqt, mongodb-query, aggregation-framework, discord, qml, qt5, nosql, discord.js, discord.py, aggregate, ejs, handlebars.js, pymongo, backend, schema, mongoose-schema, mean-stack, sails.js, pug, nestjs, loopbackjs, pyqt4, spring-data-mongodb, multer, geospatial, mean, fs, aggregation, lookup, loopback, apollo-server, parse-server, pyside2, bcrypt, mongodb-.net-driver, document, pyside, qt-designer, mongoengine, body-parser, discord.py-rewrite, projection, mern, mongoose-populate, qthread, mlab, joi, passport-local, pyqtgraph, bson, sharding, express-handlebars

centroid 5: multithreading, java, python, concurrency, c++
—–
multithreading, asynchronous, parallel-processing, concurrency, multiprocessing, callback, queue, celery, jvm, task, python-asyncio, synchronization, dask, python-multiprocessing, thread-safety, locking, mpi, openmp, singleton, pickle, python-multithreading, opencl, future, threadpool, mutex, task-parallel-library, tornado, deadlock, aiohttp, atomic, wait, executorservice, semaphore, completable-future, handler, goroutine, channel, race-condition, volatile, pool, runnable, java-threads, synchronous, async.js, producer-consumer, synchronized, java.util.concurrent, blocking

centroid 11: oracle, sql, plsql, oracle11g, database
—–
oracle, plsql, stored-procedures, oracle11g, triggers, db2, oracle12c, sql-server-2014, oracle-sqldeveloper, oracle-apex, cursor, database-trigger, apex, oracle10g, sqlplus, oracle-apex-5.1, procedure, dynamic-sql, cx-oracle, oracle-adf, oracleforms, hierarchical-data, plsqldeveloper, oracle-apex-5

centroid 38: pdf, html, python, php, c#
—–
pdf, merge, webview, printing, fonts, r-markdown, download, base64, puppeteer, itext, hyperlink, latex, export, blob, imagemagick, ocr, pdf-generation, adobe, jspdf, pdfbox, itext7, embed, docx, digital-signature, wkhtmltopdf, fpdf, mpdf, tcpdf, dompdf, ghostscript, acrobat, pypdf2, reportlab

centroid 9: php, laravel, laravel-5, mysql, javascript
—–
php, laravel, ajax, laravel-5, codeigniter, validation, eloquent, session, file-upload, model, pagination, codeigniter-3, laravel-5.6, controller, laravel-5.5, migration, upload, laravel-5.7, laravel-blade, laravel-5.4, relational-database, laravel-5.2, php-7, relationship, lumen, middleware, php-7.2, octobercms, guzzle, image-uploading, laravel-5.8, algolia, query-builder, phpexcel, jobs, laravel-query-builder, laravel-5.3, roles, artisan, laravel-nova, php-carbon, laravel-4, laravel-5.1, homestead, pusher, laravel-eloquent, blade, laravel-routing, laravel-dusk, relation, shared-hosting, eager-loading

centroid 47: php, wordpress, javascript, woocommerce, python
—–
google-apps-script, email, notifications, google-drive-api, paypal, stripe-payments, gmail, smtp, attributes, wordpress-theming, google-calendar-api, google-visualization, google-oauth, google-sheets-api, product, phpmailer, youtube-data-api, advanced-custom-fields, gmail-api, hook-woocommerce, outlook-addin, metadata, google-sheets-formula, cart, html-email, google-app-maker, hook, exchangewebservices, payment-gateway, custom-post-type, exchange-server, checkout, e-commerce, sendgrid, field, content-management-system, nodemailer, categories, gsuite, google-form, admin, mailchimp, comments, web-hosting, outlook-web-addins, orders, wordpress-rest-api, customization, imap, google-docs, rss, custom-wordpress-pages, custom-taxonomy, shortcode, outlook-restapi, email-attachments, google-sheets-query, mailgun, google-api-php-client, woocommerce-rest-api, wordpress-gutenberg, google-apis-explorer, attachment, gmail-addons, price, sendmail, icalendar, blogs, registration, custom-fields, multisite, google-admin-sdk, shipping, gravity-forms-plugin, google-api-client, archive, pagespeed, smtplib, mime, meta, google-api-nodejs-client, contact-form, taxonomy, google-api-python-client, account, stock

google: google-sheets-api, google-docs, google-oauth, google-apps-script, google-sheets-query, google-sheets-formula, google-api-php-client, google-api-nodejs-client, google-visualization, google-admin-sdk, google-calendar-api, google-apis-explorer, google-api-python-client, google-form, google-app-maker, google-api-client, google-drive-api

centroid 23: postgresql, sql, python, javascript, mysql
—–
sqlite, flask, sqlalchemy, hyperledger-fabric, odoo, hyperledger, elixir, blockchain, hyperledger-composer, flask-sqlalchemy, sequence, couchdb, psycopg2, psql, postgis, pyodbc, knex.js, plpgsql, jsonb, jooq, typeorm, postgresql-9.5, ecto, postgresql-10, marshalling, connection-pooling, postgresql-9.6, database-replication, unmarshalling, pgadmin-4, pgadmin, recursive-query, postgresql-9.4, crosstab, go-gorm, database-backups, postgresql-9.3, rds, heroku-postgres

centroid 35: python, java, c++, c#, windows
—–
windows, powershell, batch-file, ffmpeg, audio, cmd, windows-10, path, stream, directory, process, vbscript, prolog, ftp, time-complexity, copy, command, zip, file-io, ocaml, scheduled-tasks, storage, big-o, binary-search-tree, registry, scheme, binary-tree, text-files, dynamic-programming, rename, exe, echo, command-prompt, hashtable, powershell-v3.0, filereader, batch-processing, wmi, stack-overflow, file-handling, windows-server-2016, windows-server-2012-r2, bufferedreader, taskscheduler, fstream, hyper-v, readfile, depth-first-search, fibonacci, ifstream, backtracking

centroid 3: python, java, javascript, arrays, c#
—–
arrays, string, list, function, loops, csv, algorithm, dictionary, performance, for-loop, file, sorting, class, object, if-statement, oop, haskell, recursion, pointers, variables, generics, java-8, matrix, rust, filter, optimization, indexing, math, lambda, arraylist, inheritance, input, multidimensional-array, search, random, vector, data-structures, time, struct, types, methods, while-loop, foreach, design-patterns, dynamic, functional-programming, collections, java-stream, sas, parameters, enums, nested, interface, constructor, linked-list, syntax, casting, tree, hashmap, binary, properties, scope, reference, type-conversion, floating-point, iterator, null, tuples, static, format, set, conditional, iteration, range, switch-statement, return, append, numbers, boolean, output, int, concatenation, polymorphism, hex, compare, initialization, namespaces, integer, pattern-matching, grouping, logic, key, filtering, parameter-passing, list-comprehension, apply, subset, global-variables, slice, vectorization, conditional-statements, combinations, scanf, java.util.scanner, character, lapply, comparison, this, 2d, override, counter, numpy-ndarray, permutation, user-input, rounding, nested-loops, abstract-class, instance, reduce, prototype, itertools, global, reverse, subclass, comparator, key-value, increment, min, infinite-loop, contains, do-while, associative-array, mergesort, abstract, indexof, break, bubble-sort

centroid 48: python, java, javascript, r, php
—–
typo3, wso2, clojure, sublimetext3, drupal-8, acumatica, jframe, docusignapi, teradata, emacs, netsuite, karate, verilog, jasper-reports, marklogic, sparql, vhdl, sympy, autodesk-forge, knitr, erlang, yocto, tcl, odoo-11, cakephp-3.0, mule, phoenix-framework, drupal-7, integration, racket, netlogo, autohotkey, uml, drools, node-red, stata, magento-1.9, common-lisp, aem, opencart, abap, line, python-sphinx, jtable, yii, pentaho, wagtail, coq, regex-lookarounds, slack-api, bioinformatics, openstack, perl6, antlr4, awt, rcpp, upgrade, tweepy, jpanel, macos-high-sierra, documentation, jsonschema, actionscript-3, vmware, wildcard, microsoft-dynamics, prestashop-1.7, typo3-8.x, zend-framework, lisp, gwt, elasticsearch-5, distance, smartcontracts, talend, wso2-am, slim, sfml, message-queue, computer-science, scenebuilder, yii2-advanced-app, rdf, inno-setup, fpga, flask-wtforms, bitcoin, clipboard, special-characters, unreal-engine4, hana, preg-replace, gdal, flash, nginx-config, wso2esb, system-verilog, arangodb, wso2is, netbeans-8, uuid, graphviz, liferay, omnet++, spatial, encode, powershell-v4.0, paypal-sandbox, http2, dynamics-365, local, ip-address, servicestack, hdf5, firewall, kdb, executable, linear-programming, add, orientdb, angularjs-scope, cplex, pymysql, xpages, phaser-framework, maya, powershell-v2.0, nginx-location, adfs, limit, abstract-syntax-tree, variable-assignment, elementtree, wav, mouse, splunk, asterisk, pandoc, publish-subscribe, simulink, webassembly, packages, complexity-theory, ansible-2.x, python-decorators, preg-match, regex-negation, minecraft, spotfire, nested-lists, pcre, gfortran, percentage, matching, monads, jbutton, gsub, numba, sitecore, watson-conversation, dropwizard, edit, frequency, vulkan, cdn, mime-types, wrapper, php-curl, jersey-2.0, scapy, converters, zapier, attributeerror, elm, console.log, web3js, kentico, moodle, intervals, anylogic, multilingual, logical-operators, glm, vlc, owl, autodesk-viewer, dojo, z3, arcgis, classloader, pyomo, sybase, antlr, md5, indexeddb, powerapps, data-conversion, http-status-code-403, web-worker, external, alfresco, airflow-scheduler, finance, typo3-9.x, cpu-usage, pouchdb, ibm-midrange, ping, truffle, qemu, dask-distributed, selection, apache-httpclient-4.x, jsonpath, sha256, coding-style, polymer-2.x, zipfile, vps, symbols, mediawiki, calculation, okta, smarty, blueprism, netcdf, zend-framework3, genetic-algorithm, snmp, repeat, plesk, signature, doxygen, qgis, dlib, jena, ckeditor4.x, tortoisesvn, face-recognition, message, number-formatting, dropbox, lm, default, spss, perforce, assert, uart, acl, symlink, suitescript2.0, kendo-ui-angular2, getter-setter, jira-rest-api, crm, sleep, indentation, prompt, analysis, lme4, ironpython, latitude-longitude, grammar, dotnetnuke, cqrs, gaussian, regex-greedy, koa, yum, numerical-methods, haskell-stack, scala-cats, hugo, rpc, priority-queue, openldap, popen, shapefile, var, partition, record, combinatorics, ada, nio, pentaho-data-integration, appium-ios, scheduling, pyautogui, snakemake, production-environment, lwjgl, computational-geometry, scaling, tabulator, lifetime, naming-conventions, jsf-2, lotus-notes, javac, numeric, odoo-8, modeling, salesforce-lightning, silverstripe, bookdown, websphere-liberty, virtualization, wxwidgets, ramda.js, hapijs, reload, marklogic-9, snapshot, hierarchy, server-side, cakephp-3.x, prestashop-1.6, schedule, unix-timestamp, difference, ontology, readline, configuration-files, opendaylight, block, wolfram-mathematica, rpm, logstash-grok, currency, mount, remote-server, destructor, nsis, captcha, feathersjs, code-generation, hardware, django-2.1, suitescript, typoscript, jython, trim, distributed-system, zabbix, vaadin8, nodemcu, magento2.2, string-matching, shuffle, ckeditor5, mixed-models, fedora, ipv6, new-operator, ember-data, llvm-clang, exit, webcam, str-replace, large-data, simplexml, rules, elasticsearch-aggregation, rhel, dsl, ethernet, event-sourcing, vimeo, hashset, date-formatting, zlib, standards, bamboo, converter, liferay-7, file-get-contents, solrcloud, servicenow, logstash-configuration, xhtml, virtual, pywin32, equals, intersection, micronaut, production, cs50, fopen, elasticsearch-6, lazy-evaluation, server-sent-events, extbase, translate, python-module, tabular, libreoffice, sml, private, apostrophe-cms, tostring, bitbake, actionlistener, restore, activiti, mypy, opencart-3, janusgraph, rank, multiplication, keyword, archlinux, optaplanner, imagick, informix, flex-lexer, photoshop, pyaudio, openlayers-3, reset, sentry, umbraco7, messaging, lotus-domino, fiddler, interactive, jlabel, folium, bigcommerce, transparency, nullreferenceexception, operator-keyword, tracking, keyboard-events, twitter-oauth, static-methods, polymer-3.x, prisma, mule-esb, string-comparison, counting, layout-manager, gherkin, inner-classes, docker-for-windows, checksum, imagemagick-convert, connect, php-7.1, sublimetext, wso2carbon, python-telegram-bot, react-native-router-flux, desktop, paypal-rest-sdk, php-5.6, division, typeclass, identityserver3, mapbox-gl, wix-react-native-navigation, sample, point-clouds, web-audio, vertica, java-time, mosquitto, dllimport, dump, covariance, cytoscape.js, cgal, r-package, installshield, eigen3, point-cloud-library, ngx-datatable, sampling, vapor, clojurescript, auto-increment, overlap, ibm-cloud-infrastructure, echarts, web-audio-api, hp-uft, office365api, ngx-translate, shortcut, paho, ember-cli, rfid, applet, swap, predicate, host, detox, cas, jsf-2.2, trigonometry, sandbox, beagleboneblack, filestream, numpy-broadcasting, xsd-validation, fgets, ghc, serenity-bdd, spi, duration, jna, unzip, fiware, rhel7, long-integer, pentaho-spoon, cpython, crystal-lang, assertion, string-concatenation, compatibility, java-module, fluid, meta-tags, wildfly-10, hashicorp-vault, apache-karaf, roblox, gurobi, install4j, development-environment, static-analysis, ffi, h5py, django-authentication, urlencode, directx-11, salt-stack, r-raster, pseudocode, hsqldb, django-celery, fatal-error, pywinauto, peewee, p2p, tinymce-4, mysql-5.7, openlayers-5, vis.js, palindrome, angular-template, resteasy, kable, quickbooks, monaco-editor, rdp, solrj, file-transfer, language-agnostic, mustache, java-7, naudio, velocity, wikidata, copy-constructor, countdown, wtforms, montecarlo, wso2ei, sitemap, stringbuilder, geoserver, joomla3.0, symbolic-math, bytecode, high-availability, sharepoint-2010, assign, semantic-web, rtf, vmware-clarity, odoo-12, informatica, volume, jit, monogame, super, eof, syncfusion, rust-cargo, dataweave, stack-trace, browser-sync, arduino-ide, blogdown, communication, rider, favicon, fill, processbuilder, biginteger, labview, resampling, normal-distribution, linear, angular4-router, hierarchical-clustering, drag, modbus, diagram, word, org-mode, admin-on-rest, equation, webrequest, tinkerpop3, restful-authentication, lib, tizen, user-agent, survival-analysis, point, fragment-shader, tableau-server, ansible-inventory, delete-file, code-injection, clickhouse, text-editor, kettle, angularjs-material, date-range, rpy2, complex-numbers, graphene-python, coded-ui-tests, midi, programming-languages, web-push, pine-script, equality, holoviews, sapply, quotes, jtextfield, emscripten, sas-macro, angular-http, varnish, phalcon, typing, freeze, opentok, password-protection, anonymous-function, resolution, remote-desktop, cryptocurrency, hpc, default-value, tsc, multiline, chromium-embedded, treemap, substitution, arm64, shutil, supervisord, at-command, interpreter, packet, google-search, dynamics-crm-online, can-bus, neo4j-apoc, ranking, httpserver, gsm, freebsd, centos6, yield, c++-winrt, fread, anypoint-studio, jboss7.x, type-hinting, wixcode, epoch, uninstall, autoit, smartcard, wikipedia, angular-service-worker, cosine-similarity, protege, schema.org, typescript-generics, dropbox-api, verification, composition, windows-server, using, hmac, dry, ag-grid-angular, median, messenger, rethinkdb, thingsboard, xilinx, named-pipes, office-ui-fabric, dynamics-crm-365, heroku-cli, date-format, imputation, jfreechart, wiremock, packaging, outliers, target, typo3-7.6.x, ngrok, audit, models, jboss-eap-7, moving-average, 32bit-64bit, strapi, views, silverstripe-4, rasa-core, content-type, event-loop, textinput, smoothing, surface, explode, getusermedia, 7zip, confidence-interval, watch, freertos, zend-framework2, circular-dependency, qliksense, repeater, cjk, clock, django-testing, pic, csrf-protection, shopping-cart, encapsulation, paypal-ipn, json-ld, cobol, key-bindings, relative-path, hashcode, thrift, bing-maps, localdate, dicom, netezza

google: google-search

centroid 37: python, machine-learning, tensorflow, python-3.x, keras
—–
tensorflow, numpy, keras, machine-learning, opencv, matlab, deep-learning, scikit-learn, image-processing, neural-network, scipy, nlp, pytorch, computer-vision, conv-neural-network, lstm, data-science, regression, classification, dataset, gpu, nltk, google-colaboratory, linear-regression, artificial-intelligence, object-detection, cluster-analysis, generator, spacy, interpolation, logistic-regression, svm, data-analysis, tesseract, random-forest, tensorflow-datasets, bazel, signal-processing, recurrent-neural-network, opencv3.0, fft, tensorboard, gensim, sparse-matrix, word2vec, cv2, h2o, cross-validation, tensor, caffe, xgboost, reshape, reinforcement-learning, stanford-nlp, linear-algebra, k-means, tensorflow-estimator, prediction, rnn, keras-layer, pca, probability, curve-fitting, decision-tree, text-mining, tensorflow-serving, loss-function, object-detection-api, google-cloud-ml, metrics, mathematical-optimization, image-segmentation, matrix-multiplication, autoencoder, tensorflow-lite, r-caret, text-classification, sentiment-analysis, scikit-image, convolution, training-data, categorical-data, tensorflow.js, knn, mnist, valueerror, sklearn-pandas, gradient-descent, weka, yolo, python-tesseract, word-embedding, feature-extraction, tokenize, emgucv, feature-selection, predict, coreml, similarity, normalization, one-hot-encoding, backpropagation, convolutional-neural-network, ode, lda, tf-idf, openai-gym, image-recognition, theano, naivebayes, rasa-nlu, detection, grid-search, data-mining, differential-equations, layer, torch, roc, mxnet, camera-calibration, topic-modeling, cudnn, kaggle, confusion-matrix, tensorflow2.0, doc2vec, distributed, embedding, tfrecord, recommendation-engine, multilabel-classification, ner

google: google-colaboratory, google-cloud-ml

centroid 46: python, php, apache, python-3.x, ubuntu
—–
apache, nginx, .htaccess, web, url, pip, anaconda, redirect, pycharm, dns, ubuntu-16.04, permissions, url-rewriting, package, mod-rewrite, installation, conda, centos7, ubuntu-18.04, debian, virtualenv, python-import, install, apache2, url-redirection, vagrant, webserver, virtualbox, cpanel, gunicorn, digital-ocean, http-status-code-404, config, hosting, subdomain, systemd, uwsgi, wamp, nvidia, setuptools, ubuntu-14.04, mod-wsgi, pipenv, url-routing, mamp, httpd.conf, environment, query-string, seo, wsgi, subdirectory, setup.py, wampserver, permalinks, pypi, apache2.4, lamp, apt, http-status-code-301, miniconda, http-redirect

centroid 50: python, python-3.x, python-2.7, pandas, javascript
—–
python, python-3.x, pandas, dataframe, python-2.7, web-scraping, datetime, parsing, beautifulsoup, post, python-requests, scrapy, request, pygame, python-3.6, pandas-groupby, encoding, unicode, get, utf-8, twitter, character-encoding, header, python-3.7, web-crawler, python-imaging-library, pyinstaller, spyder, tags, export-to-csv, typeerror, odoo-10, openpyxl, jsoup, regex-group, python-3.5, ascii, series, urllib, wxpython, lxml, turtle-graphics, xlsx, rvest, nan, argparse, python-2.x, mysql-python, datetime-format, kivy-language, emoji, importerror, screen-scraping, scrapy-spider, pyserial, html-parsing, python-xarray, flask-restful, tkinter-canvas, cheerio, frame, odoo-9, cx-freeze, twisted, tk, python-datetime, python-unicode, decoding, httr, tkinter-entry, timedelta

centroid 16: python, unix, bash, sed, linux
—–
regex, linux, bash, shell, unix, perl, awk, sed, text, replace, split, cron, command-line, grep, scripting, sh, find, substring, filesystems, notepad++, fork, posix, scheduler, cgi, delimiter, text-processing, aix, cut, solaris

centroid 29: r, python, ggplot2, plot, matplotlib
—–
r, matplotlib, ggplot2, dplyr, shiny, jupyter-notebook, plot, graph, time-series, statistics, colors, plotly, julia, rstudio, graphics, seaborn, cypher, tidyverse, bokeh, data-visualization, bar-chart, label, networkx, histogram, ipython, purrr, tidyr, visualization, gremlin, gnuplot, influxdb, igraph, shinydashboard, legend, heatmap, octave, graph-theory, simulation, data-manipulation, correlation, raster, plotly-dash, statsmodels, data-cleaning, na, matlab-figure, mutate, scatter-plot, boxplot, multi-index, stringr, dashboard, forecasting, scale, lubridate, r-plotly, missing-data, graph-databases, arima, pie-chart, graph-algorithm, jupyter-lab, axis, geopandas, shiny-server, distribution, breadth-first-search, sparklyr, bayesian, lag, sf, tibble, subplot, matplotlib-basemap, xts, contour, plyr, anova, axis-labels, shortest-path, shiny-reactivity, ggmap, dijkstra, figure, rlang, facet, ggplotly, zoo

centroid 49: ruby, ruby-on-rails, ruby-on-rails-5, javascript, activerecord
—–
ruby-on-rails, postgresql, ruby, heroku, redis, ruby-on-rails-5, yii2, hash, activerecord, routes, shopify, ruby-on-rails-4, rspec, rubygems, devise, chef, ruby-on-rails-3, bundle, rails-activestorage, associations, rails-activerecord, puppet, ruby-on-rails-5.2, metaprogramming, rspec-rails, activeadmin, bundler, coffeescript, shopify-app, sidekiq, carrierwave, passenger, sinatra, simple-form, nokogiri, cloudinary, capistrano, puma, webpacker, mongoid, actioncable, erb, haml, rake, paperclip, factory-bot

centroid 41: sql, sql-server, mysql, database, postgresql
—–
sql, mysql, sql-server, database, tsql, date, mysqli, join, select, group-by, mariadb, sequelize.js, pdo, phpmyadmin, sql-server-2008, data.table, count, sql-server-2012, xampp, database-design, duplicates, view, sum, timestamp, pivot, foreign-keys, sql-update, mysql-workbench, timezone, tableau, insert, ssms, odbc, constraints, sql-server-2016, etl, subquery, max, syntax-error, left-join, case, database-connection, decimal, inner-join, prepared-statement, full-text-search, database-migration, sql-order-by, query-optimization, sql-server-2008-r2, sql-server-2017, sql-insert, union, backup, aggregate-functions, where, partitioning, common-table-expression, distinct, query-performance, concat, oledb, innodb, replication, window-functions, sql-injection, mdx, primary-key, greatest-n-per-group, where-clause, database-performance, sql-delete, data-warehouse, rdbms, sql-like, database-administration, ddl, entity-relationship, bulkinsert, ssis-2012, calculated-columns, resultset, derby, database-schema, sql-server-2005, create-table, database-normalization, temp-tables

centroid 14: swift, ios, xcode, objective-c, android
—–
ios, swift, xcode, objective-c, uitableview, swift4, iphone, uicollectionview, facebook-graph-api, realm, bluetooth-lowenergy, twilio, core-data, cocoa, admob, cocoapods, alamofire, arkit, swift3, crash, uiview, tableview, localization, keyboard, autolayout, sprite-kit, frameworks, uiviewcontroller, augmented-reality, uikit, wkwebview, scenekit, accessibility, closures, in-app-purchase, instagram, xcode10, uinavigationcontroller, avfoundation, uibutton, apple-push-notifications, uiscrollview, uitextfield, delegates, crashlytics, app-store, uicollectionviewcell, ios11, protocols, macos-mojave, ios12, xcode9, storyboard, uinavigationbar, qr-code, uilabel, uiimageview, uitabbarcontroller, ipad, optional, decode, uiimage, cell, metal, segue, mapkit, deep-linking, swift4.2, parse-platform, codable, spotify, uitextview, avplayer, itunesconnect, facebook-login, gradient, touch, audiokit, cocoa-touch, textfield, assets, ios-simulator, uistackview, uisearchbar, uiwebview, grand-central-dispatch, firebase-dynamic-links, watchkit, voip, nslayoutconstraint, fastlane, uiimagepickercontroller, iphone-x, nsattributedstring, interface-builder, viewcontroller, uipickerview, nsuserdefaults, core-bluetooth, decodable, xctest, extension-methods, apple-watch, nsurlsession, core-location, code-signing, navigationbar, uialertcontroller, orientation, uigesturerecognizer, objectmapper, swift-playground, keychain, core-graphics, xib, uisearchcontroller, lldb, swift-protocols, avaudioplayer, tvos, cloudkit, shadow, appdelegate, contacts, statusbar, swift4.1, testflight, mkmapview, ios-autolayout, uibezierpath, gesture, uipageviewcontroller, health-kit, uitabbar, cllocationmanager, calayer, xcuitest, provisioning-profile, icloud, uicollectionviewlayout, geofire, uistoryboard, metalkit, pdfkit, plist, uiviewanimation, swifty-json, collectionview, uibarbuttonitem, ipa, xcode-ui-testing

centroid 32: symfony, php, symfony4, javascript, mysql
—–
forms, symfony, symfony4, composer-php, twig, doctrine, annotations, doctrine-orm, phpunit, symfony-3.4, translation, autowired, symfony-forms, sonata-admin, api-platform.com, fosuserbundle, swiftmailer, data-annotations, sonata

centroid 27: testing, java, javascript, unit-testing, automated-tests
—–
angularjs, unit-testing, testing, groovy, junit, jmeter, protractor, mockito, automated-tests, jasmine, mocking, cucumber, appium, pytest, testng, robotframework, integration-testing, cypress, performance-testing, android-espresso, code-coverage, e2e-testing, junit5, chai, python-unittest, junit4, sinon, karma-runner, ui-automation, load, appium-android, nightwatch.js, load-testing, tdd, katalon-studio, testcafe, jest, bdd, spock, powermockito, powermock, codeception, jmeter-plugins, rpa, qa, cucumberjs, jmeter-4.0, angular-test, robolectric

Full code

The code I used to create and display the previous results:
clusters = 50
percent = '01'
one_hot_dimensions = 50

model = 'deleting.kmeans_tagsubtag_%s_big_a_%s_%s' % (clusters, percent,one_hot_dimensions)
clusters_temp_table = 'deleting.clusters_%s_result_a_%s_%s' % (clusters, percent, one_hot_dimensions)

one_hot_p = client.query("""
SELECT STRING_AGG(
  FORMAT("IFNULL(ANY_VALUE(IF(tag2='%%s',1,null)),0)X%%s", tag2, REPLACE(REPLACE(REPLACE(REPLACE(tag2,'-','_'),'.','D'),'#','H'),'+','P')) 
) one_hot
FROM (
  SELECT tag2, SUM(questions) questions 
  FROM `deleting.stack_overflow_tag_co_ocurrence`
  GROUP BY tag2
  ORDER BY questions DESC
  LIMIT %s
)""" % one_hot_dimensions).to_dataframe().iloc[0]['one_hot']

client.query("""
CREATE OR REPLACE MODEL `%s` 
OPTIONS ( 
    model_type='kmeans',
    distance_type='COSINE',
    num_clusters=%s )
AS
WITH tag_and_subtags AS (
    SELECT tag1, %s
    FROM `deleting.stack_overflow_tag_co_ocurrence`
    WHERE percent>0.%s
    GROUP BY tag1
)
SELECT * EXCEPT(tag1)
FROM tag_and_subtags
""" % (model, clusters, one_hot_p, percent)).to_dataframe()

df = client.query("""
CREATE OR REPLACE TABLE %s
AS
WITH tag_and_subtags AS (
    SELECT tag1, MAX(questions) questions, %s
    FROM `deleting.stack_overflow_tag_co_ocurrence`
    WHERE percent>0.%s
    GROUP BY tag1
)

SELECT centroid_id
 , STRING_AGG(tag1, ', ' ORDER BY questions DESC) tags
 , ARRAY_TO_STRING(ARRAY_AGG(IF(tag1 LIKE '%%google%%', tag1, null)  IGNORE nulls LIMIT 18), ', ') google_tags
 , ARRAY_TO_STRING(ARRAY_AGG(IF(tag1 LIKE '%%amazon%%' OR tag1 LIKE '%%aws%%', tag1, null)  IGNORE nulls LIMIT 18), ', ') amazon_tags
 , ARRAY_TO_STRING(ARRAY_AGG(IF(tag1 LIKE '%%azure%%', tag1, null)  IGNORE nulls LIMIT 18), ', ') azure_tags
 , COUNT(*) c
FROM ML.PREDICT(MODEL `%s` 
  , (SELECT * FROM tag_and_subtags )
)
GROUP BY 1
ORDER BY c DESC
""" % (clusters_temp_table, one_hot_p, percent, model)).to_dataframe()

df = client.query("""
SELECT centroid_id, c, yep, tags
    , IFNULL(google_tags, '') google_tags
    , IFNULL(amazon_tags, '') amazon_tags
    , IFNULL(azure_tags, '') azure_tags
FROM `%s` 
JOIN (
  SELECT centroid_id
    , STRING_AGG(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(feature,'_','-'),'D','.'),'H','#'),'P','+'), 'X', ''), ', ' ORDER BY numerical_value DESC  LIMIT 5) yep
  FROM ML.CENTROIDS(MODEL `%s`
  )
  GROUP BY 1
) 
USING(centroid_id)
ORDER BY yep
""" % (clusters_temp_table, model)).to_dataframe()

for index, row in df.iterrows():
    print('centroid %s: %s\n-----' % (row['centroid_id'],  row['yep']))
    print(row['tags'])
    print()
    if row['google_tags']:
        print('google: %s' % row['google_tags'])
    if row['amazon_tags']:
        print('amazon: %s' % row['amazon_tags'])
    if row['azure_tags']:
        print('azure: %s' % row['azure_tags'])
    print('\n')
Note that I joined my results with ML.CENTROIDS() — which gives me the top differentiating occurring tags for each centroid — even if they are not part of each cluster.

And that’s all it takes

Are you surprised by the results? Or surprised by how easy was to run our k-means modeling with BigQuery? Or curious for why I chose distance-type: 'COSINE' for this problem? Now it’s your turn to play :). Check the BigQuery docs for creating models and the k-means tutorial.

Lak’s thoughts

How would you choose the best parameters? Check Lak Lakshmanan post for hyper-parameter tuning. Lak also had some interesting ideas about dimensionality reduction for clustering via matrix factorization — but we’ll leave that for a future post. As an untested preview: Once you have the co-occurrence of tag1 and tag2, treat it as a recommendation problem, i.e. tag2 is what tag1 liked percent of the time. Then, you can do matrix factorization:
CREATE OR REPLACE MODEL deleting.tag1_tag2
OPTIONS (
model_type='matrix_factorization',
user_col='tag1', item_col='tag2', rating_col='percent10' )
AS
SELECT tag1, tag2, percent*10 AS percent10
FROM advdata.stack_overflow_tag_co_ocurrence
From this matrix factorization, you will have tag1_factors and tag2_factors which are essentially an embedding of the tags learned from the data. Concatenate these and cluster it instead …
SELECT feature, factor_weights
FROM ML.WEIGHTS( MODEL deleting.tag1_tag2 )
WHERE processed_input = 'tag1' and feature LIKE '%google%'

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Tweet me @felipehoffa, my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery – including predicting when will Stack Overflow reply.

Author

Felipe Hoffa
Developer Advocate
Developer Advocate at Google Cloud. Originally from Chile, now in San Francisco and around the world. #BigQuery #bigdata #opendata.

Related Articles

Comments

  1. That centroid 48 (python, java, javascript, r, php) contains a huge lot of totally unrelated tags.

  2. Jakub Narębski says:

    k-Means is not actually a *clustering* algorithm; it is a *partitioning* algorithm. That is to say K-means doesn’t ‘find clusters’ it partitions your dataset into as many (assumed to be globular – this depends on the metric/distance used) chunks as you ask for by attempting to minimize intra-partition distances.

    I guess it was used because it is fast, and possible to implement in BigQuery ML. Better solution might be to use (almost) parameter-less HDBSCAN algorithm (a density based algorithm). This algorithm also allows for points to not be in any cluster, i.e. for outliers. It is slower than k-Means, though.

    See e.g. https://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb

  3. Apples, Apricots, Avocados says:

    Dear Felipe, Developer Advocate for Google Cloud:
    What explains that blockchain/ethereum is in the top visualization, but not ios/swift? Is there a bias somewhere?

  4. Great writeup, I really like how you shared the queries!
    As an analyst myself, I’d report out in somewhat more condensed fashion. For fellow analysts it is interesting to understand why you made the decision on more than 180 posts per keyword. My guess is frequency analysis and a clear cutoff pattern for it. That’s where you’ve lost the management person in the audience normally 😀

    For the 50 centroids as an outcome for k-means analysis, I’d summarize with the centroid name and would ‘collapse’ with a spoiler or other type of content hiding feature the full list of centroid tags.

    I’m missing a remark on the centroid tags; as for example you mention that python is related to pandas. However, pandas is a smaller subset related to python development, but there is a hierarchy in this. Pandas does not exist without python, python can exist without Pandas. Sometimes we use in our analysis an exclusion, so all items with only one tag will be central. There still might be some ‘pollution’ in that set, as not all question posters have included all related tags that might be relevant.

    In any case, thanks for making such an elaborate and open presentation, it really helps the research field forward!

  5. grady player says:

    So… learning algs will always make mistakes… on computer and in meat… but these are fun, but not useful.

    `centroid 7: android, java, javascript, ios, python` -> have fun parsing out what the what version of java to run on you python ios device…

    `c++, c, python, linux, java` -> yup those are popular things…

    I suspect that the methodology allows something like `linux` to bridge unrelated C/Java … as there is probably large co-occurrence with linux/C and linux/Java, but probably very low co-occurrence with C/Java.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.