A deeper dive into our May 2019 security incident

Back in May 2019, we had a security incident that was reported on this blog. It’s been quite some time since our last update but, after consultation with law enforcement, we’re now in a position to give more detail about what happened, how it happened, and what we did to address the underlying issues that allowed the incident to occur.

On May 12th, 2019, at around 00:00 UTC, we were alerted to an unexpected privilege escalation for a new user account by multiple members of the community. A user that nobody recognised had gained moderator and developer level access across all of the sites in the Stack Exchange Network. Our immediate response was to revoke privileges and to suspend this account and then set in motion a process to identify and audit the actions that led to the event.

After initial discovery, we found that the escalation of privilege was just the tip of the iceberg and the attack had actually resulted in the exfiltration of our source code and the inadvertent exposure of the PII (email, real name, IP addresses) of 184 users of the Stack Exchange Network (all of whom were notified). Thankfully, none of the databases—neither public (read: Stack Exchange content) nor private (Teams, Talent, or Enterprise)—were exfiltrated. Additionally, there has been no evidence of any direct access to our internal network infrastructure, and at no time did the attacker ever have access to data in Teams, Talent, or Enterprise products.

In order to understand how the privilege escalation and subsequent exfiltration of source code occurred, we needed to be able to trace the attacker’s accesses to our sites prior to the culmination of the attack. Fortunately, we have a database containing a log of all traffic to our publicly accessible properties—this proved invaluable in identifying activity associated with the attacker. Using the account identifier that had been escalated, we were able to use the IP address and other identifying information to correlate traffic to a candidate set of rows. This amounted to well over 75,000 rows of data that we then set out to categorise. Based upon that categorisation, we were able to further filter out rows to those that were deemed “interesting.” Coupled with other information from our customer support team and various other sources of log data, we came up with a timeline of events. This is quite detailed but it’s here because we’d like to bring attention to the amount of time the attacker took to understand our infrastructure and gradually escalate their privilege level to the point at which they could exfiltrate our source code.

Tuesday April 30th, 2019

The attacker starts probing our infrastructure, in particular parts of our build/source control systems and web servers hosting some of our development environments.

Wednesday May 1st

The attacker continues probing our public network and attempts to access employee-only rooms in Stack Exchange Chat—notably our SRE room. They get “access denied.”

Additionally a person claiming to be one of our Enterprise customers submits a support request to obtain a copy of source code for auditing purposes. This request is rejected because we don’t give out source code and, additionally, the email cannot be verified as coming from one of our customers. It is flagged for further investigation by our support team.

Next, the attacker creates a Team on Stack Overflow on one device and sends an email invite to another account, which is accepted using another device.

Thursday May 2nd

The attacker visits a number of Meta posts associated with our Teams product and publicly available case studies related to Enterprise published on stackoverflow.com.

Another support ticket is raised following up on the previous one, but this time with a spoofed email address of an actual customer. Details were apparently harvested from the case studies above. An automated reply is sent to the customer (because of the spoofed address), and we are quickly notified that it is not a legitimate request.

Friday May 3rd

Support request is denied. Attacker continues probing public facing infrastructure, including viewing user profiles of support personnel dealing with their support ticket.

Saturday May 4th

Attacker accesses a URL to attempt to download a ZIP file containing Stack Overflow source code from our GitHub Enterprise instance but is redirected to login. We later discover that the repository URL is inadvertently referenced in a public GitHub repo containing some of our open source code.

Sunday May 5th

This is a busy day. Attacker starts with further probing of our dev environments and a little later a login request is crafted to our dev tier that is able to bypass the access controls limiting login to those users with an access key. The attacker is able to successfully log in to the development tier.

Next, the attacker starts probing a number of internal URLs that we later find are documented in our Enterprise support portal but is unable to access many of them due to an insufficient level of privilege.

Our dev tier was configured to allow impersonation of all users for testing purposes, and the attacker eventually finds a URL that allows them to elevate their privilege level to that of a Community Manager (CM). This level of access is a superset of the access available to site moderators.

After attempting to access some URLs, to which this level of access does not allow, they use account recovery to attempt to recover access to a developer’s account (a higher privilege level again) but are unable to intercept the email that was sent. However, there is a route on dev that can show email content to CMs and they use this to obtain the magic link used to reset credentials. This is used and the attacker gains developer-level privileges in the dev environment. Here they are also able to access “site settings”—a central repository of settings (feature flags) that configure a lot of functionality within the site.

Monday May 6th

Another busy day—the attacker resumes with access to the dev tier. While they are browsing around and understanding what their new found privilege gives access to, they are also browsing production to harvest more information about the Stack employees that work on the Teams product. In addition, they use their freshly created Teams instance in production to try some admin-level functionality in a “safe” environment as well as trying to use impersonation on production. Impersonation is not compiled into production binaries, so all of these requests result in HTTP 404 responses.

After some time spent investigating URLs on dev, the attacker accesses site settings again and stumbles upon a setting that contains credentials of a service account with access to our TeamCity instance. Historically, this was used for accessing the TeamCity REST API from within the Stack Exchange Network’s code base and, while the functionality was removed a long time ago, the credentials remained and were still valid.

Using the credentials the attacker attempts to login to TeamCity (which at the time was accessible from the internet) and is granted access. This user has never logged in interactively before and a misconfiguration with role assignments means the user was immediately granted administrative privileges to the build server.

A significant period of time is spent investigating TeamCity—the attacker is clearly not overly familiar with the product so they spend time looking up Q&A on Stack Overflow on how to use and configure it. This act of looking up things (visiting questions) across the Stack Exchange Network becomes a frequent occurrence and allows us to anticipate and understand the attacker’s methodology over the coming days.

While browsing TeamCity, the attacker is able to download build artifacts containing binaries and setup configuration scripts for our Enterprise product. From here, they browse more questions on Stack Overflow—including configuring IIS and migrating data from our Teams product to an Enterprise instance, among many other questions related to our Teams products.

For their final act of the day, the attacker attempts to gain access to a build agent within our data center but is unable to connect because they would need VPN access to do so.

Tuesday May 7th

Attacker browses questions related to setting up and configuring Enterprise, and they investigate more site settings.

Wednesday May 8th

Attacker immediately logs into TeamCity and continues browsing the administrative sections of the site. They stumble across a diagnostics section that allows browsing of the build server’s file system—this yields a plaintext SSH key used by build agents to obtain source code from GitHub Enterprise. Within minutes, this was used to clone several key repositories (gathered from VCS roots configured in TeamCity).

This process of gradually cloning continues throughout the day, and the attacker browses questions on Stack Overflow related to the build of .NET projects. Over the course of the day, they manage to get their hands on our local development setup scripts (called `DevLocalSetup`) and wikis related to operating our production environment.

Later, they attempt to directly login to GitHub Enterprise using the web interface and the service account credentials that were used to access TeamCity. Fortunately, login to our GitHub instance is protected by 2FA and the service account is not in the relevant AD groups that permit access to it.

Thursday May 9th

Attacker pulls latest copies of all the repositories they had previously cloned.

The attacker attempts to use an Azure-based VM to connect to our VPN using the service account credentials that were used to access TeamCity, but the attempts to login fail because service accounts are not permitted to access the network in this way.

They continue to browse Stack Overflow for details on building and running .NET applications under IIS as well as running SQL scripts in an Azure environment.

Friday May 10th

Attacker pulls latest copies of all repositories they had previously cloned and browses configuration settings in our TeamCity builds.

They continue viewing Stack Overflow and Server Fault questions around IIS and .NET applications.

Saturday May 11th

This is the day that the attacker manages to escalate their privileges across the Stack Exchange Network. Once again, the attacker pulls the latest copies of source code and immediately starts to investigate build logs and configurations.

From an Azure VM, the attacker changes Git username mappings in TeamCity to make it look like the build service account (which the SSH key is mapped to) is another account in audit logs. They then create a project in TeamCity, switch off versioned settings for that project (which audits changes to Git), and starts to configure the project with build configurations.

Several different builds are attempted. Initially, they attempt to create a copy of databases that are used to configure our local environments from a network share but are unable to upload to any externally accessible locations. After that they attempt the same thing with some of our internal NuGet packages, but this attempt also fails.

After failing to upload files they create a public gist on GitHub that contains the SQL needed to elevate permissions across the Stack Exchange Network. After several attempts, they are able to craft a build that executes this as a SQL migration against the production databases housing data for the Stack Exchange Network.

Immediately they attempt to clean up any evidence of the attack by removing builds and history. Fortunately, we have a “trash can” for TeamCity configuration and are later able to recover these builds to understand what aspects of the attack each build was responsible for.

Sunday May 12th

Shortly after execution of the SQL, we were notified of the odd activity by the community and our incident response team started investigating.

At this point, we did not know the extent of the attack so initial remediation focused on removal of privileges and credentials. Further investigation led us to the builds that ran on TeamCity and the compromised TeamCity account which was immediately disabled, followed by bringing TeamCity offline entirely.

Once we discovered that the escalation path involved dev and the use of site settings to acquire credentials, we committed code to remove those paths—notably, the tool used to view an account recovery email and the site settings used to compromise the TeamCity service account. Additionally, all affected accounts were removed or had their credentials reset. At this point, the initial incident is considered to be dealt with and emails are sent to engage a secondary response team for forensics and further fixes on Monday.

Meanwhile, the attacker attempts to access user impersonation on production (functionality isn’t present in production builds) and also tries to access TeamCity (which is now offline) but both attempts fail. However, they continue to be able to pull source code (at this time we were not yet aware they had access to Stack’s source code).

Monday May 13th

Attacker pulls source code again. Whilst doing this, they are viewing questions on Stack Overflow on how to publish and consume NuGet packages, how to build .NET codebases, and how to delete repositories on GitLab.

They attempt to access TeamCity repeatedly, but it is still offline.

Secondary response team starts work. We start by pulling traffic logs to understand what happened. Given that the attacker had access to dev, we rotated the access keys used for that environment and began rotating any secrets that were exposed from site settings or TeamCity. In order to allow us to build fixes to production, the TeamCity service is brought online but, this time, only inside the firewall. We ascertained that dev access checks were missing from some login routes, allowing the attacker to replay a login from prod and successfully gain access to dev—these checks are then added. Additionally, we add write-only access to secrets on site settings, closing the vulnerability that enabled the attacker to retrieve credentials from this UI. We also discover that many secrets stored in the build server are not marked as secrets (this obfuscates them in the UI)—so anybody could view them. These are updated appropriately as part of the rotation process.

Meanwhile the attacker continues to pull source code—at this point, we’re still unaware that they have a valid SSH key for GitHub. They keep trying to access TeamCity—we can see this because the traffic is still hitting our load balancer and landing in our logs. They continue viewing Stack Overflow questions, this time related to refreshing many Git repos programmatically (surprise!) and creating SQL databases.

Secondary response team investigation continues into the following day.

Tuesday May 14th

Early in the day, the attacker pulls the latest source code again and continues trying to access TeamCity (now available from inside our network only and inaccessible to the attacker).

During our continued analysis of the attacker’s traffic, we find evidence of access to the SSH key used by the build server. We immediately revoke the key and investigate any attempts to use it. We were able to dump all SSH and HTTPS traffic logs from GitHub and found indications that the key was used outside of our network. We make the decision to immediately move GitHub behind the firewall in case the attacker has more than just SSH access. It’s worth noting that once an SSH key has been added to a GitHub account, if the raw key material is exposed, the key can be used without any form of 2FA involved. The same is true for personal access tokens which are provisioned when cloning using HTTPS + 2FA.

We begin auditing for any commits that were not pushed by a Stack employee, but find nothing untoward. We also begin an audit of all repositories for any secrets that are not injected at build time and instruct the team that owns each repository to move secrets into the build process and to rotate them. Additionally, we begin auditing all third party systems for unauthorised access (we find no evidence of access).

We contract an external security vendor to audit and double check our methodology and all available data as well as assist us in the investigation.

Thursday 16th May

Auditing continues and secret rotation propagates across the various affected systems. We start to categorise traffic that resulted in inadvertent PII access—we don’t see any evidence that the attacker was actively seeking personal data, but want to be able to notify affected users if we can.

A blog post is published notifying our community of the breach.

Attacker has little activity today, limited to viewing Q&A around SQL databases in Azure.

Friday 17th May

Attacker continues viewing Q&A, this time around SQL and certificates. No other activity of note.

We continue to categorise the traffic to find any PII access.

A blog post is posted with an update, including intent to notify an estimated 250 users who had their PII accessed.

Saturday 18th May

Attacker continues viewing Q&A, this time around removing git remotes. No other activity of note.

Wednesday 22nd May

Users affected by PII leaks are notified.

Thursday 23rd May

Our secondary investigation concludes but we produce a set of remediations to address underlying issues that led to the attack happening; more details on that below.

This incident brought to light shortcomings in how we’d structured access to some of our systems and how we managed secrets, both at build time and in source code. During the investigation, we made a number of short-term changes to address some of these shortcomings:

Move build and source control systems behind the firewall, requiring valid VPN credentials to access. Most of our engineers are remote so these systems were originally on the Internet for ease of access. Our GitHub Enterprise instance was already using 2FA.
Removal of default group assignments from TeamCity. This was an accidental misconfiguration from many years ago that was never reverted and meant all newly created users inherited administrative privileges. Administrative privileges in TeamCity can browse the server’s file system where things like SSH keys are stored in plain text.
Bad secret hygiene—we had secrets sprinkled in source control, in plain text in build systems and available through settings screens in the application. We made sure all secrets were removed from source code, rotated, and then securely added to build systems or made available at runtime to the application. All secrets are now write-only—once set, they cannot be read except by the application or specifically authorised employees behind the firewall.
Access to support documentation for our Enterprise product was limited to authorised users of that product.
Metrics and alerting around privilege escalation in production. Any escalation to developer level access now notifies a group of people that can validate the escalation is legitimate.
Hardening code paths that allow access into our dev tier. We cannot take our dev tier off of the internet because we have to be able to test integrations with third-party systems that send inbound webhooks, etc. Instead, we made sure that access can only be gained with access keys obtained by employees and that features such as impersonation only allow de-escalation—i.e. it only allows lower or equal privilege users to the currently authenticated user. We also removed functionality that allowed viewing emails, in particular account recovery emails.
All employees were made to change their passwords in case of a compromise of those credentials. This was a “just in case” safety measure—no employee credentials were found to be compromised or used during the incident.

We also made plans for some slightly longer term projects to address larger issues that were brought to light by the incident:

We gave higher priority to an existing project to replace our VPN with a system that mandates 2FA and restricts access to secure zones within our network based upon role membership. Although this was not an attack vector during the incident we wanted to further harden ingress points into our network. This has been put in place since last year.
Moving away from secrets that are injected at build time to using a runtime secret store. This is part of a larger project to better handle configuration management in our data center environments. In our Azure environments, we already use KeyVault and AppConfig to do this.
Migrating our CI/CD pipeline to break apart the build and deploy components. This is an on-going effort to migrate away from a TeamCity process that builds and immediately deploys our code to using GitHub Actions to create artifacts and Octopus Deploy to manage deployment. This allows us to have deterministic builds and better manage deployment permissions.
Moving to use SSO and 2FA on as many third-party systems as we can. We’re actively moving to Okta Workplace Identity and using it for any third-party systems that our employees need to access.
Better role-based access control—one of the things we discovered was that diagnosing why a particular user gained privileges in a specific system was made difficult by the culmination of years of group nesting and difficult to understand security group assignments. We have tooling being built to better manage this process.
On-going training to ensure employees are aware of and can identify phishing techniques. This is especially important for our Customer Success team that routinely deals with support emails.

This incident reminded us about some fundamental security practices that everyone should follow:

Log all your inbound traffic. We keep logs on all in-bound connections. This enabled all of our investigations. You can’t investigate what you don’t log.
Use 2FA. That remaining system that still uses legacy authentication can be your biggest vulnerability.
Guard secrets better. TeamCity has a way to protect secrets, but we found we weren't using it consistently. Educate engineers that "secrets aren't just passwords.” Protect SSH keys and database connection strings too. When in doubt, protect it. If you must store secrets in a Git repo, protect them with git-crypt or Blackbox .
Validate customer requests. The more unusual a request from a customer, the more important it is to verify whether or not the request is legitimate.
Take security reports seriously. We're grateful that our community reported suspicious activity so quickly. Thank you!

There’s a lot to digest here, but we wanted to give a good overview of how an incident like this starts out and use it as an opportunity to inform other companies and sites on how these things unfold. The more that we are prepared for and anticipate events like this, the better protected we all will be against future events of this nature.

We are not able to comment on any other details related to the attacker due to ongoing investigations. If there is anything that you would like to discuss please do so on our Meta Stack Exchange feedback post.

Thanks for listening!

Stack Engineering Team <3

A deeper dive into our May 2019 security incident

Setting the scene

Walking in the attacker’s shoes

Timeline

Remediations

Advice to others

Wrap-up