code-for-a-living May 16, 2022

Stack under attack: what we learned about handling DDoS attacks

When the bots came for us, we strengthened our defenses. Here's what we learned about parrying a few DDoS attacks.
Avatar for Josh Zhang
Staff Site Reliability Engineer

As a very popular website, stackoverflow.com gets a lot of attention. Some of it is good, like the time we were nominated for a Webby Award. Other times, that attention is distinctly less good, as when we get targeted for distributed denial of service (DDoS) attacks. 

For a few months, we’ve been the target of ongoing DDoS attacks. These attacks have been split into two types of attack: our API has been hit by application layer attacks while the main site has been subject to a volume-based attack. Both of these attacks take advantage of the surfaces that we expose to the internet. 

Volume Based Attacks: Includes UDP floods, ICMP floods, and other spoofed-packet floods. The attack’s goal is to saturate the bandwidth of the attacked site, and magnitude is measured in bits per second (Bps).

Protocol Attacks: Includes SYN floods, fragmented packet attacks, Ping of Death, Smurf DDoS and more. This type of attack consumes actual server resources, or those of intermediate communication equipment, such as firewalls and load balancers, and is measured in packets per second (Pps).

Application Layer Attacks: Includes low-and-slow attacks, GET/POST floods, attacks that target Apache, Windows or OpenBSD vulnerabilities and more. Comprised of seemingly legitimate and innocent requests, the goal of these attacks is to crash the web server, and the magnitude is measured in Requests per second (Rps).
Caption: DDoS attacks fall into three categories. 

We’re still getting hit regularly, but thanks to our SRE and DBRE teams, along with some code changes made by our Public Platform Team, we’ve been able to minimize the impact that they have on our users’ experience. Some of these attacks are now only visible through our logs and dashboards. 

We wanted to share some of the general tactics that we’ve used to dampen the effect of DDoS attacks so that others under the same assaults can minimize them. 

Botnet attacks on expensive SQL queries

Between two application-layer attacks, an attacker leveraged a very large botnet to trigger a very expensive query. Some back end servers hit 100% CPU utilization during this attack. What made this extra challenging is that the attack was distributed over a huge pool of IP addresses; some IPs only sent two requests, so rate limiting by IP address would be ineffective.

We had to create a filter that separated the malicious requests from the legitimate ones so we could block those specific requests. Initially, the filter was a bit overzealous but, over time, we slowly refined the filter to identify only the malicious requests.

After we mitigated the attack, they regrouped and tried targeting user pages by requesting super high page counts. To avoid detection or bans they incremented the page number their bots requested. This subverted our previous controls by attacking a different area of the web site while still exploiting the same vulnerability. In response, we put a filter to identify and block the malicious traffic. 

These API routes, like any API that pulls data from a database, are necessary to the day-to-day functioning of Stack Overflow. To protect routes like these from DDoS, here’s what you can do:

  • Insist that every API call be authenticated. This will help identify malicious users. If having only authenticated API calls is not possible, set stricter limits for anonymous / unauthenticated traffic.
  • Minimize the amount of data a single API call can return. When we build our front page question list, we don’t retrieve all of the data for every question. We paginate, lazy load only the data in the viewport, and request only the data that will be visible (that is, we don’t request the text for every answer until loading the question page itself). 
  • Rate-limit all API calls. This goes hand-in-hand with minimizing data per call; to get large amounts of data, the attacker will need to call the API multiple times. Nobody needs to call your API a hundred times per second. 
  • Filter malicious traffic before it hits your application. HAProxy load balancers sit between all requests and our servers to balance the amount of traffic across our servers. But that doesn’t mean all traffic has to go to one of those servers. Implement thorough and easily queryable logs so malicious requests can be easily identified and blocked.

Whack-a-mole on malicious IPs

We also were subject to some volume-based attacks. A botnet sent a large number of `POST` requests to `stackoverflow.com/questions/`. This one was easy: since we don’t use trailing slash on that URL, we blocked all traffic on that specific path. 

The attacker figured it out, dropped the trailing slash, and came back at us. Instead of just reactively blocking every route the attacker hit, we collected the botnet IPs and blocked them through our CDN, Fastly. This attacker took three swings at us: the first two caused us some difficulties, but once we collected the IPs from the second attack, we could block the third attack instantly. The malicious traffic never even made it to our servers. 

A new volume-based attack—possibly from the same attacker—took a different approach. Instead of throwing the entire botnet at us, they activated just enough bots to disrupt the site. We’d put those IPs on our CDN’s blocklist, and the attacker would send the next wave at us. It was like a game of Whack-a-mole, except not fun and we didn’t win any prizes at the end. 

Instead of having our incident teams scramble and ban IPs as they came in, we automated it like good little SREs. We created a script that would check our traffic logs for IPs behaving a specific way and automatically add them to the ban list. Our response time improved on every attack. The attacker kept going until they got bored or ran out of IPs to throw at us. 

Volume-based attacks can be more insidious. They look like regular traffic, just more of it. Even if a botnet is focusing on a single URL, you can’t always just block the URL. Legitimate traffic hits that page, too. Here are a few takeaways from our efforts:

  • Block weird URLs. If you start seeing trailing slashes where you don’t use them, `POST` requests to invalid paths, flag and block those requests. If you have other catch-all pages and start seeing strange URLs coming in, block them. 
  • Block malicious IPs even if legitimate traffic can originate from them. This does cause some collateral damage but it’s better to block some legitimate traffic than be down for all traffic.
  • Automate your blocklist. The problem with blocking a botnet manually is the toil involved with identifying a bot and sending the IPs to your blocklist. If you can recognize the patterns of a bot then automate blocking based on that pattern, your response time will go down and your uptime time will go up.
  • Tar pitting is a great way to slow down botnets and mitigate volume based attacks. The idea is to reduce the number of requests being sent by botnet by increasing the time between requests.

Other things we learned

By having to deal with a lot of DDoS attacks back-to-back, we were able to learn and improve our overall infrastructure and resiliency. We’re not about to say thank you to the botnets, but nothing teaches better than a crisis. Here are a few of the big overall lessons we learned. 

Invest in monitoring and alerting. We identified a few gaps in our monitoring protocols that could have alerted us to these attacks sooner. The application layer attacks in particular had telltale signs that we could add to our monitoring portfolio. In general, improving our tooling overall has helped us respond and maintain site uptime. 

Automate all the things. Because we were dealing with several DDoS attacks in a row, we could spot the patterns in our workflow better. When an SRE sees a pattern, they automate it, which is exactly what we did. By letting our systems handle the repetitive work, we reduced our response time. 

Write it all down. If you can’t automate it, record it for future firefighters. It can be hard to step back during a crisis and take notes. But we managed to take some time and create runbooks for future attacks. The next time a botnet floods us with traffic, we’ve got a headstart on handling it. 

Talk to your users. Tor exit nodes were the source of a significant amount of traffic during one of the volume attacks, so we blocked them. That didn’t sit well with legitimate users that happened to use the same IPs. Users started a bit of wild speculation, blaming Chinese Communists for preventing anonymous access to the site (to be fair, that’s half right: I’m Chinese). We had no intention of blocking Tor access permanently, but it was preventing other users from reaching the site, so we got on Meta to explain the situation before the pitchforks came out en masse. We’re now adding communication tasks and tooling into our incident response runbooks so we can be more proactive about informing users. 

DDoS attacks can often come with success on the internet. We’ve gotten a lot of attention over the last 12 years, and some of it is bound to be negative. If you find yourselves on the receiving end of a botnet’s attention, we hope the lessons that we’ve learned can help you out as well. 

Tags: , ,
Podcast logo The Stack Overflow Podcast is a weekly conversation about working in software development, learning to code, and the art and culture of computer programming.

Related

Stack Overflow Podcast Relaunch
se-stackoverflow September 16, 2022

Hypergrowth headaches (Ep. 485)

When a company hits a period of hypergrowth, developers are in for a thrill ride. They need to start scaling their systems, moving to service architectures and clouds, and looking to solve problems others haven’t. But hypergrowth brings headaches, too, and chief among them is how to keep everyone aware of what’s going on with teams that they aren’t a part of.