Stack under attack: what we learned about handling DDoS attacks
As a very popular website, stackoverflow.com gets a lot of attention. Some of it is good, like the time we were nominated for a Webby Award. Other times, that attention is distinctly less good, as when we get targeted for distributed denial of service (DDoS) attacks.
For a few months, we’ve been the target of ongoing DDoS attacks. These attacks have been split into two types of attack: our API has been hit by application layer attacks while the main site has been subject to a volume-based attack. Both of these attacks take advantage of the surfaces that we expose to the internet.
We’re still getting hit regularly, but thanks to our SRE and DBRE teams, along with some code changes made by our Public Platform Team, we’ve been able to minimize the impact that they have on our users’ experience. Some of these attacks are now only visible through our logs and dashboards.
We wanted to share some of the general tactics that we’ve used to dampen the effect of DDoS attacks so that others under the same assaults can minimize them.
Botnet attacks on expensive SQL queries
Between two application-layer attacks, an attacker leveraged a very large botnet to trigger a very expensive query. Some back end servers hit 100% CPU utilization during this attack. What made this extra challenging is that the attack was distributed over a huge pool of IP addresses; some IPs only sent two requests, so rate limiting by IP address would be ineffective.
We had to create a filter that separated the malicious requests from the legitimate ones so we could block those specific requests. Initially, the filter was a bit overzealous but, over time, we slowly refined the filter to identify only the malicious requests.
After we mitigated the attack, they regrouped and tried targeting user pages by requesting super high page counts. To avoid detection or bans they incremented the page number their bots requested. This subverted our previous controls by attacking a different area of the web site while still exploiting the same vulnerability. In response, we put a filter to identify and block the malicious traffic.
These API routes, like any API that pulls data from a database, are necessary to the day-to-day functioning of Stack Overflow. To protect routes like these from DDoS, here’s what you can do:
- Insist that every API call be authenticated. This will help identify malicious users. If having only authenticated API calls is not possible, set stricter limits for anonymous / unauthenticated traffic.
- Minimize the amount of data a single API call can return. When we build our front page question list, we don’t retrieve all of the data for every question. We paginate, lazy load only the data in the viewport, and request only the data that will be visible (that is, we don’t request the text for every answer until loading the question page itself).
- Rate-limit all API calls. This goes hand-in-hand with minimizing data per call; to get large amounts of data, the attacker will need to call the API multiple times. Nobody needs to call your API a hundred times per second.
- Filter malicious traffic before it hits your application. HAProxy load balancers sit between all requests and our servers to balance the amount of traffic across our servers. But that doesn’t mean all traffic has to go to one of those servers. Implement thorough and easily queryable logs so malicious requests can be easily identified and blocked.
Whack-a-mole on malicious IPs
We also were subject to some volume-based attacks. A botnet sent a large number of `POST` requests to `stackoverflow.com/questions/`. This one was easy: since we don’t use trailing slash on that URL, we blocked all traffic on that specific path.
The attacker figured it out, dropped the trailing slash, and came back at us. Instead of just reactively blocking every route the attacker hit, we collected the botnet IPs and blocked them through our CDN, Fastly. This attacker took three swings at us: the first two caused us some difficulties, but once we collected the IPs from the second attack, we could block the third attack instantly. The malicious traffic never even made it to our servers.
A new volume-based attack—possibly from the same attacker—took a different approach. Instead of throwing the entire botnet at us, they activated just enough bots to disrupt the site. We’d put those IPs on our CDN’s blocklist, and the attacker would send the next wave at us. It was like a game of Whack-a-mole, except not fun and we didn’t win any prizes at the end.
Instead of having our incident teams scramble and ban IPs as they came in, we automated it like good little SREs. We created a script that would check our traffic logs for IPs behaving a specific way and automatically add them to the ban list. Our response time improved on every attack. The attacker kept going until they got bored or ran out of IPs to throw at us.
Volume-based attacks can be more insidious. They look like regular traffic, just more of it. Even if a botnet is focusing on a single URL, you can’t always just block the URL. Legitimate traffic hits that page, too. Here are a few takeaways from our efforts:
- Block weird URLs. If you start seeing trailing slashes where you don’t use them, `POST` requests to invalid paths, flag and block those requests. If you have other catch-all pages and start seeing strange URLs coming in, block them.
- Block malicious IPs even if legitimate traffic can originate from them. This does cause some collateral damage but it’s better to block some legitimate traffic than be down for all traffic.
- Automate your blocklist. The problem with blocking a botnet manually is the toil involved with identifying a bot and sending the IPs to your blocklist. If you can recognize the patterns of a bot then automate blocking based on that pattern, your response time will go down and your uptime time will go up.
- Tar pitting is a great way to slow down botnets and mitigate volume based attacks. The idea is to reduce the number of requests being sent by botnet by increasing the time between requests.
Other things we learned
By having to deal with a lot of DDoS attacks back-to-back, we were able to learn and improve our overall infrastructure and resiliency. We’re not about to say thank you to the botnets, but nothing teaches better than a crisis. Here are a few of the big overall lessons we learned.
Invest in monitoring and alerting. We identified a few gaps in our monitoring protocols that could have alerted us to these attacks sooner. The application layer attacks in particular had telltale signs that we could add to our monitoring portfolio. In general, improving our tooling overall has helped us respond and maintain site uptime.
Automate all the things. Because we were dealing with several DDoS attacks in a row, we could spot the patterns in our workflow better. When an SRE sees a pattern, they automate it, which is exactly what we did. By letting our systems handle the repetitive work, we reduced our response time.
Write it all down. If you can’t automate it, record it for future firefighters. It can be hard to step back during a crisis and take notes. But we managed to take some time and create runbooks for future attacks. The next time a botnet floods us with traffic, we’ve got a headstart on handling it.
Talk to your users. Tor exit nodes were the source of a significant amount of traffic during one of the volume attacks, so we blocked them. That didn’t sit well with legitimate users that happened to use the same IPs. Users started a bit of wild speculation, blaming Chinese Communists for preventing anonymous access to the site (to be fair, that’s half right: I’m Chinese). We had no intention of blocking Tor access permanently, but it was preventing other users from reaching the site, so we got on Meta to explain the situation before the pitchforks came out en masse. We’re now adding communication tasks and tooling into our incident response runbooks so we can be more proactive about informing users.
DDoS attacks can often come with success on the internet. We’ve gotten a lot of attention over the last 12 years, and some of it is bound to be negative. If you find yourselves on the receiving end of a botnet’s attention, we hope the lessons that we’ve learned can help you out as well.Tags: DDoS, devops, security
Thank you for a very interesting and enlightening article. This is a very valuable lesson for all website administrators. May I know the number of distinct IPs involved in the volumetric atrack that you mentioned in your article? How did you recognize a bad IP? Thanks again for sharing your experience and lessons.
We love HAProxy and use it in our openshift cluster, but we have a VMWare NSX-ALB (avi networks) in front and it help a lot to block malicious bots and traffic. Save us some automation and do a great job together. We’re thinking to test some envoy also.
Did you search https://security.stackexchange.com/ to find solutions to these kinds of attacks? I’ve heard that site is pretty reliable.
Wait, is that the one with a trailing slash?
Why are they DDOS attacking the site? Are you being extorted to pay to stop the attack? Or is it simply for the lulz of having taken down a major site?
Is there any way you can share how you were able to identify that you were experiencing a volume attack rather than just heavy real traffic? Was it because you were already alerted that botnets were targeting you from the application layer attack?
In “Volume based attacks” Bps would indicate bytes rather than bits. Suggest changing to “bps”.
Today you need as less a WAF if not 2 layers of WAF to be on line also good monitors for the webservers and databases.
The WAF may be the opensource like MODSEC, it has an effective anti DDOS . There are some sophisticates
attackers they know to pass through it but with another commercial WAF in the reverse proxies
may improve the response to these attacks. There aren’t a full response against DDOS but it’s possible to stop several kid of
attacks with antiddos filters and get alerts of the problem in real time . It’s very important to configure the applications to be very efficient with the resources,
sometime the problem is that the applications are not efficient and DOS kill them easily .