If trees could scream, would we be so cavalier about cutting them down? We might, if they screamed all the time, for no good reason.
A big part of scaling up an engineering team is getting serious about monitoring and alerts. A good monitoring system collects data from all of your various systems -- for example, how fast pages are loading, or server CPU usage, or emails being sent -- and alerts you when something isn't working correctly. When everything works perfectly you can sleep easy at night knowing that you'll get an alert if something isn't working correctly. That's the theory, anyway. About a year ago we realized that our monitoring system needed some serious upgrading. Instead of proactively alerting us before something broke, it mostly alerted us that something was already down. When we did get an alert it wasn't obvious what exactly was breaking or who needed to fix it. If you're a developer or sysadmin this email inbox probably looks a bit familiar:
So we set out to fix it. We weren't happy with any of the tools available so we decided to build our own. Since we are big fans of giving back to the community, we decided to make it open source as well. The new system is called Bosun (because naming is hard) and was developed by our own Kyle Brandt and Matt Jibson. It's still very much in development but we've been using it internally for a few months and are really happy with the results. We can measure much more intelligently and build complex alerts based on those metrics. Some of the things it makes easy are:
- Push data into Bosun from anywhere via a simple JSON api, or use scollector to collect common metrics from lots of different systems
- Test alerts against older data and see when they would have gone off
- Reduce email clutter with scope-aware alerts, so when e.g. redis goes down we get one email, not twelve (one for each instance)
- Forecast and alert against future data, like when we’re about to run out of disk space
If you're interested, read the full announcement (with a lot more detail) on the Server Fault blog or go straight to bosun.org to check it out. There's a link on the Getting Started page to a Docker image that populates itself with some data for you to experiment with. And if you happen to be at LISA this week you can check out Kyle Brandt’s talk on Thursday. As with any open source project, we're looking for a few brave souls to join us. You can grab the source on GitHub and start submitting issues and pull requests today. Do you love solving engineering problems? Good sysadmins are always in demand. Find new opportunities in our system administrators job listings.