\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Out of the box M3 data handling is also a big plus: the possibility to define aggregations on top of raw data makes downsampling just a matter of a terminal command. Specifically for metrics use cases, this is a welcome feature, enabling us to store raw data (30 seconds samples) for the most recent period and aggregated data (2-5 min) for long term. The data can be compressed at rest, but we also implemented in-transit compression of traffic to optimize the usage of network resources; M3 perhaps naively assumes that traffic to `m3db`and `m3aggregator` is in the same cluster and therefore free. To ensure high availability, we ran these distributed across availability zones, but that meant a 20-50x increase in bandwidth which we found quite unsatisfactory, as we had to pay for it in some of the cloud providers we support. The raw data amounts were large enough to cause lag—the service could not keep up with the data ingestion as we ran the cluster connectivity inside IPsec tunnels. We hope compression in transit will be available upstream in the near future.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The last point in our evaluation was represented by trust: we knew we wanted a proven open source option for our platform, and M3 is actively maintained by \u003Ca href=\"https://chronosphere.io/\">Chronosphere\u003C/a> and other third-party developers, including Aiven. We knew that companies like Uber, Walmart, and LinkedIn were using M3 successfully. These factors gave us the confidence that M3 was the right option for us.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-pain-points-of-migration\">Pain points of migration\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>We didn’t have to change the overall architecture when moving from InfluxDB to M3. We still monitor the metrics of every internal and customer node. However, we had to change a few things elsewhere in our pipeline to successfully transition from InfluxDB to M3. Just because we could keep the existing architecture in place doesn’t mean there weren’t struggles. \u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The biggest task was migrating hundreds of our Grafana dashboards (each with multiple panels) from InfluxDB query language to PromQL (Prometheus query language). This was a painful experience—we developed a tool to automate this, but that took over a year of on-and-off work—but worth it in the end. There were other, smaller tasks as well. In particular, we had a lot of hard-coded assumptions in our admin tools about using InfluxDB either directly or as a data source, so quite a few updates were needed there too.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Another pain point was the lack of great tooling for backing up and restoring M3 databases. We developed our own tool for this, called\u003Ca href=\"https://github.com/aiven/astacus\"> Astacus\u003C/a>, and open-sourced it. Astacus supports cluster-wide backup and restore of all M3DB data on disk, and a subset of \u003Ca href=\"https://etcd.io/\">etcd\u003C/a> state for the cluster containing the M3 metadata about nodes and shards.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-lessons-learned\">Lessons learned\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>There are a few lessons we’d like to share from our journey, starting with advice about metrics. Having a clear focus on the metrics you really care about is key. A year ago we tracked more time series metrics than we do now, despite the business growing. But as part of the migration to M3, we took the time to review and refine which metrics we track. We cut out half the metrics per node and can now handle twice the nodes at the same cost. \u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Our architecture helped with that as well, since every node pushes metrics to Kafka, which gives us a common location to filter metrics. This decoupling of metrics sources and consumers via Kafka helped us a lot, especially in managing the transition, since we could easily run M3 in parallel with InfluxDB—and we did, for over a year before finally switching. We also learned that compressing network traffic is critical if you want to save on bandwidth (and cloud bills) in large clustered services like ours. M3 version 1.3 includes one of our compression patches.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Aggregation in M3 is surprisingly resource-heavy, especially regarding memory. We initially used daily and weekly aggregations in addition to our unaggregated data, and it tripled the memory requirements of the nodes. We now run without the aggregator and instead use unaggregated metrics with a long retention period. This makes queries over longer time periods slower, but is less of a problem than the memory impact that we saw with aggregation in place.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Happily, some things have improved since we trod this path. M3 configuration is pretty complex, and because documentation was sparse when we implemented M3, we learned a lot through experimentation. Fortunately, documentation has since improved.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-some-interesting-aiven-m3-production-deployment-facts\">Some interesting Aiven M3 production deployment facts\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>A few facts we want to share with you about our M3 setup:\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Almost everything we use in our own stack is open source, so that was one of the main criteria in choosing M3 back in early 2019. Once we found something that could handle the size of our own metrics, and once we checked that it really did work well for us in production, we started packaging it so our customers could use it too.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>We communicate with M3 using the compressed InfluxDB line protocol, since (at least at the time of evaluation) it was more performant than the native Prometheus write. We contributed to the M3 implementation of the InfluxDB write protocol, something we needed for ourselves and were proud to share.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>Currently, we have ~20 million unique time series (with replication factor 3) at any time\u003C/li>\u003Cli>We use two clusters for M3 \u003Cul>\u003Cli>6-node `m3coordinator` (64GB RAM each)\u003C/li>\u003Cli>9-node ‘m3db` (150GB RAM each)\u003C/li>\u003C/ul>\u003C/li>\u003Cli>Uncompressed raw backup size is ~8TB for two weeks worth of data \u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>At the risk of tempting fate, the production monitoring cluster hasn’t had any downtime in nearly two years that wasn’t caused by humans misconfiguring things (oops!).\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-open-source-at-aiven\">Open source at Aiven\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>At Aiven, we live and breathe open source solutions. Our Aiveners publish and maintain half a dozen projects and connectors through our Open Source Program Office. We aim to make every service we use internally available to our customers as well. We avoid non-OSS solutions where possible, as this precludes making these solutions available to our customers too. We also see ourselves as “insiders” in the open source projects we use: we contribute to and advocate for the projects as well as deploy and operate the projects on our own platforms.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Implementing M3 for our own monitoring led us to contribute to the implementation of the InfluxDB protocol in the project. We built Astacus, the backup and restore tool that we need for M3, and made it open source for others to use too. We implemented compression on data in transit, and are working on adding that to the upstream project, with the first of the patches available in the 1.3 release.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>If you’re interested in integrated M3 in your project, read more about Aiven for M3 in \u003Ca href=\"https://developer.aiven.io/docs/products/m3db/index.html?utm_source=stackoverflow&utm_medium=blog&utm_campaign=blog_m3_so_devportal&utm_content=post\">our developer portal\u003C/a>.\u003C/p>\n\u003C!-- /wp:paragraph -->","html","2021-12-01T15:00:00.000Z",{"current":553},"migrating-metrics-from-influxdb-to-m3",[555,563,567,571,576],{"_createdAt":556,"_id":557,"_rev":558,"_type":559,"_updatedAt":556,"slug":560,"title":562},"2023-05-23T16:43:21Z","wp-tagcat-code-for-a-living","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":561},"code-for-a-living","Code for a Living",{"_createdAt":556,"_id":564,"_rev":558,"_type":559,"_updatedAt":556,"slug":565,"title":566},"wp-tagcat-database",{"current":566},"database",{"_createdAt":556,"_id":568,"_rev":558,"_type":559,"_updatedAt":556,"slug":569,"title":570},"wp-tagcat-migration",{"current":570},"migration",{"_createdAt":556,"_id":572,"_rev":558,"_type":559,"_updatedAt":556,"slug":573,"title":575},"wp-tagcat-partner-content",{"current":574},"partner-content","Partner Content",{"_createdAt":556,"_id":572,"_rev":558,"_type":559,"_updatedAt":556,"slug":577,"title":575},{"current":574},"Migrating metrics from InfluxDB to M3",[580,586,592,597],{"_id":581,"publishedAt":582,"slug":583,"sponsored":12,"title":585},"9fd8968d-abaa-4253-b14b-3129c6e85408","2025-09-10T17:00:00.000Z",{"_type":10,"current":584},"ai-vs-gen-z","AI vs Gen Z: How AI has changed the career pathway for junior developers",{"_id":587,"publishedAt":588,"slug":589,"sponsored":12,"title":591},"1d082483-6dc6-424b-8b09-9c84b54779da","2025-09-02T17:00:00.000Z",{"_type":10,"current":590},"back-to-school-developers-at-stack-overflow-have-some-advice-for-you","Back to school? Developers at Stack Overflow have some advice for you",{"_id":593,"publishedAt":588,"slug":594,"sponsored":12,"title":596},"5cd91820-9515-4be5-87ae-e919fd443c18",{"_type":10,"current":595},"getting-started-on-stack-overflow-a-step-by-step-guide-for-students","Getting started on Stack Overflow: a step-by-step guide for students",{"_id":598,"publishedAt":588,"slug":599,"sponsored":12,"title":601},"614538a9-c352-4024-adf1-fa44a9f911b6",{"_type":10,"current":600},"stack-overflow-is-helping-you-learn-to-code-with-new-resources","Stack Overflow is helping you learn to code with new resources",{"count":603,"lastTimestamp":604},4,"2023-05-25T09:47:43Z",["Reactive",606],{"$sarticleModal":607},false,["Set"],["ShallowReactive",610],{"sanity-nIYX3TphlJBehzFPSVs6C_l3yUd8KGDd0GGcWqoeRFs":-1,"sanity-comment-wp-post-19174-1757618497232":-1},"/2021/12/01/migrating-metrics-from-influxdb-to-m3"]