A while back I met Lee Yanco to discuss Google’s take on their managed Prometheus service (GMP). I’ve met with quite a few companies since about GMP, I’ve thought about it a lot, and I’ve come to really appreciate what Google’s put together.
Scaling Prometheus is not an O(n) operation. Operational complexity doesn’t become linearly more complicated over time and I think Google has an offering that’s worth the price of admission.
The Problem
Prometheus is often not a problem if you only have one environment. One Kubernetes cluster, one instance of Prometheus storing one month of metrics, and one instance of Grafana to visualize our metrics is straightforward. This is hardly ever the case long-term. As your application grows, so does your compute footprint. New tenants of your application, adding in another region, expanding to multiple clouds, are all common reasons why you’ll add more Prometheus instances.
And often, it’s not just your compute footprint. Workloads themselves have a tendency to grow over time. Your new features drive new custom metrics to observe, the need for tracking metric seasonality increases your retention periods, and the number of monitored endpoints will grow depending on your scaling methodologies and the number of services.
Once you’ve determined the need to expand, what can you do? You could scale the single instance vertically. However, you likely will be wasting resources in the pursuit of satisfying a single constrained metric. If you choose to scale horizontally, how do you query across your fleet of Prometheus instances? You could set up remote writes or federated queries to consolidate your query access patterns at the expense of more management effort but that’s ultimately adding more Prometheus instances into our monitoring solution.
So if Google’s offering a managed solution that appears to reduce this operational toil, is it worth it?
The Economics of It
Let’s use some napkin math to figure out our requirements.
From the Prometheus documentation:
Prometheus stores an average of only 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
Let’s assume we want to keep our data for two years (63,072,000 seconds), we ingest on average 500 samples per second, and each sample is approximately 1.5 bytes. We will need 44.06 GiB (or 47.31 GB) to store. Google Cloud charges us a minimum of $0.04/GB/month for persistent disks. Multiply that for two years and our persistent disk cost will be $45.42. This doesn’t take into account the additional GiB to run the underlying OS and dependencies but bear with me.
On the other hand, Google’s Managed Prometheus starts at $0.06/million samples. Going with our previous number of seconds in two years and an average of 500 samples per second then we’ll have 15,768,000,000 samples in two years. This works out to $946.08 over two years.
Of course, we need a virtual machine to run Prometheus. The price difference between the persistent disks and Managed Prometheus means our compute capacity must cost us less than $37.53/month (or $0.0521/h) for us to break even. If we had a three year Committed Use Discount in us-central1, this means the largest instance we can use is the n2d-highmem-2 instance with 2 vCPUs and 16 GiB of memory. Anything cheaper that’ll still work allows you to regain savings.
But so far, we’re barely breaking even on infrastructure cost and it took a three year commitment and we haven’t accounted for the time spent patching and maintaining this instance. You can use Thanos for long-term retention and querying to bring our costs down. It’s a great solution but you will need to decide if the engineering effort will be less than the cost of vertically scaling Prometheus.
And we haven’t even broached the topic of managing multiple Prometheus instances.
Why I like GMP
- It works. I don’t have to think about it.
- Operations Suite is pretty good. Grafana has a lot of community support but you can import your Grafana dashboards into Operations Suite and alerts can be written in PromQL. Less additional services to maintain.
- The architecture is dramatically more elegant. In traditional Prometheus, you’re pulling metrics into a single source. With Google’s solution, scraping can be done by a Daemonset in Kubernetes and with Ops Agent for VMs. Each node is responsible for scraping the metrics of its underlying workload(s) instead of the entire distributed system. Then those metrics are pushed into Operations Suite, leaving the challenges associated with scaling the ingestion and storage of our metrics to Google. This difference immediately erases the challenge of querying multiple, separate Prometheus instances.
How to get started
Google’s quickstarts are great. If you’re scraping Kubernetes workloads, try to use managed collection if you can.
Got more questions? Hit me up.