The Challenge
I recently spoke with a customer that was looking to capture and track DevOps Research and Assessment (DORA) metrics. The DORA metrics are four key metrics that indicate the performance of a software development team. These metrics are:
- Deployment Frequency: How often an organization successfully releases to production
- Lead Time for Changes: The amount of time it takes a commit to get into production
- Change Failure Rate: The percentage of deployments causing a failure in production
- Time to Restore Service: How long it takes an organization to recover from a failure in production
This customer could report on the Deployment Frequency, Lead Time for Changes, and the Change Failure Rate. However, they didn’t have an easy way to measure the Time to Restore Service. This customer uses GCP’s Operations Suite alerting tools to detect failures and resolve incidents. While the events are generated by Operations Suite, ultimately we want the data in BigQuery so we can build Looker Studio dashboards and analyze the data later.
The Solution
To capture these events, we can take advantage of the Pub/Sub notification channels. After your alerting policy is defined, alerts are sent to the notification channel of your choice (e.g. SMS, Email, Pub/Sub, etc.) Pub/Sub notification channels are a great, general solution when your preferred destination is not currently supported in Operations Suite. This allows us to take the generated events and do, really, whatever we want.
We could use a BigQuery subscription to write the Pub/Sub messages directly to BigQuery; however, we’d encounter a few issues.
- If we don’t define a topic schema, the event will be written as a JSON object in a single column, which impacts our querying capabilities.
- If we do define a topic schema, all messages must perfectly match the schema when published to the topic. We know that the Operations Suite will adjust the schema over time (you’ll notice the version field defaults to value of 1.2) and your organization may alter portions of the schema.
- You can’t explicitly define the Pub/Sub event beforehand. Individual instances may have more labels than you’d expect or the alerting policy may have any number of user labels associated with them. Any unexpected changes to schema will prevent messages being published to the topic.
But what if we used Workflows? We can unmarshal the JSON object, extract the specific fields we want, and then write them into BigQuery.
main:
params: [event]
steps:
- decode_pubsub:
assign:
- base64: ${base64.decode(event.data.message.data)}
- message: ${text.decode(base64)}
- alert: ${json.decode(message)}
- if_event_over:
switch:
- condition: ${alert.incident.ended_at != null}
next: insert_record
next: end
- insert_record:
call: googleapis.bigquery.v2.jobs.query
args:
projectId: taylor-lab
body:
useLegacySql: false
query: ${"INSERT INTO `reporting.alerts` (incident_id, scoping_project_id, policy_name, started_at, ended_at) VALUES ( \"" + alert.incident.incident_id + "\", \"" + alert.incident.scoping_project_id + "\", \"" + alert.incident.policy_name + "\", " + string(alert.incident.started_at) + ", " + string(alert.incident.ended_at) + ")"}
next: end
One design quirk of this workflow is the if_event_over step. In Pub/Sub, subscriptions can be filtered by attributes but not by data in the message. If the incident is over, the event will contain the start and end date. However, the Pub/Sub topic will receive messages at the start of the incident, which is incomplete for our reporting purposes. It’s simpler to insert the complete record instead of updating the record later so we’ll perform one insert when the incident is done.
This provides some advantages and unique characteristics:
- We don’t have to manage any code. Everything is written in YAML so individual contributors don’t require any programming experience to contribute, extend, or maintain.
- The event versioning can change without disrupting the pipeline as long as the field you’re extracting continues to exist.
- It’s more cost effective than other tools, like Dataflow, to process the events.
However, you should still watch for schema changes to minimize the chance of disruptions.
Next Steps
Optionally, if you expect this Workflow to have high throughput, you could publish a message to Pub/Sub with a BigQuery subscription after the JSON object has been parsed. While we couldn’t guarantee the schema beforehand, our workflow allows us to convert a dynamic schema into a predictable output to take advantage of BigQuery subscriptions.