Adding datadog metric collection
The datadogpy documentation is good, but there's some useful local knowledge that will be helpful when adding metric collection to Release Engineering tools, and some good habits to get into when choosing how to set up your script.
Quickstart
pip install datadogpy
or
pipenv install datadogpy
Then a basic example:
#!/usr/bin/env python3 from datadog import initialize, ThreadStats ddstats = ThreadStats(namespace='releng.sandbox.blogpost') dd_options = { 'api_key': 'api key goes here', } initialize(**dd_options) ddstats.start(flush_interval=1) metric_tags = [ "tag1:value1", "tag2:value2" ] with ddstats.timer('sleeptimer', tags=metric_tags): time.sleep(10)
Explanations
from datadog import initialize, ThreadStats
ThreadStats is an option in the datadogpy module that allows the metric collection and API calls to run in a separate thread, ensuring that any error with that aspect won't impact the script's own performance.
ddstats = ThreadStats(namespace='releng.sandbox.blogpost')
Next, create an object that can be used to send metrics, ensuring it's visible in all the scopes that need it. You might want to make it a global.
Namespaces
The namespace is important. We share the global metric namespace with everyone else in the same datadog organisation, so we need to make sure that anything we send doesn't conflict with someone else's data.
Adding a namespace to the ThreadStats() object will prepend this value to the name of any metric we send, making the submission calls shorter to read and type, and allowing us to change the namespace globally for testing.
Funsize uses 'releng.releases.partials' as a namespace, because release engineering produced this metric, it's about releases, and within releases it's about partial generation. So you might see metrics called 'releng.releases.partials.task_duration'
For testing, you can change the global namespace to a sandbox one - the name itself isn't special, this is just a convention:
ddstats = ThreadStats(namespace='releng.sandbox.sfraser')
Initialization
Once you initialize() with the api key, and call the start method on your stats object, any metric submission calls will send their data to datadog. If you don't call initialize and start, then the metric submission calls will still work, but no data will be submitted.
This is useful for turning off metric collection, for example not having it run on some repositories or with local testing, as you don't need to wrap all the submission calls in conditionals.
Getting the API key.
The python module does not automatically detect any source of API key, so we must find it ourselves.
For local testing it can be simplest to provide it as an environment variable:
dd_api_key = os.environ.get('DATADOG_API_KEY')
For running in taskcluster, you may want to use the Taskcluster Secrets service and pass the location of the secret into the script. The task will need the appropriate scope to read the secret, as will the user or decision task that creates it. Ensure the task has the 'taskclusterproxy' feature enabled, and make a call to the secrets service, such as this example.
flush_interval
This is simply how often you'd like the submission thread to submit its metrics, if it has any.
If less than flush_interval seconds occurs between your script's last metric submission, and the script ending, then you may lose data.
Metric Tags
Tagging a metric is very helpful. For example, if we have a tag of 'branch', we can easily submit metrics for 'mozilla-central', 'mozilla-beta', 'maple' and others, and filter out the ones we're not interested in when we view the data. It might be that something we've done on, say, maple makes things a lot slower, and without the tag we won't be able to tell when looking at the graph. It'll just be spikey and we won't be able to say why.
Don't add too many tags - this increases the cardinality of the metric being collected, making it slower to retrieve, and cause storage problems (not just with datadog, with any time series database)
Good examples of tags might be:
- branch
- platform
- hostname (for scriptworkers and others where this makes sense, not docker)
- The unit of any collected measurement - bytes, bps, Mbps, gigaparsecs per hour. Essential if we want to compare two metrics and know whether the comparison is meaningful
- script name and version ("There's a spike here when we changed from scriptv1 to scriptv2!")
See also Metrics 2.0 for useful guidance.
with ddstats.timer('sleeptimer', tags=metric_tags)
A straightforward enough example - when entering the context, the time is recorded, and when the context exits the difference in time is submitted. In this example the metric name will be 'releng.sandbox.blogpost.sleeptimer', and it will be tagged with the tags set just above.
The 'dog' command line tool
Sometimes you may need a shell script to send metrics. If you've installed datadogpy, then there will be a command to send data and make a few other queries. Its basic use will be like this:
dog metric post "releng.sandbox.mytest.metric" 1.0
To add tags:
dog metric post "releng.sandbox.mytest.metric" 1.0 --tags "tag1:value1,tag2:value2"
The current hostname will be automatically added as a tag. If you're running in a docker-worker, or don't care about the hostname of the instance that produced the metric, then you can add the --no_host option to avoid sending the hostname.
The .dogrc file
To get the command-line tool to work, it needs a configuration file with the API key in. This is configparser compatible:
[Connection] apikey = api_key_goes_here appkey =
(Yes, appkey can be left blank)
Funsize has an example of generating this dogrc file from a surrounding python script.