Automated end-to-end tests for Glean

Jul 24, 2020

7 minutes read

#mozilla #glean #burnham

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

This is a special guest post by non-Glean-team member Raphael Pierzina! 👨🏻‍💻

Origins

Mozilla All-Hands

Last year at the Mozilla All-Hands in Whistler, Canada I went for a walk with my colleague Mark Reid who manages our Data Platform team. We caught up on personal stuff and discussed ongoing projects as well as shared objectives for the next half-year. These in-person conversations with colleagues are my favorite activity at our semi-annual gatherings and are helpful in ensuring that my team is working on the most impactful projects and that our tests create value for the teams we support. 📈

Glean

For Mozilla, getting reliable data from our products is critical to inform our decision making. Glean is a new product analytics and telemetry solution that provides a consistent experience and behavior across all of our products. Mark and I agreed that it would be fantastic if we had automated end-to-end tests to complement existing test suites and alert us of potential issues with the system as quickly as possible.

Project

We wrote a project proposal, consulted with the various stakeholders and presented it to the Data Engineering and Data Operations teams, before scoping out the work for the different teams and getting started on the project. Fast forward to today, I’m excited to share that we have recently reached a major milestone and successfully completed a proof of concept!

The burnham project is an end-to-end test suite that aims to automatically verify that Glean-based products correctly measure, collect, and submit non-personal information to the GCP-based Data Platform and that the received telemetry data is then correctly processed, stored to the respective tables and made available in BigQuery. 👩‍🚀📈🤖

The project theme is inspired by Michael Burnham, the fictional protagonist on the web television series Star Trek: Discovery portrayed by Sonequa Martin-Green. Burnham is a science specialist on the Discovery. She and her crew do research on spore drive technology and complete missions in outer space and these themes of scientific exploration and space travel are a perfect fit for this project.

Architecture

burnham

We have developed a command-line application based on the Glean SDK Python bindings for producing test data as part of the automated end-to-end test suite. The burnham application submits custom discovery Glean pings to the Data Platform which validates and stores these pings in the burnham_live.discovery_v1 BigQuery table.

Every mission identifier that is passed as an argument to the burnham CLI corresponds to a class, which defines a series of actions for the space ship in its complete method:

class MissionF:
    """Warp two times and jump one time."""

    identifier: ClassVar[str] = "MISSION F: TWO WARPS, ONE JUMP"

    def complete(self, space_ship: Discovery) -> None:
        space_ship.warp("abc")
        space_ship.warp("de")
        space_ship.jump("12345")

The actions record to a custom labeled_counter metric, as you can see in the following snippet. After completing a mission burnham submits a discovery ping that contains the recorded metrics.

class WarpDrive:
    """Space-travel technology."""

    def __call__(self, coordinates: str) -> str:
        """Warp to the given coordinates."""

        metrics.technology.space_travel["warp_drive"].add(1)
        logger.debug("Warp to %s using space-travel technology", coordinates)

        return coordinates

The example test scenario in a following section shows how we access this data. 📊

You can find the code for the burnham application in the application directory of the burnham repository. 👩‍🚀

burnham-bigquery

We also developed a test suite based on the pytest framework that dynamically generates tests. Each test runs a specific query on BigQuery to verify a certain test scenario.

The following snippet shows how we generate the tests in a pytest hook:

def pytest_generate_tests(metafunc):
    """Generate tests from test run information."""

    ids = []
    argvalues = []

    for scenario in metafunc.config.burnham_run.scenarios:
        ids.append(scenario.name)
        query_job_config = bigquery.QueryJobConfig(
            # The SQL query is expected to contain a @burnham_test_run parameter
            # and the value is passed in for the --run-id CLI option.
            query_parameters=[
                bigquery.ScalarQueryParameter(
                    "burnham_test_run",
                    "STRING",
                    metafunc.config.burnham_run.identifier,
                ),
            ]
        )
        argvalues.append([query_job_config, scenario.query, scenario.want])

    metafunc.parametrize(
        ["query_job_config", "query", "want"], argvalues, ids=ids
    )

The test suite code is located in the bigquery directory of the burnham repository. 📊

telemetry-airflow

We build and push Docker images for both burnham and burnham-bigquery on CI for pushes to the main branch of the burnham repository. The end-to-end test suite is configured as a DAG on telemetry-airflow on the Data Platform and scheduled to run daily (this is the same infrastructure we use to generate all the derived datasets). It runs several instances of a burnham Docker container to produce Glean telemetry, uses an Airflow sensor to wait for the data to be available in the burnham live tables, and then runs burnham-bigquery to verify the results.

The following snippet shows how we call a helper function that returns a GKEPodOperator which runs the burnham Docker container in a Kubernetes pod. We pass in information about the current test run which we later use in the SQL queries to filter out rows from other test runs. We also specify the missions burnham runs and whether the spore-drive experiment is active for this client:

client2 = burnham_run(
    task_id="client2",
    burnham_test_run=burnham_test_run,
    burnham_test_name=burnham_test_name,
    burnham_missions=[
        "MISSION A: ONE WARP",
        "MISSION B: TWO WARPS",
        "MISSION D: TWO JUMPS",
        "MISSION E: ONE JUMP, ONE METRIC ERROR",
        "MISSION F: TWO WARPS, ONE JUMP",
        "MISSION G: FIVE WARPS, FOUR JUMPS",
    ],
    burnham_spore_drive="tardigrade-dna",
    owner=DAG_OWNER,
    email=DAG_EMAIL,
)

Please see the burnham DAG for more information. 📋

Test scenarios

The burnham Docker image and the burnham-bigquery Docker image support parameters which control their behavior. This means we can modify the client automation and the test scenarios from the burnham DAG and the DAG effectively becomes the test runner.

We currently test four different scenarios. Every scenario consists of a BigQuery SQL query and a corresponding list of expected result rows.

Example test scenario

The test_labeled_counter_metrics test verifies that labeled_counter metric values reported by the Glean SDK across several documents from three different clients are correct.

The UNNEST operator in the following SQL query takes metrics.labeled_counter.technology_space_travel and returns a table, with one row for each element in the ARRAY. The CROSS JOIN adds the values for the other columns to the table, which we check in the WHERE clause:

SELECT
  technology_space_travel.key,
  SUM(technology_space_travel.value) AS value_sum
FROM
  `project_id.burnham_live.discovery_v1`
CROSS JOIN
  UNNEST(metrics.labeled_counter.technology_space_travel) AS technology_space_travel
WHERE
  submission_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 HOUR)
  AND metrics.uuid.test_run = @burnham_test_run
GROUP BY
  technology_space_travel.key
ORDER BY
  technology_space_travel.key
LIMIT
  10

The burnham DAG instructs client1 to perform 5 warps and 5 jumps, client2 to perform 10 warps and 8 jumps, and client3 to perform 3 warps. We expect a total number of 18 warps with a warp drive and a total number of 13 jumps with a spore drive across the three clients:

WANT_TEST_LABELED_COUNTER_METRICS = [
    {"key": "spore_drive", "value_sum": 13},
    {"key": "warp_drive", "value_sum": 18},
]

Next steps

We currently monitor the test run results via the Airflow dashboard and have set up email alerts for when the burnham DAG fails. Airflow stores logs for every task allowing us to diagnose failures.

We are now working on storing the test results along with the test report from burnham-bigquery in a new BigQuery table. This will enable us to create dashboards and monitor test results over time. We plan to also add additional test scenarios to the suite as for example a test to verify that different naming schemes for pings are working as designed in the Glean SDK and on the Data Platform.

It has been amazing to collaborate with folks from various teams at Mozilla on the Glean end-to-end tests project and I’m excited to continue this work with my fellow Mozillians in the next half-year. 👨‍🚀

Back to posts