Non-Quality Performance Metrics for DM in Operations#

Abstract

A concise set of performance metrics that are tracked by the team and reported to operations management

Non-Scientific Performance Metrics for Operations#

Scientific quality metrics address the accuracy and reliability of our data products, and have been defined and tracked extensively throughout construction [2, 4, 5, 6, 7, 8, 9] and references therein. These quality metrics also include non-scientific metrics that ensure track DM system requirements [1], such as time to issue alerts after shutter close. See [3] for a summary of how these quality metrics will be used in operations.

This document outlines the operational non-quality metrics for assessing and improving our system’s performance and user experience. Non-quality metrics provide insights into the operational efficiency, usage patterns, infrastructure robustness, development processes, and security measures of the Data Management System. By monitoring these metrics, we aim to ensure that our infrastructure and services remain effective and secure, supporting our scientific objectives and user community.

Non-quality metrics are tracked by the various areas of oversight within DM. Some examples of these metrics are as follows:

Data Production#

Software (Science Pipelines)

  • Deployment Frequency: How often is lsst_distrib released?

  • Lead time for changes: Average time to research, implement and changes of three different levels: non-scientific-behavior bug fix, scientific-behavior change, swapping in a new state of the art algorithm.

  • Average time to recovery: time to notice and fix a breaking commit on main for both lsst_distrib itself and various conitnuous integration builds.

  • Average time to investigate an issue

Campaign Management (tracked jointly with Data Facility)

These metrics are affected by infrastructure and also provide insights into the health of the storage, compute and networks.

  • Peak number of cores used per workflow/campaign

  • Wall time of a campaign

  • CPU-time of a campaign

  • Number of average concurrent cores used per workflow management system (e.g. htcondor, PanDA)

Campaign Management (tracked jointly with Pipelines)

These metrics are affected by the pipeline software and also provide insights into how to best configure pipeline algorithms for the available hardware.

  • Core utilization efficiency (joint metric between pipelines and CM)

  • Memory usage per workflow, task, and quanta

Data Services#

The Science Platform (RSP) metrics include:

  • Number of RSP accounts: total accounts per day

  • Active RSP accounts: number of accounts logged in, and active per hour

Data Facility Infrastructure#

Metrics monitoring#

These metrics are monitored using:

References#

[1]

[LSE-61]. Gregory Dubois-Felsmann and Tim Jenness. Data Management System (DMS) Requirements. 2019. Vera C. Rubin Observatory . URL: https://lse-61.lsst.io/

[2]

[SQR-008]. Angelo Fausti. SQUASH QA database. 2016. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-008.lsst.io/

[3]

[RTN-038]. Leanne Guy, ...., and the DM System Science Team. Rubin Science Performance Metrics. 2024. Vera C. Rubin Observatory Technical Note. URL: https://rtn-038.lsst.io/

[4]

[DMTN-211]. Leanne P. Guy. Faro: A framework for measuring the scientific performance of petascale Rubin Observatory data products. 2022. Vera C. Rubin Observatory Data Management Technical Note. URL: https://dmtn-211.lsst.io/

[5]

[PSTN-023]. K. Simon Krughoff. LSST Data Management Quality Assurance and Reliability Engineering. 2019. Vera C. Rubin Observatory Project Science Technical Note. URL: https://pstn-023.lsst.io/

[6]

[LSE-63]. Tony Tyson, DQA Team, and Science Collaboration. LSST Data Quality Assurance Plan. 2017. Vera C. Rubin Observatory . URL: https://lse-63.lsst.io/

[7]

[DMTN-008]. Michael Wood-Vasey. Introducing validate_drp: Calculate SRD Key Performance Metrics for an output repository. 2016. Vera C. Rubin Observatory Data Management Technical Note. URL: https://dmtn-008.lsst.io/

[8]

[DMTN-091]. Michael Wood-Vasey, Eric Bellm, Jim Bosch, Jeff Carlin, Leanne Guy, Zeljko Ivezic, Lauren MacArthur, and Colin Slater. Test Datasets for Scientific Performance Monitoring. 2023. Vera C. Rubin Observatory Data Management Technical Note. URL: https://dmtn-091.lsst.io/

[9]

[LPM-17]. Ž. Ivezić and The LSST Science Collaboration. LSST Science Requirements Document. 2018. URL, https://ls.st/LPM-17.