PLT Out-of-Service Post-Mortem

Date

Nov 26, 2019

Participants

Stefan Boniecki, Mark Frost, Bill Xu, Jonathan Bailey, Jiehua Yi, Josh Hevenor, Alvaro, Diego Cesa de Meira, Lawrence Chong

Agenda

  1. Review what happened.

  2. From a technical perspective:

    1. what did we do to resolve the issue?

    2. what were the problems with how we resolved the isse

    3. what should be do next time?

Problem

PLT was out-of-service twice; users can’t use the system

Root Causes

  1. A script was running on PLT that filled the hard disk.

  2. The full hard disk caused the backup to fail.

  3. The server was rebuilt.

  4. An issue with the rebuilt configuration caused the configuration script to fail.

  5. The Server machine was re-federated in PLT.

Actions Taken

  • added monitoring of disk usage and other health parameters

  • restored PLT from backup

Further Steps

  • need more robust backup / DR

  • changes should be tested first in DEV, STG

  • identify many possible solutions

    • assess risks of potential fixes

    • ask vendors for advice

  • use DevOps (CI/CD) process

  • shorten time to discovery

Action items

Develop test plan @Alvaro (Unlicensed)@Jiehua Yi (Unlicensed)
Automate tests where possible
Integrate tests into CI/CD pipelines
Develop DR strategy (focus on restoring service and preserving diagnostics and logs)
Harden DevOps pipelines @Jonathan Bailey (Unlicensed)
Harden scripts and templates @Jonathan Bailey (Unlicensed)@Josh Hevenor (Unlicensed)@Jiehua Yi (Unlicensed)