PLT Out-of-Service Post-Mortem

Date

Nov 26, 2019

Stefan Boniecki, Mark Frost, Bill Xu, Jonathan Bailey, Jiehua Yi, Josh Hevenor, Alvaro, Diego Cesa de Meira, Lawrence Chong

Review what happened.
From a technical perspective:
1. what did we do to resolve the issue?
2. what were the problems with how we resolved the isse
3. what should be do next time?

PLT was out-of-service twice; users can’t use the system

A script was running on PLT that filled the hard disk.
The full hard disk caused the backup to fail.
The server was rebuilt.
An issue with the rebuilt configuration caused the configuration script to fail.
The Server machine was re-federated in PLT.

need more robust backup / DR
changes should be tested first in DEV, STG
identify many possible solutions
- assess risks of potential fixes
- ask vendors for advice
use DevOps (CI/CD) process
shorten time to discovery

Develop test plan @Alvaro (Unlicensed)@Jiehua Yi (Unlicensed)

Automate tests where possible

Integrate tests into CI/CD pipelines

Develop DR strategy (focus on restoring service and preserving diagnostics and logs)

Harden DevOps pipelines @Jonathan Bailey (Unlicensed)

Harden scripts and templates @Jonathan Bailey (Unlicensed)@Josh Hevenor (Unlicensed)@Jiehua Yi (Unlicensed)