PLT Out-of-Service Post-Mortem
Date
Nov 26, 2019
Participants
Stefan Boniecki, Mark Frost, Bill Xu, Jonathan Bailey, Jiehua Yi, Josh Hevenor, Alvaro, Diego Cesa de Meira, Lawrence Chong
Agenda
Review what happened.
From a technical perspective:
what did we do to resolve the issue?
what were the problems with how we resolved the isse
what should be do next time?
Problem
PLT was out-of-service twice; users can’t use the system
Root Causes
A script was running on PLT that filled the hard disk.
The full hard disk caused the backup to fail.
The server was rebuilt.
An issue with the rebuilt configuration caused the configuration script to fail.
The Server machine was re-federated in PLT.
Actions Taken
added monitoring of disk usage and other health parameters
restored PLT from backup
Further Steps
need more robust backup / DR
changes should be tested first in DEV, STG
identify many possible solutions
assess risks of potential fixes
ask vendors for advice
use DevOps (CI/CD) process
shorten time to discovery
Action items