A DevOps Wiki

View project on GitHub

Monitoring Reliability

Reliability is a function of mean-time-to-failure (MTTF), and mean-time-to-recovery (MTTR).

MTTF is important in determining when a process is not reliable.


MTTR is important in determining how quickly you can resolve an issue and limit the impact on a service.

As humans add latency to the MTTR, automated systems are useful to reduce the amount of time it takes to resolve an issue. Playbooks, or runbooks, also help reduce MTTR.

Prev: Culture | Next: Tools