Skip to content

System Reliability Engineering

Some potential failures to consider for each deployment:

Availability

  • Always ask what are the SLAs
  • Queue full and transaction not being processed
  • API backend not available
  • Network failure
  • DNS problems
  • Update for new version fails to deploy
  • Hitting unknown limits, like number of VPC interfaces
  • Certificate not auto renewed on expiry, and authentication fails

Performance

  • Cold start of the function cause deploy
  • Unreliable response time
  • Consumer lag behind
  • Event backbone not responding
  • Functions lock an external resource which becomes a bottleneck
  • Backend DB starts to throttle at high load
  • Serverless stars to throttle at high volume
  • Front end times out, no response from the back end
  • Latency measurement
  • Meaningful throughput measurement
  • Batch job impacting main real time processing
  • No more in SLA
  • Retry logic missing jitter causes bursts of retries
  • Message Queues are filled too quickly, causing unhandled messages
  • Network latency impacting call / response - and front end doesn't handle
  • Can't scale up fast enough to support demand spikes
  • Transaction chain scales at different rates or hard limits on one link break scalability

Data Loss / Durability

  • Data at rest gets corrupted
  • Error while archiving central logs
  • Every nth transaction fails silently, loses data
  • DR site down
  • Data out of synch between two data sources
  • Can't find a particular orderID in the logs
  • Lack of log files in serverless world
  • Data replicated among active cluster is taking longer than expected
  • Queue exceeds visibility limit
  • Losing messages in kafka
  • Duplicated messages

Quality and correctness

  • Idempotency failure causing multiple transactions in error
  • Dual records created in database
  • See unexpected records
  • Version error after code upgrade
  • Data recovery can cause ID duplicates
  • Error reporting not reporting actual problem
  • Are the runbooks autodated?
  • Does simulated transaction bring clear understanding of the process flow and error flow?
  • Inconsistent view between caller and callee?

Security

  • Privacy laws enforcement
  • GDPR
  • How to identify data breach
  • Data at rest not encrypted and protected
  • Elevation of privileges on PROD accounts
  • Access to private data
  • Cannot trace who make changes
  • Keep up to date with security patches
  • Open source component with security vulnerability goes undetected