Oddsmaker Incident Response Runbook
Overview
This runbook provides procedures for responding to incidents affecting the Oddsmaker Gaming Analytics Platform.
Severity Levels
| Level | Description | Response Time | Example |
|---|---|---|---|
| P1 - Critical | Service completely down, data loss | 15 minutes | Database failure, security breach |
| P2 - High | Major feature unavailable | 30 minutes | API errors, authentication failure |
| P3 - Medium | Degraded performance | 2 hours | Slow queries, high latency |
| P4 - Low | Minor issue, workaround available | 24 hours | UI bug, non-critical feature issue |
Incident Response Process
1. Detection and Alert
Automated Alerts:
- Prometheus/Grafana alerts
- Health check failures
- Error rate spikes
- Performance degradation
Manual Detection:
- Customer reports
- Support tickets
- Monitoring dashboard review
2. Initial Response
First 5 Minutes:
- Acknowledge the alert
- Assess severity level
- Notify on-call team
- Create incident ticket
First 15 Minutes:
- Gather initial information
- Identify affected components
- Determine scope of impact
- Communicate status to stakeholders
3. Investigation
Information Gathering:
bash
# Check service status
kubectl get pods -n oddsmaker
kubectl describe pod <pod-name> -n oddsmaker
# View logs
kubectl logs <pod-name> -n oddsmaker --tail=100
kubectl logs <pod-name> -n oddsmaker -f
# Check events
kubectl get events -n oddsmaker --sort-by='.lastTimestamp'
# Check resource usage
kubectl top pods -n oddsmaker
kubectl top nodesDatabase Investigation:
sql
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Check long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
-- Check locks
SELECT * FROM pg_locks WHERE NOT granted;
-- Check database size
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname))
FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC;4. Mitigation
Common Mitigation Actions:
Service Restart:
bash
kubectl rollout restart deployment/oddsmaker-control -n oddsmakerScale Up:
bash
kubectl scale deployment/oddsmaker-control --replicas=5 -n oddsmakerRollback:
bash
kubectl rollout undo deployment/oddsmaker-control -n oddsmakerDatabase Failover:
bash
# Promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data5. Resolution
Resolution Steps:
- Implement fix (code change, configuration, infrastructure)
- Test fix in staging environment
- Deploy to production
- Verify resolution
- Monitor for recurrence
6. Post-Incident
Post-Incident Review:
- Document timeline
- Identify root cause
- Determine contributing factors
- Create action items
- Update runbooks
Post-Incident Template:
markdown
# Incident Report: [Title]
## Summary
- **Date**: [Date]
- **Duration**: [Duration]
- **Severity**: [P1/P2/P3/P4]
- **Impact**: [Description of impact]
## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Fix implemented
- [Time] - Incident resolved
## Root Cause
[Description of root cause]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## Resolution
[Description of resolution]
## Action Items
- [ ] [Action 1]
- [ ] [Action 2]
## Lessons Learned
- [Lesson 1]
- [Lesson 2]Common Incidents
Incident: High Error Rate
Symptoms:
- Error rate > 5%
- 5xx responses increasing
- Customer complaints
Investigation:
bash
# Check error logs
kubectl logs -l app=oddsmaker-control -n oddsmaker --tail=1000 | grep -i error
# Check recent deployments
kubectl rollout history deployment/oddsmaker-control -n oddsmaker
# Check resource usage
kubectl top pods -n oddsmakerResolution:
- If recent deployment: Rollback
- If resource issue: Scale up
- If code bug: Hotfix deployment
Incident: Database Connection Issues
Symptoms:
- Connection timeout errors
- Connection pool exhaustion
- Slow queries
Investigation:
sql
-- Check connection count
SELECT count(*) FROM pg_stat_activity;
-- Check connection by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- Check waiting queries
SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL;Resolution:
- Kill long-running queries
- Increase connection pool size
- Optimize queries
- Scale database
Incident: High Memory Usage
Symptoms:
- OOM kills
- Pod restarts
- High garbage collection
Investigation:
bash
# Check memory usage
kubectl top pods -n oddsmaker
# Check JVM heap
curl http://localhost:8086/actuator/metrics/jvm.memory.used
# Check for memory leaks
jmap -histo <pid> | head -20Resolution:
- Increase memory limits
- Optimize JVM settings
- Fix memory leak
- Scale horizontally
Incident: Security Breach
Symptoms:
- Unauthorized access attempts
- Suspicious audit logs
- Data exfiltration alerts
Investigation:
bash
# Check audit logs
kubectl logs -l app=oddsmaker-control -n oddsmaker | grep -i "security\|unauthorized\|breach"
# Check failed login attempts
psql -c "SELECT * FROM audit_logs WHERE action = 'LOGIN' AND result = 'FAILURE' ORDER BY created_at DESC LIMIT 100;"
# Check suspicious IPs
psql -c "SELECT client_ip, count(*) FROM audit_logs WHERE result = 'FAILURE' GROUP BY client_ip ORDER BY count DESC LIMIT 20;"Resolution:
- Block suspicious IPs
- Revoke compromised credentials
- Enable additional security measures
- Notify security team
- Preserve evidence
Escalation Matrix
| Severity | Initial Responder | Escalation (30 min) | Escalation (1 hour) |
|---|---|---|---|
| P1 | On-call Engineer | Engineering Manager | CTO |
| P2 | On-call Engineer | Engineering Manager | CTO |
| P3 | On-call Engineer | Team Lead | Engineering Manager |
| P4 | On-call Engineer | Team Lead | - |
Communication Templates
Internal Communication
Initial Notification:
🚨 Incident Alert - [Severity]
Service: Oddsmaker Control Service
Impact: [Description]
Status: Investigating
ETA: [Time]
Updates will follow every [15/30/60] minutes.Status Update:
📊 Incident Update - [Severity]
Service: Oddsmaker Control Service
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Current impact]
Progress: [What's been done]
Next Steps: [What's planned]
Next update in [15/30/60] minutes.Resolution Notification:
✅ Incident Resolved - [Severity]
Service: Oddsmaker Control Service
Duration: [Duration]
Root Cause: [Brief description]
Resolution: [What was done]
Post-incident review scheduled for [Date/Time].External Communication
Customer Notification:
We are currently experiencing issues with [service/feature].
Our team is actively working on resolving this.
We will provide updates every [30/60] minutes.
We apologize for any inconvenience.Tools and Resources
Monitoring
- Grafana: https://grafana.oddsmaker.local
- Prometheus: https://prometheus.oddsmaker.local
- Kibana: https://kibana.oddsmaker.local
Communication
- Slack: #oddsmaker-incidents
- PagerDuty: https://oddsmaker.pagerduty.com
- Status Page: https://status.oddsmaker.local
Documentation
- Architecture: docs/reference/architecture.md
- API Reference: docs/reference/api-reference.md
- Runbooks: docs/operations/
Contacts
| Role | Name | Phone | |
|---|---|---|---|
| On-call Engineer | [Name] | [Email] | [Phone] |
| Engineering Manager | [Name] | [Email] | [Phone] |
| DBA | [Name] | [Email] | [Phone] |
| Security Lead | [Name] | [Email] | [Phone] |
| CTO | [Name] | [Email] | [Phone] |