Alerts and Notifications
Effective monitoring and alerting are essential for maintaining optimal cluster performance and cost efficiency. This guide explains how to configure Stackbooster.io's comprehensive alerting system to stay informed about important events and changes in your Kubernetes environment.
Alert Types
Stackbooster.io provides alerts across several categories:
Cost Alerts
- Budget Thresholds: Notifications when spending reaches defined percentages of your budget
- Anomalous Spending: Alerts for unusual cost increases or unexpected resource usage
- Savings Opportunities: Notifications about newly identified cost optimization opportunities
- Reserved Instance Coverage: Alerts when RI coverage drops below target levels
Performance Alerts
- Resource Saturation: Notifications when nodes or clusters approach resource limits
- Scaling Failures: Alerts when auto-scaling operations fail or are delayed
- Pod Scheduling Issues: Notifications about pods failing to schedule due to resource constraints
- Node Performance Problems: Alerts for node-level performance issues
Operational Alerts
- Agent Health: Notifications if the Stackbooster.io agent becomes disconnected
- Configuration Changes: Alerts about changes to cluster or optimization settings
- Optimization Actions: Notifications about significant scaling or optimization actions
- Security Events: Alerts related to permissions or access issues
Configuring Alerts
Alert Policies
Create and manage alert policies through the Alerts dashboard:
- Navigate to "Operations" > "Alerts"
- Click "Create Alert Policy"
- Select an alert category and specific condition
- Configure threshold values and evaluation periods
- Choose notification channels
- Set severity level and automated response options (if applicable)
- Save the policy
Default Alert Policies
Stackbooster.io includes several default alert policies that are enabled automatically:
| Alert | Condition | Default Threshold | Severity |
|---|---|---|---|
| Critical Node Utilization | CPU or memory utilization exceeding threshold | 90% for 15 minutes | High |
| Failed Scaling Operation | Auto-scaling operation fails | Any failure | Medium |
| Agent Connection Lost | Agent stops reporting metrics | 15 minutes | High |
| Budget Threshold | Monthly spending reaches percentage of budget | 80%, 90%, 100% | Medium, High, Critical |
| Optimization Blocked | Optimization action blocked by constraint | 24 hours | Low |
Alert Severity Levels
Alert policies can be assigned different severity levels:
- Critical: Requires immediate attention, potential service impact
- High: Important issue needing prompt response
- Medium: Issue requiring attention but not immediately urgent
- Low: Informational, may require eventual action
Notification Channels
Configure how and where you receive alert notifications:
Email Notifications
Configure email notifications for individual users or groups:
- Navigate to "Operations" > "Alerts" > "Notification Channels"
- Click "Add Email Channel"
- Enter email addresses for recipients
- Select which severity levels trigger email notifications
- Choose whether to send daily or weekly summaries
Integrations
Connect alerts to external systems:
Slack
- Navigate to "Operations" > "Alerts" > "Notification Channels"
- Click "Add Slack Channel"
- Follow the OAuth flow to authorize Stackbooster.io
- Select the Slack channel to receive notifications
- Configure notification formatting and severity filtering
PagerDuty
- Navigate to "Operations" > "Alerts" > "Notification Channels"
- Click "Add PagerDuty Service"
- Enter your PagerDuty integration key
- Configure which severity levels create incidents
- Set up automatic incident resolution
Webhook
- Navigate to "Operations" > "Alerts" > "Notification Channels"
- Click "Add Webhook"
- Enter the webhook URL
- Configure payload format (JSON or form data)
- Set custom headers if needed
- Test the webhook connection
Mobile Notifications
Get alerts on your mobile device:
- Download the Stackbooster.io mobile app from the App Store or Google Play
- Log in with your account credentials
- Enable push notifications when prompted
- Configure notification preferences in the app settings
Alert Management
Alert Dashboard
The alert dashboard provides a centralized view of all active and historical alerts:
Navigate to "Operations" > "Alerts" > "Dashboard"
View alerts filtered by:
- Status (Active, Resolved, Acknowledged)
- Severity
- Cluster
- Time period
- Alert type
For each alert, you can:
- View detailed information and context
- Acknowledge receipt
- Add comments for team communication
- Resolve manually if the issue is fixed
- Snooze for a specified duration
Alert History
Review historical alert patterns:
- Navigate to "Operations" > "Alerts" > "History"
- Analyze alert frequency and patterns
- Identify recurring issues
- Review resolution times and effectiveness
- Export alert history for compliance or analysis
Best Practices
Setting Appropriate Thresholds
- Start Conservative: Begin with higher thresholds and adjust based on experience
- Consider Workload Patterns: Set different thresholds for different clusters based on their usage patterns
- Avoid Alert Fatigue: Don't create too many alerts or set thresholds too low
- Review Regularly: Analyze which alerts are actionable and adjust accordingly
Creating Escalation Paths
For critical production environments:
Define tiered response procedures:
- Initial notification to team channel
- Escalation to on-call engineer after X minutes
- Manager notification after Y minutes
- Executive notification for extended issues
Configure PagerDuty or similar service with:
- Appropriate escalation policies
- Follow-the-sun coverage for global teams
- Backup responders
Grouping and Correlation
Reduce noise by grouping related alerts:
- Enable alert correlation in "Alert Settings"
- Group alerts by:
- Affected cluster or namespace
- Root cause when detectable
- Time proximity
- Related resources
Documentation and Runbooks
For each critical alert type:
- Navigate to "Operations" > "Alerts" > "Runbooks"
- Create or edit runbooks with:
- Clear description of the alert condition
- Potential causes and impact
- Immediate mitigation steps
- Long-term resolution actions
- Links to relevant documentation
Advanced Topics
Custom Alert Conditions
Create sophisticated alert conditions using:
- Navigate to "Operations" > "Alerts" > "Custom Conditions"
- Use the condition builder to:
- Combine multiple metrics with AND/OR logic
- Create rate-of-change conditions
- Compare current values to historical baselines
- Reference external metrics or data sources
Programmatic Alert Management
Use our API to manage alerts programmatically:
# Example: Create a new alert policy via API
curl -X POST https://api.stackbooster.io/v1/alerts \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "High Pod CPU Usage",
"description": "Alert when pod CPU usage exceeds threshold",
"metric": "kubernetes.pod.cpu.utilization_rate",
"condition": "above",
"threshold": 85,
"duration": "10m",
"severity": "medium",
"notification_channels": ["slack-prod-alerts"]
}'Predictive Alerting
Enable ML-powered predictive alerts to detect issues before they impact services:
- Navigate to "Operations" > "Alerts" > "Predictive Alerts"
- Enable predictions for:
- Resource exhaustion trends
- Anomalous behavior detection
- Performance degradation patterns
- Cost spike predictions
Automated Remediation
For certain alert conditions, you can configure automated remediation actions:
- Navigate to "Operations" > "Alerts" > "Auto-remediation"
- Create remediation rules for conditions like:
- Failed node automatic replacement
- Pod rescheduling for node issues
- Scaling actions for resource constraints
- Budget enforcement actions
By effectively configuring and managing alerts, you ensure that your team remains informed about important events while minimizing unnecessary notifications. This balanced approach helps maintain optimal cluster performance and cost efficiency.
