Alerts and Notifications

Effective monitoring and alerting are essential for maintaining optimal cluster performance and cost efficiency. This guide explains how to configure Stackbooster.io's comprehensive alerting system to stay informed about important events and changes in your Kubernetes environment.

Alert Types

Stackbooster.io provides alerts across several categories:

Cost Alerts

Budget Thresholds: Notifications when spending reaches defined percentages of your budget
Anomalous Spending: Alerts for unusual cost increases or unexpected resource usage
Savings Opportunities: Notifications about newly identified cost optimization opportunities
Reserved Instance Coverage: Alerts when RI coverage drops below target levels

Performance Alerts

Resource Saturation: Notifications when nodes or clusters approach resource limits
Scaling Failures: Alerts when auto-scaling operations fail or are delayed
Pod Scheduling Issues: Notifications about pods failing to schedule due to resource constraints
Node Performance Problems: Alerts for node-level performance issues

Operational Alerts

Agent Health: Notifications if the Stackbooster.io agent becomes disconnected
Configuration Changes: Alerts about changes to cluster or optimization settings
Optimization Actions: Notifications about significant scaling or optimization actions
Security Events: Alerts related to permissions or access issues

Configuring Alerts

Alert Policies

Create and manage alert policies through the Alerts dashboard:

Navigate to "Operations" > "Alerts"
Click "Create Alert Policy"
Select an alert category and specific condition
Configure threshold values and evaluation periods
Choose notification channels
Set severity level and automated response options (if applicable)
Save the policy

Default Alert Policies

Stackbooster.io includes several default alert policies that are enabled automatically:

Alert	Condition	Default Threshold	Severity
Critical Node Utilization	CPU or memory utilization exceeding threshold	90% for 15 minutes	High
Failed Scaling Operation	Auto-scaling operation fails	Any failure	Medium
Agent Connection Lost	Agent stops reporting metrics	15 minutes	High
Budget Threshold	Monthly spending reaches percentage of budget	80%, 90%, 100%	Medium, High, Critical
Optimization Blocked	Optimization action blocked by constraint	24 hours	Low

Alert Severity Levels

Alert policies can be assigned different severity levels:

Critical: Requires immediate attention, potential service impact
High: Important issue needing prompt response
Medium: Issue requiring attention but not immediately urgent
Low: Informational, may require eventual action

Notification Channels

Configure how and where you receive alert notifications:

Email Notifications

Configure email notifications for individual users or groups:

Navigate to "Operations" > "Alerts" > "Notification Channels"
Click "Add Email Channel"
Enter email addresses for recipients
Select which severity levels trigger email notifications
Choose whether to send daily or weekly summaries

Integrations

Connect alerts to external systems:

Slack

Navigate to "Operations" > "Alerts" > "Notification Channels"
Click "Add Slack Channel"
Follow the OAuth flow to authorize Stackbooster.io
Select the Slack channel to receive notifications
Configure notification formatting and severity filtering

PagerDuty

Navigate to "Operations" > "Alerts" > "Notification Channels"
Click "Add PagerDuty Service"
Enter your PagerDuty integration key
Configure which severity levels create incidents
Set up automatic incident resolution

Webhook

Navigate to "Operations" > "Alerts" > "Notification Channels"
Click "Add Webhook"
Enter the webhook URL
Configure payload format (JSON or form data)
Set custom headers if needed
Test the webhook connection

Mobile Notifications

Get alerts on your mobile device:

Download the Stackbooster.io mobile app from the App Store or Google Play
Log in with your account credentials
Enable push notifications when prompted
Configure notification preferences in the app settings

Alert Management

Alert Dashboard

The alert dashboard provides a centralized view of all active and historical alerts:

Navigate to "Operations" > "Alerts" > "Dashboard"
View alerts filtered by:
- Status (Active, Resolved, Acknowledged)
- Severity
- Cluster
- Time period
- Alert type
For each alert, you can:
- View detailed information and context
- Acknowledge receipt
- Add comments for team communication
- Resolve manually if the issue is fixed
- Snooze for a specified duration

Alert History

Review historical alert patterns:

Navigate to "Operations" > "Alerts" > "History"
Analyze alert frequency and patterns
Identify recurring issues
Review resolution times and effectiveness
Export alert history for compliance or analysis

Best Practices

Setting Appropriate Thresholds

Start Conservative: Begin with higher thresholds and adjust based on experience
Consider Workload Patterns: Set different thresholds for different clusters based on their usage patterns
Avoid Alert Fatigue: Don't create too many alerts or set thresholds too low
Review Regularly: Analyze which alerts are actionable and adjust accordingly

Creating Escalation Paths

For critical production environments:

Define tiered response procedures:
- Initial notification to team channel
- Escalation to on-call engineer after X minutes
- Manager notification after Y minutes
- Executive notification for extended issues
Configure PagerDuty or similar service with:
- Appropriate escalation policies
- Follow-the-sun coverage for global teams
- Backup responders

Grouping and Correlation

Reduce noise by grouping related alerts:

Enable alert correlation in "Alert Settings"
Group alerts by:
- Affected cluster or namespace
- Root cause when detectable
- Time proximity
- Related resources

Documentation and Runbooks

For each critical alert type:

Navigate to "Operations" > "Alerts" > "Runbooks"
Create or edit runbooks with:
- Clear description of the alert condition
- Potential causes and impact
- Immediate mitigation steps
- Long-term resolution actions
- Links to relevant documentation

Advanced Topics

Custom Alert Conditions

Create sophisticated alert conditions using:

Navigate to "Operations" > "Alerts" > "Custom Conditions"
Use the condition builder to:
- Combine multiple metrics with AND/OR logic
- Create rate-of-change conditions
- Compare current values to historical baselines
- Reference external metrics or data sources

Programmatic Alert Management

Use our API to manage alerts programmatically:

bash

# Example: Create a new alert policy via API
curl -X POST https://api.stackbooster.io/v1/alerts \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High Pod CPU Usage",
    "description": "Alert when pod CPU usage exceeds threshold",
    "metric": "kubernetes.pod.cpu.utilization_rate",
    "condition": "above",
    "threshold": 85,
    "duration": "10m",
    "severity": "medium",
    "notification_channels": ["slack-prod-alerts"]
  }'

Predictive Alerting

Enable ML-powered predictive alerts to detect issues before they impact services:

Navigate to "Operations" > "Alerts" > "Predictive Alerts"
Enable predictions for:
- Resource exhaustion trends
- Anomalous behavior detection
- Performance degradation patterns
- Cost spike predictions

Automated Remediation

For certain alert conditions, you can configure automated remediation actions:

Navigate to "Operations" > "Alerts" > "Auto-remediation"
Create remediation rules for conditions like:
- Failed node automatic replacement
- Pod rescheduling for node issues
- Scaling actions for resource constraints
- Budget enforcement actions

By effectively configuring and managing alerts, you ensure that your team remains informed about important events while minimizing unnecessary notifications. This balanced approach helps maintain optimal cluster performance and cost efficiency.

Alerts and Notifications ​

Alert Types ​

Cost Alerts ​

Performance Alerts ​

Operational Alerts ​

Configuring Alerts ​

Alert Policies ​

Default Alert Policies ​

Alert Severity Levels ​

Notification Channels ​

Email Notifications ​

Integrations ​

Slack ​

PagerDuty ​

Webhook ​

Mobile Notifications ​

Alert Management ​

Alert Dashboard ​

Alert History ​

Best Practices ​

Setting Appropriate Thresholds ​

Creating Escalation Paths ​

Grouping and Correlation ​

Documentation and Runbooks ​

Advanced Topics ​

Custom Alert Conditions ​

Programmatic Alert Management ​

Predictive Alerting ​

Automated Remediation ​