Skip to content

Alerts and Notifications

Effective monitoring and alerting are essential for maintaining optimal cluster performance and cost efficiency. This guide explains how to configure Stackbooster.io's comprehensive alerting system to stay informed about important events and changes in your Kubernetes environment.

Alert Types

Stackbooster.io provides alerts across several categories:

Cost Alerts

  • Budget Thresholds: Notifications when spending reaches defined percentages of your budget
  • Anomalous Spending: Alerts for unusual cost increases or unexpected resource usage
  • Savings Opportunities: Notifications about newly identified cost optimization opportunities
  • Reserved Instance Coverage: Alerts when RI coverage drops below target levels

Performance Alerts

  • Resource Saturation: Notifications when nodes or clusters approach resource limits
  • Scaling Failures: Alerts when auto-scaling operations fail or are delayed
  • Pod Scheduling Issues: Notifications about pods failing to schedule due to resource constraints
  • Node Performance Problems: Alerts for node-level performance issues

Operational Alerts

  • Agent Health: Notifications if the Stackbooster.io agent becomes disconnected
  • Configuration Changes: Alerts about changes to cluster or optimization settings
  • Optimization Actions: Notifications about significant scaling or optimization actions
  • Security Events: Alerts related to permissions or access issues

Configuring Alerts

Alert Policies

Create and manage alert policies through the Alerts dashboard:

  1. Navigate to "Operations" > "Alerts"
  2. Click "Create Alert Policy"
  3. Select an alert category and specific condition
  4. Configure threshold values and evaluation periods
  5. Choose notification channels
  6. Set severity level and automated response options (if applicable)
  7. Save the policy

Default Alert Policies

Stackbooster.io includes several default alert policies that are enabled automatically:

AlertConditionDefault ThresholdSeverity
Critical Node UtilizationCPU or memory utilization exceeding threshold90% for 15 minutesHigh
Failed Scaling OperationAuto-scaling operation failsAny failureMedium
Agent Connection LostAgent stops reporting metrics15 minutesHigh
Budget ThresholdMonthly spending reaches percentage of budget80%, 90%, 100%Medium, High, Critical
Optimization BlockedOptimization action blocked by constraint24 hoursLow

Alert Severity Levels

Alert policies can be assigned different severity levels:

  • Critical: Requires immediate attention, potential service impact
  • High: Important issue needing prompt response
  • Medium: Issue requiring attention but not immediately urgent
  • Low: Informational, may require eventual action

Notification Channels

Configure how and where you receive alert notifications:

Email Notifications

Configure email notifications for individual users or groups:

  1. Navigate to "Operations" > "Alerts" > "Notification Channels"
  2. Click "Add Email Channel"
  3. Enter email addresses for recipients
  4. Select which severity levels trigger email notifications
  5. Choose whether to send daily or weekly summaries

Integrations

Connect alerts to external systems:

Slack

  1. Navigate to "Operations" > "Alerts" > "Notification Channels"
  2. Click "Add Slack Channel"
  3. Follow the OAuth flow to authorize Stackbooster.io
  4. Select the Slack channel to receive notifications
  5. Configure notification formatting and severity filtering

PagerDuty

  1. Navigate to "Operations" > "Alerts" > "Notification Channels"
  2. Click "Add PagerDuty Service"
  3. Enter your PagerDuty integration key
  4. Configure which severity levels create incidents
  5. Set up automatic incident resolution

Webhook

  1. Navigate to "Operations" > "Alerts" > "Notification Channels"
  2. Click "Add Webhook"
  3. Enter the webhook URL
  4. Configure payload format (JSON or form data)
  5. Set custom headers if needed
  6. Test the webhook connection

Mobile Notifications

Get alerts on your mobile device:

  1. Download the Stackbooster.io mobile app from the App Store or Google Play
  2. Log in with your account credentials
  3. Enable push notifications when prompted
  4. Configure notification preferences in the app settings

Alert Management

Alert Dashboard

The alert dashboard provides a centralized view of all active and historical alerts:

  1. Navigate to "Operations" > "Alerts" > "Dashboard"

  2. View alerts filtered by:

    • Status (Active, Resolved, Acknowledged)
    • Severity
    • Cluster
    • Time period
    • Alert type
  3. For each alert, you can:

    • View detailed information and context
    • Acknowledge receipt
    • Add comments for team communication
    • Resolve manually if the issue is fixed
    • Snooze for a specified duration

Alert History

Review historical alert patterns:

  1. Navigate to "Operations" > "Alerts" > "History"
  2. Analyze alert frequency and patterns
  3. Identify recurring issues
  4. Review resolution times and effectiveness
  5. Export alert history for compliance or analysis

Best Practices

Setting Appropriate Thresholds

  • Start Conservative: Begin with higher thresholds and adjust based on experience
  • Consider Workload Patterns: Set different thresholds for different clusters based on their usage patterns
  • Avoid Alert Fatigue: Don't create too many alerts or set thresholds too low
  • Review Regularly: Analyze which alerts are actionable and adjust accordingly

Creating Escalation Paths

For critical production environments:

  1. Define tiered response procedures:

    • Initial notification to team channel
    • Escalation to on-call engineer after X minutes
    • Manager notification after Y minutes
    • Executive notification for extended issues
  2. Configure PagerDuty or similar service with:

    • Appropriate escalation policies
    • Follow-the-sun coverage for global teams
    • Backup responders

Grouping and Correlation

Reduce noise by grouping related alerts:

  1. Enable alert correlation in "Alert Settings"
  2. Group alerts by:
    • Affected cluster or namespace
    • Root cause when detectable
    • Time proximity
    • Related resources

Documentation and Runbooks

For each critical alert type:

  1. Navigate to "Operations" > "Alerts" > "Runbooks"
  2. Create or edit runbooks with:
    • Clear description of the alert condition
    • Potential causes and impact
    • Immediate mitigation steps
    • Long-term resolution actions
    • Links to relevant documentation

Advanced Topics

Custom Alert Conditions

Create sophisticated alert conditions using:

  1. Navigate to "Operations" > "Alerts" > "Custom Conditions"
  2. Use the condition builder to:
    • Combine multiple metrics with AND/OR logic
    • Create rate-of-change conditions
    • Compare current values to historical baselines
    • Reference external metrics or data sources

Programmatic Alert Management

Use our API to manage alerts programmatically:

bash
# Example: Create a new alert policy via API
curl -X POST https://api.stackbooster.io/v1/alerts \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High Pod CPU Usage",
    "description": "Alert when pod CPU usage exceeds threshold",
    "metric": "kubernetes.pod.cpu.utilization_rate",
    "condition": "above",
    "threshold": 85,
    "duration": "10m",
    "severity": "medium",
    "notification_channels": ["slack-prod-alerts"]
  }'

Predictive Alerting

Enable ML-powered predictive alerts to detect issues before they impact services:

  1. Navigate to "Operations" > "Alerts" > "Predictive Alerts"
  2. Enable predictions for:
    • Resource exhaustion trends
    • Anomalous behavior detection
    • Performance degradation patterns
    • Cost spike predictions

Automated Remediation

For certain alert conditions, you can configure automated remediation actions:

  1. Navigate to "Operations" > "Alerts" > "Auto-remediation"
  2. Create remediation rules for conditions like:
    • Failed node automatic replacement
    • Pod rescheduling for node issues
    • Scaling actions for resource constraints
    • Budget enforcement actions

By effectively configuring and managing alerts, you ensure that your team remains informed about important events while minimizing unnecessary notifications. This balanced approach helps maintain optimal cluster performance and cost efficiency.

Released under the MIT License. Contact us at info@stackbooster.io