Your Metacloud installation sends alerts to the Support team when resource usage exceeds certain thresholds or when other events occur that could cause performance problems in your cloud. The team addresses the cause of each event and contacts your organization only if corrective steps will disrupt your cloud operations or if your organization needs to take some action.
Alerts range in severity, depending on the percentage of resource capacity that has been consumed. In most cases, the Support team can restore usage to normal levels without disrupting operations or involving your organization. In other words, you won't often need to know that an alert has been triggered and the cause remediated.
Resource Issues That Require Your Attention
Other alerts require the Support team to contact your organization, as in the following scenarios:
- Disk usage is close to full capacity, and the Support team considers the remaining space insufficient to meet demand much longer. For example, usage is at 85 percent, and only 32 GB are free.
- Low bandwidth or other performance-impacting issues are affecting a network that your organization controls. For example, all available subnets are used up, which prevents you from being able to create new projects.
- Lost or diminished contact with your environment affects the Support team's ability to provide normal monitoring and administrative services. For example, the Metacloud team detects a loss of its VPN contact to one of your Availability Zones (AZ).
- Hardware that your organization controls has failed. Before notifying you, the Support team may take immediate actions to minimize impact to operations, such as migrating affected instances to another host in the same cluster.
How Your Organization May Need to Respond to Resource Issues
Sometimes the Support team will contact your organization only to notify you of actions that may temporarily interrupt services, such as restarting a server. In other cases, the team will reach out to you to take one of the following actions:
- Replace failed hardware, storage, or networking devices that your organization controls.
- Investigate your network for points of failure. The Metacloud support team may be able to assist you with this effort.
- Investigate activity in your organization that is causing spikes in network bandwidth usage.
- Increase available storage by getting rid of old or unnecessary data, instances, or applications.
- Increase storage capacity.
Practicing Resource Management
Emergency issues can be time consuming for you or have adverse effects on your cloud tenants. You can reduce their likelihood by making a practice of managing cloud resources, as in the following ways:
- Scan your cloud routinely for old or unused instances, and delete them.
- Scan your instances for old or unused applications or data, and delete them.
- Note when users upload new images or make snapshots of running instances. These actions can consume disk space quickly on Metacloud Control Planes (MCPs).
- Review your flavors and make sure they allocate appropriate resources to your instances. See Managing Flavors for more information.
- Consult the Support team about certain types of system errors, especially if they occur more than once. For example, if the Compute service is unable to create new instances, it may be due to low CPU availability.
- Routinely check usage levels with your Dashboard, and note any upward trends or spikes.
- Plan to scale your capacity with the resource needs of your tenants.