This article covers a set of best practices on some of the most important metrics that can be adopted while monitoring the health of the Storage Connectors. The guidance and examples will focus on three key areas of monitoring:
- Resource utilization
For a list of the health metrics that the Storage Connector emits and their definitions please refer to Storage Connector health metrics definitions.
The metrics that focus on concurrent requests and request throttling are good indicators of how a given Storage Connector or a load balanced pool of Storage Connectors are scaled to meet the level of file upload and download activity. When thresholds are set to warn of approaching limits and these metrics are constantly above those thresholds it is often a sign that additional Storage Connector nodes should be added to scale out. Conversely these metrics can also indicate when the number of Storage Connectors are over-provisioned for the activity as monitored over a period of time, though storage administrators must plan for peak activity periods and other planned or unplanned spikes in requests.
In this category the following metrics will be exercised through some examples:
The Storage Connector tracks counters for a variety of runtime error conditions that may be encountered. These metrics, when used in conjunction with the information in the Storage Connecgtor logs, can be used to diagnose and root cause problems which impact normal operations and disruption for users accessing their files.
In this article we will focus on tracking file upload and download errors, which are among the most common types of errors that occur due to the dependencies on other subsystems in the Enterprise, the public network, Syncplicity client actions and Syncplicity web services. The examples will focus on the following metrics:
Resource Utilization Monitoring
This category of metrics focuses on the Storage Connector's consumption of system resources such as CPU, physical and virtual memory and heap. These metrics can be used to determine the proper tuning of the virtual machines to scale up or down.
The examples in this document utilize the Grafana visualization tool. However these metrics can be consumed and monitored through a variety of popular monitoring tools. To walk through these examples hands-on, first install and configure the tool of your choice to connect with the Storage Connector health metrics feed. You can refer to the following articles for the instructions on configuring Grafana, Graphite and Splunk with the Storage Connector health metrics.
- How to configure Storage Connector health metrics using Grafana
- How to configure Storage Connector health metrics using Graphite
- How to configure Storage Connector health metrics using Splunk
For the purposes of nomenclature, the examples given in this article use a 2-node cluster of Storage Connector nodes named "acme". The two nodes have been named "host1" and "host2" and are behind a load balancer. Each node in the metrics namespace for the Storage Connector configuration file has been specified as shown below. The configuration can be found in the /etc/syncp-storage/syncp-storage.conf file.
syncplicity.health.external.prefix = $CLUSTER_NAME.$NODE_NAME
All metrics will bear the prefix "acme.host1" or "acme.host2" depending on which node is returning the metrics.
To monitor your Storage Connector capacity and ensure smooth continuation of uploads and downloads the set of concurrent file upload and download request metrics are very useful. These metrics can also help determine the peak upload and download times and limits.
Enter the following query:
The Storage Connector applies a limit to the total number of concurrent requests it will process at any given time. This limit is commonly referred to as the throttling limit. This limit is available in the Storage Connector configuration file and is identifiable from the following parameter key:
In our example, the limit for each node has been set to 150 concurrent connections, so for our 2-node "acme" cluster the limit would be 150 x 2 = 300.
Enter the following query:
threshold(300, "Max connections", red)
You might also add a second threshold at 80 per cent of the limit by adding the following query:
threshold(240, "Max connections", red)
The resulting graph should look like the example given in Figure 1 below.
Analysis of this graph shows the load to be within the limit and not exceeding the 80 per cent threshold set for it.
When the load is increased, the resulting graph should start to look like Figure 2 shown below.
Analysis of this graph shows peak loads exceeding the 80 per cent threshold between the times of 17.10 and 17.30.
As the load increases further the graph will start to look like the example shown in Figure 3 below.
This graph shows peaks consistently remaining above the 80 per cent threshold and touching the 100 per cent limit set. When the concurrent request volume is maintained at or near to the throttling limit the nodes are becoming unable to handle the load and are rejecting some of the incoming requests.
The solution for this problem lies in increasing the number of Storage Connector nodes behind this load balancer to help distribute the load among more instances.
A small number of file upload and download errors occur in every environment. This happens because network communication is inherently unreliable and some HTTP errors are a part of normal network communication, e.g. 404 File Not Found error. The error rate should be defined per environment because it depends on number of clients and other factor and therefore can vary from one environment to another. However, it is a recommended best practice to determine a baseline error rate for a set of nodes by monitoring the cluster over a period of days.
Enter the following query to render the download errors grouped by error type for all the nodes:
Enable a stacking display option to create graphs for the different types of errors in a vertical manner on top of each other.
The resulting graph of file download errors should look like Figure 4.
To display the graph of upload errors enter the following query:
The resulting graph should look like the Figure 5 below:
The examples shown above are part of normal client-server environments. However, when a sudden spike of errors occurs, it is an indication of a problem in or affecting the Storage Connector. An example of a potential problem in the Storage Connector, as represented by a spike in the error rate, would potentially look like Figure 6.
The exception to this type of spike seen in the graph is the Error 500 which is displayed when the server is unable to handle a request. This could be a sign of an error detrimental to the user experience.
If an Error 500 occurs, it is recommended that Storage Connector logs are checked for other signs of the problem.
An example of repeated Error 500 occurrences is shown below in Figure 7.
Administrators frequently check the CPU and memory utilization of their systems to keep track of warning signs of potential outages. This can help them correct a potential problems at the system layer in supporting the running applications. For the Storage Connector it is important to track the utilization of system resources to ensure that those virtual machines are appropriately provisioned and tuned for the load.
To enable a graphic display of the CPU utilization enter the following query:
The value of **.syncp.compute.v1.system.cpu.current.load.value provides the "recent cpu usage" for a selected node. This value is in a [0.,100.0] interval. A value of 0.0 means that all CPUs have been idle during the recent observation period while a value of 100.0 means that all CPUs were actively running 100 per cent of the time during the recent observation period.
The graph of a normal CPU load would be similar to the example shown below in Figure 8.
A severely overloaded system will display a graph similar to the example shown in Figure 9 below.
In the case of the severely overloaded system where the graph shows the CPU utilization at above the 90 per cent threshold and even peaks to 100 per cent utilization is an indication that the Storage Connector cluster is unable to handle the load.
If such a CPU utilization is observed, it is advisable to increase the number of nodes to distribute the load.
To render JVM Heap status enter the following query:
aliasByNode(acme.*.syncp.compute.v1.jvm.heapUsed, 1, 6)
This metric will display the heap space used in the JVM on each node. Enable stacking to display this graph.
An example of normal heap usage where the heap is seldom used beyond 80 per cent should be similar to the example shown below in Figure 10.
Under such conditions about 30 per cent of the nodes in the CPU will be spent on JVM Garbage collection. If this condition occurs and is not resolved, the exception "out of memory" will be thrown and the node will begin to restart itself. When this occurs, the node hangs during garbage collection and fail to produce the statistics required. In Figure 11 such a hanging of the node will be indicated by gaps in the graph.
To correct this problem it is recommended that the Storage Connector JVM memory allocation is increased. An alternative solution is to increase the number of nodes to distribute the load.