Anomaly Detection Service – Modules¶
The Anomaly Detection Service consists of a model training or clustering module and a model application or scoring module.
Model Training (Clustering)¶
The model training module clusters the historic training data set specified in the API call providing us with a model of normal behaviour of our process/asset. Clustering is done using DBSCAN (refer to Ester, Martin, et al.: "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996). The clustering assigns each data instance a cluster ID to which it belongs, or a noise label if it does not belong to any cluster. It is sufficient to only save this information for points which are part of a cluster and where the number of neighbors inside this cluster is at least the minimum cluster size.
The algorithm uses the following parameters:
epsilon
: Distance threshold to determine if a point belongs to the clusterminPointsPerCluster
: Minimum cluster sizedistanceMeasureAlgorithm
(optional): Distance measure algorithm to be applied:- Euclidean (default)
- Manhattan
- Chebyshev
name
(optional): Human-friendly name of the model (default: "model").
The Model Management Service is used for model storage and automatically sets the expiration date of a model to 14 days. This parameter might be changed in the future.
Benefits¶
- Unsupervised approach → no labeled training data needed
- Feasible for univariate and multivariate time series as well as time series subsequences
- Robust in the presence of noise in the training data
- Extensible to distributed clustering
- Fast testing phase → extensible to real-time analysis
- Allows for human interaction and integration of domain knowledge (e.g. by labeling of new clusters and/or anomalies)
Remarks¶
- This approach requires input parameters that are highly dependent on the data being analyzed (clustering parameters, distance measure).
- It is not feasible for data that forms clusters with significantly different density values or no clusters at all.
- The complexity of the model training (clustering) is O(n2).
- Anomalies might form clusters amongst each other and result in false negatives.
- Preprocessing of the data (normalization, seasonality, segmentation, etc.) needs to be done before model training and application.
Model Application (Scoring)¶
The model application module determines whether a given set of data points is anomalous or not. This is done by calculating the distance of each data point p to its nearest neighbor n, that is a core point. If this distance is smaller or equal to epsilon, then the data point in question is not anomalous. Such data points are assigned a score of 0. In all other cases the data point is assigned a score equal to the difference between the distance and epsilon. The higher the score, the more likely it is that the data point is an anomaly.
To ensure scalability and applicability to different use cases, the Anomaly Detection Service provides a variety of distance functions that can be used in clustering and scoring. Currently, the Anomaly Detection Service supports Euclidean, Manhattan and Chebyshev distance.
Except where otherwise noted, content on this site is licensed under the Development License Agreement.