Isolation forest
A major contributor to this article appears to have a close connection with its subject. (October 2023) |
Isolation Forest is an algorithm for data anomaly detection using binary trees. It was developed by Fei Tony Liu in 2008.[1] It has a linear time complexity and a low memory use, which works well for high-volume data.[2][3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.
Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.
History
[edit]The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008.[2] In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data.[3] In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies.[4]
Isolation trees
[edit]The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.
An example of random partitioning in a 2D dataset of normally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.
Recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point in the first figure is greater than the path length of in the second figure.
Let be a set of d-dimensional points and . An Isolation Tree (iTree) is defined as a data structure with the following properties:
- for each node in the Tree, is either an external-node with no child, or an internal-node with one “test” and exactly two child nodes ( and )
- a test at node consists of an attribute and a split value such that the test determines the traversal of a data point to either or .
In order to build an iTree, the algorithm recursively divides by randomly selecting an attribute and a split value , until either
- the node has only one instance, or
- all data at the node have the same values.
When the iTree is fully grown, each point in is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length of point is defined as the number of edges traverses from the root node to get to an external node.
A probabilistic explanation of iTree is provided in the original iForest paper.[2]
Anomaly detection
[edit]Anomaly detection with Isolation Forest is done as follows:[4]
- Use the training dataset to build some number of iTrees
- For each data point in the test set:
- Pass it through all the iTrees, counting the path length for each tree
- Assign an “anomaly score” to the instance
- Label the point as “anomaly” if its score is greater than a predefined threshold, which depends on the domain
Anomaly score
[edit]The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.[4] Therefore the estimation of average for external node terminations is the same as that of the unsuccessful searches in BST, that is[5]
where is the test set size, is the sample set size and is the harmonic number, which can be estimated by , where is the Euler-Mascheroni constant.
Above, is the average given , so we can use it to normalize to get an estimate of the anomaly score for a given instance x:
where is the average value of from a collection of iTrees. For any data point :
- if is close to then is very likely an anomaly
- if is smaller than then is likely normal
- if all points in the sample score around , then likely they are all normal
Properties
[edit]- Sub-sampling: Because iForest does not need to isolate normal instances, it can often ignore most of the training set. Thus, it works very well when the sampling size is kept small, unlike most other methods, which benefit from a large sample size.[2][3]
- Swamping: When normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomenon known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. A main reason for swamping is the presence of too much data; so a possible solution is sub-sampling. Because iForest performs well under sub-sampling, reducing the number of points in the sample is also a good way to reduce the effect of swamping.[2]
- Masking: When there are many anomalies, some of them can aggregate in a dense, large cluster, making it more difficult to separate the single anomalies and so to identify them. This phenomenon is called “masking”, and as with swamping, is more likely when the sample is big and can be alleviated through sub-sampling.[2]
- High-dimensional data: A main limitation of standard, distance-based methods is their inefficiency in dealing with high dimensional data.[6] The main reason is that in a high-dimensional space, every point is equally sparse, so using a distance-based measure of separation is ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by using feature selection, like Kurtosis, to reduce the dimensionality of the sample.[2][4]
- Normal instances only: iForest performs well even if the training set contains no anomalous points.[4] This is because iForest describes data distributions such that long tree paths correspond to normal data points. Thus, the presence of anomalies is irrelevant to detection performance.
Open source implementations
[edit]Original implementation by Fei Tony Liu is Isolation Forest in R.
Other implementations (in alphabetical order):
- ELKI contains a Java implementation.
- Isolation Forest - A Spark/Scala implementation.[7]
- Isolation Forest by H2O-3 - A Python implementation.
- Package solitude implementation in R.
- Python implementation with examples in scikit-learn.
- Spark iForest - A distributed Apache Spark implementation in Scala/Python.
- PyOD IForest - Another Python implementation in the popular Python Outlier Detection (PyOD) library.
Other variations of Isolation Forest algorithm implementations:
- Extended Isolation Forest – An implementation of Extended Isolation Forest.[8]
- Extended Isolation Forest by H2O-3 - An implementation of Extended Isolation Forest.
- (Python, R, C/C++) Isolation Forest and variations - An implementation of Isolation Forest and its variations.[9]
See also
[edit]References
[edit]- ^ Liu, Fei Tony. "First Isolation Forest implementation on Sourceforge".
- ^ a b c d e f g Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation Forest". 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422. doi:10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449.
- ^ a b c Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation-Based Anomaly Detection". ACM Transactions on Knowledge Discovery from Data. 6: 3:1–3:39. doi:10.1145/2133360.2133363. S2CID 207193045.
- ^ a b c d e Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (September 2010). "On Detecting Clustered Anomalies Using SCiForest". Joint European Conference on Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2010: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. 6322: 274–290. doi:10.1007/978-3-642-15883-4_18. ISBN 978-3-642-15882-7.
- ^ Shaffer, Clifford A. (2011). Data structures & algorithm analysis in Java (3rd Dover ed.). Mineola, NY: Dover Publications. ISBN 9780486485812. OCLC 721884651.
- ^ Dilini Talagala, Priyanga; Hyndman, Rob J.; Smith-Miles, Kate (12 Aug 2019). "Anomaly Detection in High Dimensional Data". arXiv:1908.04000 [stat.ML].
- ^ Verbus, James (13 August 2019). "Detecting and preventing abuse on LinkedIn using isolation forests". LinkedIn Engineering Blog. Retrieved 2023-07-02.
- ^ Hariri, Sahand; Kind, Matias Carrasco; Brunner, Robert J. (April 2021). "Extended Isolation Forest". IEEE Transactions on Knowledge and Data Engineering. 33 (4): 1479–1489. arXiv:1811.02141. doi:10.1109/TKDE.2019.2947676. ISSN 1558-2191. S2CID 53236735.
- ^ Cortes, David (2019). "Distance approximation using Isolation Forests". arXiv:1910.12362 [stat.ML].