Automatic PAM Clustering Algorithm for Outlier Detection

doi:10.4304/jsw.7.5.1045-1051

Journal of Software, Vol 7, No 5 (2012), 1045-1051, May 2012

doi:10.4304/jsw.7.5.1045-1051

Automatic PAM Clustering Algorithm for Outlier Detection

Dajiang Lei, Qingsheng Zhu, Jun Chen, Hai Lin, Peng Yang

Abstract

In this paper, we propose an automatic PAM (Partition Around Medoids) clustering algorithm for outlier detection. The proposed methodology comprises two phases, clustering and finding outlying score. During clustering phase we automatically determine the number of clusters by combining PAM clustering algorithm and a specific cluster validation metric, which is vital to find a clustering solution that best fits the given data set, especially for PAM clustering algorithm. During finding outlier scores phase we decide outlying score of data instance corresponding to the cluster structure. Experiments on different datasets show that the proposed algorithm has higher detection rate go with lower false alarm rate comparing with the state of art outlier detection techniques, and it can be an effective solution for detecting outliers.

Keywords

outlier detection; PAM clustering algorithm; subtractive clustering; cluster validation

References

[1]D. M. Hawkins, Identification of outliers. New York: Chapman and Hall, USA, 1980.

[2]P. Yang and Q. S. Zhu, "Finding key attribute subset in dataset for outlier detection," Knowledge-Based Systems, vol. 24, pp. 269-274, Mar 2011.
http://dx.doi.org/10.1016/j.knosys.2010.09.003

[3]Y. Dianmin, W. Xiaodan, W. Yunfeng, L. Yue, and C. Chao-Hsien, "A Survey of Outlier Detection Methods in Network Anomaly Identification," Computer Journal, vol. 54, pp. 570-588, Apr 2011.
http://dx.doi.org/10.1093/comjnl/bxr026

[4]K. Bhaduri, M. D. Stefanski, and A. N. Srivastava, "Pri-vacy-Preserving Outlier Detection Through Random Nonlinear Data Distortion," IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, vol. 41, pp. 260-272, Feb 2011.
http://dx.doi.org/10.1109/TSMCB.2010.2051540
PMid:20595089

[5]S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori, "Statistical outlier detection using direct density ratio estimation," Knowledge and Information Systems, vol. 26, pp. 309-336, Feb 2011.
http://dx.doi.org/10.1007/s10115-010-0283-2

[6]S. G. Marroquin-Guerra, F. Velasco-Tapia, and L. Diaz-Gonzalez, "Statistical evaluation of geochemical reference materials from the Centre de Recherches Petrographiques et Geochimiques (France) by applying a schema for the de-tection and elimination of discordant outlier values," Re-vista Mexicana De Ciencias Geologicas, vol. 26, pp. 530-542, Aug 2009.

[7]Y. Zhang, S. Yang, and Y. Wang, "LDBOD: A novel local distribution based outlier detector," Pattern Recogn. Lett., vol. 29, pp. 967-976, 2008.
http://dx.doi.org/10.1016/j.patrec.2008.01.019

[8]I. Ruts and P. J. Rousseeuw, "Computing depth contours of bivariate point clouds," Computational Statistics & Data Analysis, vol. 23, pp. 153-168, 1996.
http://dx.doi.org/10.1016/S0167-9473(96)00027-8

[9]E. M. Knorr and R. T. Ng, "Algorithms for Mining Dis-tance-Based Outliers in Large Datasets," presented at the Proceedings of the 24rd International Conference on Very Large Data Bases, New York, USA, 1998.

[10]S. Ramaswamy, R. Rastogi, and K. Shim, "Efficient algo-rithms for mining outliers from large data sets," Sigmod Record, vol. 29, pp. 427-438, Jun 2000.
http://dx.doi.org/10.1145/335191.335437

[11]M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, "LOF: Identifying density-based local outliers," Sigmod Record, vol. 29, pp. 93-104, Jun 2000.
http://dx.doi.org/10.1145/335191.335388

[12]Z. Y. He, X. F. Xu, and S. C. Deng, "Discovering cluster-based local outliers," Pattern Recognition Letters, vol. 24, pp. 1641-1650, Jun 2003.
http://dx.doi.org/10.1016/S0167-8655(03)00003-5

[13]M. F. Jaing, S. S. Tseng, and C. M. Su, "Two-phase clus-tering process for outliers detection," Pattern Recogn. Lett., vol. 22, pp. 691-700, 2001.
http://dx.doi.org/10.1016/S0167-8655(00)00131-8

[14]S. L. Chiu, "Extracting fuzzy rules for pattern classification by cluster estimation," presented at the 6th Internat. Fuzzy Systems Association World Congress, Taipai, Taiwan, 1995.

[15]K. Wang, B. Wang, and L. Peng, "CVAP: Validation for Cluster Analyses," Data Science Journal, vol. 8, pp. 88-93, 2009.
http://dx.doi.org/10.2481/dsj.007-020

[16]G. Chen, S. A. Jaradat, N. Banerjee, T. S. Tanaka, M. S. H. Ko, and M. Q. Zhang, "Evaluation and Comparison of Clustering Algorithms in Anglyzing ES Cell Gene Expres-sion Data," Statistica Sinica, vol.12, pp.241-262, 2002.

[17]L. Kaufman and P. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis," New York: John Wiley & Sons, USA, 1990.

[18]P. Yang, B. Huang, "An Outlier Detection Algorithm Based on Spectral Cluster," presented at Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application, Wuhan, China, 2008.
http://dx.doi.org/10.1109/PACIIA.2008.60

[19]A. Cerioli and A. Farcomeni, "Error rates for multivariate outlier detection," Computational Statistics & Data Analy-sis, vol. 55, pp. 544-553, Jan 2011.
http://dx.doi.org/10.1016/j.csda.2010.05.021

[20]J. Davis and M. Goadrich, "The relationship betwen Pre-cision-Recall and ROC curves," presented at the Proceed-ings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, 2006.

[21]A. Asuncion and D. J. Newman, "UCI machine learning repository," [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, 2007.

Full Text: PDF

Journal of Software (JSW, ISSN 1796-217X)

Username
Password
Remember me