Text Clustering Using a Suffix Tree Similarity Measure

doi:10.4304/jcp.6.10.2180-2186

Journal of Computers, Vol 6, No 10 (2011), 2180-2186, Oct 2011

doi:10.4304/jcp.6.10.2180-2186

Text Clustering Using a Suffix Tree Similarity Measure

Chenghui HUANG, Jian YIN, Fang HOU

Abstract

In text mining area, popular methods use the bag-of-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between the text documents of corpus by applying a suffix tree similarity that combines with TF-IDF weighting method. Experimental results on standard document benchmark corpus RUTERS and BBC indicate that the new text similarity measure is effective. Comparing with the results of the other two frequent word sequence based methods, our proposed method achieves an improvement of about 15% on the average of F-Measure score.

Keywords

clustering algorithm;suffix tree;document model;similarity measure

References

[1] Meadow, C. T., Boyce, B. R., Kraft, D. H. (2000), Text Information Retrieval Systems (second edition). Academic Press.

[2] Ko, Y., Park, J., Seo, J. (2004), ‘Improving Text Categorization Using the Importance of Sentences’, Information Processing & Management, vol. 40, pp. 65-79.
http://dx.doi.org/10.1016/S0306-4573(02)00056-0

[3] Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs. (2008), ‘Robust and Efficient Near Duplicate Detection in Large Web Collections’, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, Singapore, pp.563-570.

[4] Wang, D., Li, T., Zhu, S. (2008), ‘Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization’, Proceeding of the 31st Annual International ACM SIGIR Conference, ACM Press, Singapore, pp. 307-314.

[5] Maguitman, A., Menczer, F., Roinestad, H., Vespignani, A. (2005) ‘Algorithmic Detection of Semantic Similarity’. Proceeding of the 14th International World Wide Web Conference, ACM Press, Chiba, Japan, pp.107-116.
http://dx.doi.org/10.1145/1060745.1060765

[6] Salton, G., Wong, A., Yang, C. S. (1975), ‘A vector space model for automatic indexing’, Communications of the ACM, vol. 18, pp. 613-620.
http://dx.doi.org/10.1145/361219.361220

[7] Deerwester, S., Dumais, S., Furnas, T. (1990), ‘Indexing by latent semantic analysis’, Journal of American Society of Information Science, Vol. 41, 391-407.
http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

[8] Zamir, O., Etzioni, O., Madani, O., Karp, R. M. (1997), ‘Fast and intuitive clustering of web documents’, Proceeding of the 3rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM Press, Newport Beach, California, USA, pp. 287-290.

[9] Zamir, O., Etzioni, O. (1998), ‘Web text clustering: a feasibility demonstration’, Proceeding of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, Melbourne, Australia, pp. 46-54.

[10] Li, Y. J., Soon, M. C., John, D. H. (2008), ‘Tex text clustering based on frequent word meaning sequences’, Data & Knowledge Engineering, Vol. 64, pp. 381-404.
http://dx.doi.org/10.1016/j.datak.2007.08.001

[11] Shehata, S., Karray, F., Kamel, M. (2007), ‘A Concept-based Model for Enhancing Text Categorization’, Proceedings of the 13rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM Press, San Jose, California, USA, pp.629-637,

[12] Chim, H., Deng, X. (2007), ‘A new suffix tree similarity measure for document clustering’ Proceeding of the 16th International Conference on World Wide Web (2007). ACM Press, Banff, Alberta, Canada, pp.121-130.
http://dx.doi.org/10.1145/1242572.1242590

[13] Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martinez-Trinidad, J.F. (2006), ‘Document clustering based on maximal frequent sequences’, Proceedings of the FinTAL2006, LNAI, vol. 4139, pp. 257-267.

[14] Beil, F., Ester, M., Xu, X.W. (2002), ‘Frequent term-based text clustering’, Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 436-442.
http://dx.doi.org/10.1145/775047.775110

[15] Reuters-21578 (1997), text categorization test collection, Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/, Assessed on 17 December 2010.

[16] BBC Dataset, (2010), Machine Learning group, Available at: http://mlg.ucd.ie, Assessed on 17 December 2010.

[17] LingPipe, (2010), Alias-i, Inc, Available at: http://www.alias-i.com, Assessed on 17 December 2010.

[18] Karypis, G., (2010), CLUTO–A Clustering Toolkit, Department of Computer Science, University of Minnesota, Available at: http://www.cs.umn.edu/~karypis/cluoto/, Assessed on 17 December 2010.

Full Text: PDF

Journal of Computers (JCP, ISSN 1796-203X)

Username
Password
Remember me