Emerging Patterns and Classification Algorithms for DNA Sequence

doi:10.4304/jsw.6.6.985-992

Journal of Software, Vol 6, No 6 (2011), 985-992, Jun 2011

doi:10.4304/jsw.6.6.985-992

Emerging Patterns and Classification Algorithms for DNA Sequence

Xiaoyun Chen, Jinhua Chen

Abstract

Existing machine learning methods for classification of DNA sequence achieve good results, but these methods try to express a DNA sequences as discrete multi-dimensional vector, so when the length of the sequences in the DNA sequence database is not fixed or there exists some omitted characters, these methods can not be used directly. In this paper, we define the new support and growth rate of support to find the frequent emerging patterns from DNA sequence database, and present a classification algorithm FESP based on the frequent emerging sequence patterns. The frequent emerging sequence patterns keep the information provided by the order of bases in gene sequences and can catch interaction among bases. FESP algorithm applies classification rules that are constructed by frequent emerging sequence patterns of each class to classify the new DNA sequences. This method can work on sequences with different lengths or omitted character and shows good performance.

Keywords

emerging sequence pattern;classification rule;feature selection;DNA

References

[1] I.Guyon, J.Weston, S.Barnhill, V. Vapnik,2002. “Gene selection for cancer classification using support vector machines”, Machine Learning. Volume 46, Issue 1-3:389-422.
doi:10.1023/A:1012487302797

[2] B. Liu, W. Hsu and Y. Ma. “Integrating classification and association rule mining”, Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIG KDD’98), pages 80-86, New York City, NY, August 1998.

[3] W. Li, J. Han and J. pei. “CMAR: Accurate and efficient classification based on multiple classification rules”, Proceedings of the 2001 IEEE International Conference on Data Mining, California, 2001.

[4] G. Dong, X. Zhang, L. Wong and J. Li, “CAEP: Classification by Aggregating Emerging Patterns”, Proceedings of the Second International Conference on Discovery Science, p.30-42, December 01, 1999.

[5] O.R.Zaïane and M.L.Antonie, “Classifying text documents by associating terms with text categories”, Proceedings of thirteenth Australasian Database Conference (ADC’02), pages 215-222, Melbourne, Australia, January 2002.

[6] G.Dong. and J.Li, “Efficient mining of emerging patterns: Discovering trends and differences”, Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD’99), pages43-52,San Diego, CA, Aug.1999.

[7] H.Liu and R.Setiono, “Chi2: Feature selection and discretization of numeric attributes”, Proceeding of IEEE 7th International Conference on Tools with Artificial Intelligence, 338-391, 1995.

[8] Y. Xiong and Y. Zhu, “A multi-supports-based sequential pattern mining algorithm”, In Computer and Information Technology, 2005. CIT 2005. The Fifth International Conference on, pages 170–174. IEEE, 2005.

[9] P.Pollastro and S.Rampone, “HS3D: Homo Sapiens Splice Site Data Set”, Nucleic Acids Research, 2003 Annual Database Issue.

[10] V. Vapnik The Nature of Statistical Learning Theory. 2nd edition, NY: Springer.

[11] X. Zhang, X. Xiao and G. Xu, “Fuzzy Support Vector Machine Based on Affinity Among Samples”, Journal of Software, 2006,17(5): 951-958.
doi:10.1360/jos170951

Full Text: PDF

Journal of Software (JSW, ISSN 1796-217X)

Username
Password
Remember me