MACHINE LEARNING METHODS FOR SOLVING TEXT AUTHOR IDENTIFICATION PROBLEMS
Keywords:
Machine learning, text author identification, natural language processing, feature extraction, supervised learning, unsupervised learning, deep learning.Abstract
Text author identification is a crucial problem in natural language processing (NLP), with applications ranging from forensic analysis to copyright enforcement and literary studies. Machine learning (ML) has emerged as a powerful tool for addressing this challenge, offering algorithms capable of analyzing stylistic, lexical, and syntactic features in text. This paper explores the state-of-the-art ML methods for solving text author identification problems, including supervised, unsupervised, and deep learning techniques. A comprehensive review of the literature is presented, highlighting the effectiveness of various approaches. Additionally, the discussion outlines challenges such as data sparsity, feature selection, and ethical considerations. Experimental results demonstrate the impact of advanced ML models on classification accuracy and scalability. The findings emphasize the growing importance of machine learning in author attribution research.
References
1. Holmes, D. I. (1998). The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13(3), 111-117.
2. Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning.
3. McCallum, A., & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI Workshop.
4. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
5. Jain, A., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-323.
6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
7. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780.
8. Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP.
9. Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
10. Stamatatos, E. (2009). A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, 60(3), 538-556.
11. Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR.
12. Juola, P. (2006). Authorship Attribution. Foundations and Trends in Information Retrieval, 1(3), 233-334.
13. Project Gutenberg. (2022). Literary Works for Machine Learning Research. Retrieved from https://www.gutenberg.org
14. Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. CEAS.
15. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL.
16. Weiss, S. M., et al. (2010). Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer.
17. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47.
18. Floridi, L., & Cowls, J. (2019). A Unified Framework of Five Principles for AI in Society. Harvard Data Science Review.
19. Doshi-Velez, F., & Kim, B. (2017). Towards a Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608.
20. Kestemont, M., et al. (2019). Overview of the Author Identification Task at PAN 2019. CEUR Workshop Proceedings.