Treffer: Regularization of document-term matrices using singular value decomposition.
Weitere Informationen
This paper seeks to identify appropriate regularization methods for document-term matrices. Regularization is essential in machine learning for reducing overfitting and improving model generalization. The evaluation uses a document-term matrix generated from a vectorized wine review dataset, with cross-validation performed using random forest and gradient boosting regression algorithms. Using these evaluation scores, we identify the proper regularization methods for document-term matrices. All implementations are written in Python, and the source code is provided to ensure reproducibility. While L1 (lasso) and L2 (ridge or Tikhonov) regularization are widely used, this paper investigates Singular Value Decomposition (SVD) as an alternative approach, particularly suited for high-dimensional and noisy datasets. Three SVD-based regularization techniques are explored for document-term matrices in the context of natural language processing: low-rank approximation, feature orthogonalization, and principal component analysis. Each method is implemented and assessed based on execution time, memory consumption, and predictive performance measured by R² scores. [ABSTRACT FROM AUTHOR]