🧠 AI⚪ NeutralImportance 5/10

Improving Requirements Classification with SMOTE-Tomek Preprocessing

arXiv – CS AI|Barak Or|May 28, 2026 at 04:00 AM

🤖AI Summary

Researchers applied SMOTE-Tomek preprocessing to address class imbalance in requirements engineering classification, achieving 76.16% accuracy with logistic regression compared to a 58.31% baseline. The technique combines synthetic minority oversampling with Tomek link removal and stratified K-fold validation on the PROMISE dataset of 969 categorized requirements.

Analysis

This research addresses a fundamental challenge in machine learning applied to software engineering: handling imbalanced datasets where minority classes are underrepresented. The PROMISE dataset contains functional and non-functional requirements in unequal proportions, a common real-world scenario that can bias classifiers toward majority classes. The SMOTE-Tomek hybrid approach tackles this by generating synthetic samples for minority classes while removing borderline instances that create overlap, thereby improving decision boundaries.

Class imbalance represents a persistent obstacle in software requirements classification, where non-functional requirements often appear less frequently than functional ones. Traditional models trained on such imbalanced data produce misleading accuracy metrics and poor minority class recall. The stratified K-fold cross-validation ensures each fold maintains the original class distribution, preventing data leakage and providing reliable performance estimates across different data splits.

The 31.85 percentage-point improvement from baseline to the proposed method demonstrates the substantial impact of preprocessing choices on model performance. Logistic regression's success here is noteworthy because simpler models often struggle with complex patterns, yet proper data preparation enabled strong results. This finding has direct implications for software development teams implementing automated requirements classification systems, reducing manual categorization burden and improving consistency.

Future applications should explore whether these results generalize across different requirements datasets and whether more complex classifiers could amplify the preprocessing benefits. The scalability and interpretability of logistic regression make it particularly valuable for production environments where stakeholders need to understand classification decisions. Practitioners should consider similar preprocessing strategies when deploying machine learning in domains with naturally imbalanced categorical data.

Key Takeaways

→SMOTE-Tomek preprocessing improved requirements classification accuracy from 58.31% to 76.16% using logistic regression
→The technique combines synthetic oversampling with Tomek link removal to balance minority and majority classes
→Stratified K-fold cross-validation preserved class distribution across validation folds for reliable performance estimates
→Simple models like logistic regression achieved strong results when properly paired with data preprocessing techniques
→The approach provides scalable, interpretable solutions for automated requirements engineering classification