| Abstract: |
AI-driven prediction models face significant challenges in low-resource educational contexts, where cold start conditions and severe class imbalance data are prevalent. This study addresses this problem by proposing a methodological framework that integrates NLP-based preprocessing for feature extraction, advanced imbalance- aware resampling, and a comprehensive multi-model comparison. We evaluated the framework using a small, authentic dataset of learners from Mali, applying ten classifiers, including SVM, Random Forest, XGBoost, and ensemble methods. Model performance was assessed using stratified 10-fold cross-validation, with and without resampling via Random Oversampling (ROS) and SMOTE strategies, with evaluation based on accuracy, F1-score, and AUC-ROC. Results show that ensemble methods, specifically Random Forest, Gradient Boosting and XGBoost, achieved over 90% accuracy, and superior F1-scores, after SMOTE, significantly outperforming baselines and ROS. |