Kaggle Competition: Child Mind Institute – Problematic Internet Use
Kaggle is a platform where anyone can host a data science competition. This competition was about the following:
Can you predict the level of problematic internet usage exhibited by children and adolescents, based on their physical activity? The goal of this competition is to develop a predictive model that analyzes children’s physical activity and fitness data to identify early signs of problematic internet use. Identifying these patterns can help trigger interventions to encourage healthier digital habits.
Available was the Healthy Brain Network dataset, consisting of five thousand 5-22 year-olds who have undergone clinical and research screenings. Much of the data was about categorization, which made regression methods a good contender for a solid model. Additionally, a dataset comprising actigraphy files was provided. These files were captured by the participants, who carried an accelerometer for 30 days. However, after some feature analysis, it was found that these files did not significantly improve the model.
Things I was doin:
I was running a machine learning pipeline using multiple models (LightGBM
, XGBoost
and CatBoost
) in an ensemble configuration to predict the target variable (sii) in a dataset. I have done significant preprocessing, feature engineering, and tuning, and I am using the quadratic weighted kappa (QWK) as my evaluation metric. The code also includes handling of missing values using KNN imputation, categorical feature encoding, and optimization of thresholds for the final predictions.
Using sii
as target variable was a mistake because this target variable is determined by binning PCIAT-PCIAT_Total
which would have been easier to predict.
Voting Regressor:
I have created an ensemble model by combining LightGBM
, XGBoost
and CatBoost
using a VotingRegressor. This approach leverages the strengths of each model to produce a more robust final prediction.
Cross-validation:
Using Stratified K-Folds
cross-validation, I split the data into five parts to train and validate the model, ensuring more stable evaluation of performance.
Things I could have done better:
Model Tuning
While I tried to fine-tune the models hyperparameters using optuna, I did not used other methods like RandomizedSearchCV or GridSearchCV to achieve better performance.
Feature Selection
Although I have used some weights for features, I might benefit from feature selection techniques like Recursive Feature Elimination or feature importance ranking to remove less important features and further improve model performance.