The College of Administration and Economics at the University of Baghdad discussed, a PhD dissertation in field of Statistics by the student ( Afiaa Raheem khudauir ) and tagged with (Performance of Classification and Variables Selection in Penalized Logistic Regression Model for High-Dimensional Data with Application) , Under supervision of (Prof. Dr. Saja Mohammad Hussien)
In many fields, notably medicine, social sciences, and finance, the current time witnesses a significant surge in data, fueled by the rapid advancement of technology. This increase in data volume has led to the emergence of high-dimensional data (where the number of variables exceeds the sample size), creating challenges in precision and target identification. Our study included data that followed a logistic model Consequently, binary response variable classification becomes intricate due to the multicollinearity in explanatory variables.
To tackle this, response variable classification has prompted the utilization of penalization techniques, reduced variables and selecting the best variables in the model. This aids in simplifying the model complexity to attain the specific binary outcome (0,1).
In this thesis, various penalization methods were applied, including Weighted Lasso Estimates, With the weight suggested by the researcher (type V) and Correlation-Based Elastic Net Penalty (CBEP), Correlation-Based Penalized Logistic Regression (CBPLR), Adjusted Adaptive Elastic Net Penalty (AAEL), and alongside the Partial Least Squares (PLS)and Penalized Partial Least Squares (PPLS). This method was applied for the first time and yielded good results. The application involved two sets of data A large sample (p=12,600, n=100) and A sample collected by the researcher (p=49, n=41).
The second proposed method, (PPLS) with Penalized logistic regression, was successfully applied (PPLS) for the first time with Penalized logistic regression, and it produced positive results for classification in both a large sample and a sample collected by the researcher, and its results were comparable to other methods used in the study.
These methods were compared based on criteria like selected variables count, classification accuracy, misclassification rate, sensitivity, and specificity and confusion matrix. And Simulations were performed under three cases (n=100, p=2000) and (n=40, p=50) and (n=100, p=1000), each replicated 100 times. The experiments also considered three correlation cases (r=0.99, 0.95, 0.75, 0.25). Across these cases, the methods exhibited performance goodness, achieving high classification accuracy while efficiently selecting an optimal number of variables using a range of packages and functions in the R programming language.