The College of Administration and Economics at the University of Baghdad discussed, a PhD dissertation in field of statistics by the student (Noor Ayad Mohammed ) and tagged with (Estimating the survival function of the semi-parametric model using machine learning methods with practical application ) , Under supervision of (Prof. Dr.Entsar Arebe Fadam )
Estimating survival functions is one of the topics of increasing importance in statistics as it combines the statistical, medical and biological aspects, as it is used to study the relationship between time and the probability of survival. The importance of this topic increases when its estimation methods are linked to modern and highly accurate artificial intelligence methods such as the machine learning method.
Regarding the dissertation model, the Cox model was adopted, which is considered one of the most widely used statistical models in estimating survival for medical data. It is one of the important and commonly used semi-parametric models for estimating survival, as its nonparametric part is usually estimated using specific and iterative nonparametric methods; while the parametric part has its parameters usually estimated using the partial maximum likelihood method.
However instead of estimating the nonparametric part using traditional, repetitive methods that cannot deal with the problems of current survival data, including: the high dimensions of the variables studied, especially medical image data, the presence of correlations between variables, censored data, and large sample sizes resulting from the development of methods for collecting and storing data. Therefore, in the presence of these problems, the semi-parametric Cox model cannot be used.
Therefore, it was necessary use modern methods to estimate the non-parametric part; that have the ability to deal with the problems of current survival data, and one of these methods is the machine learning method (which is a branch of artificial intelligence).
Since the mechanism of the machine learning method requires first the availability of data on the studied case, and therefore the data of the non-parametric part of the Cox model represent real mammogram images of breast cancer patients in Iraq, and after the availability of the data, the machine learning method is apply to it through the implementation of six algorithms to extract The important features of each image are PCA, KPCA, Fast ICA, NMF, Truncated SVD, CNN with five machine learning algorithms for survival estimation: KNN, Decision tree, Random forests, SVM, Gradient Boosting.
Two criteria were adopted for comparison between the models, MSE and c-Index, through which it became clear that the best model for estimating survival, was the model resulting from the application of Truncated SVD algorithm with Gradient Boosting algorithm.
A modification was also made to five machine learning algorithms to estimate the survival of the non-parametric part of the Cox model by conducting the process of extracting the important variables in two stages. The model resulting from the use of the CNN algorithm and PCA with Gradient Boosting algorithm.
The data of the parametric part of the Cox model, it is a set of variables affecting the disease and for the same group of patients, and the parameters of this part were estimated using the penal greatest possibility method.
Researchers in the statistical and medical side can adopt the estimation of the non-parametric part with modern methods of estimation such as machine learning algorithms, which have proven through the results of their high accuracy in estimation instead of using the limited and refined non-parametric methods in estimating the non-parametric part of the Cox model.
For data that are medical images of radiological tomography for some part of the body, we recommend that researchers in this aspect adopt the modified model proposed in the thesis for this type of data.