Study to aid in the early detection and treatment of cancer
A Sudanese scientist, Mohanad Mohamed, has used next generation gene expression data to aid in early detection and classification of cancer outcomes in South Africa. The implications for public health policy have been taken up by provincial health departments for early intervention and treatment of cancer cases. Mohamed, who is a fellow of the sub-Saharan African Consortium for Advanced Biostatistics (SSACAB), describes his research.
Cancer, a non-communicable disease (NCD), is among the leading causes of death in both developed and developing countries. Through gene expression profiling of tumors, the accuracy of cancer classification has been enhanced, leading to correct diagnoses and the application of effective therapies. We performed a comparative review of the binary class predictive ability of seven classification methods (support vector machines, with the radial basis kernel (SVM(RK)), linear kernel (SVM(LK)) and the polynomial kernel (SVM(PK)), artificial neural networks (ANN), random forests (RF), k-nearest neighbor (KNN), and naive Bayes (NB)), using publicly-available gene expression data from cancer research. Results indicated that NB outperformed the other methods in terms of the accuracy, sensitivity, specificity, kappa coefficient, area under the curve (AUC), and balanced error rate (BER) of the binary classifier. Thus, the naïve Bayes (NB) approach turned out to be the best classifier with our datasets.
The study was done using ten microarray gene expression datasets from the four most common cancer types among men and women, with a total sample size of 681. The datasets are publicly available from the gene expression omnibus (GEO) repository. The main objective was to identify the most optimal classification method, using gene expression data, which would be helpful in the development of targeted therapies. We did not use data from South Africa but hope to translate this methodology on the South African cancer data when it becomes available.
The work was completed as my master’s thesis at UKZN. Thereafter, I extended this work using an ensemble approach, which is generally used for combining different methods in order to improve the performance, and published the results in a conference paper entitled “Using stacking ensemble for microarray-based cancer classification”.
Why is this study important?
This study will enhance the capacity of selecting useful biomarkers needed for accurate cancer classification and prediction. It may help in the early detection of cancer and the application of stage-specific therapies to patients. According to the National Institute for Communicable Diseases (NICD) of South Africa 2014 report, breast cancer is the leading cancer type, with 8,230 cases (21%) in females, followed by basal cell carcinoma (BCC) with 7,030 cases (18.61%). In addition, BCC, with 9,322 cases in males (25.35%), was the leading cancer one among males, followed by prostate cancer with 7,057 cases (19.18%). These statistics underscore cancer burden in South Africa and suggest the importance of studies that could reduce incidence and enhance patient prophylaxis.
What were the key findings and why are the findings important? What are the implications of the finding in the intervention and treatment of cancer?
The use of the high-dimensional microarray gene expression data has necessitated the development of powerful statistical classification methods such as support vector machines (SVM), artificial neural networks (ANN), random forests (RF), and linear discriminant analysis (LDA), naive Bayes (NB), k-nearest neighbor (KNN) among others. In this work, we compared the performance of seven of such methods (SVM (linear kernel), SVM (polynomial kernel), SVM (radial basis kernel), ANN, RF, NB, and KNN), using seven performance measures (accuracy, kappa, sensitivity, specificity, area under the curve (AUC), receiver operating curve (ROC), balanced error rate (BER)). A key finding was that in seven of the ten datasets, NB performed the best, followed by SVM (LK), ANN, and RF. We therefore recommend the NB method to clinicians for classifying cancer patients into stages or tumor types, so that the right therapies can be administered based on the stage of the disease.
What are the implications for public policy?
The implications for public health policy are straight forward and clear. If individuals are diagnosed early for any of the NCDs using the techniques to be developed, early intervention and treatment will be put in place and in turn, this will save and prolong lives. On the other hand, if the methods can help determine candidate genes for a disease, this is helpful because such information can be used to work out the risk probability of an individual and hence preventative methods can be put in place and accurate diagnosis and treatment for a given condition.
What’s the next step after this study?
The computational methods are still faced with methodological challenges, including how to deal with high dimensionality characterized by a large number of genes or probes and a much smaller number of samples requiring dimension reduction. In addition, recent technological advances have also led to next-generation sequence data, which contains a large number of biomarkers and genes associated or not associated with a given disease. Thus, there is an urgent need to develop an integrated approach to gene selection in cancer survival studies that jointly utilize both sources of information, namely, microarray and sequence data. Under the classification problem, several parametric, non-parametric and semiparametric statistical methods have been proposed within the last decade, but none has been unanimously accepted as the gold standard. In my recent master's research at UKZN, support vector machines (SVM) proved superior to the k-nearest neighbor, random forests, naive Bayes and artificial neural networks methods but more research is still needed in this regard.
Are there any new findings?
Currently, I am using negative binomial linear discriminant analysis (NBLDA) for analyzing RNA-Seq data, which is discrete in nature. I downloaded RNA-Seq data on breast cancer and liver hepatocellular carcinoma from the Cancer Genome Atlas (TCGA). This type of the data contains irrelevant or noisy genes to cancer distinction. Therefore, prior to classification, pre-processing steps are followed to remove the noisy and irrelevant genes. The data include early, late, and normal samples. The NBLDA achieved 71% classification accuracy in this problem. In addition, I have used the seven methods, which I used previously and found that support vector machine with polynomial kernel (SVM-PK) had 75% classification accuracy, followed by support vector machine with linear kernel (SVM –LK) at 73% accuracy. The work to improve methods for cancer classification is still in progress. We are to yet to develop a hybrid model, which can be used to integrate both type of data (RNA-Seq and microarray).