Research paperDownregulated transferrin receptor in the blood predicts recurrent MDD in the elderly cohort: A fuzzy forests approach
Introduction
Currently investigations into biological underpinnings of MDD remain challenging; however, it is paramount for developing reliable diagnostic tools and effective treatments. Despite decades of research, elucidation of the exact molecular mechanisms is in its infancy (Cai et al., 2015; Okbay et al., 2016; Hek et al., 2013; Wray et al., 2018; Hyde et al., 2016; Jansen et al., 2016). MDD as a heterogeneous disorder is a complex dynamic system from both clinical (Cramer et al., 2016) and biological (Sibille and French, 2013) perspectives. The biological complexity of MDD can be accounted for by studying altered gene expression patterns in affected individuals compared to unaffected. These dysregulated patterns can serve as a dynamic marker of the disorder.
As far as molecular biology is concerned, genes do not act in isolation; instead, they interact within each other akin to complex networks that might be disrupted in depression. In our previous study (Ciobanu et al., 2018), we explored whether genome-wide gene co-expression patterns are associated with depression. We applied Weighted Gene Co-expression Network Analysis (WGCNA) to transcriptomic data from 521 community-dwelling individuals aged over 65. We found that four clusters containing 1241 highly interacting genes were associated with recurrent MDD, but found no cluster associations with single episode depression, current MDD, or lifetime MDD. Using in-silico Enrichment and Signaling Pathway Impact Analysis (SPIA) we found that this gene pool was biologically meaningful for 13 known molecular pathways significantly dysregulated in recurrent MDD in the elderly (Ciobanu et al., 2018). While these findings were consistent with previous observations, and provided new insights into the etiology of depression, they were limited by a biostatistical approach used in the analysis. The typical biostatistical approach is to fit pre-defined linear function between variables and the outcome. Although this approach is powerful in many scenarios, including candidate gene association study, it can be suboptimal for whole-genome gene expression data. Transcriptome data is highly multidimensional with complex non-linear biological processes underlying gene expression levels measured by transcriptomic experiments. Molecular interactions, which play an important biological role in the observed gene expression levels, are not captured by traditional statistical methods. Machine learning (ML) provides an alternative view for analysis of transcriptome data, allowing for complex linear and non-linear interactions between the genes to be explored. ML explicitly focuses on learning data-specific statistical functions to make generalizable predictions about affected individuals, which makes it a powerful tool for biomarker discovery.
Random forests (RF) is an established technique for classification and feature selection, owing to its unique advantages in dealing with relatively small sample size, high-dimensional feature space, and complex data structures. While RF is able to capture the true importance of features in settings where the features are independent, it is established that RF is biased when features are correlated with one another and the correlation structure is not known a priori (Nicodemus and Malley, 2009), which is a typical scenario for transcriptome data. A fuzzy forests (FF), an extension of a RF algorithm, is designed to reduce this bias. FF is an algorithm which takes advantage of the network structure between features and relies on WGCNA to create relatively uncorrelated clusters of highly correlated features (Zhang and Horvath, 2005). FF uses recursive feature elimination RF to select features from separate clusters (Díaz-Uriarte and Alvarez de Andrés, 2006). The final RF is fit using the surviving features. The selected features are then used to construct a predictive model (Conn et al., 2015; 2016).
Although FF is based on WGCNA, these methodologies represent two different analytical strategies. WGCNA is primarily concerned with identifying important genes assumed to be involved in the same biological processes, which is useful in understanding biological underpinnings of depression. However, given that depression is a biologically multifactorial disorder, it is likely that hundreds to thousands of genes are involved in the disease, making it diagnostically impractical. RF aims to find a small number of genes sufficient for a good prediction of the response variable. Combining the two strategies (RF and WGCNA) in a FF framework may help to overcome limitations of each individual method and enrich our understanding of the aetiology of depression.
With the aim to classify individuals affected by recurrent MDD from those unaffected by transcriptomic data, we conducted a novel analytic approach applying Fuzzy Forests (FF) – a ML algorithm that combines two established techniques – WGCNA and Random Forests (RF) - into an algorithm that effectively reduces dimensionality of the transcriptome data, and therefore, requires less sample size to identify meaningful predictive marker compared to classic statistical or ML algorithms. To the best of our knowledge, this study is the first to utilize Fuzzy Forests for transcriptome data in psychiatric research.
Section snippets
Sample characteristics
The Sydney Memory and Aging Study (SMAS) was initiated in 2005 to examine the clinical characteristics and prevalence of mild cognitive impairment and related syndromes, including depression, in a non-demented population aged 70-90 years at recruitment (N = 1037) (Sachdev et al., 2010). The phenotypic data were collected at four time points with 2-year intervals between assessments. Blood samples for gene expression analyses were collected at Wave 4 (N == 521), six years after baseline data
Demographics and clinical characteristics
The basic demographic and clinical characteristics of the cohort are presented in Table 1.
Training and test data
After partitioning the full dataset, our training consisted of two groups: 19 recurrently depressed individuals (group [1]) and 346 individuals without recurrent MDD (group [0]) (0.05 vs 0.95), which is highly unbalanced. Using SMOTE, we balanced training data to 38 observations in each group. The test data consisted of 8 recurrent MDD [1] and 148 non-recurrent MDD [0] individuals.
Co-expression network and recurrent MDD-relevant features Co-expression network and recurrent MDD-relevant features
To determine the power of
Downregulated transferrin receptor, TFRC, as a potential predictive marker for recurrent MDD
While machine learning is a powerful approach in genomic research, application of ML algorithms in psychiatry is challenging due to large sample sizes required to train the model using massively multivariate structure of transcriptomic data. The use of co-expression network feature reduction technique prior training an ML model effectively alleviates the p>>n problem without information loss, allowing, therefore, for less sample size to identify meaningful predictive markers. In this study, we
Limitations
Although we report on the ability of our model to predict recurrently depressed individuals, these results should be treated with caution. While we identified TFRC as the most predictive gene for recurrent depression in the elderly, our sample was small relative to the feature space and could be a source of poor generalizability. While we corrected our data for age, sex, RINs and latent non-biological variables, we were unable to account for medications taken, comorbidities, cognitive status
Conclusions
Using fuzzy forests framework, we identified that the most predictive gene, TFRC, can predict recurrent depression in the elderly with an accuracy of 63%. This finding, coupled with our previous observation that blood TFRC mRNA downregulated in recurrent MDD individuals as compared with those without, may potentially serve as a recurrent MDD-specific predictive marker and provide some insights into pathophysiology of depression. Although our study is exploratory in nature providing preliminary
Funding
This work was supported by funding from the National Health and Medical Research Council (NHMRC; ID 1060524 to BTB, SCW, SR, JT) of Australia. The Sydney Memory and Ageing Study (SMAS) was supported by a National Health and Medical Research Council (NHMRC)/Australian Research Council Strategic Award (ID 401162), NHMRC Program Grants (ID 350833 and 568969) and a Project Grant (ID 1045325). The Older Australian Twins Study (OATS) was funded by an NHMRC/ARC Strategic Award Grant of the Ageing Well
Limitations
Although we report on the ability of our model to predict recurrently depressed individuals, these results should be treated with caution. While we identified TFRC as the most predictive gene for recurrent depression in the elderly, our sample was small relative to the feature space and could be a source of poor generalizability. While we corrected our data for age, sex, RINs and latent non-biological variables, we were unable to account for medications taken, comorbidities, and other
Declaration of Competing Interest
The authors declare no conflict of interest.
Acknowledgments
We would like to thank the Sydney MAS and OATS participants and their respective research teams. We also thank all members of the Centre for Healthy Brain Ageing (CHeBA, UNSW) and the Discipline of Psychiatry research group (University of Adelaide) for an invaluable input during data collection and discussions during manuscript preparation.
References (33)
- et al.
Molecular signatures of major depression
Curr. Biol.
(2015) - et al.
Interactive big data resource to elucidate human immune pathways and diseases
Immunity
(2015) - et al.
Innate and adaptive immunity in the development of depression: an update on current knowledge and technological advances
Prog. Neuropsychopharmacol. Biol. Psychiatry
(2016) - et al.
A genome-wide association study of depressive symptoms
Biol Psychiatry
(2013) - et al.
Development and validation of a geriatric depression screening scale: a preliminary report
J. Psychiatr. Res.
(1982) - et al.
The soluble transferrin receptor as a marker of iron homeostasis in normal subjects and in HFE-related hemochromatosis
Haematologica
(2005) Susceptibility genes are enriched in those of the herpes simplex virus 1/host interactome in psychiatric and neurological disorders
Pathog. Dis.
(2013)- et al.
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
(2002) - et al.
Co-expression network analysis of peripheral blood transcriptome identifies dysregulated protein processing in endoplasmic reticulum and immune response in recurrent MDD in older adults
J. Psychiatr. Res.
(2018) - et al.
Fuzzy Forests: Extending Random Forests for Correlated, High-Dimensional Data
(2015)
Fuzzy forests: a new WGCNA based random forest algorithm for correlated, high-dimensional data
J. Stat. Softw.
Major depression as a complex dynamic system
PLoS One
The Neuropsychiatric Inventory
Compr. Assess. Psychopathol. Dement.
Predicting the naturalistic course of depression from a wide range of clinical, psychological, and biological data: a machine learning approach
Transl Psychiatry
Gene selection and classification of microarray data using random forest
BMC Bioinform.
Identification of 15 genetic loci associated with risk of major depression in individuals of European descent
Nat. Genet.
Cited by (9)
Identification of Hub Genes in Neuropathic Pain-induced Depression
2023, Current BioinformaticsInvestigating the effects of ensemble and weight optimization approaches on neural networks’ performance to estimate the dynamic modulus of asphalt concrete
2023, Road Materials and Pavement DesignA machine learning model for predicting patients with major depressive disorder: A study based on transcriptomic data
2022, Frontiers in NeurosciencePrediction of probable major depressive disorder in the taiwan biobank: An integrated machine learning and genome-wide analysis approach
2021, Journal of Personalized MedicineIdentification of Diagnostic Markers for Major Depressive Disorder Using Machine Learning Methods
2021, Frontiers in Neuroscience