Elsevier

Journal of Affective Disorders

Volume 267, 15 April 2020, Pages 42-48
Journal of Affective Disorders

Research paper
Downregulated transferrin receptor in the blood predicts recurrent MDD in the elderly cohort: A fuzzy forests approach

https://doi.org/10.1016/j.jad.2020.02.001Get rights and content

Highlights

  • Blood transcriptome is a proxy for studying biomarkers for Major Depressive Disorder (MDD).

  • Machine Learning (ML) is a powerful approach to identify predictive markers of MDD.

  • Fuzzy Forests, a machine learning algorithm, takes into account network structure of transcriptome data.

  • Transferrin receptor, TFRC, downregulated in blood, is predictive of recurrent MDD, indicating the role of immune system in MDD.

Abstract

Background

At present, no predictive markers for Major Depressive Disorder (MDD) exist. The search for such markers has been challenging due to clinical and molecular heterogeneity of MDD, the lack of statistical power in studies and suboptimal statistical tools applied to multidimensional data. Machine learning is a powerful approach to mitigate some of these limitations.

Methods

We aimed to identify the predictive markers of recurrent MDD in the elderly using peripheral whole blood from the Sydney Memory and Aging Study (SMAS) (N = 521, aged over 65) and adopting machine learning methodology on transcriptome data. Fuzzy Forests is a Random Forests-based classification algorithm that takes advantage of the co-expression network structure between genes; it allows to alleviate the problem of p >> n via reducing the dimensionality of transcriptomic feature space.

Results

By adopting Fuzzy Forests on transcriptome data, we found that the downregulated TFRC (transferrin receptor) can predict recurrent MDD with an accuracy of 63%.

Limitations

Although we corrected our data for several important confounders, we were not able to account for the comorbidities and medication taken, which may be numerous in the elderly and might have affected the levels of gene transcription.

Conclusions

We found that downregulated TFRC is predictive of recurrent MDD, which is consistent with the previous literature, indicating the role of the innate immune system in depression. This study is the first to successfully apply Fuzzy Forests methodology on psychiatric condition, opening, therefore, a methodological avenue that can lead to clinically useful predictive markers of complex traits.

Introduction

Currently investigations into biological underpinnings of MDD remain challenging; however, it is paramount for developing reliable diagnostic tools and effective treatments. Despite decades of research, elucidation of the exact molecular mechanisms is in its infancy (Cai et al., 2015; Okbay et al., 2016; Hek et al., 2013; Wray et al., 2018; Hyde et al., 2016; Jansen et al., 2016). MDD as a heterogeneous disorder is a complex dynamic system from both clinical (Cramer et al., 2016) and biological (Sibille and French, 2013) perspectives. The biological complexity of MDD can be accounted for by studying altered gene expression patterns in affected individuals compared to unaffected. These dysregulated patterns can serve as a dynamic marker of the disorder.

As far as molecular biology is concerned, genes do not act in isolation; instead, they interact within each other akin to complex networks that might be disrupted in depression. In our previous study (Ciobanu et al., 2018), we explored whether genome-wide gene co-expression patterns are associated with depression. We applied Weighted Gene Co-expression Network Analysis (WGCNA) to transcriptomic data from 521 community-dwelling individuals aged over 65. We found that four clusters containing 1241 highly interacting genes were associated with recurrent MDD, but found no cluster associations with single episode depression, current MDD, or lifetime MDD. Using in-silico Enrichment and Signaling Pathway Impact Analysis (SPIA) we found that this gene pool was biologically meaningful for 13 known molecular pathways significantly dysregulated in recurrent MDD in the elderly (Ciobanu et al., 2018). While these findings were consistent with previous observations, and provided new insights into the etiology of depression, they were limited by a biostatistical approach used in the analysis. The typical biostatistical approach is to fit pre-defined linear function between variables and the outcome. Although this approach is powerful in many scenarios, including candidate gene association study, it can be suboptimal for whole-genome gene expression data. Transcriptome data is highly multidimensional with complex non-linear biological processes underlying gene expression levels measured by transcriptomic experiments. Molecular interactions, which play an important biological role in the observed gene expression levels, are not captured by traditional statistical methods. Machine learning (ML) provides an alternative view for analysis of transcriptome data, allowing for complex linear and non-linear interactions between the genes to be explored. ML explicitly focuses on learning data-specific statistical functions to make generalizable predictions about affected individuals, which makes it a powerful tool for biomarker discovery.

Random forests (RF) is an established technique for classification and feature selection, owing to its unique advantages in dealing with relatively small sample size, high-dimensional feature space, and complex data structures. While RF is able to capture the true importance of features in settings where the features are independent, it is established that RF is biased when features are correlated with one another and the correlation structure is not known a priori (Nicodemus and Malley, 2009), which is a typical scenario for transcriptome data. A fuzzy forests (FF), an extension of a RF algorithm, is designed to reduce this bias. FF is an algorithm which takes advantage of the network structure between features and relies on WGCNA to create relatively uncorrelated clusters of highly correlated features (Zhang and Horvath, 2005). FF uses recursive feature elimination RF to select features from separate clusters (Díaz-Uriarte and Alvarez de Andrés, 2006). The final RF is fit using the surviving features. The selected features are then used to construct a predictive model (Conn et al., 2015; 2016).

Although FF is based on WGCNA, these methodologies represent two different analytical strategies. WGCNA is primarily concerned with identifying important genes assumed to be involved in the same biological processes, which is useful in understanding biological underpinnings of depression. However, given that depression is a biologically multifactorial disorder, it is likely that hundreds to thousands of genes are involved in the disease, making it diagnostically impractical. RF aims to find a small number of genes sufficient for a good prediction of the response variable. Combining the two strategies (RF and WGCNA) in a FF framework may help to overcome limitations of each individual method and enrich our understanding of the aetiology of depression.

With the aim to classify individuals affected by recurrent MDD from those unaffected by transcriptomic data, we conducted a novel analytic approach applying Fuzzy Forests (FF) – a ML algorithm that combines two established techniques – WGCNA and Random Forests (RF) - into an algorithm that effectively reduces dimensionality of the transcriptome data, and therefore, requires less sample size to identify meaningful predictive marker compared to classic statistical or ML algorithms. To the best of our knowledge, this study is the first to utilize Fuzzy Forests for transcriptome data in psychiatric research.

Section snippets

Sample characteristics

The Sydney Memory and Aging Study (SMAS) was initiated in 2005 to examine the clinical characteristics and prevalence of mild cognitive impairment and related syndromes, including depression, in a non-demented population aged 70-90 years at recruitment (N = 1037) (Sachdev et al., 2010). The phenotypic data were collected at four time points with 2-year intervals between assessments. Blood samples for gene expression analyses were collected at Wave 4 (N == 521), six years after baseline data

Demographics and clinical characteristics

The basic demographic and clinical characteristics of the cohort are presented in Table 1.

Training and test data

After partitioning the full dataset, our training consisted of two groups: 19 recurrently depressed individuals (group [1]) and 346 individuals without recurrent MDD (group [0]) (0.05 vs 0.95), which is highly unbalanced. Using SMOTE, we balanced training data to 38 observations in each group. The test data consisted of 8 recurrent MDD [1] and 148 non-recurrent MDD [0] individuals.

Co-expression network and recurrent MDD-relevant features Co-expression network and recurrent MDD-relevant features

To determine the power of

Downregulated transferrin receptor, TFRC, as a potential predictive marker for recurrent MDD

While machine learning is a powerful approach in genomic research, application of ML algorithms in psychiatry is challenging due to large sample sizes required to train the model using massively multivariate structure of transcriptomic data. The use of co-expression network feature reduction technique prior training an ML model effectively alleviates the p>>n problem without information loss, allowing, therefore, for less sample size to identify meaningful predictive markers. In this study, we

Limitations

Although we report on the ability of our model to predict recurrently depressed individuals, these results should be treated with caution. While we identified TFRC as the most predictive gene for recurrent depression in the elderly, our sample was small relative to the feature space and could be a source of poor generalizability. While we corrected our data for age, sex, RINs and latent non-biological variables, we were unable to account for medications taken, comorbidities, cognitive status

Conclusions

Using fuzzy forests framework, we identified that the most predictive gene, TFRC, can predict recurrent depression in the elderly with an accuracy of 63%. This finding, coupled with our previous observation that blood TFRC mRNA downregulated in recurrent MDD individuals as compared with those without, may potentially serve as a recurrent MDD-specific predictive marker and provide some insights into pathophysiology of depression. Although our study is exploratory in nature providing preliminary

Funding

This work was supported by funding from the National Health and Medical Research Council (NHMRC; ID 1060524 to BTB, SCW, SR, JT) of Australia. The Sydney Memory and Ageing Study (SMAS) was supported by a National Health and Medical Research Council (NHMRC)/Australian Research Council Strategic Award (ID 401162), NHMRC Program Grants (ID 350833 and 568969) and a Project Grant (ID 1045325). The Older Australian Twins Study (OATS) was funded by an NHMRC/ARC Strategic Award Grant of the Ageing Well

Limitations

Although we report on the ability of our model to predict recurrently depressed individuals, these results should be treated with caution. While we identified TFRC as the most predictive gene for recurrent depression in the elderly, our sample was small relative to the feature space and could be a source of poor generalizability. While we corrected our data for age, sex, RINs and latent non-biological variables, we were unable to account for medications taken, comorbidities, and other

Declaration of Competing Interest

The authors declare no conflict of interest.

Acknowledgments

We would like to thank the Sydney MAS and OATS participants and their respective research teams. We also thank all members of the Centre for Healthy Brain Ageing (CHeBA, UNSW) and the Discipline of Psychiatry research group (University of Adelaide) for an invaluable input during data collection and discussions during manuscript preparation.

References (33)

  • D. Conn et al.

    Fuzzy forests: a new WGCNA based random forest algorithm for correlated, high-dimensional data

    J. Stat. Softw.

    (2016)
  • A.O.J. Cramer et al.

    Major depression as a complex dynamic system

    PLoS One

    (2016)
  • J.L. Cummings et al.

    The Neuropsychiatric Inventory

    Compr. Assess. Psychopathol. Dement.

    (1994)
  • R. Dinga et al.

    Predicting the naturalistic course of depression from a wide range of clinical, psychological, and biological data: a machine learning approach

    Transl Psychiatry

    (2018)
  • R. Díaz-Uriarte et al.

    Gene selection and classification of microarray data using random forest

    BMC Bioinform.

    (2006)
  • C.L. Hyde et al.

    Identification of 15 genetic loci associated with risk of major depression in individuals of European descent

    Nat. Genet.

    (2016)
  • View full text