

Summary
Figure 7.1 - of machine learning approach. AdaB = AdaBoost classifier, CCS = childhood cancer survivor, GrBo = Gradient Boosting classifier, LinR = linear regression, LVEF = left ventricular ejection fraction, NA = not available, PEx = physical examination, Qtn = questionnaire, RFor = Random Forest classifier, Sib = sibling, SN = sensitivity, SP = specificity
Step 2. Data split
The dataset, stratified for CCS versus siblings, was randomly split into four folds, each containing 25% of the participants. One fold (25%) was retained as an unseen set, to test our final ML classifier. Classifiers were trained on the remaining three folds (75%), within which overfitting was prevented using three-fold cross-validation: during three cross-folds (resampling iterations), a classifier was trained on two folds (50%) and one fold (25%) was used for validation.
Step 3. Missing values
All classifier types needed complete datasets, but missing echocardiographic values are common. Specifically, in our study, image acquisition for right ventricular and left atrial strain was post-hoc implemented. We unconditionally imputed missing values in training, validation and test sets with the median values of the training set, separately for each cross-fold. These imputed median values were generic for CCS and siblings, to not influence the classifications. As sub-analyses, we compared classifiers including all 75 features (‘all features’ classifier), a subset of features with <500 missing values (‘<500 missings’ classifier) and a subset of features with <300 missing values (‘<300 missings’ classifier) (Supplemental Table 7.1).
Step 4. Primary classifiers
We trained four potentially useful classifier types: Linear Regression, Random Forest Classifier, AdaBoost Classifier and Gradient Boosting Classifier (Scikit-Learn v0.19.1 library in Python v3.6.5, Anaconda Inc.). In total, we derived twelve primary classifiers: four classifier types (each still consisting of three cross-fold models), times the three sub-analyses for missing values. The classifier type(s) that performed best (next paragraph explains evaluation criteria) and most consistently over the three cross-folds (to prevent overfitting) were selected.
Step 5. Performance evaluation and comparison of primary classifiers
A detailed rationale of our semi-supervised approach is explained in Supplemental Methods 7.1. In plain language, we let ML determine the likelihood of a CCS not being a sibling, and retrospectively evaluated its co-occurrence with an LVEF <45%. In the first, supervised training part, CCS and sibling echocardiograms were assigned the respective ground-truth labels and simultaneously entered for training. In this setting, each primary classifier would aim to optimally separate these two groups. However, for our purpose it was not desirable to discern all CCS from siblings, nor clinically applicable on the late effects clinic where no siblings visit. We expected partially overlapping CCS and sibling clusters (‘similarity’): CCS and siblings with similar predicted probabilities on a scale where 0 means ‘certainly a sibling’ and 1 means ‘certainly a CCS’ (‘soft’ predictions). In absence of a gold standard cardiac function measurement, there were no ground-truth labels for which CCS should be classified ‘(dis)similar’ to siblings, so that this similarity classification remained unsupervised. Evaluation of classifier performance for the classification of similarity necessitated to define, as a minimum requirement, which CCS should not be classified ‘similar’ to the sibling cluster: CCS with known (conventionally measured) cardiac dysfunction. The remaining CCS could be classified either similar or dissimilar to siblings. This resulted in the following performance measures:
1. Sensitivity: the proportion of CCS with an LVEF <45% in a validation or test set that was classified dissimilar to siblings. This convenience cut-off balanced the number of participants with cardiac dysfunction (LVEF <40% was rare) with the assurance of cardiac dysfunction (allowing a measurement error for LVEF and mild abnormalities to be contradicted by other features 74,105). In our evaluation of the primary classifiers in the validation sets, we inspected scatterplots of LVEF versus the predicted probability (soft predictions). Since false-negatives are undesirable in cardiomyopathy surveillance, we manually applied a prediction threshold that assured 100% sensitivity.
2. Specificity: the proportion of siblings in a validation or test set that was classified similar to siblings in the training set. Classifiers with the highest specificity, at 100% sensitivity, during validation were defined to provide the best separation. To conservatively interpret ‘similarity’ of CCS to siblings, we adjusted the prediction threshold towards the sibling cluster if this resulted in only minimal loss of specificity.
Step 6. Secondary classifier (feature selection)
Feature importance plots of the best performing primary classifier(s) were inspected to identify a threshold that marked the steepest increase in relative importance. Features that were above this threshold in all three cross-folds were considered the most important features. These were used to retrain the classifier, in a similar manner as our primary classifiers were trained and validated.
Step 7. Final model testing
The best secondary classifier was tested on the unseen test set. Since, during training and cross-validation, three versions of the same classifier were developed, the final classifications on that test set were obtained through majority voting. Sensitivity and specificity were tested as described above.
Statistical analysis
We descriptively report the reclassification of CCS in the test set by our final classifier compared to the conventionally used abnormal LVEF (men <52% and women <54% per current chamber quantification guidelines) and abnormal GLS (age-, sex- and vendor specific) 74,121. Characteristics of 10 CCS with the highest predicted probabilities (‘degree’ of abnormality), and of ‘false-positive’ siblings (classified dissimilar to siblings in the training set), in the test set were descriptively summarized. In the CCS from the test set, multivariable logistic regression analysis was performed to associate demographic variables, cardiotoxic exposure and traditional cardiovascular risk factors to the final classifications. Analyses were performed in R (version 3.5.3, R Foundation, Vienna, Austria). Two-sided p-values <0.05 were considered statistically significant.
Results
Full cohort characteristics are shown in Supplemental Table 7.2. Of 1,397 CCS, 49% were female. Median age at cancer diagnosis was 6.1 [range: 0.1-17.9] years and median age at echocardiography 34 years [range: 16-65]. Median anthracycline dose of the 1,078 (77%) exposed CCS was 180mg/m2 [range: 8-760] and 412 CCS (29%) received RT heart (median prescribed dose 12Gy [range: 0.4-99]). The fold composition after randomization is presented in Table 7.1.
Table 7.1 - Demographic, clinical and echocardiographic characteristics of survivors and controls
Primary classifiers
Performance on the validation sets of the primary classifiers trained on ‘all features’, ‘<500 missings’ and ‘<300 missings’ is presented in Figure 7.2 A-C. Of the four classifier types, the AdaBoost classifier type yielded the highest specificity at 100% sensitivity (i.e. best distinguished CCS with an LVEF <45% from siblings), consistently in all three cross-folds. Of the three subanalyses, the ‘<300 missings’ classifier performed best. Of note, the Linear Regression classifier type did not provide consistent validation results over three cross-folds.
Figure 7.2 - Training performance of different classifiers during three-fold cross-validation. Y-axes of all figures show specificity at 100% sensitivity. X-axes show the cross-fold number.
(A) All four potential classifiers, entering all echocardiographic features
(B) All four potential classifiers, entering echocardiographic features with <500 missing values
(C) All four potential classifiers, entering echocardiographic features with <300 missing values
(D) AdaBoost classifier, entering 11 most important echocardiographic features from the ‘<300 missings’ classifier.
Secondary classifier (feature selection)
We inspected feature importance plots of this AdaBoost ‘<300 missings’ classifier. A threshold of 0.02 marked the steepest increase in relative importance, generically for all three cross-folds (Supplemental Figure 7.2). Eleven features exceeded this threshold in all three cross-folds (Table 7.2). Retraining the classifier on these most important features slightly improved performance in the validation sets (Figure 7.2 D). We considered this classifier as our final classifier. The feature importance of this final classifier, averaged over the three cross-folds, is summarized in Table 7.2. All eleven features were related to the left ventricle. During validation, specificity of the final classifier was 93-100% at 100% sensitivity (Supplemental Figure 7.3). Importantly, all three cross-folds showed a distinct sibling cluster around a predicted probability of 0.5, distant from the 100% sensitivity threshold. According to our conservative interpretation, we defined our prediction threshold at the edge of this sibling cluster (generically for all three cross-folds at a predicted probability of 0.535) with minimal loss of specificity.
Table 7.2 - Feature importance in the final AdaBoost classifier
* From three cross-folds
IV = interventricular, LV = left ventricle.
Final model testing
When applying the final 11-feature classifier to the test set, this predefined prediction threshold of 0.535 also adequately defined the edge of a distinct sibling cluster, for all three cross-folds (Graphical abstract). The plot depicts the median of the three predicted probabilities (representative of the majority vote) for each participant in the test set. The final sensitivity for classifying CCS with an LVEF <45% as ‘dissimilar’ to siblings was 100%, with a specificity of 86% (i.e.: the sibling cluster contained 86% of siblings). Of the 349 CCS in the test set, 276 (79%) were classified similar to siblings, whereas 73 (21%) were classified dissimilar.
Reclassification
Table 7.3 summarizes the reclassification of CCS in the test set by our final classifier, compared to the conventionally used abnormal LVEF and abnormal GLS. Of 174 CCS with normal LVEF (≥52/54%) and normal GLS, 12 (7%) were classified dissimilar to siblings. Of 40 CCS with both abnormal LVEF and abnormal GLS, 24 (60%) were classified dissimilar. Conversely, of 63 CCS with only mildly abnormal LVEF (<52/54% but ≥45%), 34 (54%) were classified similar to siblings. Comparisons of classifications to continuous LVEF and GLS values are shown in the Graphical Abstract and Figure 7.3.
Clinical and echocardiographic associations of classifications
Clinical characteristics of the 10 CCS with the highest predicted probabilities, who may be considered as ‘most abnormal’, are described in Supplemental Table 7.3. Of ten siblings in the test set, the predicted probabilities also exceeded the predefined threshold of 0.535, classifying their echocardiograms as dissimilar/abnormal. Most of them indeed had one or more abnormal echocardiographic or clinical characteristics (Supplemental Table 7.4), but four actually had LVEF values >99th percentile of sibling LVEF values in the training data, and two had GLS values >99th percentile. We hypothesized that siblings were underrepresented in ‘high-normal’ regions of the training data due to the 5:1 CCS/sibling ratio, but a post-hoc weighted sensitivity analysis did not alter the classifications on the test set.
Table 7.3 - Reclassification of CCS in the test set (n=349) by the classifier, compared to abnormal LVEF and abnormal GLS.
a According to age-, sex and vendor specific reference values121. Values are n(% of the group with the respective echo values)
GLS = global longitudinal strain, LVEF = left ventricular ejection fraction.
In multivariable logistic regression of CCS in the test set, female sex (OR 2.0; 95% CI 1.1-3.6), higher RT heart dose (OR 1.4; 95% CI 1.0-1.9 per 10 Gy) and higher diastolic blood pressure (OR 1.5; 95% CI 1.1-1.9 per 10 Gy) were associated with being classified dissimilar to siblings, whereas cumulative anthracycline and mitoxantrone doses were not (Table 7.4).
Discussion
The current study presents a novel, ML-based approach, to identify CCS with comprehensive echocardiographic similarity to sibling controls with low expected heart failure risk. During validation, the AdaBoost classifier type best distinguished CCS with an LVEF <45% from siblings by forming a distinct sibling cluster, and accordingly projected the remaining CCS echocardiograms as similar or dissimilar. Eleven features, related to left ventricular geometry and function, contributed most to the classifier’s predictions and were used to retrain the final classifier. Applying this classifier to the test set, 79% of CCS echocardiograms were classified similar to siblings. CCS with an LVEF <45% were distinguished from siblings with 100% sensitivity. Of 104 CCS with any GLS or LVEF abnormality, 59% were classified similar to siblings. In CCS in the test set, female sex, RT heart dose and diastolic blood pressure were associated with being classified dissimilar.
Figure 7.3 - Comparison of final test set classifications to continuous values of systolic function.
(A) Depicts GLS values (y-axis) to the final classifications for CCS and siblings according to the predicted probability (x-axis) and prediction threshold.
(B) Depicts the final classifications of CCS compared to a combination of continuous values of LVEF (x-axis) and GLS (y-axis). The dark blue and light blue dots represent 79% and 21% of the data, respectively.
Table 7.4 - Multivariable associations of final predictions in CCS in the test set
Variable selection: cardiotoxic exposure variables, corrected for demography, and cardiovascular risk factors with a p-value <0.2 (exploratory) in univariable analysis. Diabetes with medication and use of lipid lowering medication were too infrequent to include in the model. Diastolic blood pressure had a superior correlation over systolic blood pressure. CCS = childhood cancer survivor, CI = confidence interval, OR = odds ratio, TBI = total body irradiation.
The AdaBoost classifier uses boosting - learning sequentially from weaknesses in previous classifiers – to build strong binary classifiers despite relatively weak underlying predictors. Single echocardiographic measurements may be weak discriminators in CCS, owing to heterogeneous cohorts, measurement variability, and cardiac remodeling patterns that depend on cardiotoxic exposures. Our classifier converts eleven-dimensional echocardiography data, which can be collected at the surveillance program, into an easily interpretable probability, to determine whether an echocardiogram is similar to those of same-aged peers. The position of the sibling cluster and overlapping CCS around a probability (‘soft prediction’) of 0.5 supports that most echocardiograms were truly indistinguishable. Our current approach shows analogies with efforts to cross-sectionally identify echocardiographic phenogroups in patients with known hypertension and relate them to healthy subjects. Instead of whole-cardiac cycle curves, we used single-point echocardiographic values as input, and showed that meaningful comparisons can be reached with less complex data.
Since ML is often regarded as a ‘black box’, we kept our ML-based results explainable and controllable, by showing that CCS with the lowest LVEFs were classified dissimilar to siblings, clinically relevant risk factors were associated to the classification, and that higher ‘degrees of abnormality’ indicated various clinical abnormalities. We reduced human bias by evaluating a markedly lower LVEF threshold than the established normality limits, so that the classifier could still contradict mild single LVEF abnormalities.
The high reclassification rate by our classifier, compared to the conventional interpretation of LVEF or GLS measurements, is to be expected knowing their serious diagnostic uncertainty. Our results highlight the imperfection of LVEF, but also GLS and the need to evaluate multiple parameters (eleven, in this study) in conjunction. In particular, CCS with mild single abnormalities of LVEF or GLS were often considered ‘similar’ to siblings, indicating that the remaining components of the echocardiogram carry useful information. The feature importance of these remaining features indicates that not a single parameter is able to indicate (ab)normality on its own. Our method expedited the discovery of these echocardiographic features of high interest. Since anthracycline-induced cardiotoxicity may be patchy, we assessed longitudinal strain globally as well as for individual cardiac walls. Inferoseptal and anterior wall strain appeared important to the classifier, but strain values of other cardiac walls may be collinear. However, the additional selected features from our analyses, primarily related to left ventricular size and hemodynamics, may provide important context for the selected functional measurements. No right ventricular parameters (conventional and strain) appeared important for our classifier, but the right ventricle may be especially affected after RT heart, which was more frequently administered in other cohorts. Follow-up is ongoing to validate our classifier’s predictions on clinical endpoints. From our results, we expect this comprehensive echocardiographic ‘fingerprint’ to have high potential in stratifying CCS with normal and abnormal echocardiograms beyond abnormal LVEF or GLS. Acknowledging that the background cardiovascular risk in CCS depends on cardiotoxic exposures, future research on prognostic validation should also include stratification for type of cardiotoxic treatment.
Dissimilar echocardiograms in our test set were associated to female sex, RT heart dose and diastolic blood pressure (all of which we expected), and near significantly to total body irradiation and mitoxantrone dose, but not anthracycline dose. Although anthracyclines are important predictors of cardiac endpoints, several hypotheses may explain this finding. First, longitudinal strain parameters are associated with RT heart, but less clearly with anthracyclines. Second, CCS exposed to RT heart have relatively smaller, and remodeled, cardiac chambers compared to siblings. Indeed, left ventricular diameter was very important to our classifier. Smaller end-diastolic volumes confound the use of LVEF (which is inflated) as a systolic function index, and the current approach may reveal undiagnosed cardiac abnormalities in such hearts. Third, our classifier may incorporate the consequences of other cardiac manifestations (coronary, epicardial, valvular) of RT heart, which highlights that cardiotoxic treatment variables should be included in longitudinal validation of our classifier. Also, in irradiated hearts, conventional measurements such as LVEF should be used with caution. It should however be noted that this multivariable analysis was only performed in 25% of our total cohort, limiting firm conclusions.
Limitations
Clinical use of the current classifier awaits prognostic validation. Our results were cross-validated and tested in unseen data, which precludes overfitting. However, its external validity depends on cohort characteristics (e.g. cardiotoxic exposures, cardiovascular risk factors and vendors of used strain software) and we started off with a more comprehensive set of measurements than those obtained in most surveillance clinics. Therefore, instead of external validation, retraining on different cohorts or pooled data may be more appropriate, for which the current study provided a framework. A perceived limitation may be that we used LVEF as an input feature and a validation variable. We emphasize that the learning process of the classifier only involved optimal separation of CCS and sibling clusters, and we manually compared the result with conventional LVEF values. The sample size may also be a perceived limitation, but the stability of the results during training, validation and testing demonstrates that the sample size was sufficient. Although age-, sex- and body size related measurements were not indexed, younger CCS may not be comparable to older siblings, and similar precautions apply for sex. In larger datasets, it would be appropriate to stratify analyses into different age and sex categories. Multicollinearity of input variables does not impact the strength of tree-based predictions as generated by AdaBoost, but may impact accountability of selected features.
Conclusions
In our novel, ML-based approach, the AdaBoost classifier meaningfully identified CCS with echocardiographic similarity to sibling controls with low expected heart failure risk, based on eleven features related to the left ventricle. We highlight the imperfection of single LVEF or GLS values; our classifier reclassified many mild LVEF or GLS abnormalities as ‘false-positive’. The presented analysis of complex echocardiographic data may identify CCS without significant cardiac dysfunction and, if prognostically validated, offer great potential for risk stratification in CCS. Prognostic validation should include clinical parameters such as cardiotoxic doses.
Supplemental material
Supplemental Figure 7.1 - Inclusion flowchart of Dutch Childhood Cancer Survivors Study (DCCSS) LATER cardiology (CARD) echocardiography study.
CCS =childhood cancer survivor. *Study arm closed early after exceeding the predefined limit. † Surveillance echocardiography performed after January 1, 2016 but before outpatient clinic invitation for the study, or under care of a cardiologist.
Supplemental Table 7.1 - Missing values per echocardiographic measurement
Values are n (% of total echocardiograms). Dashed lines denote cut-offs to exclude variables for sensitivity analyses.
* Acquisitions for RV and LA strain measurements were post-hoc implemented in the protocol, approximately halfway the inclusion period. RV fractional area change was obtained from the speckle tracking analyses.
IV = intraventricular, LA = left atrium, LV = left ventricle, RV = right ventricle























