Leigh AbbottMilan MarocchiMatthew FynnYue RongSven Nordholm

###### Abstract

Accurately interpreting cardiac auscultation signals plays a crucial role in diagnosing andmanaging cardiovascular diseases.However, the paucity of labelled data inhibits classification models’ training.Researchers have turned to generative deep learning techniquescombined with signal processing to augment the existing data and improve cardiacauscultation classification models to overcome this challenge.However, the primary focus of prior studies has been on model performance as opposed tomodel robustness.Robustness, in this case, is defined as both the in-distribution and out-of-distributionperformance by measures such as Matthew’s correlation coefficient.This work shows that more robust abnormal heart sound classifiers can be trained using anaugmented dataset.The augmentations consist of traditional audio approaches and the creation of syntheticaudio conditionally generated using the WaveGrad and DiffWave diffusion models.It is found that both the in-distribution and out-of-distribution performance can beimproved over various datasets when training a convolutional neural network-based classification model with thisaugmented dataset.With the performance increase encompassing not only accuracy but also balanced accuracy andMatthew’s correlation coefficient, an augmented dataset significantly contributes to resolving issues of imbalanceddatasets.This, in turn, helps provide a more general and robust classifier.

###### keywords:

Data augmentation , Denoising diffusion probabilistic models , Generative deep learning , Abnormal heart sound classification , Synthetic audio generation

\affiliation

[inst1]organization=School of Electrical Engineering, Computing, and Mathematical Sciences (EECMS), Faculty of Science and Engineering, Curtin University,Department and Organizationcity=Bentley,postcode=6102,state=WA,country=Australia

## 1 Introduction

Cardiovascular disease (CVD) is the primary contributor to mortality worldwide, representing morethan $30\text{\,}\mathrm{\char 37\relax}$ of all global deaths in 2019[1].In addition to the human cost, CVD places an immense economic burden on healthcare systems andsociety[1].To treat CVD effectively, it is necessary to diagnose and evaluate the condition of the heart accurately.

Cardiac auscultation (CA) is the process of listening to sounds generated by theheart[2].Physicians have traditionally performed CA using stethoscopes to detect and monitor heartconditions in a non-invasive manner.However, the difficulty of performing CA leads to uncertainty in diagnosis and poor patientoutcomes.The issue is further complicated by the fact that CA is both difficult to teach and a specialised skill, with studies noting that primary care physicians often lack proficiency in this area[2].

Recently, a wearable multichannel electrophonocardiography (EPCG) device has beendeveloped[3].The premise of this device is to detect CVD utilising synchroised phonocardiogram (PCG) and electrocardiogram (ECG) data. The combination of these signals can result in more accurate and robust classifications.However, there is currently limited synchronised multichannel phonocardiogram and electrocardiogram (SMPECG) data, which creates a need for a technique to aid in creating a larger dataset.

There are current limitations that prevent robust classification results across multiple datasets.These include a lack of quality data and unbalanced datasets, with most data having lots of backgroundnoise, resulting in a low signal-to-noise ratio.There is also a limited amount of synchronised PCG and ECG recordings,which limits the effectiveness of algorithms, despite the large amounts of standalone ECG and some PCG data.Traditional augmentation approaches can help to overcome these issues, with augmentation beingapplied to existing signals[4].This is somewhat lacking, however, as it does not always increase the out-of-distributionperformance, leaving room for further approaches to address this issue.With recent advancements in conditional waveform generation using diffusionmodels[5, 6], it is possible to extend previously ECG-only datasets bygenerating PCG signals conditioned from the ECG in these datasets.

This work explores traditional augmentation approaches alongside the generation of synthetic signals,to create more robust classifiers of abnormal heart sounds.

The main contributions of this work are summarised below:

- 1.
Development of a diffusion model to create PCG signals conditional on existing ECG signals, allowing additional data to be used from ECG datasets once the diffusion model has created the corresponding PCG signal. To the best of our knowledge, this is the first work using diffusion models to generate PCG signals.

- 2.
Traditional augmentation methods synchronised over the PCG and ECG signals and extensive methods beyond those utilised in other studies.

- 3.
Augmentation methods were applied to a previously top-performing model[7] on the training-a dataset[8], resulting in improvements of 2.5% in accuracy, 4.1% in balanced accuracy, 1.9% in $F_{1}^{+}$ score, and 0.066 in Matthew’s Correlation Coefficient (MCC).Additionally, when tested on the training-e dataset—where the model had not been trained on any of the dataset’s data—there were notable improvements of 43.1% in accuracy, 20.2% in balanced accuracy, 27.1% in $F_{1}^{+}$ score, and 0.297 in MCC.

The remainder of the paper is organised as follows. Background in PCG and ECG signals, model robustness, biomedical signal augmentation, andgenerative models are covered in Section 2.Following this, the methods and results are presented in Sections 3 and 4 before a discussion of the results in Section 5 and thefinal conclusions and further work are summarised in Section 6.

## 2 Background

### 2.1 Phonocardiogram and Electrocardiogram Signals

PCG signals comprise multiple sounds from the opening and closing of valves and blood flow inside the heart that causevibrations, which are then recorded from the chest wall[9].The fundamental heart sounds are the first (S1) and second (S2) sounds, which are the most prominent.The S1 occurs during the beginning of the systole and is caused by isovolumetric ventricularcontraction.S2 is caused by the closing of the aortic and pulmonic valves during the beginning of the diastole.Although the S1 and S2 sounds are the most audible, PCG signals consist of many other heart soundssuch as the third (S3) and fourth (S4) heart sounds, systolic ejection clicks, mid-systolicclicks, opening snap and heart murmurs[8].These heart murmurs are produced by turbulent flowing blood, which can indicate the presence of particular CVDs.These various heart sounds all lie within the low frequencies, with S1 from$10\text{\,}\mathrm{Hz}140\text{\,}\mathrm{Hz}$ and the highest energy around $25\text{\,}\mathrm{Hz}45\text{\,}\mathrm{Hz}$.The S2 is from $10\text{\,}\mathrm{Hz}200\text{\,}\mathrm{Hz}$, with most of the energy around $55\text{\,}\mathrm{Hz}75\text{\,}\mathrm{Hz}$.S3 and S4 sounds are from $20\text{\,}\mathrm{Hz}70\text{\,}\mathrm{Hz}$, although they are much less audible, mainly occurring in children and pathological subjects.Murmurs are usually found in slightly higher frequencies and range from $25\text{\,}\mathrm{Hz}$ to$400\text{\,}\mathrm{Hz}$[10], with some being found in frequencies higher than$600\text{\,}\mathrm{Hz}$, but with far less energy[11].

ECG signals represent the heart’s electrical activity[12].An ECG signal consists of the P, QRS complex, and T waves, with a U wave alsooccasionally present[13].These waves can contain information to aid in CVD diagnosis.ECG signals are commonly filtered between $0.5\text{\,}\mathrm{Hz}$ and $40\text{\,}\mathrm{Hz}$ to remove baselinewander and unwanted noise and interference[14].For example, in the case of coronary artery disease patients, studies have documented that symptomssuch as T-wave inversion, ST-T abnormalities, left ventricular hypertrophy, and prematureventricular contractions can be observed[15].

Combining these two signals has produced superior results compared to classification using a singlesignal[7], suggesting that relevant features for classification exist within both signals.The increase in performance suggests that utilising synchronised PCG and ECG data will help tocreate more accurate and robust classifiers.

### 2.2 Model Robustness

Tran etal. (2022) [16] presented a state-of-the-art framework forenhancing model reliability, focusing on robust generalisation.Robust generalisation allows a model to perform well on data outside the trainingset[16], encompassing in-distribution (ID) and out-of-distribution (OOD)generalisation[16].

ID generalisation pertains to a model’s performance on data within the training distribution butoutside the training set, addressing underfitting and overfittingissues[16, 17].OOD generalisation, on the other hand, concerns a model’s ability to handle data distributionsdifferent from the training set, addressing distribution shifts such as subpopulation shifts,covariate shifts, and domain shifts[16, 18].

Perturbation resilience is the ability of a model to handle atypical and significantly differentdata, including corruption, distortion, artifacts, missing data, gaps, spectral masking, extremenoise, and defective inputs, which is critical in clinical settings.

#### 2.2.1 Measuring Model Robustness

Table1 shows formulas for traditional binaryclassification performance measures derived from the confusion matrix inFigure1[19, 20, 21].Sensitivity (recall/true positive rate) and specificity (true negative rate) measure correctclassifications of positive and negative cases,respectively[19].Precision (positive predictive value) and negative predictive value measures correctly classifiedpositive and negative cases among classified cases,respectively[19].Accuracy measures overall correct classifications[19].Ideally, all these measures are unity, indicating no false predictions.

Actual | |||
---|---|---|---|

Positive | Negative | ||

Classified | Positive | TP | FP |

Negative | FN | TN |

While having one target metric is ideal, it is impractical as each metric contains differentinformation and no single measure captures all the information from a confusionmatrix[20].Summary metrics can be biased under certain conditions; for instance, accuracy can be misleading forimbalanced datasets.Matthew’s correlation coefficient (MCC) is a better single metric for classifier performance than Fscores[22].

Metric | Formula | |
---|---|---|

Sensitivity | TPR= | $\frac{{\text{TP}}}{{\text{TP}+\text{FN}}}$ |

Specificity | TNR= | $\frac{{\text{TN}}}{{\text{TN}+\text{FP}}}$ |

Precision | PPV= | $\frac{\text{TP}}{\text{TP}+\text{FP}}$ |

Negative Predictive Value | NPV= | $\frac{\text{TN}}{\text{TN}+\text{FN}}$ |

Accuracy | acc= | $\frac{{\text{TP}+\text{TN}}}{{\text{TP}+\text{TN}+\text{FP}+\text{FN}}}$ |

Balanced Accuracy | $\text{acc}_{\mu}$= | $\frac{\text{TPR}+\text{TNR}}{2}$ |

F1-Positive-Score | $\text{F}^{+}_{1}$= | $\frac{{2\cdot\text{PPV}\cdot\text{TPR}}}{{\text{PPV}+\text{TPR}}}$ |

F1-Negative-Score | $\text{F}^{-}_{1}$= | $\frac{{2\cdot\text{NPV}\cdot\text{TNR}}}{{\text{PNV}+\text{TNR}}}$ |

Matthew’s Correlation Coefficient | MCC= | $\frac{\text{TP}\cdot\text{TN}-\text{FP}\cdot\text{FN}}{\sqrt{(\text{TP}+\text{%FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$ |

This work focuses on ID and OOD performance as the metric for model robustness,focusing on balanced accuracy and MCC in addition to accuracy to present an overall indicator of the performance of theclassification model.

#### 2.2.2 Model Robustness and Augmentation

Data augmentation creates new data from existing data to increase the training set’s size andvariety, typically improving model performance.To improve ID generalisation, providing more training data from the same distribution as theoriginal data helps the model generalise to similar examples[16].To enhance OOD generalisation, extending the training data distribution beyond the original dataset,such as by balancing labels or adding scarce feature combinations, helps the model handledistribution shifts more effectively[23].

### 2.3 Generative Models

Generative models are trained to learn the underlying distribution of the data to generate newsamples.As such, the goal is to train a mapping between the latent space and the data space so that theresulting samples are similar to the original data.One of the important properties of the latent space is that it can enable the creation of new datathrough the manipulation of semantic representations of features and labels.In recent history, three classes of models have advanced the field of generative learning in waves.

These classes are Autoencoders (AEs), Generative Adversarial Networks (GANs) and Diffusion models(DMs).The first class of models, AEs, encode input data to a lower-dimensional latent space and thendecode it back to the data space, often used in denoising models due to their ability toreconstruct the input from the latent space[24].Variational Autoencoders (VAEs), an extension of AEs, regularise the latent distribution, enablingmeaningful sampling from the latent space and removing discontinuities, thus facilitatinggenerative capabilities[25].GANs, the second class, consist of a generator and a discriminator network; the generator createsrealistic samples from random noise, while the discriminator attempts to distinguish betweenreal and synthetic samples, engaging in a zero-sum game to improve bothnetworks[26].DMs, the third class, add random noise to input data and then train the model to reverse thisprocess, learning to denoise data in a structured manner, with models like Latent DiffusionModels (LDMs) performing diffusion in the latent space for computationalefficiency[27, 28, 29].

The “generative learning trilemma” may guide the trade-offs in choosing a generative learningmodel.As Figure2 (adapted from[30]) shows, models often excel at only two ofthree desired goals: high sample quality, fast sample speed, and large sample variety.However, as mentioned earlier, performing the diffusion process in latent space allows LDMs togenerate samples much faster, such that some argue it bypasses the trilemma inpractice[29, 30].For this reason, LDMs have seen recent use in expanding datasets in biomedical projects, where datacollection is prohibitively costly[31].As such, this work aims to use both the WaveGrad and DiffWave diffusion models for the creation of PCGfrom ECG signals.

### 2.4 Biomedical Signal Augmentation

In[4], data augmentation was employed to expand a PCG dataset from $3153\text{\,}$recordings to $53\,601\text{\,}\mathrm{r}$ecordings, an increase by a factor of $17\text{\,}$. The augmentationincluded a random combination of effects such as changes to pitch, speed, tempo, dither, volume,and mixing with audio[4].Despite achieving a sensitivity of $96\text{\,}\mathrm{\char 37\relax}$ and a specificity of $83\text{\,}\mathrm{\char 37\relax}$, theauthors concluded that their approach did not generalise well, with performance varying from$99\text{\,}\mathrm{\char 37\relax}$ on the dataset with the most recordings to $50\text{\,}\mathrm{\char 37\relax}$ on the datasetwith the fewest recordings[4].Consequently, Thomae and Dominik [4] suggested that more training data and furtheraugmentation is necessary to enhance performance on unseen data.

In a subsequent study by Zhou etal. [32], models trained with variousaugmentations were compared against a baseline.Augmentations were applied to both the original and image-transformed data and were categorised by a“physiological constraint”(whether the transform alters or violates physiological possibilities)and/or a“spectrogram constraint”(whether the transform alters the meaning of the spectrogram output)[32].Augmentations that violated the “spectrogram constraint” were linked to decreased modelperformance, while adherence to physiological possibilities was associated with improvedperformance[32].Notably, no single augmentation improved performance across all metrics, though some offered a morefavorable trade-off than others[32].

VAEs have been explored for the generation of synthetic lung auscultation sounds[33],where it was found that the use of VAE-generated signals in the training of classifiers were oftenimproved, but not always, over training on just the original data.

GANs have also found lots of use within biomedicalapplications[34, 35, 36].The introduction of synthetic data helps to overcome data imbalances as well as improve modelperformance.In particular, GANs have been used to generate synthetic heart signals[36].This work found that during early training, the waveform generated resembled a real signalwith added noise[36].Using the Empirical Wavelet Transform (EWT) to reduce this noise, the resulting signal at$2000\text{\,}$ epochs was more realistic than the resulting signal at $12\,000\text{\,}$ epochs, allowingfor a sixfold reduction in training time[36].Further work was performed to show that the generative model had not simply learned the trainingdataset[36].As a result, the classifiers were able to classify the synthetic heart sounds correctly withaccuracy greater than $90\text{\,}\mathrm{\char 37\relax}$[36].

In[37], the general problem of generating synthetic one-dimensionalbiosignals are explored.Both an autoencoder and GAN-based approach were explored.To evaluate their models, the synthetic and real datasets are each used as either the training ortest set for a classifier model that had previously achieved an accuracy of$99\text{\,}\mathrm{\char 37\relax}$[37].The results from this work showed that the synthetic data captured the underlying features anddistributions of the real data and the synthetic data could be used to train classifiers suchthat they perform well on real data[37].In addition to this, it was noted that the generative models were readily able to capture the noiseof the input data[37].

It was found that although GANs have found lots of use traditionally, the number of papers in medicalimaging that utilise VAEs and DMs has increased in recent years.For DMs in particular, there has been a substantial increase in papers, which authors attributed totheir ability to generate high-quality images with good modecoverage[38].Despite the abundance of diffusion models in medical imaging, we could not find, to the best of our knowledge, any use in biomedical audio signals, leaving room for exploration.

### 2.5 Conditional Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic models (DDPM) are a type of diffusion model that follows a Markovprocess that continuously noises the input, with the network learning to reverse this process byestimating the noise that was added.Conditional diffusion models for conditional audio generation can be adaptedfrom the diffusion model setup in [39].This model considers the conditional distribution $p_{\theta}(\mathbb{y}_{0}|\mathbb{x)}$, with $\mathbb{y}_{0}$ being theoriginal waveform and $\mathbb{x}$ the conditioning features that correspond with $\mathbb{y}_{0}$,

$p_{\theta}\left(\mathbb{y}_{0}|\mathbb{x}\right)=\int p_{\theta}\left(\mathbb{%y}_{0:N}|\mathbb{x}\right)d\mathbb{y}_{1:T}$ | (1) |

where $\mathbb{y}_{1},\ldots{},\mathbb{y}_{T}$ is a series of latent variables.The posterior $q\left(\mathbb{y}_{1:T}|\mathbb{y}_{0}\right)$ is the forward diffusion process, which isdefined through the Markov chain:

$q\left(\mathbb{y}_{1:T}|\mathbb{y}_{0}\right)=\prod_{t=1}^{T}q\left(\mathbb{y}%_{t}|\mathbb{y}_{t-1}\right)$ | (2) |

Gaussian noise being added in each iteration is defined as,

$q\left(\mathbb{y}_{t}|\mathbb{y}_{t-1}\right)=\mathcal{N}\left(\mathbb{y}_{t};%\sqrt{1-\beta_{t}}\mathbb{y}_{t-1},\beta_{t}I\right)$ | (3) |

with the noise being defined with a fixed noise schedule for $\beta_{1},\ldots{},\beta_{T}$.Hence, the diffusion process can be computed for any $t$ as

$\mathbb{y}_{t}=\sqrt{\overline{\alpha}_{t}}\mathbb{y}_{0}+\sqrt{1-\overline{%\alpha}_{t}}\epsilon_{t}$ | (4) |

where $\alpha_{t}=1-\beta_{t}$ and $\overline{\alpha_{t}}=\prod_{i=1}^{t}\alpha_{i}$.As the likelihood in Equation1 is intractable, training these models is done bymaximising its variational lower bound (ELBO).Ho etal. found that using a loss as defined in Equation5 leads tohigher quality generation.

$\operatorname{\mathbb{E}}_{t,\epsilon}\left[\left\lVert\epsilon_{\theta}\left(%\mathbb{y}_{t},\mathbb{x},t\right)-\epsilon_{t}\right\rVert_{2}^{2}\right]$ | (5) |

The model estimates the noise added in the forward process, which is written as $\epsilon_{\theta}$and the actual noise added is written as $\epsilon_{t}$, where $\epsilon_{t}\sim\mathcal{N}\left(0,I\right)$.

Generation is then done by first sampling $\mathbb{y}_{T}\sim\mathcal{N}\left(0,I\right)$ and $\mathbb{z}\sim\mathcal{N}\left(0,I\right)$, before following the below equation until for $t=T,\ldots{},1,0$,

$\mathbb{y}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbb{y}_{t}-\frac{1-%\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}\left(\mathbb{y}_{%t},\mathbb{x},t\right)\right)+\sigma_{t}\mathbb{z}$ | (6) |

where $\sigma_{t}=\tilde{\beta_{t}}$ and $\tilde{\beta_{t}}=\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}%\beta_{t}$ is the variance at step $t$ for $t>1$ and $\tilde{\beta_{1}}=\beta_{1}$.

#### 2.5.1 WaveGrad

WaveGrad is a DDPM for audio synthesis using conditional generation.The model utilises the architecture consisting of multiple upsampling blocks (UBlocks) anddownsampling blocks (DBlocks), with the input signal and the conditioning signal as inputs intothe network.The conditioning signal is converted to a mel-spectrogram representation before being input tothe model[6].These UBlocks and DBlocks follow the architecture of the upsampling and downsampling blocks utilised in the Generative Adversarial Network text-to-speech (GAN-TTS) model [40].The feature-wise linear modulation (FilM) modules combine information from the noisy waveform and the conditioning mel-spectrogram [6].The UBlock, DBlock and feature-wise linear modulation (FiLM) modules are shown inFigure3, with Figure4 showing the entire WaveGrad architecture.The loss function is based on the difference between the noise added in each step of the forwarddiffusion process and the noise predicted during the reverse process[6] asdescribed in Equation7, with the Markov process being conditioned on the continuousnoise level instead of the time-step.Also, note that the L1 norm was used over the L2 norm as it was found to provide better trainingstability[6].WaveGrad only includes a local conditioner in the form of a conditioning signal.

$\operatorname{\mathbb{E}}_{\overline{\alpha},\epsilon}\left[\left\lVert%\epsilon_{\theta}\left(\sqrt{\overline{\alpha}}\mathbb{y_{0}}+\sqrt{1-%\overline{\alpha}}\epsilon,\mathbb{x},\sqrt{\overline{\alpha}}\right)-\epsilon%_{t}\right\rVert_{1}\right]$ | (7) |

#### 2.5.2 DiffWave

DiffWave is another DDPM for raw audio synthesis with conditional and unconditional generation.The loss function utilises a single ELBO-based training objective without auxiliarylosses[5], as described in Equation8.One-dimensional convolutions are used on the input and conditioning signals that go through multiplefully connected layers.The model contains a WaveNet[41] backbone, consisting of bi-directional dilated convolutions and residual layers andconnections.The architecture is shown in Figure5.DiffWave can be used for both conditional and unconditional generation.For conditional generation, it uses a local conditioning signal and a global conditioner (discretelabels)[5].

$\operatorname{\mathbb{E}}_{t,\epsilon}\left[\left\lVert\epsilon_{\theta}\left(%\sqrt{\overline{\alpha}_{t}}\mathbb{y_{0}}+\sqrt{1-\overline{\alpha}_{t}}%\epsilon,\mathbb{x},t\right)-\epsilon_{t}\right\rVert_{1}\right]$ | (8) |

## 3 Materials and Methods

To achieve a more robust model, the augmented training dataset must first be created.Figure6 depicts the dataset creation process.Once this dataset is created, various classification models can be trained and evaluated to measurethe increase in ID and OOD performance.

### 3.1 Datasets

#### 3.1.1 PhysioNet and Computing in Cardiology Challenge 2016 Dataset

The PhysioNet and Computing in Cardiology Challenge 2016 (CinC) was an international competition thataimed to encourage the development of heart sound classification algorithms[8].The data was sourced from nine independent databases but excluded a database focused onfetal and maternal heart sounds[8].Across the nine databases, there are $2435\text{\,}$ recordings sourced from $1297\text{\,}\mathrm{p}$atients[8].Excluding the aforementioned database and splitting longer recordings into smaller samples, therewere in total $4430\text{\,}$ samples from $1072\text{\,}\mathrm{p}$atients, equating to $233\,512\text{\,}$ heartsounds, $116\,865\text{\,}$ heart beats, and nearly $30\text{\,}$ hours of recordings used in thecompetition[42].At the time of (their) publication, this amounted to the largest open-access heart sound database inthe world[42].

The recordings were resampled to $2000\text{\,}\mathrm{Hz}$ for the competition and only one PCG lead wasused, with the exception of training-set a, which includes ECG[42].

Database Information | Proportion of Recordings (%) | ||||
---|---|---|---|---|---|

Challenge Use | Dataset | Source Database | Abnormal | Normal | Unsure |

Training | training-a | MITHSDB | 67.5 | 28.4 | 4.2 |

training-b | AADHSDB | 14.9 | 60.2 | 24.9 | |

training-c | AUTHHSDB | 64.5 | 22.6 | 12.9 | |

training-d | UHAHSDB | 47.3 | 47.3 | 5.5 | |

training-e | DLUTHSDB | 7.1 | 86.7 | 6.2 | |

training-f | SUAHSDB | 27.2 | 68.4 | 4.4 | |

Average | 18.1 | 73.0 | 8.8 | ||

Test | test-b | AADHSDB | 15.6 | 48.8 | 35.6 |

test-c | AUTHHSDB | 64.3 | 28.6 | 7.1 | |

test-d | UHAHSDB | 45.8 | 45.8 | 8.3 | |

test-e | DLUTHSDB | 6.7 | 86.4 | 6.9 | |

test-g | TUTHSDB | 18.1 | 81.9 | 0.0 | |

test-i | SSHHSDB | 60 | 34.3 | 5.7 | |

Average | 12.0 | 77.1 | 10.9 |

Recordings were divided into eithernormal (healthy),abnormal (diagnosed with CVD or other cardiac problems),or unsure (low quality signals)[8].A summary of the data, shown in Table2,was adapted from[42] and[8].These datasets also include additional information, such as individual disease diagnoses andannotations of the heart cycles.These can be used to assist with the data augmentation.

#### 3.1.2 Synchronised Multichannel PCG and ECG dataset

Recently, synchronised multichannel PCG and ECG (SMPECG) data has been collected from an EPCG devicethat consists of seven PCG and one lead-I ECG sensors[43].Using this device, data was collected from 105 subjects, of which 46 were diagnosed with coronaryartery disease.Ten seconds of audio were recorded for each subject, during which the subjects were instructed notto breathe to eliminate lung sounds from the recording.This data was collected in a clinical environment with background noise and non-optimal sensor placement as it is designed for ease of use, making it achallenging dataset for classification.As only single channel PCG is available in the other datasets, only a single channel (channel 2) wasused for this dataset.

#### 3.1.3 Incentia Dataset

Along with the training-a dataset used for the inputs for training the generative models, theincentive dataset[44] was utilised to provide unique unseen ECG to generate anaccompanying PCG signal.This data set contains 11,000 patients and 2,774,054,987 labelled heartbeats at a sample rate of$250\text{\,}\mathrm{Hz}$ with 541,794 segments.Each beat was classified with a type from normal, premature atrial contraction, prematureventricular contraction and rhythm from normal sinusal rhythm, atrial fibrillation and atrialflutter.

#### 3.1.4 Further Datasets

To improve the model’s robustness against noise, one of the stages of augmentation introduces noisefrom other PCG and ECG datasets.These are the electro-phono-cardiogram (EPHNOGRAM) dataset[45] for PCG and the Massachusetts Institute of Technology - Beth Israel Hospital (MIT-BIH) dataset[46] for ECG.The EPHNOGRAM dataset comprises 24 healthy adults and contains recordings taken during stress testsand at rest[45].The MIT-BIH dataset contains 12 half-hour ECG recordings and three half-hour recordings of noisetypical in ambulatory ECG recordings, where this noise is used for augmentation[46].

### 3.2 Signal Augmentation

The augmentation procedure of the PCG and ECG signals is shown in Figure7.The time stretching augmentation is synchronised to ensure that they are both stretched the sameamount, with the black lines representing the flow of the ECG data and the white linesrepresenting the flow of PCG data.Augmentation stages have different percentage chances of occurring, where the chances chosen were determined to providethe widest variety of augmented signals after every stage has been completed whilst also resulting in the bestperformance.The augmentations vary slightly between PCG and ECG to best meet the physiological constraints.

The PCG signals are augmented in various ways: harmonic percussive source separation (HPSS) foremphasis on certain parts of the signal, time stretching, emphasis on certain bands of thesignal using a parametric equalisation (EQ) filter and introducing noise from the EPHNOGRAMdataset[45].Before these operations are applied, the signals are normalised to have a zero mean and be between-1 and 1.Shown in Figure7 is the augmentation procedure applied to PCG data, noted with thewhite lines.

The HPSS has a 75% chance of occurring and works by extracting harmonic and percussive componentsof the signal with varying thresholds to extract different parts of the signal.The HPSS implementation is from the librosa v0.1.0 Python library[47, 48].$\mathbb{X}(t,k)$ denotes the short-time Fourier transform (STFT) of the signal $\mathbb{x}(t)$, defined as

$\mathbb{X}(t,k)=\sum_{n=0}^{N-1}\mathbb{w}(n)\mathbb{x}(n+tH)\exp\left(-2\pijkn%/N\right)$ | (9) |

where $\mathbb{w}$ is a sine-window, $H$ represents the hop size and $N$ is the window length and the length ofthe discrete Fourier transform.

Firstly, the STFT of the signal is calculated, with the parameters chosen randomly from a windowlength of 512, 1024 and 2048 with equal probability.A hop length was randomly chosen from 16, 32, 64, and 128 with uniform distribution.

Following this, the harmonic and percussive components are then extracted from the following,

$\tilde{\mathbb{Y}}_{h}(t,k)=median(\mathbb{X}(t-\ell_{h},k),\ldots{},\mathbb{X%}(t+\ell_{h},k))$ | (10) |

$\tilde{\mathbb{Y}}_{p}(t,k)=median(\mathbb{X}(t,k-\ell_{p}),\ldots{},\mathbb{X%}(t,k+\ell_{p}))$ | (11) |

$\mathbb{M}_{h}(t,k)=\begin{cases}1,&\text{if }\frac{\tilde{\mathbb{Y}}_{h}(t,k%)}{\tilde{\mathbb{Y}}_{p}(t,k)+\eta}>\lambda_{h}\\0,&\text{otherwise}\end{cases}$ | (12) |

$\mathbb{M}_{p}(t,k)=\begin{cases}1,&\text{if }\frac{\tilde{\mathbb{Y}}_{p}(t,k%)}{\tilde{\mathbb{Y}}_{h}(t,k)+\eta}\geq\lambda_{p}\\0,&\text{otherwise}\end{cases}$ | (13) |

$\mathbb{X}_{h}(t,k)=\mathbb{X}(t,k)\cdot\mathbb{M}_{h}(t,k)$ | (14) |

$\mathbb{X}_{p}(t,k)=\mathbb{X}(t,k)\cdot\mathbb{M}_{p}(t,k)$ | (15) |

where $\mathbb{X}_{h}(t,k)$ is the harmonic component, $\mathbb{X}_{p}(t,k)$ is the percussive component $\eta$ is asmall number added to avoid a divide by 0 error[48].$\mathbb{x}_{h}(t)$ and $\mathbb{x}_{p}(t)$ are the inverse STFT (ISTFT) of $\mathbb{X}_{h}(t,k)$ and $\mathbb{X}_{p}(t,k)$.If the thresholds, $\lambda_{h}>1$ or $\lambda_{p}>1$, there will be some part of the spectrum that is not a harmonic or percussive component of the signal but a residual component that appears as textured noise. As the abnormalities to be detected are from diseases that produce more percussive or harmonic sounds, these residuals can be ignored without important information loss that would negatively impact the ability of a classifier to classify these sounds.

The first set have parameters $\lambda_{h}=rand(1,2)$, $\lambda_{p}=rand(1,2)$, $\ell_{h}=randint(5,30)$, and $\ell_{p}=randint(5,30)$.$rand$ denotes a random floating point number chosen uniformly between the two bounds, and $randint$is an integer uniformly chosen between those bounds.The second set are then extracted from $\mathbb{X}_{h}(t,k)$ and $\mathbb{X}_{p}(t,k)$.$\mathbb{X}_{hh}(t,k)$ and$\mathbb{X}_{hp}(t,k)$ are the harmonic and percussive components of $\mathbb{X}_{h}(t,k)$ and $\mathbb{X}_{ph}(t,k)$ and $\mathbb{X}_{pp}(t,k)$ the harmonic and percussive components of $\mathbb{X}_{p}(t,k)$.The second stage of decomposition uses parameters of $\lambda_{hh}=rand(1,4)$, $\lambda_{hp}=rand(1,4)$, $\lambda_{ph}=rand(1,4)$, $\lambda_{pp}=rand(1,4)$, $\ell_{hh}=randint(5,30)$,$\ell_{hp}=randint(5,30)$, and $\ell_{ph}=randint(5,30)$, $\ell_{pp}=randint(5,30)$.

The ISTFT is then applied to each component before reconstructing the signal as,

$\begin{split}\mathbb{s}_{HPSS}(t)=a_{hh}\mathbb{x}_{hh}(t)+a_{hp}\mathbb{x}_{%hp}(t)+a_{ph}\mathbb{x}_{ph}(t)+a_{pp}\mathbb{x}_{pp}(t)\end{split}$ | (16) |

where $a_{hh}=rand(0.01,10)$, $a_{hp}=rand(0.01,10)$,$a_{ph}=rand(0.01,10)$, $a_{pp}=rand(0.01,10)$.

This two stage decomposition and reconstruction described in Equation16 is done twice to create $\mathbb{s}_{HPSS_{1}}(t)$ and $\mathbb{s}_{HPSS_{2}}(t)$, which are then combined to get the final augmented signal$\mathbb{s}_{HPSS_{final}}(t)$,

$\mathbb{s}_{HPSS_{final}}(t)=\mathbb{s}_{HPSS_{1}}(t)+a_{HPSS}\mathbb{s}_{HPSS%_{2}}(t)$ | (17) |

where $a_{HPSS}=rand(0.01,0.05)$.The use of these parameters was determined by inspection to ensure the signals remain realistic.

Next, there is a 7.5% chance of introducing noise to the signal, as defined in the equation below,where $\mathbb{s}_{HPSS}(t)$ is the signal after the HPSS augmentation stage, $\mathbb{s}_{SN}(t)$ is the augmented signal and$\mathbb{r}(t)\sim\mathcal{N}(\mu,\sigma I)$, $\sigma=rand\_choice(0.01,0.001,0.0001)$ and $\mu=rand(0,0.1)$.Note that $\mathbb{s}_{HPSS}(t)$ may not have had the HPSS augmentation applied as it depends on the random chance.$rand\_choice()$ denotes a random choice from those numbers with equal probability.

$\mathbb{s}_{SN}(t)=\mathbb{s}_{HPSS}(t)+\mathbb{r}(t)$ | (18) |

Following this, there is a 75% chance of adding in a time warp.This time warp will stretch the signal randomly to either 1.004 times the length or 1.006 timesthe length of the original signal.It is noted that a time warp with the same factor will be applied to both the PCG and ECG.

There is then a 75% chance of adding in amplitude modulation.The modulation is done as described in Equation19, where $b_{AM_{1}}=rand(0.01,0.25)$, $b_{AM_{2}}=rand(0.01,0.25)$, $c_{AM_{1}}=rand(0.05,0.5)$, $c_{AM_{2}}=rand(0.001,0.05)$, $d_{AM_{1}}=rand(0,1)$, $d_{AM_{2}}=rand(0,1)$ and $s_{TS}(t)$ is signal after the time stretch augmentation stage, which depending on the random chance may have been time-stretched.

$\mathbb{s}_{AM}=\mathbb{s}_{TS}(t)\cdot\left(1+b_{AM_{1}}\sin\left(2\pi c_{AM_%{1}}t+d_{AM_{1}}\right)+b_{AM_{2}}\sin\left(2\pi c_{AM_{2}}t+d_{AM_{2}}\right)\right)$ | (19) |

Next, there is another 7.5% chance of introducing the same noise as done in Equation18.Following this, there is a 25% chance of applyingparametric equalisation to boost frequency bands.Given the frequency range of $2\text{\,}\mathrm{Hz}500\text{\,}\mathrm{Hz}$,the bandwidth is randomly selected between 5% and 20% of this range,and the signal is attenuated using a bandpass filter.After repeating this process 5 times,the filtered signal and original signalare summed and normalised.

Lastly, real noise from the EPHNOGRAM dataset is introduced.The introduced noise from the EPHNOGRAM is clinical noise extracted from some of the recordings inthis dataset.This augmentation occurs 50% of the time.

The ECG signals are also augmented in numerous ways; these include introducing random noise, addingbaseline wander, time stretching, adding noise from the MIT-BIH dataset, and emphasising certainsignal bands. Figure7 shows the order of processing on the ECG, indicated withthe black lines.

Random noise is applied the same way as the PCG noise, as defined in Equation18, with thisaugmentation occurring with a probability of 7.5%.Next, a baseline wander is added 30% of this time.This is done as described in Equation20, where $b_{BW_{1}}=rand(0.01,0.2)$, $b_{BW_{2}}=rand(0.01,0.2)$, $c_{BW_{1}}=rand(0.05,0.5)$, $c_{BW_{2}}=rand(0.001,0.05)$, $d_{BW_{1}}=rand(0,1)$, $d_{BW_{2}}=rand(0,1)$. $\mathbb{s}_{SN_{E}}(t)$ is the ECG signal after the random noise augmentation stage, which may include the random noise as per the random chance.

$\mathbb{s}_{BW}(t)=\mathbb{s}_{SN_{E}}(t)+b_{BW_{1}}\sin\left(2\pi c_{BW_{1}}t%+d_{BW_{1}}\right)+b_{BW_{2}}\sin\left(2\pi c_{BW_{2}}t+d_{BW_{2}}\right)$ | (20) |

Following this, there is a 25% chance of a timewarp between 1 and 1.06 times the originalsignal.It is noted that a timewarp with the same factor will be applied to both the PCG and ECG.Then, the same parametric equalisation, as with the PCG, is applied between $0.25\text{\,}\mathrm{Hz}$ and$100\text{\,}\mathrm{Hz}$.

Lastly, noise from the MIT-BIH database is added.This is noise from the ECG sensors taken from recordings in the MIT-BIH database.

### 3.3 Synthetic Audio Generation

Synthetic signals were generated using the mel-spectrogram of the ECG signal as a conditioner for both theWaveGrad[6] and DiffWave[5] diffusion models.They are trained before data is generated for use.These diffusion models generated data for 3200 patients, 800 abnormal and 2400 normal, with three segmentsused to train the classification models.This is done to reduce the effect of overfitting to the synthetic signals.The ECG signals for conditioning were taken from the icentia database[44] to introducenew data, with abnormal ECG used for abnormal PCG.The generative models were trained to create individual conditions and make them more realisticusing additional labels from the dataset.To get around the lack of training data, the order of heart cycles was rearranged to increasetraining diversity.DiffWave and WaveGrad models were trained on an Nvidia RTX 4090 for 24 hours.The parameters for the DiffWave model that differ from the default are shown below inTable3.Parameters used for the WaveGrad model that differ from the default are shown in TableTable4.Both models differ slightly from their base implementations as they use a custom global conditioner.Additional global conditioners were added for specific abnormalities or lack of abnormalities, suchas mitral valve prolapse, innocent or benign murmurs, aortic disease, miscellaneous conditions,and normal.

Parameter | Value |
---|---|

Residual layers | 30 |

Residual channels | 64 |

Dilation cycle length | 10 |

Embedding dimension | 32 |

Batch size | 8 |

Learning rate | 2e-4 |

Noise schedule | T=50, linearly spaced [1e-4, 5e-2] |

Inference noise schedule | {1e-4, 1e-3, 1e-2, 5e-2, 2e-1, 5e-1} |

Parameter | Value |
---|---|

Embedding dimension | 32 |

Batch size | 8 |

Learning rate | 2e-4 |

Noise schedule | T=1000, linearly spaced [1e-6, 1e-2] |

To ensure a diversity of training examples, various heart cycles were occasionallyrearranged for each patient for each minibatch during training.This was done inside a custom collator, with a 75% chance of rearranging the heart cycles.Heart cycles could be rearranged in three ways with equal probability.The first will take groupings of many cycles and then randomly rearrange these large groups.These first groups would have a size of half of the total number of heart cycles within that signal.Secondly, groupings of 1 to 4 heart cycles were chosen randomly and used to rearrange the signal.Finally, the third way involved rearranging each heart cycle.

Although this rearranging can violate physiological constraints, it was found that this helped the model learn a better representation of the data and improved classification results when trained on the synthetic data.

The signals were then bandpass filtered between $2\text{\,}\mathrm{Hz}$ to $500\text{\,}\mathrm{Hz}$ for PCG and$0.25\text{\,}\mathrm{Hz}$ to $100\text{\,}\mathrm{Hz}$ for ECG, the conditioning signal.A mel-spectrogram of the ECG was created as the local conditioning signal.The mel-spectrogram was created using a sample rate of $4\text{\,}\mathrm{kHz}$, window length 1024, hoplength 256, and 80 mel bins.Crossfading was used to ensure minimal audio artifacts when rearranging heart cycles.As the signals are joined when they are both in the same state, the end of the cycle in thediastole phase, they are assumed to be roughly correlated.The crossfade occurs between the last 40 samples of the first signal, $-1\leq t\leq 0$, and thefirst 40 samples from the second signal, $0\leq t\leq 1$.If one of the signals has a low variance, then a simple linear crossfade is used between the two.A linear crossfade can be described from Equations21 and22 below,

$\mathbb{f}(t)=1/2+t/2,\;\;-1<t<1$ | (21) |

$\mathbb{v}(t)=\mathbb{f}(t)\mathbb{y}(t)+\mathbb{f}(-t)\mathbb{x}(t)$ | (22) |

where $f$ is the crossfade function, $v$ is the final spliced signal, $x$ is the last 40 samplesfrom the first signal, and $y$ is the first 40 samples from the second signal.

Otherwise, the following crossfade function will be used to ensure a crossfade is applied thatrepresents how correlated the two signals are.For two fully uncorrelated signals, a constant power crossfade would be desired, and for two fullycorrelated signals, a constant voltage crossfade would be desired and something in between ifnot fully correlated or uncorrelated.Assuming that the crossfade function is deterministic, the two signals are a randomprocess.Along with the assumption, the mean power of the signals at the point of crossfading is equal asthey are being crossfaded when in the same phase of the heart cycle.This allows the following generalised crossfade function[49] to be used to satisfy acrossfade related to the signals’ correlation.The crossfade is defined in Equations23, 24 and25,

$\mathbb{o}(t)=\frac{9}{16}\sin\left(\frac{\pi}{2}t\right)+\frac{1}{16}\sin%\left(\frac{3\pi}{2}t\right),\;\;\;-1<t<1$ | (23) |

$\mathbb{e}(t)=\sqrt{\frac{1}{2(1+r)}-\left(\frac{1-r}{1+r}\right)\mathbb{o}{(t%)}^{2}}$ | (24) |

$\mathbb{f}(t)=\mathbb{o}(t)+\mathbb{e}(t)$ | (25) |

where $e$ is the even component of the crossfade function, and $o$ is the odd component, and $r$ isthe correlation coefficient of the two signals at zero lag and $0\leq r\leq 1$.The crossfade is then interpolated to double the length using a univariate spline, with a degree of 3 and a smoothing factor equal to the length of the signal.The implementation is the scipy implementation of the univariate spline [50].The final signal consists of the first signal before the last 40 samples, the crossfaded andinterpolated signal, and the second signal after the first 40 samples. Figure8demonstrates the effect that this crossfade has on reducing artifacts.Rearranging of the heart cycles can be seen through the rearranging of the chirp in the lastrow.The first column shows the original signal, the second shows the rearranging of all heart cycles, thethird shows the rearranging of a few heart cycles, and the final shows the rearranging of largergroups of heart cycles.

### 3.4 Classification Model

The model used to test the augmented dataset is a convolutional neural network-based model finetuned from ResNet trained onImageNet[7].The purpose of choosing this model is not to show its better performance in classification but to demonstrate the capability of the proposed data augmentation methods.Before the signals are passed into the convolutional neural network (CNN), the PCG signal is bandpass filtered between$45\text{\,}\mathrm{Hz}$ and $400\text{\,}\mathrm{Hz}$.The ECG signal is bandpass filtered between 25Hz and 100Hz.The signals also then undergo normalisation.A spectrogram is created from the signal before being passed to the model, with a window length of 100 and a hop length of 50.This spectrogram is created based on $1.5\text{\,}\mathrm{s}$ of audio, with each being referred to as afragment, with the training objective to maximise accuracy on the fragment level.From the synthetic data, only three fragments of $1.5\text{\,}\mathrm{s}$ audio are taken to ensure reducedoverfitting to the synthetic data.These $1.5\text{\,}\mathrm{s}$ fragments differ from the original model[7] which took in a singleheart cycle.This change has been done to reduce the need for accurate segmentation.For testing the subject level, the outputs from the classification are averaged between allfragments before the classification is made, as was done previously.

The Adam optimiser is used for training along with a cyclic triangular learning rate scheduler withparameters below in Table5.

Parameter | Value |
---|---|

initial learning rate | $0.001$ |

betas | $(0.9,0.999)$ |

epsilon | $10^{-8}$ |

weight decay | $10^{-3}$ |

learning rate step size up | $2$ |

learning rate step size down | $2$ |

max learning rate | $10^{-3}$ |

During the model’s training on the original dataset, as a CNN is being finetuned, only 10 epochs areused in which the best weights are chosen from the highest MCC value from the validation set toreduce overfitting.The model is only updated for each dataset if it performed better on the validation set thanpreviously.A schedule is used to reduce the overfitting of the synthetic data for training on the augmenteddataset.This schedule can be found below in Table6 and was experimentally determinedto provide the best results, where max-mix is all of the data with no augmentations being applied to the original dataset and 3augmentations applied to the DiffWave and WaveGrad data.From the synthetic data, only three random segments were taken to ensure the model does not overfitto the synthetic data.The max-aug data is the original data with 30 augmentations being applied and no synthetic data.

Data | Epochs |
---|---|

max-mix | 8 |

max-aug | 8 |

max-mix | 8 |

max-aug | 8 |

max-mix | 8 |

max-aug | 8 |

max-mix | 16 |

max-aug | 16 |

max-mix | 16 |

max-aug | 16 |

max-mix | 16 |

max-aug | 16 |

As only the training-a dataset contains synchronised PCG and ECG for measuring the OOD performance,a PCG-only model will also be trained and used to be evaluated on training-b-f datasets whilstthe PCG and ECG input model will be evaluated on the SMPECG dataset.

## 4 Results

### 4.1 In-distribution Performance

The ID results are for the datasets on which the models were trained.This shows the increase in performance when training on the augmented dataset compared tothe original dataset.As the only dataset being trained on was training-a, these are the only models presented forin-distribution performance. Table7 displays the ID performance when the models aretrained on the original dataset, with Table8 displaying the ID performance for modelstrained on the augmented dataset.

Dataset | Data | Acc | Acc-mu | TPR | TNR | PPV | NPV | $\text{F1}^{+}$ | $\text{F1}^{-}$ | MCC |
---|---|---|---|---|---|---|---|---|---|---|

training-a | PCG+ECG | 90.10% | 89.40% | 91.20% | 87.50% | 94.50% | 80.80% | 92.90% | 84.0% | 0.770 |

training-a | PCG | 70.40% | 56.00% | 91.20% | 20.80% | 73.20% | 50.00% | 81.20% | 29.40% | 0.167 |

Dataset | Data | Acc | Acc-mu | TPR | TNR | PPV | NPV | $\text{F1}^{+}$ | $\text{F1}^{-}$ | MCC |
---|---|---|---|---|---|---|---|---|---|---|

training-a | PCG+ECG | 92.60% | 93.50% | 91.20% | 95.80% | 98.10% | 82.10% | 94.50% | 88.50% | 0.836 |

training-a | PCG | 84.00% | 80.20% | 89.50% | 70.80% | 87.90% | 73.90% | 88.70% | 72.30% | 0.611 |

### 4.2 Out-of-distribution Performance

The out-of-distribution results are for the datasets the models were not trained on.Hence, this shows an increase in the generalisation of the models to other datasets that were nottrained on.As the dataset being trained on was training-a, all other datasets are presented for theout-of-distribution performance. Table9 shows the OOD performance on the originaldataset, with Table10 showing the OOD performance when trained on the augmenteddataset.

Dataset | Data | Acc | Acc-mu | TPR | TNR | PPV | NPV | $\text{F1}^{+}$ | $\text{F1}^{-}$ | MCC |
---|---|---|---|---|---|---|---|---|---|---|

training-b | PCG | 22.90% | 50.7% | 99.00% | 2.30% | 21.5% | 90.00% | 35.30% | 4.5% | 0.040 |

training-c | PCG | 74.20% | 47.90% | 95.80% | 0.00% | 76.70% | 0.00% | 85.20% | NaN | -0.099 |

training-d | PCG | 49.10% | 48.50% | 82.10% | 14.80% | 50.0% | 44.40% | 62.20% | 22.20% | -0.041 |

training-e | PCG | 40.90% | 65.80% | 96.20% | 35.50% | 12.70% | 99.00% | 22.50% | 52.20% | 0.192 |

training-f | PCG | 52.60% | 58.60% | 73.50% | 43.80% | 35.70% | 79.50% | 48.10% | 56.50% | 0.162 |

SMPECG | PCG+ECG | 56.20% | 50.20% | 98.30% | 2.20% | 56.30% | 50.00% | 71.60% | 4.20% | 0.017 |

SMPECG | PCG | 56.20% | 50.20% | 98.30% | 2.20% | 56.30% | 50.00 | 71.60% | 4.20% | 0.017 |

Dataset | Data | Acc | Acc-mu | TPR | TNR | PPV | NPV | $\text{F1}^{+}$ | $\text{F1}^{-}$ | MCC |
---|---|---|---|---|---|---|---|---|---|---|

training-b | PCG | 33.30% | 53.10% | 87.50% | 18.70% | 22.50% | 84.70% | 35.80% | 30.60% | 0.066 |

training-c | PCG | 83.90% | 74.70% | 91.7% | 57.10% | 88.00% | 66.70% | 89.80% | 61.50% | 0.517 |

training-d | PCG | 52.70% | 52.00% | 92.90% | 11.10% | 52.00% | 60.00% | 66.70% | 18.80% | 0.069 |

training-e | PCG | 84.00% | 86.00% | 88.50% | 83.50% | 34.50% | 98.70% | 49.60% | 90.50% | 0.489 |

training-f | PCG | 73.70% | 60.10% | 26.50% | 93.80% | 64.30% | 75.00% | 37.50% | 83.30% | 0.282 |

SMPECG | PCG+ECG | 61.90% | 57.00% | 96.60% | 17.40% | 60.00% | 80.00% | 71.40% | 28.60% | 0.237 |

SMPECG | PCG | 57.10% | 51.60% | 96.60% | 6.50% | 57.00% | 60.00% | 71.70% | 11.80% | 0.073 |

## 5 Discussion

It was found that the ID performance was improved for all models tested, with a 2.5% improvement inaccuracy for the PECG model and a 13.6% improvement in subject-level accuracy for the PCGmodel.The augmented dataset is also shown to improve the balanced accuracy and hence help to balancebetween sensitivity and specificity, with all these being improved from the original dataset tothe augmented dataset.This was observed through a balanced accuracy improvement of 4.1% and 24.2% for the PECG and PCGmodels, respectively.This is further shown by an increase in the MCC value from 0.77 to 0.836 and 0.167 to 0.611 forthe EPCG and PCG models, respectively.This shows that by augmenting the original data as well as adding synthetic data, and ensuring abalanced dataset, the ID performance can be improved.

The OOD performance was also seen to improve with the augmented dataset.Although the models were not trained on these datasets, the introduction of augmented data improvedall model’s accuracy and overall robustness, as seen by the increase in MCC valuesacross all datasets.In particular, in the CinC datasets, there was an improvement in accuracy of at most 43.1% intraining-e and of at least 3.6% in training-d, with the improvement in accuracy in all otherCinC datasets are between these values.Further, the balanced accuracy in all of these datasets was improved.With the greatest increase in balanced accuracy of 26.8% from training-c and the smallest being 1.5% from training-f.The MCC was also seen to increase in all cases, with the greatest increase of 0.616 occurring intraining-c and the smallest increase of 0.026 in training-b.With all performance metrics increasing, the OOD performance was improved by the use of thisaugmented dataset, which shows that these augmentations help to improve the robustness of modelswhen used on unseen OOD data.

In the SMPECG dataset, there was a much smaller improvement in accuracy, with an increase of 5.7%with the EPCG model and an increase of 0.9% with the PCG model.Also, balanced accuracy for both models increases, with 6.8% and 1.4% for the EPCGand PCG models, respectively.However, there was a much greater improvement in MCC and overall balancing the performance withan increase to the MCC value of 0.22 for the EPCG model and 0.056 for the PCG model.This shows that although a small improvement, this augmentation helps not only improveclassification accuracy but also helps to balance the classifier, improving its balancedaccuracy and MCC values.

As shown, both the ID and OOD performance have been increased by utilising the augmented data,achieving the objective of improving the robustness of the classifier.Better results are found for PCG-only models.This, however, is due to more data to test withthan synchronised PCG and ECG data.However, the OOD for some datasets is still low, showing that there is still room forimprovement in making a truly robust and general abnormal heart sound classifier.Utilising a larger dataset and applying these methods, the classifier is expected to become muchmore general, as seen with classifiers trained on this smaller dataset.

## 6 Conclusion and Further Work

Increasing training data through augmentation has improved ID and OOD performance in classifying abnormal heart sounds.The use of diffusion models to generate synthetic heart sounds conditioned on ECG signals has successfully enabled the generation of synchronised PCG from ECG data, expanding the data distribution and enhancing classifier robustness.This is not limited to classifiers that utilise multimodal PCG and ECG data but also for single-mode classifiers that utilise only PCG, as found from the increase in performance and robustness of PCG-only models.Future work should scale this approach to multichannel PCG signals for use with classifiers that utilise such data.

This study provides evidence that data augmentation, specifically through DDPMs, can significantly enhance the robustness and generalisation of classifiers for abnormal heart sound detection.By conditioning synthetic PCG signals on ECG data, we generated augmented datasets that improved performance in both ID and OOD scenarios, consistently observed across key metrics such as accuracy, balanced accuracy, and MCC.

Our approach increases the size of training datasets and enriches data diversity, which is crucial for developing models resilient to variations in real-world clinical settings.The augmentation process effectively addresses data imbalance and noise, providing a stronger foundation for training machine learning models.

However, while the introduced augmentation techniques have shown promise, certain limitations remain, particularly in generalising models to new datasets.The OOD performance, though improved, suggests that further refinement of these methods is necessary.This could involve optimising diffusion model parameters or exploring alternative generative approaches that better capture the complex patterns in biomedical signals.

Future work should focus on scaling these methods to accommodate multichannel PCG data, enabling more comprehensive heart sound analysis and potentially improving classification accuracy.This study demonstrates a viable strategy for enhancing classifier performance through synthetic data generation, contributing to more reliable cardiovascular disease diagnosis.

## Authors’ contribution

Leigh Abbott: conceptualisation, methodology, software, formal analysis, validation investigation-data collection, writing-original draft, writing-review & editing, visualisation.Milan Marocchi: software, validation, writing-original draft, writing-review & editing, visualisation.Matthew Fynn: writing-review & editing.Yue Rong: resources, project administration, writing-review &editing, supervision.Sven Nordholm:resources, project administration, writing-review & editing, supervision.

## Ethics approval and consent

The study received approval from the ethics committee of Fortis Hospital, Kolkata, India, where the multichannel data collection occurred. Informed consent was obtained from all participating subjects.All other datasets are open-access, so no approval is required.

## Acknowledgement

We thank Ticking Heart Pty Ltd for allowing the use of data collected from their vest design. We also thank Harry Walters for his valuable remarks and feedback on this work.

## Conflict of Interest

We declare that we have no conflicts of interest.

## References

- [1]“Cardiovascular diseases (CVDs).” [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
- Reed etal. [2004]T.R. Reed, N.E. Reed, and P.Fritzson, “Heart sound analysis for symptom detection and computer-aided diagnosis,”
*Simulation Modelling Practice and Theory*, vol.12, no.2, pp. 129–146, May 2004. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1569190X04000206 - Rong etal. [2023a]Y.Rong, M.Fynn, S.Nordholm, S.Siaw, and G.Dwivedi, “Wearable electro-phonocardiography device for cardiovascular disease monitoring,” 2023. [Online]. Available: https://ddfe.curtin.edu.au/yurong/SSP23.pdf
- Thomae and Dominik [2016]C.Thomae and A.Dominik, “Using deep gated RNN with a convolutional front end for end-to-end classification of heart sound,” in
*2016 Computing in Cardiology Conference (CinC)*, 2016, pp. 625–628. - Kong etal. [2021]Z.Kong, W.Ping, J.Huang, K.Zhao, and B.Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” 2021.
- Chen etal. [2020]N.Chen, Y.Zhang, H.Zen, R.J. Weiss, M.Norouzi, and W.Chan, “Wavegrad: Estimating gradients for waveform generation,” 2020.
- Marocchi etal. [2023]M.Marocchi, L.Abbott, Y.Rong, S.Nordholm, and G.Dwivedi, “Abnormal heart sound classification and model interpretability: A transfer learning approach with deep learning,”
*Journal of Vascular Diseases*, vol.2, no.4, pp. 438–459, 2023. [Online]. Available: https://www.mdpi.com/2813-2475/2/4/34 - Liu etal. [2016]C.Liu, D.Springer, Q.Li, B.Moody, R.A. Juan, F.J. Chorro, F.Castells, J.M. Roig, I.Silva, A.E.W. Johnson, Z.Syed, S.E. Schmidt, C.D. Papadaniil, L.Hadjileontiadis, H.Naseri, A.Moukadem, A.Dieterlen, C.Brandt, H.Tang, M.Samieinasab, M.R. Samieinasab, R.Sameni, R.G. Mark, and G.D. Clifford, “An open access database for the evaluation of heart sound algorithms,”
*Physiological Measurement*, vol.37, no.12, p. 2181 – 2213, 2016, cited by: 382; All Open Access, Green Open Access. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85007492920&doi=10.1088%2f0967-3334%2f37%2f12%2f2181&partnerID=40&md5=bf943637cf0ed9217d6f4243debcde9c - Leatham [1975]A.Leatham, “Auscultation of the heart and phonocardiography,”
*(No Title)*, 1975. - Schmidt etal. [2015]S.E. Schmidt, C.Holst-Hansen, J.Hansen, E.Toft, and J.J. Struijk, “Acoustic Features for the Identification of Coronary Artery Disease,”
*IEEE Transactions on Biomedical Engineering*, vol.62, no.11, pp. 2611–2619, Nov. 2015. - Springer etal. [2016]D.B. Springer, L.Tarassenko, and G.D. Clifford, “Logistic regression-HSMM-based heart sound segmentation,”
*IEEE transactions on biomedical engineering*, vol.63, no.4, pp. 822–832, 2016. - Rajni and Kaur [2013]R.Rajni and I.Kaur, “Electrocardiogram signal analysis-an overview,”
*International Journal of Computer Applications*, vol.84, no.7, pp. 22–25, 2013. - Clifford etal. [2006]G.D. Clifford, F.Azuaje, and P.McSharry,
*Advanced Methods and Tools for ECG Data Analysis*.Artech House, 2006. - Xie [2020]C.Xie,
*Biomedical Signal Processing: An ECG Application*.Cham: Springer International Publishing, 2020, pp. 285–303. [Online]. Available: https://doi.org/10.1007/978-3-030-47994-7_17 - DeBacquer etal. [1998]D.DeBacquer, G.DeBacker, M.Kornitzer, K.Myny, Z.Doyen, and H.Blackburn, “Prognostic value of ischemic electrocardiographic findings for cardiovascular mortality in men and women,”
*Journal of the American College of Cardiology*, vol.32, no.3, pp. 680–685, 1998. - Tran etal. [2022]D.Tran, J.Liu, M.W. Dusenberry, D.Phan, M.Collier, J.Ren, K.Han, Z.Wang, Z.Mariet, H.Hu, N.Band, T.G.J. Rudner, K.Singhal, Z.Nado, J.van Amersfoort, A.Kirsch, R.Jenatton, N.Thain, H.Yuan, K.Buchanan, K.Murphy, D.Sculley, Y.Gal, Z.Ghahramani, J.Snoek, and B.Lakshminarayanan, “Plex: Towards reliability using pretrained large model extensions,” 2022. [Online]. Available: https://storage.googleapis.com/plex-paper/plex.pdf
- Briscoe and Feldman [2011]E.Briscoe and J.Feldman, “Conceptual complexity and the bias/variance tradeoff,” p. 2–16, 2011. [Online]. Available: dx.doi.org/10.1016/j.cognition.2010.10.004
- Yao etal. [2022]H.Yao, Y.Wang, S.Li, L.Zhang, W.Liang, J.Zou, and C.Finn, “Improving out-of-distribution robustness via selective augmentation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.00299
- Sokolova and Lapalme [2009]M.Sokolova and G.Lapalme, “A systematic analysis of performance measures for classification tasks,”
*Information Processing & Management*, vol.45, no.4, pp. 427–437, 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306457309000259 - DeDiego etal. [2022]I.M. DeDiego, A.R. Redondo, R.R. Fernández, J.Navarro, and J.M. Moguerza, “General performance score for classification problems,”
*Applied Intelligence*, vol.52, no.10, p. 12049–12063, aug 2022. [Online]. Available: https://doi.org/10.1007/s10489-021-03041-7 - Ren etal. [2023]Z.Ren, Y.Chang, T.T. Nguyen, Y.Tan, K.Qian, and B.W. Schuller, “A comprehensive survey on heart sound analysis in the deep learning era,” 2023.
- Chicco and Jurman [2020]D.Chicco and G.Jurman, “The advantages of the matthews correlation coefficient (MCC) over f1 score and accuracy in binary classification evaluation,”
*BMC Genomics*, vol.21, no.6, 2020. [Online]. Available: https://doi.org/10.1186/s12864-019-6413-7 - Ding etal. [2021]M.Ding, K.Kong, J.Chen, J.Kirchenbauer, M.Goldblum, D.Wipf, F.Huang, and T.Goldstein, “A closer look at distribution shifts and out-of-distribution generalization on graphs,” in
*NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications*, 2021. [Online]. Available: https://openreview.net/forum?id=XvgPGWazqRH - Bank etal. [2021]D.Bank, N.Koenigstein, and R.Giryes, “Autoencoders,” 2021. [Online]. Available: http://arxiv.org/abs/2003.05991
- Kingma and Welling [2019]D.P. Kingma and M.Welling, “An introduction to variational autoencoders,”
*Foundations and Trends® in Machine Learning*, vol.12, no.4, pp. 307–392, 2019. [Online]. Available: https://doi.org/10.1561/2200000056 - Goodfellow etal. [2014]I.J. Goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville, and Y.Bengio, “Generative adversarial networks,” 2014. [Online]. Available: http://arxiv.org/abs/1406.2661
- Sohl-Dickstein etal. [2015]J.Sohl-Dickstein, E.A. Weiss, N.Maheswaranathan, and S.Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” 2015. [Online]. Available: http://arxiv.org/abs/1503.03585
- Yang etal. [2023]L.Yang, Z.Zhang, Y.Song, S.Hong, R.Xu, Y.Zhao, W.Zhang, B.Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” 2023. [Online]. Available: https://arxiv.org/abs/2209.00796
- Rombach etal. [2022]R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer, “High-resolution image synthesis with latent diffusion models,” 2022. [Online]. Available: http://arxiv.org/abs/2112.10752
- Xiao etal. [2022]Z.Xiao, K.Kreis, and A.Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” 2022. [Online]. Available: http://arxiv.org/abs/2112.07804
- Pinaya etal. [2022]W.H.L. Pinaya, P.-D. Tudosiu, J.Dafflon, P.F. daCosta, V.Fernandez, P.Nachev, S.Ourselin, and M.J. Cardoso, “Brain imaging generation with latent diffusion models,” 2022. [Online]. Available: http://arxiv.org/abs/2209.07162
- Zhou etal. [2022]G.Zhou, Y.Chen, and C.Chien, “On the analysis of data augmentation methods for spectral imaged based heart sound classification using convolutional neural networks,”
*BMC Medical Informatics and Decision Making*, 2022. - Saldanha etal. [2022]J.Saldanha, S.Chakraborty, S.Patil, K.Kotecha, S.Kumar, and A.Nayyar, “Data augmentation using variational autoencoders for improvement of respiratory disease classification,”
*PLOS ONE*, vol.17, no.8, pp. 1–41, 08 2022. [Online]. Available: https://doi.org/10.1371/journal.pone.0266467 - Kochetov and Filchenkov [2021]K.Kochetov and A.Filchenkov, “Generative adversarial networks for respiratory sound augmentation,” in
*Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System*, ser. CCRIS ’20.New York, NY, USA: Association for Computing Machinery, 2021, p. 106–111. [Online]. Available: https://doi.org/10.1145/3437802.3437821 - Zhang etal. [2020]Z.Zhang, J.Han, K.Qian, C.Janott, Y.Guo, and B.Schuller, “Snore-GANs: Improving automatic snore sound classification with synthesized data,”
*IEEE Journal of Biomedical and Health Informatics*, vol.24, no.1, pp. 300–310, 2020. - Narváez and Percybrooks [2020]P.Narváez and W.S. Percybrooks, “Synthesis of normal heart sounds using generative adversarial networks and empirical wavelet transform,”
*Applied Sciences*, vol.10, no.19, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/19/7003 - Dissanayake etal. [2023]T.Dissanayake, T.Fernando, S.Denman, S.Sridharan, and C.Fookes, “Generalized generative deep learning models for biosignal synthesis and modality transfer,”
*IEEE Journal of Biomedical and Health Informatics*, vol.27, no.2, pp. 968–979, 2023. - Kebaili etal. [2023]A.Kebaili, J.Lapuyade-Lahorgue, and S.Ruan, “Deep learning approaches for data augmentation in medical imaging: A review,”
*Journal of Imaging*, vol.9, no.4, 2023. [Online]. Available: https://www.mdpi.com/2313-433X/9/4/81 - Ho etal. [2020]J.Ho, A.Jain, and P.Abbeel, “Denoising diffusion probabilistic models,” 2020.
- Bińkowski etal. [2019]M.Bińkowski, J.Donahue, S.Dieleman, A.Clark, E.Elsen, N.Casagrande, L.C. Cobo, and K.Simonyan, “High fidelity speech synthesis with adversarial networks,” 2019.
- vanden Oord etal. [2016]A.vanden Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, and K.Kavukcuoglu, “WaveNet: A generative model for raw audio,” 2016. [Online]. Available: https://arxiv.org/abs/1609.03499
- Clifford etal. [2016]G.D. Clifford, C.Liu, B.Moody, D.Springer, I.Silva, Q.Li, and R.G. Mark, “Classification of normal/abnormal heart sound recordings: The physionet/computing in cardiology challenge 2016,”
*Computing in Cardiology*, vol.43, p. 609 – 612, 2016, cited by: 128. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85016097943&partnerID=40&md5=7c41089fc1564d3091a220eeaf126379 - Rong etal. [2023b]Y.Rong, M.Fynn, and S.Nordholm,
*A Pre-Screening Technique for Coronary Artery Disease with Multi-Channel Phonocardiography and Electrocardiography*.Taylor & Francis, 2023, ch.9. [Online]. Available: https://www.taylorfrancis.com/books/edit/10.1201/9781003346678/non-invasive-health-systems-based-advanced-biomedical-signal-image-processing-adel-al-jumaily-paolo-crippa-ali-mansour-claudio-turchetti - Tan etal. [2019]S.Tan, G.Androz, A.Chamseddine, P.Fecteau, A.Courville, Y.Bengio, and J.P. Cohen, “Icentia11k: An unsupervised representation learning dataset for arrhythmia subtype discovery,” 2019.
- Kazemnejad etal. [2021]A.Kazemnejad, P.Gordany, and R.Sameni, “EPHNOGRAM: A simultaneous electrocardiogram and phonocardiogram database,” 2021.
- Moody and Mark [2001]G.Moody and R.Mark, “The impact of the MIT-BIH arrhythmia database,”
*IEEE Engineering in Medicine and Biology Magazine*, vol.20, no.3, pp. 45–50, 2001. - McFee [2023]B.McFee, “librosa/librosa: 0.10.1,” https://doi.org/10.5281/zenodo.8252662, Aug 2023, accessed: 2024-03-24.
- Driedger etal. [2014]J.Driedger, M.Muller, and S.Disch, “Extending harmonic-percussive separation of audio signals,” in
*Proceedings of the International Society for Music Information Retrieval Conference*, vol.15, 2014. - Bristow-Johnson [2011]R.Bristow-Johnson, “A theory of optimal splicing of audio in the time domain,” Music-DSP Mailing List, July 2011, accessed: 2024-03-24. [Online]. Available: https://music.columbia.edu/pipermail/music-dsp/2011-July/069971.html
- Virtanen etal. [2020]P.Virtanen, R.Gommers, T.E. Oliphant, M.Haberland, T.Reddy, D.Cournapeau, E.Burovski, P.Peterson, W.Weckesser, J.Bright, S.J. van der Walt, M.Brett, J.Wilson, K.J. Millman, N.Mayorov, A.R.J. Nelson, E.Jones, R.Kern, E.Larson, C.J. Carey, İ.Polat, Y.Feng, E.W. Moore, J.VanderPlas, D.Laxalde, J.Perktold, R.Cimrman, I.Henriksen, E.A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F.Pedregosa, P.van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,”
*Nature Methods*, vol.17, pp. 261–272, 2020.