Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (2025)

Leigh AbbottMilan MarocchiMatthew FynnYue RongSven Nordholm

Abstract

Accurately interpreting cardiac auscultation signals plays a crucial role in diagnosing andmanaging cardiovascular diseases.However, the paucity of labelled data inhibits classification models’ training.Researchers have turned to generative deep learning techniquescombined with signal processing to augment the existing data and improve cardiacauscultation classification models to overcome this challenge.However, the primary focus of prior studies has been on model performance as opposed tomodel robustness.Robustness, in this case, is defined as both the in-distribution and out-of-distributionperformance by measures such as Matthew’s correlation coefficient.This work shows that more robust abnormal heart sound classifiers can be trained using anaugmented dataset.The augmentations consist of traditional audio approaches and the creation of syntheticaudio conditionally generated using the WaveGrad and DiffWave diffusion models.It is found that both the in-distribution and out-of-distribution performance can beimproved over various datasets when training a convolutional neural network-based classification model with thisaugmented dataset.With the performance increase encompassing not only accuracy but also balanced accuracy andMatthew’s correlation coefficient, an augmented dataset significantly contributes to resolving issues of imbalanceddatasets.This, in turn, helps provide a more general and robust classifier.

keywords:

Data augmentation , Denoising diffusion probabilistic models , Generative deep learning , Abnormal heart sound classification , Synthetic audio generation

\affiliation

[inst1]organization=School of Electrical Engineering, Computing, and Mathematical Sciences (EECMS), Faculty of Science and Engineering, Curtin University,Department and Organizationcity=Bentley,postcode=6102,state=WA,country=Australia

1 Introduction

Cardiovascular disease (CVD) is the primary contributor to mortality worldwide, representing morethan 30%times30percent30\text{\,}\mathrm{\char 37\relax}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG % end_ARG of all global deaths in 2019[1].In addition to the human cost, CVD places an immense economic burden on healthcare systems andsociety[1].To treat CVD effectively, it is necessary to diagnose and evaluate the condition of the heart accurately.

Cardiac auscultation (CA) is the process of listening to sounds generated by theheart[2].Physicians have traditionally performed CA using stethoscopes to detect and monitor heartconditions in a non-invasive manner.However, the difficulty of performing CA leads to uncertainty in diagnosis and poor patientoutcomes.The issue is further complicated by the fact that CA is both difficult to teach and a specialised skill, with studies noting that primary care physicians often lack proficiency in this area[2].

Recently, a wearable multichannel electrophonocardiography (EPCG) device has beendeveloped[3].The premise of this device is to detect CVD utilising synchroised phonocardiogram (PCG) and electrocardiogram (ECG) data. The combination of these signals can result in more accurate and robust classifications.However, there is currently limited synchronised multichannel phonocardiogram and electrocardiogram (SMPECG) data, which creates a need for a technique to aid in creating a larger dataset.

There are current limitations that prevent robust classification results across multiple datasets.These include a lack of quality data and unbalanced datasets, with most data having lots of backgroundnoise, resulting in a low signal-to-noise ratio.There is also a limited amount of synchronised PCG and ECG recordings,which limits the effectiveness of algorithms, despite the large amounts of standalone ECG and some PCG data.Traditional augmentation approaches can help to overcome these issues, with augmentation beingapplied to existing signals[4].This is somewhat lacking, however, as it does not always increase the out-of-distributionperformance, leaving room for further approaches to address this issue.With recent advancements in conditional waveform generation using diffusionmodels[5, 6], it is possible to extend previously ECG-only datasets bygenerating PCG signals conditioned from the ECG in these datasets.

This work explores traditional augmentation approaches alongside the generation of synthetic signals,to create more robust classifiers of abnormal heart sounds.

The main contributions of this work are summarised below:

  • 1.

    Development of a diffusion model to create PCG signals conditional on existing ECG signals, allowing additional data to be used from ECG datasets once the diffusion model has created the corresponding PCG signal. To the best of our knowledge, this is the first work using diffusion models to generate PCG signals.

  • 2.

    Traditional augmentation methods synchronised over the PCG and ECG signals and extensive methods beyond those utilised in other studies.

  • 3.

    Augmentation methods were applied to a previously top-performing model[7] on the training-a dataset[8], resulting in improvements of 2.5% in accuracy, 4.1% in balanced accuracy, 1.9% in F1+superscriptsubscript𝐹1F_{1}^{+}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT score, and 0.066 in Matthew’s Correlation Coefficient (MCC).Additionally, when tested on the training-e dataset—where the model had not been trained on any of the dataset’s data—there were notable improvements of 43.1% in accuracy, 20.2% in balanced accuracy, 27.1% in F1+superscriptsubscript𝐹1F_{1}^{+}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT score, and 0.297 in MCC.

The remainder of the paper is organised as follows. Background in PCG and ECG signals, model robustness, biomedical signal augmentation, andgenerative models are covered in Section 2.Following this, the methods and results are presented in Sections 3 and 4 before a discussion of the results in Section 5 and thefinal conclusions and further work are summarised in Section 6.

2 Background

2.1 Phonocardiogram and Electrocardiogram Signals

PCG signals comprise multiple sounds from the opening and closing of valves and blood flow inside the heart that causevibrations, which are then recorded from the chest wall[9].The fundamental heart sounds are the first (S1) and second (S2) sounds, which are the most prominent.The S1 occurs during the beginning of the systole and is caused by isovolumetric ventricularcontraction.S2 is caused by the closing of the aortic and pulmonic valves during the beginning of the diastole.Although the S1 and S2 sounds are the most audible, PCG signals consist of many other heart soundssuch as the third (S3) and fourth (S4) heart sounds, systolic ejection clicks, mid-systolicclicks, opening snap and heart murmurs[8].These heart murmurs are produced by turbulent flowing blood, which can indicate the presence of particular CVDs.These various heart sounds all lie within the low frequencies, with S1 from10Hz140Hzrangetimes10hertztimes140hertz10\text{\,}\mathrm{Hz}140\text{\,}\mathrm{Hz}start_ARG start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 140 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG and the highest energy around 25Hz45Hzrangetimes25hertztimes45hertz25\text{\,}\mathrm{Hz}45\text{\,}\mathrm{Hz}start_ARG start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 45 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG.The S2 is from 10Hz200Hzrangetimes10hertztimes200hertz10\text{\,}\mathrm{Hz}200\text{\,}\mathrm{Hz}start_ARG start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 200 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG, with most of the energy around 55Hz75Hzrangetimes55hertztimes75hertz55\text{\,}\mathrm{Hz}75\text{\,}\mathrm{Hz}start_ARG start_ARG 55 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 75 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG.S3 and S4 sounds are from 20Hz70Hzrangetimes20hertztimes70hertz20\text{\,}\mathrm{Hz}70\text{\,}\mathrm{Hz}start_ARG start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 70 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG, although they are much less audible, mainly occurring in children and pathological subjects.Murmurs are usually found in slightly higher frequencies and range from 25Hztimes25hertz25\text{\,}\mathrm{Hz}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to400Hztimes400hertz400\text{\,}\mathrm{Hz}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG[10], with some being found in frequencies higher than600Hztimes600hertz600\text{\,}\mathrm{Hz}start_ARG 600 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG, but with far less energy[11].

ECG signals represent the heart’s electrical activity[12].An ECG signal consists of the P, QRS complex, and T waves, with a U wave alsooccasionally present[13].These waves can contain information to aid in CVD diagnosis.ECG signals are commonly filtered between 0.5Hztimes0.5hertz0.5\text{\,}\mathrm{Hz}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG and 40Hztimes40hertz40\text{\,}\mathrm{Hz}start_ARG 40 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to remove baselinewander and unwanted noise and interference[14].For example, in the case of coronary artery disease patients, studies have documented that symptomssuch as T-wave inversion, ST-T abnormalities, left ventricular hypertrophy, and prematureventricular contractions can be observed[15].

Combining these two signals has produced superior results compared to classification using a singlesignal[7], suggesting that relevant features for classification exist within both signals.The increase in performance suggests that utilising synchronised PCG and ECG data will help tocreate more accurate and robust classifiers.

2.2 Model Robustness

Tran etal. (2022) [16] presented a state-of-the-art framework forenhancing model reliability, focusing on robust generalisation.Robust generalisation allows a model to perform well on data outside the trainingset[16], encompassing in-distribution (ID) and out-of-distribution (OOD)generalisation[16].

ID generalisation pertains to a model’s performance on data within the training distribution butoutside the training set, addressing underfitting and overfittingissues[16, 17].OOD generalisation, on the other hand, concerns a model’s ability to handle data distributionsdifferent from the training set, addressing distribution shifts such as subpopulation shifts,covariate shifts, and domain shifts[16, 18].

Perturbation resilience is the ability of a model to handle atypical and significantly differentdata, including corruption, distortion, artifacts, missing data, gaps, spectral masking, extremenoise, and defective inputs, which is critical in clinical settings.

2.2.1 Measuring Model Robustness

Table1 shows formulas for traditional binaryclassification performance measures derived from the confusion matrix inFigure1[19, 20, 21].Sensitivity (recall/true positive rate) and specificity (true negative rate) measure correctclassifications of positive and negative cases,respectively[19].Precision (positive predictive value) and negative predictive value measures correctly classifiedpositive and negative cases among classified cases,respectively[19].Accuracy measures overall correct classifications[19].Ideally, all these measures are unity, indicating no false predictions.

Actual
PositiveNegative
ClassifiedPositiveTPFP
NegativeFNTN

While having one target metric is ideal, it is impractical as each metric contains differentinformation and no single measure captures all the information from a confusionmatrix[20].Summary metrics can be biased under certain conditions; for instance, accuracy can be misleading forimbalanced datasets.Matthew’s correlation coefficient (MCC) is a better single metric for classifier performance than Fscores[22].

MetricFormula
SensitivityTPR=TPTP+FNTPTPFN\frac{{\text{TP}}}{{\text{TP}+\text{FN}}}divide start_ARG TP end_ARG start_ARG TP + FN end_ARG
SpecificityTNR=TNTN+FPTNTNFP\frac{{\text{TN}}}{{\text{TN}+\text{FP}}}divide start_ARG TN end_ARG start_ARG TN + FP end_ARG
PrecisionPPV=TPTP+FPTPTPFP\frac{\text{TP}}{\text{TP}+\text{FP}}divide start_ARG TP end_ARG start_ARG TP + FP end_ARG
Negative Predictive ValueNPV=TNTN+FNTNTNFN\frac{\text{TN}}{\text{TN}+\text{FN}}divide start_ARG TN end_ARG start_ARG TN + FN end_ARG
Accuracyacc=TP+TNTP+TN+FP+FNTPTNTPTNFPFN\frac{{\text{TP}+\text{TN}}}{{\text{TP}+\text{TN}+\text{FP}+\text{FN}}}divide start_ARG TP + TN end_ARG start_ARG TP + TN + FP + FN end_ARG
Balanced Accuracyaccμsubscriptacc𝜇\text{acc}_{\mu}acc start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT=TPR+TNR2TPRTNR2\frac{\text{TPR}+\text{TNR}}{2}divide start_ARG TPR + TNR end_ARG start_ARG 2 end_ARG
F1-Positive-ScoreF1+subscriptsuperscriptF1\text{F}^{+}_{1}F start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=2PPVTPRPPV+TPR2PPVTPRPPVTPR\frac{{2\cdot\text{PPV}\cdot\text{TPR}}}{{\text{PPV}+\text{TPR}}}divide start_ARG 2 ⋅ PPV ⋅ TPR end_ARG start_ARG PPV + TPR end_ARG
F1-Negative-ScoreF1subscriptsuperscriptF1\text{F}^{-}_{1}F start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=2NPVTNRPNV+TNR2NPVTNRPNVTNR\frac{{2\cdot\text{NPV}\cdot\text{TNR}}}{{\text{PNV}+\text{TNR}}}divide start_ARG 2 ⋅ NPV ⋅ TNR end_ARG start_ARG PNV + TNR end_ARG
Matthew’s Correlation CoefficientMCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)TPTNFPFNTPFPTPFNTNFPTNFN\frac{\text{TP}\cdot\text{TN}-\text{FP}\cdot\text{FN}}{\sqrt{(\text{TP}+\text{%FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}divide start_ARG TP ⋅ TN - FP ⋅ FN end_ARG start_ARG square-root start_ARG ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN ) end_ARG end_ARG

This work focuses on ID and OOD performance as the metric for model robustness,focusing on balanced accuracy and MCC in addition to accuracy to present an overall indicator of the performance of theclassification model.

2.2.2 Model Robustness and Augmentation

Data augmentation creates new data from existing data to increase the training set’s size andvariety, typically improving model performance.To improve ID generalisation, providing more training data from the same distribution as theoriginal data helps the model generalise to similar examples[16].To enhance OOD generalisation, extending the training data distribution beyond the original dataset,such as by balancing labels or adding scarce feature combinations, helps the model handledistribution shifts more effectively[23].

2.3 Generative Models

Generative models are trained to learn the underlying distribution of the data to generate newsamples.As such, the goal is to train a mapping between the latent space and the data space so that theresulting samples are similar to the original data.One of the important properties of the latent space is that it can enable the creation of new datathrough the manipulation of semantic representations of features and labels.In recent history, three classes of models have advanced the field of generative learning in waves.

These classes are Autoencoders (AEs), Generative Adversarial Networks (GANs) and Diffusion models(DMs).The first class of models, AEs, encode input data to a lower-dimensional latent space and thendecode it back to the data space, often used in denoising models due to their ability toreconstruct the input from the latent space[24].Variational Autoencoders (VAEs), an extension of AEs, regularise the latent distribution, enablingmeaningful sampling from the latent space and removing discontinuities, thus facilitatinggenerative capabilities[25].GANs, the second class, consist of a generator and a discriminator network; the generator createsrealistic samples from random noise, while the discriminator attempts to distinguish betweenreal and synthetic samples, engaging in a zero-sum game to improve bothnetworks[26].DMs, the third class, add random noise to input data and then train the model to reverse thisprocess, learning to denoise data in a structured manner, with models like Latent DiffusionModels (LDMs) performing diffusion in the latent space for computationalefficiency[27, 28, 29].

The “generative learning trilemma” may guide the trade-offs in choosing a generative learningmodel.As Figure2 (adapted from[30]) shows, models often excel at only two ofthree desired goals: high sample quality, fast sample speed, and large sample variety.However, as mentioned earlier, performing the diffusion process in latent space allows LDMs togenerate samples much faster, such that some argue it bypasses the trilemma inpractice[29, 30].For this reason, LDMs have seen recent use in expanding datasets in biomedical projects, where datacollection is prohibitively costly[31].As such, this work aims to use both the WaveGrad and DiffWave diffusion models for the creation of PCGfrom ECG signals.

2.4 Biomedical Signal Augmentation

In[4], data augmentation was employed to expand a PCG dataset from 3153times3153absent3153\text{\,}start_ARG 3153 end_ARG start_ARG times end_ARG start_ARG end_ARGrecordings to 53 601rtimes53601r53\,601\text{\,}\mathrm{r}start_ARG 53 601 end_ARG start_ARG times end_ARG start_ARG roman_r end_ARGecordings, an increase by a factor of 17times17absent17\text{\,}start_ARG 17 end_ARG start_ARG times end_ARG start_ARG end_ARG. The augmentationincluded a random combination of effects such as changes to pitch, speed, tempo, dither, volume,and mixing with audio[4].Despite achieving a sensitivity of 96%times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG and a specificity of 83%times83percent83\text{\,}\mathrm{\char 37\relax}start_ARG 83 end_ARG start_ARG times end_ARG start_ARG % end_ARG, theauthors concluded that their approach did not generalise well, with performance varying from99%times99percent99\text{\,}\mathrm{\char 37\relax}start_ARG 99 end_ARG start_ARG times end_ARG start_ARG % end_ARG on the dataset with the most recordings to 50%times50percent50\text{\,}\mathrm{\char 37\relax}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG % end_ARG on the datasetwith the fewest recordings[4].Consequently, Thomae and Dominik [4] suggested that more training data and furtheraugmentation is necessary to enhance performance on unseen data.

In a subsequent study by Zhou etal. [32], models trained with variousaugmentations were compared against a baseline.Augmentations were applied to both the original and image-transformed data and were categorised by a“physiological constraint”(whether the transform alters or violates physiological possibilities)and/or a“spectrogram constraint”(whether the transform alters the meaning of the spectrogram output)[32].Augmentations that violated the “spectrogram constraint” were linked to decreased modelperformance, while adherence to physiological possibilities was associated with improvedperformance[32].Notably, no single augmentation improved performance across all metrics, though some offered a morefavorable trade-off than others[32].

VAEs have been explored for the generation of synthetic lung auscultation sounds[33],where it was found that the use of VAE-generated signals in the training of classifiers were oftenimproved, but not always, over training on just the original data.

GANs have also found lots of use within biomedicalapplications[34, 35, 36].The introduction of synthetic data helps to overcome data imbalances as well as improve modelperformance.In particular, GANs have been used to generate synthetic heart signals[36].This work found that during early training, the waveform generated resembled a real signalwith added noise[36].Using the Empirical Wavelet Transform (EWT) to reduce this noise, the resulting signal at2000times2000absent2000\text{\,}start_ARG 2000 end_ARG start_ARG times end_ARG start_ARG end_ARG epochs was more realistic than the resulting signal at 12 000times12000absent12\,000\text{\,}start_ARG 12 000 end_ARG start_ARG times end_ARG start_ARG end_ARG epochs, allowingfor a sixfold reduction in training time[36].Further work was performed to show that the generative model had not simply learned the trainingdataset[36].As a result, the classifiers were able to classify the synthetic heart sounds correctly withaccuracy greater than 90%times90percent90\text{\,}\mathrm{\char 37\relax}start_ARG 90 end_ARG start_ARG times end_ARG start_ARG % end_ARG[36].

In[37], the general problem of generating synthetic one-dimensionalbiosignals are explored.Both an autoencoder and GAN-based approach were explored.To evaluate their models, the synthetic and real datasets are each used as either the training ortest set for a classifier model that had previously achieved an accuracy of99%times99percent99\text{\,}\mathrm{\char 37\relax}start_ARG 99 end_ARG start_ARG times end_ARG start_ARG % end_ARG[37].The results from this work showed that the synthetic data captured the underlying features anddistributions of the real data and the synthetic data could be used to train classifiers suchthat they perform well on real data[37].In addition to this, it was noted that the generative models were readily able to capture the noiseof the input data[37].

It was found that although GANs have found lots of use traditionally, the number of papers in medicalimaging that utilise VAEs and DMs has increased in recent years.For DMs in particular, there has been a substantial increase in papers, which authors attributed totheir ability to generate high-quality images with good modecoverage[38].Despite the abundance of diffusion models in medical imaging, we could not find, to the best of our knowledge, any use in biomedical audio signals, leaving room for exploration.

2.5 Conditional Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic models (DDPM) are a type of diffusion model that follows a Markovprocess that continuously noises the input, with the network learning to reverse this process byestimating the noise that was added.Conditional diffusion models for conditional audio generation can be adaptedfrom the diffusion model setup in [39].This model considers the conditional distribution pθ(𝕪0|𝕩)subscript𝑝𝜃conditionalsubscript𝕪0𝕩p_{\theta}(\mathbb{y}_{0}|\mathbb{x)}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | blackboard_x ), with 𝕪0subscript𝕪0\mathbb{y}_{0}blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being theoriginal waveform and 𝕩𝕩\mathbb{x}blackboard_x the conditioning features that correspond with 𝕪0subscript𝕪0\mathbb{y}_{0}blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

pθ(𝕪0|𝕩)=pθ(𝕪0:N|𝕩)𝑑𝕪1:Tsubscript𝑝𝜃conditionalsubscript𝕪0𝕩subscript𝑝𝜃conditionalsubscript𝕪:0𝑁𝕩differential-dsubscript𝕪:1𝑇p_{\theta}\left(\mathbb{y}_{0}|\mathbb{x}\right)=\int p_{\theta}\left(\mathbb{%y}_{0:N}|\mathbb{x}\right)d\mathbb{y}_{1:T}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | blackboard_x ) = ∫ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT 0 : italic_N end_POSTSUBSCRIPT | blackboard_x ) italic_d blackboard_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT(1)

where 𝕪1,,𝕪Tsubscript𝕪1subscript𝕪𝑇\mathbb{y}_{1},\ldots{},\mathbb{y}_{T}blackboard_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , blackboard_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is a series of latent variables.The posterior q(𝕪1:T|𝕪0)𝑞conditionalsubscript𝕪:1𝑇subscript𝕪0q\left(\mathbb{y}_{1:T}|\mathbb{y}_{0}\right)italic_q ( blackboard_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the forward diffusion process, which isdefined through the Markov chain:

q(𝕪1:T|𝕪0)=t=1Tq(𝕪t|𝕪t1)𝑞conditionalsubscript𝕪:1𝑇subscript𝕪0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝕪𝑡subscript𝕪𝑡1q\left(\mathbb{y}_{1:T}|\mathbb{y}_{0}\right)=\prod_{t=1}^{T}q\left(\mathbb{y}%_{t}|\mathbb{y}_{t-1}\right)italic_q ( blackboard_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | blackboard_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )(2)

Gaussian noise being added in each iteration is defined as,

q(𝕪t|𝕪t1)=𝒩(𝕪t;1βt𝕪t1,βtI)𝑞conditionalsubscript𝕪𝑡subscript𝕪𝑡1𝒩subscript𝕪𝑡1subscript𝛽𝑡subscript𝕪𝑡1subscript𝛽𝑡𝐼q\left(\mathbb{y}_{t}|\mathbb{y}_{t-1}\right)=\mathcal{N}\left(\mathbb{y}_{t};%\sqrt{1-\beta_{t}}\mathbb{y}_{t-1},\beta_{t}I\right)italic_q ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | blackboard_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I )(3)

with the noise being defined with a fixed noise schedule for β1,,βTsubscript𝛽1subscript𝛽𝑇\beta_{1},\ldots{},\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.Hence, the diffusion process can be computed for any t𝑡titalic_t as

𝕪t=α¯t𝕪0+1α¯tϵtsubscript𝕪𝑡subscript¯𝛼𝑡subscript𝕪01subscript¯𝛼𝑡subscriptitalic-ϵ𝑡\mathbb{y}_{t}=\sqrt{\overline{\alpha}_{t}}\mathbb{y}_{0}+\sqrt{1-\overline{%\alpha}_{t}}\epsilon_{t}blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(4)

where αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and αt¯=i=1tαi¯subscript𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha_{t}}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.As the likelihood in Equation1 is intractable, training these models is done bymaximising its variational lower bound (ELBO).Ho etal. found that using a loss as defined in Equation5 leads tohigher quality generation.

𝔼t,ϵ[ϵθ(𝕪t,𝕩,t)ϵt22]subscript𝔼𝑡italic-ϵsuperscriptsubscriptdelimited-∥∥subscriptitalic-ϵ𝜃subscript𝕪𝑡𝕩𝑡subscriptitalic-ϵ𝑡22\operatorname{\mathbb{E}}_{t,\epsilon}\left[\left\lVert\epsilon_{\theta}\left(%\mathbb{y}_{t},\mathbb{x},t\right)-\epsilon_{t}\right\rVert_{2}^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_x , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ](5)

The model estimates the noise added in the forward process, which is written as ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPTand the actual noise added is written as ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵt𝒩(0,I)similar-tosubscriptitalic-ϵ𝑡𝒩0𝐼\epsilon_{t}\sim\mathcal{N}\left(0,I\right)italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ).

Generation is then done by first sampling 𝕪T𝒩(0,I)similar-tosubscript𝕪𝑇𝒩0𝐼\mathbb{y}_{T}\sim\mathcal{N}\left(0,I\right)blackboard_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ) and 𝕫𝒩(0,I)similar-to𝕫𝒩0𝐼\mathbb{z}\sim\mathcal{N}\left(0,I\right)blackboard_z ∼ caligraphic_N ( 0 , italic_I ), before following the below equation until for t=T,,1,0𝑡𝑇10t=T,\ldots{},1,0italic_t = italic_T , … , 1 , 0,

𝕪t1=1αt(𝕪t1αt1α¯tϵθ(𝕪t,𝕩,t))+σt𝕫subscript𝕪𝑡11subscript𝛼𝑡subscript𝕪𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝕪𝑡𝕩𝑡subscript𝜎𝑡𝕫\mathbb{y}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbb{y}_{t}-\frac{1-%\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}\left(\mathbb{y}_{%t},\mathbb{x},t\right)\right)+\sigma_{t}\mathbb{z}blackboard_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_x , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_z(6)

where σt=βt~subscript𝜎𝑡~subscript𝛽𝑡\sigma_{t}=\tilde{\beta_{t}}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and βt~=1α¯t11α¯tβt~subscript𝛽𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\tilde{\beta_{t}}=\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}%\beta_{t}over~ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the variance at step t𝑡titalic_t for t>1𝑡1t>1italic_t > 1 and β1~=β1~subscript𝛽1subscript𝛽1\tilde{\beta_{1}}=\beta_{1}over~ start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

2.5.1 WaveGrad

WaveGrad is a DDPM for audio synthesis using conditional generation.The model utilises the architecture consisting of multiple upsampling blocks (UBlocks) anddownsampling blocks (DBlocks), with the input signal and the conditioning signal as inputs intothe network.The conditioning signal is converted to a mel-spectrogram representation before being input tothe model[6].These UBlocks and DBlocks follow the architecture of the upsampling and downsampling blocks utilised in the Generative Adversarial Network text-to-speech (GAN-TTS) model [40].The feature-wise linear modulation (FilM) modules combine information from the noisy waveform and the conditioning mel-spectrogram [6].The UBlock, DBlock and feature-wise linear modulation (FiLM) modules are shown inFigure3, with Figure4 showing the entire WaveGrad architecture.The loss function is based on the difference between the noise added in each step of the forwarddiffusion process and the noise predicted during the reverse process[6] asdescribed in Equation7, with the Markov process being conditioned on the continuousnoise level instead of the time-step.Also, note that the L1 norm was used over the L2 norm as it was found to provide better trainingstability[6].WaveGrad only includes a local conditioner in the form of a conditioning signal.

Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (1)
Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (2)
Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (3)
Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (4)
𝔼α¯,ϵ[ϵθ(α¯𝕪𝟘+1α¯ϵ,𝕩,α¯)ϵt1]subscript𝔼¯𝛼italic-ϵsubscriptdelimited-∥∥subscriptitalic-ϵ𝜃¯𝛼subscript𝕪01¯𝛼italic-ϵ𝕩¯𝛼subscriptitalic-ϵ𝑡1\operatorname{\mathbb{E}}_{\overline{\alpha},\epsilon}\left[\left\lVert%\epsilon_{\theta}\left(\sqrt{\overline{\alpha}}\mathbb{y_{0}}+\sqrt{1-%\overline{\alpha}}\epsilon,\mathbb{x},\sqrt{\overline{\alpha}}\right)-\epsilon%_{t}\right\rVert_{1}\right]blackboard_E start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG end_ARG blackboard_y start_POSTSUBSCRIPT blackboard_0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG end_ARG italic_ϵ , blackboard_x , square-root start_ARG over¯ start_ARG italic_α end_ARG end_ARG ) - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ](7)

2.5.2 DiffWave

DiffWave is another DDPM for raw audio synthesis with conditional and unconditional generation.The loss function utilises a single ELBO-based training objective without auxiliarylosses[5], as described in Equation8.One-dimensional convolutions are used on the input and conditioning signals that go through multiplefully connected layers.The model contains a WaveNet[41] backbone, consisting of bi-directional dilated convolutions and residual layers andconnections.The architecture is shown in Figure5.DiffWave can be used for both conditional and unconditional generation.For conditional generation, it uses a local conditioning signal and a global conditioner (discretelabels)[5].

𝔼t,ϵ[ϵθ(α¯t𝕪𝟘+1α¯tϵ,𝕩,t)ϵt1]subscript𝔼𝑡italic-ϵsubscriptdelimited-∥∥subscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript𝕪01subscript¯𝛼𝑡italic-ϵ𝕩𝑡subscriptitalic-ϵ𝑡1\operatorname{\mathbb{E}}_{t,\epsilon}\left[\left\lVert\epsilon_{\theta}\left(%\sqrt{\overline{\alpha}_{t}}\mathbb{y_{0}}+\sqrt{1-\overline{\alpha}_{t}}%\epsilon,\mathbb{x},t\right)-\epsilon_{t}\right\rVert_{1}\right]blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_y start_POSTSUBSCRIPT blackboard_0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , blackboard_x , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ](8)
Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (5)

3 Materials and Methods

To achieve a more robust model, the augmented training dataset must first be created.Figure6 depicts the dataset creation process.Once this dataset is created, various classification models can be trained and evaluated to measurethe increase in ID and OOD performance.

3.1 Datasets

3.1.1 PhysioNet and Computing in Cardiology Challenge 2016 Dataset

The PhysioNet and Computing in Cardiology Challenge 2016 (CinC) was an international competition thataimed to encourage the development of heart sound classification algorithms[8].The data was sourced from nine independent databases but excluded a database focused onfetal and maternal heart sounds[8].Across the nine databases, there are 2435times2435absent2435\text{\,}start_ARG 2435 end_ARG start_ARG times end_ARG start_ARG end_ARG recordings sourced from 1297ptimes1297p1297\text{\,}\mathrm{p}start_ARG 1297 end_ARG start_ARG times end_ARG start_ARG roman_p end_ARGatients[8].Excluding the aforementioned database and splitting longer recordings into smaller samples, therewere in total 4430times4430absent4430\text{\,}start_ARG 4430 end_ARG start_ARG times end_ARG start_ARG end_ARG samples from 1072ptimes1072p1072\text{\,}\mathrm{p}start_ARG 1072 end_ARG start_ARG times end_ARG start_ARG roman_p end_ARGatients, equating to 233 512times233512absent233\,512\text{\,}start_ARG 233 512 end_ARG start_ARG times end_ARG start_ARG end_ARG heartsounds, 116 865times116865absent116\,865\text{\,}start_ARG 116 865 end_ARG start_ARG times end_ARG start_ARG end_ARG heart beats, and nearly 30times30absent30\text{\,}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG end_ARG hours of recordings used in thecompetition[42].At the time of (their) publication, this amounted to the largest open-access heart sound database inthe world[42].

The recordings were resampled to 2000Hztimes2000hertz2000\text{\,}\mathrm{Hz}start_ARG 2000 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG for the competition and only one PCG lead wasused, with the exception of training-set a, which includes ECG[42].

Database InformationProportion of Recordings (%)
Challenge UseDatasetSource DatabaseAbnormalNormalUnsure
Trainingtraining-aMITHSDB67.528.44.2
training-bAADHSDB14.960.224.9
training-cAUTHHSDB64.522.612.9
training-dUHAHSDB47.347.35.5
training-eDLUTHSDB7.186.76.2
training-fSUAHSDB27.268.44.4
Average18.173.08.8
Testtest-bAADHSDB15.648.835.6
test-cAUTHHSDB64.328.67.1
test-dUHAHSDB45.845.88.3
test-eDLUTHSDB6.786.46.9
test-gTUTHSDB18.181.90.0
test-iSSHHSDB6034.35.7
Average12.077.110.9

Recordings were divided into eithernormal (healthy),abnormal (diagnosed with CVD or other cardiac problems),or unsure (low quality signals)[8].A summary of the data, shown in Table2,was adapted from[42] and[8].These datasets also include additional information, such as individual disease diagnoses andannotations of the heart cycles.These can be used to assist with the data augmentation.

3.1.2 Synchronised Multichannel PCG and ECG dataset

Recently, synchronised multichannel PCG and ECG (SMPECG) data has been collected from an EPCG devicethat consists of seven PCG and one lead-I ECG sensors[43].Using this device, data was collected from 105 subjects, of which 46 were diagnosed with coronaryartery disease.Ten seconds of audio were recorded for each subject, during which the subjects were instructed notto breathe to eliminate lung sounds from the recording.This data was collected in a clinical environment with background noise and non-optimal sensor placement as it is designed for ease of use, making it achallenging dataset for classification.As only single channel PCG is available in the other datasets, only a single channel (channel 2) wasused for this dataset.

3.1.3 Incentia Dataset

Along with the training-a dataset used for the inputs for training the generative models, theincentive dataset[44] was utilised to provide unique unseen ECG to generate anaccompanying PCG signal.This data set contains 11,000 patients and 2,774,054,987 labelled heartbeats at a sample rate of250Hztimes250hertz250\text{\,}\mathrm{Hz}start_ARG 250 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG with 541,794 segments.Each beat was classified with a type from normal, premature atrial contraction, prematureventricular contraction and rhythm from normal sinusal rhythm, atrial fibrillation and atrialflutter.

3.1.4 Further Datasets

To improve the model’s robustness against noise, one of the stages of augmentation introduces noisefrom other PCG and ECG datasets.These are the electro-phono-cardiogram (EPHNOGRAM) dataset[45] for PCG and the Massachusetts Institute of Technology - Beth Israel Hospital (MIT-BIH) dataset[46] for ECG.The EPHNOGRAM dataset comprises 24 healthy adults and contains recordings taken during stress testsand at rest[45].The MIT-BIH dataset contains 12 half-hour ECG recordings and three half-hour recordings of noisetypical in ambulatory ECG recordings, where this noise is used for augmentation[46].

3.2 Signal Augmentation

The augmentation procedure of the PCG and ECG signals is shown in Figure7.The time stretching augmentation is synchronised to ensure that they are both stretched the sameamount, with the black lines representing the flow of the ECG data and the white linesrepresenting the flow of PCG data.Augmentation stages have different percentage chances of occurring, where the chances chosen were determined to providethe widest variety of augmented signals after every stage has been completed whilst also resulting in the bestperformance.The augmentations vary slightly between PCG and ECG to best meet the physiological constraints.

The PCG signals are augmented in various ways: harmonic percussive source separation (HPSS) foremphasis on certain parts of the signal, time stretching, emphasis on certain bands of thesignal using a parametric equalisation (EQ) filter and introducing noise from the EPHNOGRAMdataset[45].Before these operations are applied, the signals are normalised to have a zero mean and be between-1 and 1.Shown in Figure7 is the augmentation procedure applied to PCG data, noted with thewhite lines.

The HPSS has a 75% chance of occurring and works by extracting harmonic and percussive componentsof the signal with varying thresholds to extract different parts of the signal.The HPSS implementation is from the librosa v0.1.0 Python library[47, 48].𝕏(t,k)𝕏𝑡𝑘\mathbb{X}(t,k)blackboard_X ( italic_t , italic_k ) denotes the short-time Fourier transform (STFT) of the signal 𝕩(t)𝕩𝑡\mathbb{x}(t)blackboard_x ( italic_t ), defined as

𝕏(t,k)=n=0N1𝕨(n)𝕩(n+tH)exp(2πjkn/N)𝕏𝑡𝑘superscriptsubscript𝑛0𝑁1𝕨𝑛𝕩𝑛𝑡𝐻2𝜋𝑗𝑘𝑛𝑁\mathbb{X}(t,k)=\sum_{n=0}^{N-1}\mathbb{w}(n)\mathbb{x}(n+tH)\exp\left(-2\pijkn%/N\right)blackboard_X ( italic_t , italic_k ) = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT blackboard_w ( italic_n ) blackboard_x ( italic_n + italic_t italic_H ) roman_exp ( - 2 italic_π italic_j italic_k italic_n / italic_N )(9)

where 𝕨𝕨\mathbb{w}blackboard_w is a sine-window, H𝐻Hitalic_H represents the hop size and N𝑁Nitalic_N is the window length and the length ofthe discrete Fourier transform.

Firstly, the STFT of the signal is calculated, with the parameters chosen randomly from a windowlength of 512, 1024 and 2048 with equal probability.A hop length was randomly chosen from 16, 32, 64, and 128 with uniform distribution.

Following this, the harmonic and percussive components are then extracted from the following,

𝕐~h(t,k)=median(𝕏(th,k),,𝕏(t+h,k))subscript~𝕐𝑡𝑘𝑚𝑒𝑑𝑖𝑎𝑛𝕏𝑡subscript𝑘𝕏𝑡subscript𝑘\tilde{\mathbb{Y}}_{h}(t,k)=median(\mathbb{X}(t-\ell_{h},k),\ldots{},\mathbb{X%}(t+\ell_{h},k))over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) = italic_m italic_e italic_d italic_i italic_a italic_n ( blackboard_X ( italic_t - roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_k ) , … , blackboard_X ( italic_t + roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_k ) )(10)
𝕐~p(t,k)=median(𝕏(t,kp),,𝕏(t,k+p))subscript~𝕐𝑝𝑡𝑘𝑚𝑒𝑑𝑖𝑎𝑛𝕏𝑡𝑘subscript𝑝𝕏𝑡𝑘subscript𝑝\tilde{\mathbb{Y}}_{p}(t,k)=median(\mathbb{X}(t,k-\ell_{p}),\ldots{},\mathbb{X%}(t,k+\ell_{p}))over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) = italic_m italic_e italic_d italic_i italic_a italic_n ( blackboard_X ( italic_t , italic_k - roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , … , blackboard_X ( italic_t , italic_k + roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) )(11)
𝕄h(t,k)={1,if𝕐~h(t,k)𝕐~p(t,k)+η>λh0,otherwisesubscript𝕄𝑡𝑘cases1ifsubscript~𝕐𝑡𝑘subscript~𝕐𝑝𝑡𝑘𝜂subscript𝜆0otherwise\mathbb{M}_{h}(t,k)=\begin{cases}1,&\text{if }\frac{\tilde{\mathbb{Y}}_{h}(t,k%)}{\tilde{\mathbb{Y}}_{p}(t,k)+\eta}>\lambda_{h}\\0,&\text{otherwise}\end{cases}blackboard_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) = { start_ROW start_CELL 1 , end_CELL start_CELL if divide start_ARG over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) end_ARG start_ARG over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) + italic_η end_ARG > italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW(12)
𝕄p(t,k)={1,if𝕐~p(t,k)𝕐~h(t,k)+ηλp0,otherwisesubscript𝕄𝑝𝑡𝑘cases1ifsubscript~𝕐𝑝𝑡𝑘subscript~𝕐𝑡𝑘𝜂subscript𝜆𝑝0otherwise\mathbb{M}_{p}(t,k)=\begin{cases}1,&\text{if }\frac{\tilde{\mathbb{Y}}_{p}(t,k%)}{\tilde{\mathbb{Y}}_{h}(t,k)+\eta}\geq\lambda_{p}\\0,&\text{otherwise}\end{cases}blackboard_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) = { start_ROW start_CELL 1 , end_CELL start_CELL if divide start_ARG over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) end_ARG start_ARG over~ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) + italic_η end_ARG ≥ italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW(13)
𝕏h(t,k)=𝕏(t,k)𝕄h(t,k)subscript𝕏𝑡𝑘𝕏𝑡𝑘subscript𝕄𝑡𝑘\mathbb{X}_{h}(t,k)=\mathbb{X}(t,k)\cdot\mathbb{M}_{h}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) = blackboard_X ( italic_t , italic_k ) ⋅ blackboard_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k )(14)
𝕏p(t,k)=𝕏(t,k)𝕄p(t,k)subscript𝕏𝑝𝑡𝑘𝕏𝑡𝑘subscript𝕄𝑝𝑡𝑘\mathbb{X}_{p}(t,k)=\mathbb{X}(t,k)\cdot\mathbb{M}_{p}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) = blackboard_X ( italic_t , italic_k ) ⋅ blackboard_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k )(15)

where 𝕏h(t,k)subscript𝕏𝑡𝑘\mathbb{X}_{h}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) is the harmonic component, 𝕏p(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{p}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) is the percussive component η𝜂\etaitalic_η is asmall number added to avoid a divide by 0 error[48].𝕩h(t)subscript𝕩𝑡\mathbb{x}_{h}(t)blackboard_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t ) and 𝕩p(t)subscript𝕩𝑝𝑡\mathbb{x}_{p}(t)blackboard_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t ) are the inverse STFT (ISTFT) of 𝕏h(t,k)subscript𝕏𝑡𝑘\mathbb{X}_{h}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) and 𝕏p(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{p}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ).If the thresholds, λh>1subscript𝜆1\lambda_{h}>1italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT > 1 or λp>1subscript𝜆𝑝1\lambda_{p}>1italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > 1, there will be some part of the spectrum that is not a harmonic or percussive component of the signal but a residual component that appears as textured noise. As the abnormalities to be detected are from diseases that produce more percussive or harmonic sounds, these residuals can be ignored without important information loss that would negatively impact the ability of a classifier to classify these sounds.

The first set have parameters λh=rand(1,2)subscript𝜆𝑟𝑎𝑛𝑑12\lambda_{h}=rand(1,2)italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 2 ), λp=rand(1,2)subscript𝜆𝑝𝑟𝑎𝑛𝑑12\lambda_{p}=rand(1,2)italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 2 ), h=randint(5,30)subscript𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{h}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ), and p=randint(5,30)subscript𝑝𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{p}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ).rand𝑟𝑎𝑛𝑑randitalic_r italic_a italic_n italic_d denotes a random floating point number chosen uniformly between the two bounds, and randint𝑟𝑎𝑛𝑑𝑖𝑛𝑡randintitalic_r italic_a italic_n italic_d italic_i italic_n italic_tis an integer uniformly chosen between those bounds.The second set are then extracted from 𝕏h(t,k)subscript𝕏𝑡𝑘\mathbb{X}_{h}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) and 𝕏p(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{p}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ).𝕏hh(t,k)subscript𝕏𝑡𝑘\mathbb{X}_{hh}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) and𝕏hp(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{hp}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) are the harmonic and percussive components of 𝕏h(t,k)subscript𝕏𝑡𝑘\mathbb{X}_{h}(t,k)blackboard_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) and 𝕏ph(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{ph}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT ( italic_t , italic_k ) and 𝕏pp(t,k)subscript𝕏𝑝𝑝𝑡𝑘\mathbb{X}_{pp}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ) the harmonic and percussive components of 𝕏p(t,k)subscript𝕏𝑝𝑡𝑘\mathbb{X}_{p}(t,k)blackboard_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t , italic_k ).The second stage of decomposition uses parameters of λhh=rand(1,4)subscript𝜆𝑟𝑎𝑛𝑑14\lambda_{hh}=rand(1,4)italic_λ start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 4 ), λhp=rand(1,4)subscript𝜆𝑝𝑟𝑎𝑛𝑑14\lambda_{hp}=rand(1,4)italic_λ start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 4 ), λph=rand(1,4)subscript𝜆𝑝𝑟𝑎𝑛𝑑14\lambda_{ph}=rand(1,4)italic_λ start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 4 ), λpp=rand(1,4)subscript𝜆𝑝𝑝𝑟𝑎𝑛𝑑14\lambda_{pp}=rand(1,4)italic_λ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 1 , 4 ), hh=randint(5,30)subscript𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{hh}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ),hp=randint(5,30)subscript𝑝𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{hp}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ), and ph=randint(5,30)subscript𝑝𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{ph}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ), pp=randint(5,30)subscript𝑝𝑝𝑟𝑎𝑛𝑑𝑖𝑛𝑡530\ell_{pp}=randint(5,30)roman_ℓ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_i italic_n italic_t ( 5 , 30 ).

The ISTFT is then applied to each component before reconstructing the signal as,

𝕤HPSS(t)=ahh𝕩hh(t)+ahp𝕩hp(t)+aph𝕩ph(t)+app𝕩pp(t)subscript𝕤𝐻𝑃𝑆𝑆𝑡subscript𝑎subscript𝕩𝑡subscript𝑎𝑝subscript𝕩𝑝𝑡subscript𝑎𝑝subscript𝕩𝑝𝑡subscript𝑎𝑝𝑝subscript𝕩𝑝𝑝𝑡\begin{split}\mathbb{s}_{HPSS}(t)=a_{hh}\mathbb{x}_{hh}(t)+a_{hp}\mathbb{x}_{%hp}(t)+a_{ph}\mathbb{x}_{ph}(t)+a_{pp}\mathbb{x}_{pp}(t)\end{split}start_ROW start_CELL blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT ( italic_t ) = italic_a start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT blackboard_x start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT ( italic_t ) + italic_a start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT blackboard_x start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT ( italic_t ) + italic_a start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT blackboard_x start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT ( italic_t ) + italic_a start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT blackboard_x start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_t ) end_CELL end_ROW(16)

where ahh=rand(0.01,10)subscript𝑎𝑟𝑎𝑛𝑑0.0110a_{hh}=rand(0.01,10)italic_a start_POSTSUBSCRIPT italic_h italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 10 ), ahp=rand(0.01,10)subscript𝑎𝑝𝑟𝑎𝑛𝑑0.0110a_{hp}=rand(0.01,10)italic_a start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 10 ),aph=rand(0.01,10)subscript𝑎𝑝𝑟𝑎𝑛𝑑0.0110a_{ph}=rand(0.01,10)italic_a start_POSTSUBSCRIPT italic_p italic_h end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 10 ), app=rand(0.01,10)subscript𝑎𝑝𝑝𝑟𝑎𝑛𝑑0.0110a_{pp}=rand(0.01,10)italic_a start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 10 ).

This two stage decomposition and reconstruction described in Equation16 is done twice to create 𝕤HPSS1(t)subscript𝕤𝐻𝑃𝑆subscript𝑆1𝑡\mathbb{s}_{HPSS_{1}}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) and 𝕤HPSS2(t)subscript𝕤𝐻𝑃𝑆subscript𝑆2𝑡\mathbb{s}_{HPSS_{2}}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ), which are then combined to get the final augmented signal𝕤HPSSfinal(t)subscript𝕤𝐻𝑃𝑆subscript𝑆𝑓𝑖𝑛𝑎𝑙𝑡\mathbb{s}_{HPSS_{final}}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ),

𝕤HPSSfinal(t)=𝕤HPSS1(t)+aHPSS𝕤HPSS2(t)subscript𝕤𝐻𝑃𝑆subscript𝑆𝑓𝑖𝑛𝑎𝑙𝑡subscript𝕤𝐻𝑃𝑆subscript𝑆1𝑡subscript𝑎𝐻𝑃𝑆𝑆subscript𝕤𝐻𝑃𝑆subscript𝑆2𝑡\mathbb{s}_{HPSS_{final}}(t)=\mathbb{s}_{HPSS_{1}}(t)+a_{HPSS}\mathbb{s}_{HPSS%_{2}}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) + italic_a start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t )(17)

where aHPSS=rand(0.01,0.05)subscript𝑎𝐻𝑃𝑆𝑆𝑟𝑎𝑛𝑑0.010.05a_{HPSS}=rand(0.01,0.05)italic_a start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 0.05 ).The use of these parameters was determined by inspection to ensure the signals remain realistic.

Next, there is a 7.5% chance of introducing noise to the signal, as defined in the equation below,where 𝕤HPSS(t)subscript𝕤𝐻𝑃𝑆𝑆𝑡\mathbb{s}_{HPSS}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT ( italic_t ) is the signal after the HPSS augmentation stage, 𝕤SN(t)subscript𝕤𝑆𝑁𝑡\mathbb{s}_{SN}(t)blackboard_s start_POSTSUBSCRIPT italic_S italic_N end_POSTSUBSCRIPT ( italic_t ) is the augmented signal and𝕣(t)𝒩(μ,σI)similar-to𝕣𝑡𝒩𝜇𝜎𝐼\mathbb{r}(t)\sim\mathcal{N}(\mu,\sigma I)blackboard_r ( italic_t ) ∼ caligraphic_N ( italic_μ , italic_σ italic_I ), σ=rand_choice(0.01,0.001,0.0001)𝜎𝑟𝑎𝑛𝑑_𝑐𝑜𝑖𝑐𝑒0.010.0010.0001\sigma=rand\_choice(0.01,0.001,0.0001)italic_σ = italic_r italic_a italic_n italic_d _ italic_c italic_h italic_o italic_i italic_c italic_e ( 0.01 , 0.001 , 0.0001 ) and μ=rand(0,0.1)𝜇𝑟𝑎𝑛𝑑00.1\mu=rand(0,0.1)italic_μ = italic_r italic_a italic_n italic_d ( 0 , 0.1 ).Note that 𝕤HPSS(t)subscript𝕤𝐻𝑃𝑆𝑆𝑡\mathbb{s}_{HPSS}(t)blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT ( italic_t ) may not have had the HPSS augmentation applied as it depends on the random chance.rand_choice()𝑟𝑎𝑛𝑑_𝑐𝑜𝑖𝑐𝑒rand\_choice()italic_r italic_a italic_n italic_d _ italic_c italic_h italic_o italic_i italic_c italic_e ( ) denotes a random choice from those numbers with equal probability.

𝕤SN(t)=𝕤HPSS(t)+𝕣(t)subscript𝕤𝑆𝑁𝑡subscript𝕤𝐻𝑃𝑆𝑆𝑡𝕣𝑡\mathbb{s}_{SN}(t)=\mathbb{s}_{HPSS}(t)+\mathbb{r}(t)blackboard_s start_POSTSUBSCRIPT italic_S italic_N end_POSTSUBSCRIPT ( italic_t ) = blackboard_s start_POSTSUBSCRIPT italic_H italic_P italic_S italic_S end_POSTSUBSCRIPT ( italic_t ) + blackboard_r ( italic_t )(18)

Following this, there is a 75% chance of adding in a time warp.This time warp will stretch the signal randomly to either 1.004 times the length or 1.006 timesthe length of the original signal.It is noted that a time warp with the same factor will be applied to both the PCG and ECG.

There is then a 75% chance of adding in amplitude modulation.The modulation is done as described in Equation19, where bAM1=rand(0.01,0.25)subscript𝑏𝐴subscript𝑀1𝑟𝑎𝑛𝑑0.010.25b_{AM_{1}}=rand(0.01,0.25)italic_b start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 0.25 ), bAM2=rand(0.01,0.25)subscript𝑏𝐴subscript𝑀2𝑟𝑎𝑛𝑑0.010.25b_{AM_{2}}=rand(0.01,0.25)italic_b start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 0.25 ), cAM1=rand(0.05,0.5)subscript𝑐𝐴subscript𝑀1𝑟𝑎𝑛𝑑0.050.5c_{AM_{1}}=rand(0.05,0.5)italic_c start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.05 , 0.5 ), cAM2=rand(0.001,0.05)subscript𝑐𝐴subscript𝑀2𝑟𝑎𝑛𝑑0.0010.05c_{AM_{2}}=rand(0.001,0.05)italic_c start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.001 , 0.05 ), dAM1=rand(0,1)subscript𝑑𝐴subscript𝑀1𝑟𝑎𝑛𝑑01d_{AM_{1}}=rand(0,1)italic_d start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0 , 1 ), dAM2=rand(0,1)subscript𝑑𝐴subscript𝑀2𝑟𝑎𝑛𝑑01d_{AM_{2}}=rand(0,1)italic_d start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0 , 1 ) and sTS(t)subscript𝑠𝑇𝑆𝑡s_{TS}(t)italic_s start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ( italic_t ) is signal after the time stretch augmentation stage, which depending on the random chance may have been time-stretched.

𝕤AM=𝕤TS(t)(1+bAM1sin(2πcAM1t+dAM1)+bAM2sin(2πcAM2t+dAM2))subscript𝕤𝐴𝑀subscript𝕤𝑇𝑆𝑡1subscript𝑏𝐴subscript𝑀12𝜋subscript𝑐𝐴subscript𝑀1𝑡subscript𝑑𝐴subscript𝑀1subscript𝑏𝐴subscript𝑀22𝜋subscript𝑐𝐴subscript𝑀2𝑡subscript𝑑𝐴subscript𝑀2\mathbb{s}_{AM}=\mathbb{s}_{TS}(t)\cdot\left(1+b_{AM_{1}}\sin\left(2\pi c_{AM_%{1}}t+d_{AM_{1}}\right)+b_{AM_{2}}\sin\left(2\pi c_{AM_{2}}t+d_{AM_{2}}\right)\right)blackboard_s start_POSTSUBSCRIPT italic_A italic_M end_POSTSUBSCRIPT = blackboard_s start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ( italic_t ) ⋅ ( 1 + italic_b start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin ( 2 italic_π italic_c start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_t + italic_d start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin ( 2 italic_π italic_c start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_t + italic_d start_POSTSUBSCRIPT italic_A italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) )(19)

Next, there is another 7.5% chance of introducing the same noise as done in Equation18.Following this, there is a 25% chance of applyingparametric equalisation to boost frequency bands.Given the frequency range of 2Hz500Hzrangetimes2hertztimes500hertz2\text{\,}\mathrm{Hz}500\text{\,}\mathrm{Hz}start_ARG start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG – start_ARG start_ARG 500 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG end_ARG,the bandwidth is randomly selected between 5% and 20% of this range,and the signal is attenuated using a bandpass filter.After repeating this process 5 times,the filtered signal and original signalare summed and normalised.

Lastly, real noise from the EPHNOGRAM dataset is introduced.The introduced noise from the EPHNOGRAM is clinical noise extracted from some of the recordings inthis dataset.This augmentation occurs 50% of the time.

The ECG signals are also augmented in numerous ways; these include introducing random noise, addingbaseline wander, time stretching, adding noise from the MIT-BIH dataset, and emphasising certainsignal bands. Figure7 shows the order of processing on the ECG, indicated withthe black lines.

Random noise is applied the same way as the PCG noise, as defined in Equation18, with thisaugmentation occurring with a probability of 7.5%.Next, a baseline wander is added 30% of this time.This is done as described in Equation20, where bBW1=rand(0.01,0.2)subscript𝑏𝐵subscript𝑊1𝑟𝑎𝑛𝑑0.010.2b_{BW_{1}}=rand(0.01,0.2)italic_b start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 0.2 ), bBW2=rand(0.01,0.2)subscript𝑏𝐵subscript𝑊2𝑟𝑎𝑛𝑑0.010.2b_{BW_{2}}=rand(0.01,0.2)italic_b start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.01 , 0.2 ), cBW1=rand(0.05,0.5)subscript𝑐𝐵subscript𝑊1𝑟𝑎𝑛𝑑0.050.5c_{BW_{1}}=rand(0.05,0.5)italic_c start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.05 , 0.5 ), cBW2=rand(0.001,0.05)subscript𝑐𝐵subscript𝑊2𝑟𝑎𝑛𝑑0.0010.05c_{BW_{2}}=rand(0.001,0.05)italic_c start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0.001 , 0.05 ), dBW1=rand(0,1)subscript𝑑𝐵subscript𝑊1𝑟𝑎𝑛𝑑01d_{BW_{1}}=rand(0,1)italic_d start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0 , 1 ), dBW2=rand(0,1)subscript𝑑𝐵subscript𝑊2𝑟𝑎𝑛𝑑01d_{BW_{2}}=rand(0,1)italic_d start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d ( 0 , 1 ). 𝕤SNE(t)subscript𝕤𝑆subscript𝑁𝐸𝑡\mathbb{s}_{SN_{E}}(t)blackboard_s start_POSTSUBSCRIPT italic_S italic_N start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) is the ECG signal after the random noise augmentation stage, which may include the random noise as per the random chance.

𝕤BW(t)=𝕤SNE(t)+bBW1sin(2πcBW1t+dBW1)+bBW2sin(2πcBW2t+dBW2)subscript𝕤𝐵𝑊𝑡subscript𝕤𝑆subscript𝑁𝐸𝑡subscript𝑏𝐵subscript𝑊12𝜋subscript𝑐𝐵subscript𝑊1𝑡subscript𝑑𝐵subscript𝑊1subscript𝑏𝐵subscript𝑊22𝜋subscript𝑐𝐵subscript𝑊2𝑡subscript𝑑𝐵subscript𝑊2\mathbb{s}_{BW}(t)=\mathbb{s}_{SN_{E}}(t)+b_{BW_{1}}\sin\left(2\pi c_{BW_{1}}t%+d_{BW_{1}}\right)+b_{BW_{2}}\sin\left(2\pi c_{BW_{2}}t+d_{BW_{2}}\right)blackboard_s start_POSTSUBSCRIPT italic_B italic_W end_POSTSUBSCRIPT ( italic_t ) = blackboard_s start_POSTSUBSCRIPT italic_S italic_N start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) + italic_b start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin ( 2 italic_π italic_c start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_t + italic_d start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin ( 2 italic_π italic_c start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_t + italic_d start_POSTSUBSCRIPT italic_B italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )(20)

Following this, there is a 25% chance of a timewarp between 1 and 1.06 times the originalsignal.It is noted that a timewarp with the same factor will be applied to both the PCG and ECG.Then, the same parametric equalisation, as with the PCG, is applied between 0.25Hztimes0.25hertz0.25\text{\,}\mathrm{Hz}start_ARG 0.25 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG and100Hztimes100hertz100\text{\,}\mathrm{Hz}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG.

Lastly, noise from the MIT-BIH database is added.This is noise from the ECG sensors taken from recordings in the MIT-BIH database.

3.3 Synthetic Audio Generation

Synthetic signals were generated using the mel-spectrogram of the ECG signal as a conditioner for both theWaveGrad[6] and DiffWave[5] diffusion models.They are trained before data is generated for use.These diffusion models generated data for 3200 patients, 800 abnormal and 2400 normal, with three segmentsused to train the classification models.This is done to reduce the effect of overfitting to the synthetic signals.The ECG signals for conditioning were taken from the icentia database[44] to introducenew data, with abnormal ECG used for abnormal PCG.The generative models were trained to create individual conditions and make them more realisticusing additional labels from the dataset.To get around the lack of training data, the order of heart cycles was rearranged to increasetraining diversity.DiffWave and WaveGrad models were trained on an Nvidia RTX 4090 for 24 hours.The parameters for the DiffWave model that differ from the default are shown below inTable3.Parameters used for the WaveGrad model that differ from the default are shown in TableTable4.Both models differ slightly from their base implementations as they use a custom global conditioner.Additional global conditioners were added for specific abnormalities or lack of abnormalities, suchas mitral valve prolapse, innocent or benign murmurs, aortic disease, miscellaneous conditions,and normal.

ParameterValue
Residual layers30
Residual channels64
Dilation cycle length10
Embedding dimension32
Batch size8
Learning rate2e-4
Noise scheduleT=50, linearly spaced [1e-4, 5e-2]
Inference noise schedule{1e-4, 1e-3, 1e-2, 5e-2, 2e-1, 5e-1}
ParameterValue
Embedding dimension32
Batch size8
Learning rate2e-4
Noise scheduleT=1000, linearly spaced [1e-6, 1e-2]

To ensure a diversity of training examples, various heart cycles were occasionallyrearranged for each patient for each minibatch during training.This was done inside a custom collator, with a 75% chance of rearranging the heart cycles.Heart cycles could be rearranged in three ways with equal probability.The first will take groupings of many cycles and then randomly rearrange these large groups.These first groups would have a size of half of the total number of heart cycles within that signal.Secondly, groupings of 1 to 4 heart cycles were chosen randomly and used to rearrange the signal.Finally, the third way involved rearranging each heart cycle.

Although this rearranging can violate physiological constraints, it was found that this helped the model learn a better representation of the data and improved classification results when trained on the synthetic data.

The signals were then bandpass filtered between 2Hztimes2hertz2\text{\,}\mathrm{Hz}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 500Hztimes500hertz500\text{\,}\mathrm{Hz}start_ARG 500 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG for PCG and0.25Hztimes0.25hertz0.25\text{\,}\mathrm{Hz}start_ARG 0.25 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 100Hztimes100hertz100\text{\,}\mathrm{Hz}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG for ECG, the conditioning signal.A mel-spectrogram of the ECG was created as the local conditioning signal.The mel-spectrogram was created using a sample rate of 4kHztimes4kilohertz4\text{\,}\mathrm{kHz}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG, window length 1024, hoplength 256, and 80 mel bins.Crossfading was used to ensure minimal audio artifacts when rearranging heart cycles.As the signals are joined when they are both in the same state, the end of the cycle in thediastole phase, they are assumed to be roughly correlated.The crossfade occurs between the last 40 samples of the first signal, 1t01𝑡0-1\leq t\leq 0- 1 ≤ italic_t ≤ 0, and thefirst 40 samples from the second signal, 0t10𝑡10\leq t\leq 10 ≤ italic_t ≤ 1.If one of the signals has a low variance, then a simple linear crossfade is used between the two.A linear crossfade can be described from Equations21 and22 below,

𝕗(t)=1/2+t/2,1<t<1formulae-sequence𝕗𝑡12𝑡21𝑡1\mathbb{f}(t)=1/2+t/2,\;\;-1<t<1blackboard_f ( italic_t ) = 1 / 2 + italic_t / 2 , - 1 < italic_t < 1(21)
𝕧(t)=𝕗(t)𝕪(t)+𝕗(t)𝕩(t)𝕧𝑡𝕗𝑡𝕪𝑡𝕗𝑡𝕩𝑡\mathbb{v}(t)=\mathbb{f}(t)\mathbb{y}(t)+\mathbb{f}(-t)\mathbb{x}(t)blackboard_v ( italic_t ) = blackboard_f ( italic_t ) blackboard_y ( italic_t ) + blackboard_f ( - italic_t ) blackboard_x ( italic_t )(22)

where f𝑓fitalic_f is the crossfade function, v𝑣vitalic_v is the final spliced signal, x𝑥xitalic_x is the last 40 samplesfrom the first signal, and y𝑦yitalic_y is the first 40 samples from the second signal.

Otherwise, the following crossfade function will be used to ensure a crossfade is applied thatrepresents how correlated the two signals are.For two fully uncorrelated signals, a constant power crossfade would be desired, and for two fullycorrelated signals, a constant voltage crossfade would be desired and something in between ifnot fully correlated or uncorrelated.Assuming that the crossfade function is deterministic, the two signals are a randomprocess.Along with the assumption, the mean power of the signals at the point of crossfading is equal asthey are being crossfaded when in the same phase of the heart cycle.This allows the following generalised crossfade function[49] to be used to satisfy acrossfade related to the signals’ correlation.The crossfade is defined in Equations23, 24 and25,

𝕠(t)=916sin(π2t)+116sin(3π2t),1<t<1formulae-sequence𝕠𝑡916𝜋2𝑡1163𝜋2𝑡1𝑡1\mathbb{o}(t)=\frac{9}{16}\sin\left(\frac{\pi}{2}t\right)+\frac{1}{16}\sin%\left(\frac{3\pi}{2}t\right),\;\;\;-1<t<1blackboard_o ( italic_t ) = divide start_ARG 9 end_ARG start_ARG 16 end_ARG roman_sin ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_t ) + divide start_ARG 1 end_ARG start_ARG 16 end_ARG roman_sin ( divide start_ARG 3 italic_π end_ARG start_ARG 2 end_ARG italic_t ) , - 1 < italic_t < 1(23)
𝕖(t)=12(1+r)(1r1+r)𝕠(t)2𝕖𝑡121𝑟1𝑟1𝑟𝕠superscript𝑡2\mathbb{e}(t)=\sqrt{\frac{1}{2(1+r)}-\left(\frac{1-r}{1+r}\right)\mathbb{o}{(t%)}^{2}}blackboard_e ( italic_t ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_r ) end_ARG - ( divide start_ARG 1 - italic_r end_ARG start_ARG 1 + italic_r end_ARG ) blackboard_o ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG(24)
𝕗(t)=𝕠(t)+𝕖(t)𝕗𝑡𝕠𝑡𝕖𝑡\mathbb{f}(t)=\mathbb{o}(t)+\mathbb{e}(t)blackboard_f ( italic_t ) = blackboard_o ( italic_t ) + blackboard_e ( italic_t )(25)

where e𝑒eitalic_e is the even component of the crossfade function, and o𝑜oitalic_o is the odd component, and r𝑟ritalic_r isthe correlation coefficient of the two signals at zero lag and 0r10𝑟10\leq r\leq 10 ≤ italic_r ≤ 1.The crossfade is then interpolated to double the length using a univariate spline, with a degree of 3 and a smoothing factor equal to the length of the signal.The implementation is the scipy implementation of the univariate spline [50].The final signal consists of the first signal before the last 40 samples, the crossfaded andinterpolated signal, and the second signal after the first 40 samples. Figure8demonstrates the effect that this crossfade has on reducing artifacts.Rearranging of the heart cycles can be seen through the rearranging of the chirp in the lastrow.The first column shows the original signal, the second shows the rearranging of all heart cycles, thethird shows the rearranging of a few heart cycles, and the final shows the rearranging of largergroups of heart cycles.

Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (6)

3.4 Classification Model

The model used to test the augmented dataset is a convolutional neural network-based model finetuned from ResNet trained onImageNet[7].The purpose of choosing this model is not to show its better performance in classification but to demonstrate the capability of the proposed data augmentation methods.Before the signals are passed into the convolutional neural network (CNN), the PCG signal is bandpass filtered between45Hztimes45hertz45\text{\,}\mathrm{Hz}start_ARG 45 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG and 400Hztimes400hertz400\text{\,}\mathrm{Hz}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG.The ECG signal is bandpass filtered between 25Hz and 100Hz.The signals also then undergo normalisation.A spectrogram is created from the signal before being passed to the model, with a window length of 100 and a hop length of 50.This spectrogram is created based on 1.5stimes1.5second1.5\text{\,}\mathrm{s}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG of audio, with each being referred to as afragment, with the training objective to maximise accuracy on the fragment level.From the synthetic data, only three fragments of 1.5stimes1.5second1.5\text{\,}\mathrm{s}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG audio are taken to ensure reducedoverfitting to the synthetic data.These 1.5stimes1.5second1.5\text{\,}\mathrm{s}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG fragments differ from the original model[7] which took in a singleheart cycle.This change has been done to reduce the need for accurate segmentation.For testing the subject level, the outputs from the classification are averaged between allfragments before the classification is made, as was done previously.

The Adam optimiser is used for training along with a cyclic triangular learning rate scheduler withparameters below in Table5.

ParameterValue
initial learning rate0.0010.0010.0010.001
betas(0.9,0.999)0.90.999(0.9,0.999)( 0.9 , 0.999 )
epsilon108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
weight decay103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
learning rate step size up2222
learning rate step size down2222
max learning rate103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT

During the model’s training on the original dataset, as a CNN is being finetuned, only 10 epochs areused in which the best weights are chosen from the highest MCC value from the validation set toreduce overfitting.The model is only updated for each dataset if it performed better on the validation set thanpreviously.A schedule is used to reduce the overfitting of the synthetic data for training on the augmenteddataset.This schedule can be found below in Table6 and was experimentally determinedto provide the best results, where max-mix is all of the data with no augmentations being applied to the original dataset and 3augmentations applied to the DiffWave and WaveGrad data.From the synthetic data, only three random segments were taken to ensure the model does not overfitto the synthetic data.The max-aug data is the original data with 30 augmentations being applied and no synthetic data.

DataEpochs
max-mix8
max-aug8
max-mix8
max-aug8
max-mix8
max-aug8
max-mix16
max-aug16
max-mix16
max-aug16
max-mix16
max-aug16

As only the training-a dataset contains synchronised PCG and ECG for measuring the OOD performance,a PCG-only model will also be trained and used to be evaluated on training-b-f datasets whilstthe PCG and ECG input model will be evaluated on the SMPECG dataset.

4 Results

4.1 In-distribution Performance

The ID results are for the datasets on which the models were trained.This shows the increase in performance when training on the augmented dataset compared tothe original dataset.As the only dataset being trained on was training-a, these are the only models presented forin-distribution performance. Table7 displays the ID performance when the models aretrained on the original dataset, with Table8 displaying the ID performance for modelstrained on the augmented dataset.

DatasetDataAccAcc-muTPRTNRPPVNPVF1+superscriptF1\text{F1}^{+}F1 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTF1superscriptF1\text{F1}^{-}F1 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPTMCC
training-aPCG+ECG90.10%89.40%91.20%87.50%94.50%80.80%92.90%84.0%0.770
training-aPCG70.40%56.00%91.20%20.80%73.20%50.00%81.20%29.40%0.167

DatasetDataAccAcc-muTPRTNRPPVNPVF1+superscriptF1\text{F1}^{+}F1 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTF1superscriptF1\text{F1}^{-}F1 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPTMCC
training-aPCG+ECG92.60%93.50%91.20%95.80%98.10%82.10%94.50%88.50%0.836
training-aPCG84.00%80.20%89.50%70.80%87.90%73.90%88.70%72.30%0.611

4.2 Out-of-distribution Performance

The out-of-distribution results are for the datasets the models were not trained on.Hence, this shows an increase in the generalisation of the models to other datasets that were nottrained on.As the dataset being trained on was training-a, all other datasets are presented for theout-of-distribution performance. Table9 shows the OOD performance on the originaldataset, with Table10 showing the OOD performance when trained on the augmenteddataset.

DatasetDataAccAcc-muTPRTNRPPVNPVF1+superscriptF1\text{F1}^{+}F1 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTF1superscriptF1\text{F1}^{-}F1 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPTMCC
training-bPCG22.90%50.7%99.00%2.30%21.5%90.00%35.30%4.5%0.040
training-cPCG74.20%47.90%95.80%0.00%76.70%0.00%85.20%NaN-0.099
training-dPCG49.10%48.50%82.10%14.80%50.0%44.40%62.20%22.20%-0.041
training-ePCG40.90%65.80%96.20%35.50%12.70%99.00%22.50%52.20%0.192
training-fPCG52.60%58.60%73.50%43.80%35.70%79.50%48.10%56.50%0.162
SMPECGPCG+ECG56.20%50.20%98.30%2.20%56.30%50.00%71.60%4.20%0.017
SMPECGPCG56.20%50.20%98.30%2.20%56.30%50.0071.60%4.20%0.017

DatasetDataAccAcc-muTPRTNRPPVNPVF1+superscriptF1\text{F1}^{+}F1 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTF1superscriptF1\text{F1}^{-}F1 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPTMCC
training-bPCG33.30%53.10%87.50%18.70%22.50%84.70%35.80%30.60%0.066
training-cPCG83.90%74.70%91.7%57.10%88.00%66.70%89.80%61.50%0.517
training-dPCG52.70%52.00%92.90%11.10%52.00%60.00%66.70%18.80%0.069
training-ePCG84.00%86.00%88.50%83.50%34.50%98.70%49.60%90.50%0.489
training-fPCG73.70%60.10%26.50%93.80%64.30%75.00%37.50%83.30%0.282
SMPECGPCG+ECG61.90%57.00%96.60%17.40%60.00%80.00%71.40%28.60%0.237
SMPECGPCG57.10%51.60%96.60%6.50%57.00%60.00%71.70%11.80%0.073

5 Discussion

It was found that the ID performance was improved for all models tested, with a 2.5% improvement inaccuracy for the PECG model and a 13.6% improvement in subject-level accuracy for the PCGmodel.The augmented dataset is also shown to improve the balanced accuracy and hence help to balancebetween sensitivity and specificity, with all these being improved from the original dataset tothe augmented dataset.This was observed through a balanced accuracy improvement of 4.1% and 24.2% for the PECG and PCGmodels, respectively.This is further shown by an increase in the MCC value from 0.77 to 0.836 and 0.167 to 0.611 forthe EPCG and PCG models, respectively.This shows that by augmenting the original data as well as adding synthetic data, and ensuring abalanced dataset, the ID performance can be improved.

The OOD performance was also seen to improve with the augmented dataset.Although the models were not trained on these datasets, the introduction of augmented data improvedall model’s accuracy and overall robustness, as seen by the increase in MCC valuesacross all datasets.In particular, in the CinC datasets, there was an improvement in accuracy of at most 43.1% intraining-e and of at least 3.6% in training-d, with the improvement in accuracy in all otherCinC datasets are between these values.Further, the balanced accuracy in all of these datasets was improved.With the greatest increase in balanced accuracy of 26.8% from training-c and the smallest being 1.5% from training-f.The MCC was also seen to increase in all cases, with the greatest increase of 0.616 occurring intraining-c and the smallest increase of 0.026 in training-b.With all performance metrics increasing, the OOD performance was improved by the use of thisaugmented dataset, which shows that these augmentations help to improve the robustness of modelswhen used on unseen OOD data.

In the SMPECG dataset, there was a much smaller improvement in accuracy, with an increase of 5.7%with the EPCG model and an increase of 0.9% with the PCG model.Also, balanced accuracy for both models increases, with 6.8% and 1.4% for the EPCGand PCG models, respectively.However, there was a much greater improvement in MCC and overall balancing the performance withan increase to the MCC value of 0.22 for the EPCG model and 0.056 for the PCG model.This shows that although a small improvement, this augmentation helps not only improveclassification accuracy but also helps to balance the classifier, improving its balancedaccuracy and MCC values.

As shown, both the ID and OOD performance have been increased by utilising the augmented data,achieving the objective of improving the robustness of the classifier.Better results are found for PCG-only models.This, however, is due to more data to test withthan synchronised PCG and ECG data.However, the OOD for some datasets is still low, showing that there is still room forimprovement in making a truly robust and general abnormal heart sound classifier.Utilising a larger dataset and applying these methods, the classifier is expected to become muchmore general, as seen with classifiers trained on this smaller dataset.

6 Conclusion and Further Work

Increasing training data through augmentation has improved ID and OOD performance in classifying abnormal heart sounds.The use of diffusion models to generate synthetic heart sounds conditioned on ECG signals has successfully enabled the generation of synchronised PCG from ECG data, expanding the data distribution and enhancing classifier robustness.This is not limited to classifiers that utilise multimodal PCG and ECG data but also for single-mode classifiers that utilise only PCG, as found from the increase in performance and robustness of PCG-only models.Future work should scale this approach to multichannel PCG signals for use with classifiers that utilise such data.

This study provides evidence that data augmentation, specifically through DDPMs, can significantly enhance the robustness and generalisation of classifiers for abnormal heart sound detection.By conditioning synthetic PCG signals on ECG data, we generated augmented datasets that improved performance in both ID and OOD scenarios, consistently observed across key metrics such as accuracy, balanced accuracy, and MCC.

Our approach increases the size of training datasets and enriches data diversity, which is crucial for developing models resilient to variations in real-world clinical settings.The augmentation process effectively addresses data imbalance and noise, providing a stronger foundation for training machine learning models.

However, while the introduced augmentation techniques have shown promise, certain limitations remain, particularly in generalising models to new datasets.The OOD performance, though improved, suggests that further refinement of these methods is necessary.This could involve optimising diffusion model parameters or exploring alternative generative approaches that better capture the complex patterns in biomedical signals.

Future work should focus on scaling these methods to accommodate multichannel PCG data, enabling more comprehensive heart sound analysis and potentially improving classification accuracy.This study demonstrates a viable strategy for enhancing classifier performance through synthetic data generation, contributing to more reliable cardiovascular disease diagnosis.

Authors’ contribution

Leigh Abbott: conceptualisation, methodology, software, formal analysis, validation investigation-data collection, writing-original draft, writing-review & editing, visualisation.Milan Marocchi: software, validation, writing-original draft, writing-review & editing, visualisation.Matthew Fynn: writing-review & editing.Yue Rong: resources, project administration, writing-review &editing, supervision.Sven Nordholm:resources, project administration, writing-review & editing, supervision.

Ethics approval and consent

The study received approval from the ethics committee of Fortis Hospital, Kolkata, India, where the multichannel data collection occurred. Informed consent was obtained from all participating subjects.All other datasets are open-access, so no approval is required.

Acknowledgement

We thank Ticking Heart Pty Ltd for allowing the use of data collected from their vest design. We also thank Harry Walters for his valuable remarks and feedback on this work.

Conflict of Interest

We declare that we have no conflicts of interest.

References

  • [1]“Cardiovascular diseases (CVDs).” [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
  • Reed etal. [2004]T.R. Reed, N.E. Reed, and P.Fritzson, “Heart sound analysis for symptom detection and computer-aided diagnosis,” Simulation Modelling Practice and Theory, vol.12, no.2, pp. 129–146, May 2004. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1569190X04000206
  • Rong etal. [2023a]Y.Rong, M.Fynn, S.Nordholm, S.Siaw, and G.Dwivedi, “Wearable electro-phonocardiography device for cardiovascular disease monitoring,” 2023. [Online]. Available: https://ddfe.curtin.edu.au/yurong/SSP23.pdf
  • Thomae and Dominik [2016]C.Thomae and A.Dominik, “Using deep gated RNN with a convolutional front end for end-to-end classification of heart sound,” in 2016 Computing in Cardiology Conference (CinC), 2016, pp. 625–628.
  • Kong etal. [2021]Z.Kong, W.Ping, J.Huang, K.Zhao, and B.Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” 2021.
  • Chen etal. [2020]N.Chen, Y.Zhang, H.Zen, R.J. Weiss, M.Norouzi, and W.Chan, “Wavegrad: Estimating gradients for waveform generation,” 2020.
  • Marocchi etal. [2023]M.Marocchi, L.Abbott, Y.Rong, S.Nordholm, and G.Dwivedi, “Abnormal heart sound classification and model interpretability: A transfer learning approach with deep learning,” Journal of Vascular Diseases, vol.2, no.4, pp. 438–459, 2023. [Online]. Available: https://www.mdpi.com/2813-2475/2/4/34
  • Liu etal. [2016]C.Liu, D.Springer, Q.Li, B.Moody, R.A. Juan, F.J. Chorro, F.Castells, J.M. Roig, I.Silva, A.E.W. Johnson, Z.Syed, S.E. Schmidt, C.D. Papadaniil, L.Hadjileontiadis, H.Naseri, A.Moukadem, A.Dieterlen, C.Brandt, H.Tang, M.Samieinasab, M.R. Samieinasab, R.Sameni, R.G. Mark, and G.D. Clifford, “An open access database for the evaluation of heart sound algorithms,” Physiological Measurement, vol.37, no.12, p. 2181 – 2213, 2016, cited by: 382; All Open Access, Green Open Access. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85007492920&doi=10.1088%2f0967-3334%2f37%2f12%2f2181&partnerID=40&md5=bf943637cf0ed9217d6f4243debcde9c
  • Leatham [1975]A.Leatham, “Auscultation of the heart and phonocardiography,” (No Title), 1975.
  • Schmidt etal. [2015]S.E. Schmidt, C.Holst-Hansen, J.Hansen, E.Toft, and J.J. Struijk, “Acoustic Features for the Identification of Coronary Artery Disease,” IEEE Transactions on Biomedical Engineering, vol.62, no.11, pp. 2611–2619, Nov. 2015.
  • Springer etal. [2016]D.B. Springer, L.Tarassenko, and G.D. Clifford, “Logistic regression-HSMM-based heart sound segmentation,” IEEE transactions on biomedical engineering, vol.63, no.4, pp. 822–832, 2016.
  • Rajni and Kaur [2013]R.Rajni and I.Kaur, “Electrocardiogram signal analysis-an overview,” International Journal of Computer Applications, vol.84, no.7, pp. 22–25, 2013.
  • Clifford etal. [2006]G.D. Clifford, F.Azuaje, and P.McSharry, Advanced Methods and Tools for ECG Data Analysis.Artech House, 2006.
  • Xie [2020]C.Xie, Biomedical Signal Processing: An ECG Application.Cham: Springer International Publishing, 2020, pp. 285–303. [Online]. Available: https://doi.org/10.1007/978-3-030-47994-7_17
  • DeBacquer etal. [1998]D.DeBacquer, G.DeBacker, M.Kornitzer, K.Myny, Z.Doyen, and H.Blackburn, “Prognostic value of ischemic electrocardiographic findings for cardiovascular mortality in men and women,” Journal of the American College of Cardiology, vol.32, no.3, pp. 680–685, 1998.
  • Tran etal. [2022]D.Tran, J.Liu, M.W. Dusenberry, D.Phan, M.Collier, J.Ren, K.Han, Z.Wang, Z.Mariet, H.Hu, N.Band, T.G.J. Rudner, K.Singhal, Z.Nado, J.van Amersfoort, A.Kirsch, R.Jenatton, N.Thain, H.Yuan, K.Buchanan, K.Murphy, D.Sculley, Y.Gal, Z.Ghahramani, J.Snoek, and B.Lakshminarayanan, “Plex: Towards reliability using pretrained large model extensions,” 2022. [Online]. Available: https://storage.googleapis.com/plex-paper/plex.pdf
  • Briscoe and Feldman [2011]E.Briscoe and J.Feldman, “Conceptual complexity and the bias/variance tradeoff,” p. 2–16, 2011. [Online]. Available: dx.doi.org/10.1016/j.cognition.2010.10.004
  • Yao etal. [2022]H.Yao, Y.Wang, S.Li, L.Zhang, W.Liang, J.Zou, and C.Finn, “Improving out-of-distribution robustness via selective augmentation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.00299
  • Sokolova and Lapalme [2009]M.Sokolova and G.Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol.45, no.4, pp. 427–437, 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306457309000259
  • DeDiego etal. [2022]I.M. DeDiego, A.R. Redondo, R.R. Fernández, J.Navarro, and J.M. Moguerza, “General performance score for classification problems,” Applied Intelligence, vol.52, no.10, p. 12049–12063, aug 2022. [Online]. Available: https://doi.org/10.1007/s10489-021-03041-7
  • Ren etal. [2023]Z.Ren, Y.Chang, T.T. Nguyen, Y.Tan, K.Qian, and B.W. Schuller, “A comprehensive survey on heart sound analysis in the deep learning era,” 2023.
  • Chicco and Jurman [2020]D.Chicco and G.Jurman, “The advantages of the matthews correlation coefficient (MCC) over f1 score and accuracy in binary classification evaluation,” BMC Genomics, vol.21, no.6, 2020. [Online]. Available: https://doi.org/10.1186/s12864-019-6413-7
  • Ding etal. [2021]M.Ding, K.Kong, J.Chen, J.Kirchenbauer, M.Goldblum, D.Wipf, F.Huang, and T.Goldstein, “A closer look at distribution shifts and out-of-distribution generalization on graphs,” in NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, 2021. [Online]. Available: https://openreview.net/forum?id=XvgPGWazqRH
  • Bank etal. [2021]D.Bank, N.Koenigstein, and R.Giryes, “Autoencoders,” 2021. [Online]. Available: http://arxiv.org/abs/2003.05991
  • Kingma and Welling [2019]D.P. Kingma and M.Welling, “An introduction to variational autoencoders,” Foundations and Trends® in Machine Learning, vol.12, no.4, pp. 307–392, 2019. [Online]. Available: https://doi.org/10.1561/2200000056
  • Goodfellow etal. [2014]I.J. Goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville, and Y.Bengio, “Generative adversarial networks,” 2014. [Online]. Available: http://arxiv.org/abs/1406.2661
  • Sohl-Dickstein etal. [2015]J.Sohl-Dickstein, E.A. Weiss, N.Maheswaranathan, and S.Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” 2015. [Online]. Available: http://arxiv.org/abs/1503.03585
  • Yang etal. [2023]L.Yang, Z.Zhang, Y.Song, S.Hong, R.Xu, Y.Zhao, W.Zhang, B.Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” 2023. [Online]. Available: https://arxiv.org/abs/2209.00796
  • Rombach etal. [2022]R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer, “High-resolution image synthesis with latent diffusion models,” 2022. [Online]. Available: http://arxiv.org/abs/2112.10752
  • Xiao etal. [2022]Z.Xiao, K.Kreis, and A.Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” 2022. [Online]. Available: http://arxiv.org/abs/2112.07804
  • Pinaya etal. [2022]W.H.L. Pinaya, P.-D. Tudosiu, J.Dafflon, P.F. daCosta, V.Fernandez, P.Nachev, S.Ourselin, and M.J. Cardoso, “Brain imaging generation with latent diffusion models,” 2022. [Online]. Available: http://arxiv.org/abs/2209.07162
  • Zhou etal. [2022]G.Zhou, Y.Chen, and C.Chien, “On the analysis of data augmentation methods for spectral imaged based heart sound classification using convolutional neural networks,” BMC Medical Informatics and Decision Making, 2022.
  • Saldanha etal. [2022]J.Saldanha, S.Chakraborty, S.Patil, K.Kotecha, S.Kumar, and A.Nayyar, “Data augmentation using variational autoencoders for improvement of respiratory disease classification,” PLOS ONE, vol.17, no.8, pp. 1–41, 08 2022. [Online]. Available: https://doi.org/10.1371/journal.pone.0266467
  • Kochetov and Filchenkov [2021]K.Kochetov and A.Filchenkov, “Generative adversarial networks for respiratory sound augmentation,” in Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System, ser. CCRIS ’20.New York, NY, USA: Association for Computing Machinery, 2021, p. 106–111. [Online]. Available: https://doi.org/10.1145/3437802.3437821
  • Zhang etal. [2020]Z.Zhang, J.Han, K.Qian, C.Janott, Y.Guo, and B.Schuller, “Snore-GANs: Improving automatic snore sound classification with synthesized data,” IEEE Journal of Biomedical and Health Informatics, vol.24, no.1, pp. 300–310, 2020.
  • Narváez and Percybrooks [2020]P.Narváez and W.S. Percybrooks, “Synthesis of normal heart sounds using generative adversarial networks and empirical wavelet transform,” Applied Sciences, vol.10, no.19, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/19/7003
  • Dissanayake etal. [2023]T.Dissanayake, T.Fernando, S.Denman, S.Sridharan, and C.Fookes, “Generalized generative deep learning models for biosignal synthesis and modality transfer,” IEEE Journal of Biomedical and Health Informatics, vol.27, no.2, pp. 968–979, 2023.
  • Kebaili etal. [2023]A.Kebaili, J.Lapuyade-Lahorgue, and S.Ruan, “Deep learning approaches for data augmentation in medical imaging: A review,” Journal of Imaging, vol.9, no.4, 2023. [Online]. Available: https://www.mdpi.com/2313-433X/9/4/81
  • Ho etal. [2020]J.Ho, A.Jain, and P.Abbeel, “Denoising diffusion probabilistic models,” 2020.
  • Bińkowski etal. [2019]M.Bińkowski, J.Donahue, S.Dieleman, A.Clark, E.Elsen, N.Casagrande, L.C. Cobo, and K.Simonyan, “High fidelity speech synthesis with adversarial networks,” 2019.
  • vanden Oord etal. [2016]A.vanden Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, and K.Kavukcuoglu, “WaveNet: A generative model for raw audio,” 2016. [Online]. Available: https://arxiv.org/abs/1609.03499
  • Clifford etal. [2016]G.D. Clifford, C.Liu, B.Moody, D.Springer, I.Silva, Q.Li, and R.G. Mark, “Classification of normal/abnormal heart sound recordings: The physionet/computing in cardiology challenge 2016,” Computing in Cardiology, vol.43, p. 609 – 612, 2016, cited by: 128. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85016097943&partnerID=40&md5=7c41089fc1564d3091a220eeaf126379
  • Rong etal. [2023b]Y.Rong, M.Fynn, and S.Nordholm, A Pre-Screening Technique for Coronary Artery Disease with Multi-Channel Phonocardiography and Electrocardiography.Taylor & Francis, 2023, ch.9. [Online]. Available: https://www.taylorfrancis.com/books/edit/10.1201/9781003346678/non-invasive-health-systems-based-advanced-biomedical-signal-image-processing-adel-al-jumaily-paolo-crippa-ali-mansour-claudio-turchetti
  • Tan etal. [2019]S.Tan, G.Androz, A.Chamseddine, P.Fecteau, A.Courville, Y.Bengio, and J.P. Cohen, “Icentia11k: An unsupervised representation learning dataset for arrhythmia subtype discovery,” 2019.
  • Kazemnejad etal. [2021]A.Kazemnejad, P.Gordany, and R.Sameni, “EPHNOGRAM: A simultaneous electrocardiogram and phonocardiogram database,” 2021.
  • Moody and Mark [2001]G.Moody and R.Mark, “The impact of the MIT-BIH arrhythmia database,” IEEE Engineering in Medicine and Biology Magazine, vol.20, no.3, pp. 45–50, 2001.
  • McFee [2023]B.McFee, “librosa/librosa: 0.10.1,” https://doi.org/10.5281/zenodo.8252662, Aug 2023, accessed: 2024-03-24.
  • Driedger etal. [2014]J.Driedger, M.Muller, and S.Disch, “Extending harmonic-percussive separation of audio signals,” in Proceedings of the International Society for Music Information Retrieval Conference, vol.15, 2014.
  • Bristow-Johnson [2011]R.Bristow-Johnson, “A theory of optimal splicing of audio in the time domain,” Music-DSP Mailing List, July 2011, accessed: 2024-03-24. [Online]. Available: https://music.columbia.edu/pipermail/music-dsp/2011-July/069971.html
  • Virtanen etal. [2020]P.Virtanen, R.Gommers, T.E. Oliphant, M.Haberland, T.Reddy, D.Cournapeau, E.Burovski, P.Peterson, W.Weckesser, J.Bright, S.J. van der Walt, M.Brett, J.Wilson, K.J. Millman, N.Mayorov, A.R.J. Nelson, E.Jones, R.Kern, E.Larson, C.J. Carey, İ.Polat, Y.Feng, E.W. Moore, J.VanderPlas, D.Laxalde, J.Perktold, R.Cimrman, I.Henriksen, E.A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F.Pedregosa, P.van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol.17, pp. 261–272, 2020.
Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Annamae Dooley

Last Updated:

Views: 6272

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Annamae Dooley

Birthday: 2001-07-26

Address: 9687 Tambra Meadow, Bradleyhaven, TN 53219

Phone: +9316045904039

Job: Future Coordinator

Hobby: Archery, Couponing, Poi, Kite flying, Knitting, Rappelling, Baseball

Introduction: My name is Annamae Dooley, I am a witty, quaint, lovely, clever, rich, sparkling, powerful person who loves writing and wants to share my knowledge and understanding with you.