A privacy-preserving platform oriented medical healthcare and its application in identifying patients with candidemia

A privacy-preserving platform oriented medical healthcare and its application in identifying patients with candidemia

Data sources and patients

We collected the ICU patients greater than 14 years old from three hospitals: Peking Union Medical College Hospital (Pumch), The Affiliated Hospital of Qingdao University (Qyfy), The First Affiliated Hospital of Fujian Medical University (Fyyy). This study has been approved by the ethics committee of the three hospital (SK693, QYFYKYLL 601311920, FMU244). Patients who were admitted to the above target hospitals and had new-onset systemic inflammatory response syndrome (SIRS) from 2013 to 2017 were selected as the subjects of the study. New-onset SIRS needed to meet the following criteria: (1) SIRS occurred in the ICU; (2) blood culture was obtained during the course of SIRS; (3) no previous SIRS within 24 h. This content has been described in detail in our previous research. The exclusion criteria includes: (1) age < 14 years old; (2) SIRS occurred out of ICU; (3) no blood cultures obtained during SIRS. Positive patients with candidemia were diagnosed by the positive blood culture results which show the presence of Candida species after ICU reception.

Upon conducting a thorough review of prior research, a comprehensive set of 22 risk factors exhibiting strong clinical relevance to candidemia was identified. These factors encompass four distinct categories: basic patient information, associated comorbidities, laboratory blood test results, and treatment histories. The datasets obtained from the three participating hospitals share a uniform feature space, ensuring consistency in the scope of variables considered for analysis. This meticulous selection and harmonization of risk factors facilitate a robust and comprehensive evaluation of the predictors of candidemia.

YiDuManda framework

YiDuManda is a sophisticated framework designed to enable privacy-preserving data mining across various data silos, integrating three key services: Management Service, Computation-Engine Service, and Network Communication Service. The Management Service oversees the entire lifecycle of a data mining task, including initiation and computing resource allocation, ensuring efficient task progression. The Computation-Engine Service provides fundamental computational capabilities, supporting secure multi-party computation, differential privacy, and homomorphic encryption, crucial for data integrity and confidentiality. Inter-service communication within YiDuManda is facilitated through the gRPC protocol, known for its strong performance29 and security features30. The platform also offers high-level Python interfaces, enhancing its scalability and usability for data scientists. Complementing this architecture, YiDuManda includes several user-friendly modules: a Machine Learning Algorithms Module with various methods for classification, regression, and ranking; a Statistics Module for comprehensive data overviews and trend analysis; and a Feature Engineering Module focusing on feature standardization, selection, and transformation. Collectively, YiDuManda stands out as an all-encompassing, versatile framework for privacy-sensitive data mining, equipped with a broad spectrum of tools for advanced data analysis and machine learning applications.

Model development and performance evaluation

In our experimental design, the dataset was partitionally randomized into a training set and a testing set at each silo, adhering to an 8:2 split ratio, respectively. The testing sets from the three distinct silos were then collectively aggregated on a centralized server. Throughout the training phase, the federated learning (FL) model weights underwent transfer amongst participating entities. Each local model was exclusively trained on its respective dataset, while the centralized model training was conducted using a selection of three training datasets from the participating hospitals. The efficacy of all models was rigorously evaluated using the testing sets derived from these hospitals. For performance assessment, we employed several key metrics: the area under the curve (AUC) of the receiver operating characteristics (ROC) curve, the true positive rate (TPR), and the true negative rate (TNR). To ensure the robustness of our findings, this experimental procedure was replicated 15 times. In terms of feature selection, we adopted the “TPR + TNR” criterion, as proposed by Yuan et al.19. This criterion was operationalized through a hybrid linear searching algorithm, which was utilized to identify the configuration yielding the highest combined TPR and TNR scores. This methodological approach was instrumental in enhancing the precision of our feature selection process and choose the highest one by a hybrid linear searching algorithm.

Privacy-preserving boosting tree

In the realm of computational engineering, the boosting tree methodology has been recognized as a highly effective and versatile tool, as evidenced by its application in various domains31,32. Particularly, the implementation of secure boosting trees has been adapted to different data partitioning strategies, including vertically partitioned data33 and horizontally partitioned data15. In our research, we have advanced the secure boosting tree approach by integrating secure multi-party computation sorting25. We set a secure sorting network on shares and obtain the positions of values in each feature. The positions are declared to every participant. As shown by Fig. 1, the sum of local gradients and hessians are calculated coupled with pairwise masking at each silo. Each participant sends the two masked sums to the manager and the best split feature can be found. The position of best feature as well as the participant whom the position belongs to are kept in the tree structure. This is necessary to make predictions since the model has no idea of the threshold at each node. The whole process is depicted in Fig. 1.

Figure 1
figure 1

Best-split calculation for XGBoost.

Privacy-preserving SVM and LR

In the domain of supervised machine learning classification, both Logistic Regression (LR) and Linear Support Vector Machine (SVM) are recognized for their strong interpretative capabilities. McMahan et al.9 introduced the Federated Averaging Algorithm (FedAvg), a groundbreaking approach for implementing federated neural networks. This algorithm allows for each client in the network to perform multiple iterations of local updates prior to synchronizing with the central server, a methodology we adopted for federated LR in our system. The neighbors can be acquired by numbering and sorting the participants. And the whole process of secure progress is show in Fiture2. To secure the averaging process in the server, we made the pairwise masking by applying Diffie-Hellman key exchanging protocol34. To get the probabilistic outputs of the federated SVM, we have adopted the parametric sigmoid formula suggested by Platt27 and calculated the optimized parameters with FEDSGD9. And the constructing workflow is depicted in Fig. 2.

Figure 2
figure 2

Flow of secure logistic regression and SVM.

Privacy-preserving RF

Random Forest (RF) is an ensemble classifier renowned for its ability to construct numerous independent decision trees and derive predictions by aggregating the outcomes from each tree. In our research, we have developed a privacy-preserving RF model, employing the methodology advocated by Vaidya35 and utilizing the sklearn toolkit within a Python 3.6 environment. This approach assumes that the model’s structure, while sensitive, does not carry the same level of vulnerability as raw data, allowing for its controlled dissemination to select users. Looking ahead, our aim is to enhance the security framework of RF, focusing on minimizing the risk of information leakage and further fortifying the model’s privacy-preserving capabilities.

Feature engineering

Within statistical analysis, two prevalent hypothesis tests are the chi-square test and the Student’s t-test. The chi-square test is primarily utilized to assess and compute the associations between two categorical variables, while the Student’s t-test is employed to compare the differences in continuous variables across different groups. Typically, the correlation coefficient between variables is indicative of their mutual dependence. In the realm of feature selection, the objective often involves identifying a smaller yet more significant subset of variables. In a previous study, Yuan et al.19 successfully identified risk factors associated with candidemia using XGBoost exclusively. Building upon this, we have developed a novel hybrid feature selection method tailored for federated settings. This method synergistically combines hypothesis testing and correlation coefficient analysis, thereby enhancing the robustness and relevance of the selected features. In the study, we have computed the Pearson Coefficient between x and y like this: calculate the global mean for each variable \(\overlinex and \overliney \) locally; then get the sum of (x-\(\overlinex \)), (y-\(\overliney \)), (x-\(\overlinex \))2 and (y-\(\overliney \))2; finally, we can calculate the Pearson Coefficient value.

The hybrid approach was carried out by combining statistical analysis and feature importance of XGBoost. The results are subset-A and subset-B selected by two methods. The Subset-B was obtained as the approach proposed by Yuan S. The statistical selection approach is as follows: (1) calculating the p values for each variable and sorting in an ascending order; (2) eliminate the feature with the largest p value and constructing XGBoost on the left features; (3) repeat the second step until no more features can be delete; (4) assigning subset-A as the subset that reaches the highest level of “TPR + TNR; (5) calculate the intersection of subset-A and subset-B and assign the result as subset-C; (6) calculate the coefficient correlations of features in subset-C; (7) remove the redundant variables which coefficient is more than threshold.

Ethical approval

The ethics was approved by the ethics committe of Peking Union Medical College Hospital (Reference Number: S-K693). All patients data has been anonymized before sharing among researchers. And Informed consent was obtained from all the participants involved in the study. The experiment was conducted in adherence to the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research Involving Human Subjects.

Consent for publication

All listed authors consented to the submission and all data were used with the consent of the person generating the data.

link

Leave a Reply

Your email address will not be published. Required fields are marked *