Mapping single-cell data to reference atlases by transfer learning

Mapping single-cell data to reference atlases by transfer learning 1

SummaryLarge single-cell atlases are actually routinely generated to function references for evaluation of smaller-scale research. Yet learning from reference data is difficult by batch results between datasets, restricted availability of computational sources and sharing restrictions on uncooked data. Here we introduce a deep learning technique for mapping question datasets on high of a reference known as single-cell architectural surgical procedure (scArches). scArches makes use of transfer learning and parameter optimization to allow environment friendly, decentralized, iterative reference constructing and contextualization of latest datasets with current references with out sharing uncooked data. Using examples from mouse mind, pancreas, immune and whole-organism atlases, we present that scArches preserves organic state data whereas eradicating batch results, regardless of utilizing 4 orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, permitting imputation of lacking modalities. Finally, scArches retains coronavirus illness 2019 (COVID-19) illness variation when mapping to a wholesome reference, enabling the invention of disease-specific cell states. scArches will facilitate collaborative tasks by enabling iterative building, updating, sharing and environment friendly use of reference atlases.

            PredominantLarge single-cell reference atlases1,2,3,4 comprising millions5 of cells throughout tissues, organs, developmental phases and situations are actually routinely generated by consortia such because the Human Cell Atlas6. These references assist to perceive the mobile heterogeneity that constitutes pure and inter-individual variation, ageing, environmental influences and illness. Reference atlases present a chance to transform how we presently analyze single-cell datasets: by learning from the suitable reference, we may automate annotation of latest datasets and simply carry out comparative analyses throughout tissues, species and illness situations.

Learning from a reference atlas requires mapping a question dataset to this reference to generate a joint embedding. Yet question datasets and reference atlases usually comprise data generated in numerous laboratories with totally different experimental protocols and thus include batch results. Data-integration strategies are usually used to overcome these batch results in reference construction7. This requires entry to all related datasets, which might be hindered by authorized restrictions on data sharing. Furthermore, contextualizing a single dataset requires rerunning the total integration pipeline, presupposing each computational experience and sources. Finally, conventional data-integration strategies contemplate any perturbation between datasets that impacts most cells as a technical batch impact, however organic perturbations may have an effect on most cells. Thus, standard approaches are inadequate for mapping question data onto references throughout organic situations.

Exploiting giant reference datasets is a well-established strategy in Computer Vision8 and Natural Language Processing9. In these fields, generally used deep learning approaches usually require numerous coaching samples, which aren’t at all times obtainable. By leveraging weights discovered from giant reference datasets to improve learning on a goal or question dataset10, transfer-learning (TL) fashions equivalent to ImageNet11 and BERT12 have revolutionized evaluation approaches8,9: TL has improved technique efficiency with small datasets (for instance, clustering13, classification and/or annotation14) and enabled mannequin sharing15,16,17,18. Recently, TL has been utilized to single-cell RNA-seq (scRNA-seq) data for denoising19, variance decomposition20 and cell sort classification21,22. However, present TL approaches in genomics don’t account for technical results inside and between the reference and query19 and lack of systematic retraining with question data20,21,22,23. These limitations can lead to spurious predictions on question data with no or small overlap in cell sorts, tissues or species24,25. Nonetheless, deep learning fashions for data integration in single-cell genomics demonstrated superior performance7,26,27,28. We suggest a TL and fine-tuning technique to leverage current conditional neural community fashions and transfer them to new datasets, known as ‘structure surgical procedure’, as applied within the scArches pipeline. scArches is a quick and scalable device for updating, sharing and utilizing reference atlases skilled with a wide range of neural community fashions. Specifically, given a primary reference atlas, scArches allows customers to share this reference as a skilled community with different customers, who can in flip replace the reference utilizing query-to-reference mapping and partial weight optimization with out sharing their data. Thus, customers can construct their very own prolonged reference fashions or carry out stepwise evaluation of datasets as they’re collected, which is commonly essential for rising medical datasets. Furthermore, scArches permits customers to study from reference data by contextualizing new (for instance, illness) data with a wholesome reference in a shared illustration. Due to the versatile selection of the underlying core mannequin that’s transferred utilizing scArches, we will study references with varied base fashions but in addition prepare on multimodal data. We show the options of scArches utilizing single-cell datasets starting from pancreas to whole-mouse atlases and immune cells from sufferers with COVID-19. scArches is in a position to iteratively replace a pancreas reference, transfer labels or unmeasured data modalities between reference atlases and question data and map COVID-19 data onto a wholesome reference whereas preserving disease-specific variation.

ResultsscArches allows mapping question data to referenceConsider the state of affairs with N ‘reference’ scRNA-seq datasets of a selected tissue or organism. A typical strategy to combine such datasets is to use a conditional variational autoencoder (CVAE) (for instance, single-cell variational inference (scVI)29, transfer variational autoencoder (trVAE)30) that assigns a categorical label Si to every dataset that corresponds to the research label. These research labels might index conventional batch IDs (that’s, samples, experiments throughout laboratories or sequencing applied sciences), organic batches (that’s, organs or species when used over the set of orthologous genes), perturbations equivalent to illness or a mixture of those categorical variables. Training a CVAE mannequin with reference research S1:N (Fig. 1a) leads to a latent area the place the consequences of situation labels (that’s, batch or expertise) are regressed out. Thus, we will use this embedding for additional downstream evaluation equivalent to visualization or identification of cell clusters or subpopulations.

Fig. 1: scArches allows iterative query-to-reference single-cell integration.a, Pre-training of a latent illustration utilizing public reference datasets and corresponding reference labels. b, Decentralized mannequin constructing: customers obtain parameters for the atlas of curiosity, effective tune the mannequin and optionally add their up to date mannequin for different customers. c–e, Illustration of this workflow for a human pancreas atlas throughout totally different scArches base fashions. Training a reference atlas throughout three human pancreas datasets (CelSeq, InDrop, Fluidigm C1), uniform manifold approximation and projection (UMAP) embedding for the unique (c) and the built-in reference for pre-trained reference fashions (d,e, first column). Second column in d,e, querying a brand new SS2 dataset to the built-in reference. Updating the cell atlas with a fifth dataset (CelSeq2). Third column in d,e, black dashed circles symbolize cells absent within the reference data. UMAP plots are primarily based on the mannequin embedding.

Architectural surgical procedure is a TL strategy that takes current reference fashions and adapts these to allow query-to-reference mapping. After coaching an current autoencoder mannequin on a number of reference datasets, architectural surgical procedure is the method of transferring these skilled weights with solely minor weight adaptation (effective tuning) and including a situation node to map a brand new research into this reference. While this strategy is broadly relevant on any deep conditional mannequin, right here we apply scArches to three unsupervised fashions (CVAEs, trVAE, scVI), a semi-supervised (single-cell annotation utilizing variational inference (scANVI))31 algorithm and a multimodal (whole variational inference (totalVI))32 algorithm (Methods).

To facilitate mannequin sharing, we tailored current reference-building strategies to incorporate them into our scArches package deal as ‘base fashions’. Reference fashions constructed inside scArches might be uploaded to a mannequin repository through our built-in software programming interface for Zenodo (Methods). To allow customers to map new datasets on high of customized reference atlases, we suggest sharing mannequin weights, which one can obtain from the mannequin repository and effective tune with new question data. This effective tuning extends the mannequin by including a set of trainable weights per question dataset known as ‘adaptors’. In classical conditional neural networks, a research corresponds to an enter neuron. As a skilled community has a inflexible structure, it doesn’t permit for including new research inside the given community. To overcome this, we implement the structure surgical procedure strategy to incorporate new research labels as new enter nodes (Methods). These new enter nodes with trainable weights are the aforementioned adaptors. Importantly, adaptors are shareable, permitting customers to additional customise shared reference fashions by downloading a reference atlas, selecting a set of obtainable adaptors for that reference and eventually incorporating the person’s personal data by coaching question adaptors (Fig. 1b). Trainable parameters of the question mannequin are restricted to a small subset of weights for question research labels. Depending on the scale of this subset, this restriction capabilities as an inductive bias to stop the mannequin from strongly adapting its parameters to the question research. Thus, question data replace the reference atlas.

To illustrate the feasibility of this strategy, we utilized scArches with trVAE, scVI and scANVI (see Supplementary Tables 1–7 for detailed parameters) to consecutively combine two research right into a pancreas reference atlas comprising three research (Fig. 1c). To moreover simulate the state of affairs through which question data include a brand new cell sort absent within the reference, we eliminated all alpha cells within the coaching reference data. We first skilled totally different current reference fashions inside the scArches framework to combine coaching data and assemble a reference atlas (Fig. 1d,e and Supplementary Fig. 1, first column). Once the reference atlas was constructed, we effective tuned the reference mannequin with the primary question data (SMART-seq2 (SS2)) and iteratively up to date the reference atlas with this research (Fig. 1d,e, second column) and the second question data (CelSeq2, Fig. 1d,e, third column). After every replace, our mannequin overlays data from all shared cell sorts current in each question and reference whereas yielding a separate and well-mixed cluster of alpha cells within the question datasets (black dashed circles in Fig. 1d,e). To additional assess the robustness of the strategy, we held out two cell sorts (alpha cells and gamma cells) within the reference data whereas retaining each within the question datasets. Here our mannequin robustly built-in question data whereas putting unseen cell sorts into distinct clusters (Supplementary Fig. 2). Additional testing utilizing simulated data confirmed that scArches can be strong to concurrently updating the reference atlas with a number of question research at a time (Supplementary Fig. 3).

Overall, TL with architectural surgical procedure allows customers to replace learnt reference fashions by integrating question data whereas accounting for variations in cell sort composition.

Minimal effective tuning performs greatest for mannequin replaceTo decide the variety of weights to optimize throughout reference mapping, we evaluated the efficiency of various fine-tuning methods. Reference mapping efficiency was assessed utilizing ten metrics not too long ago established to consider data-integration performance7 by way of removing of batch results and preservation of organic variation. Batch-effect removing was measured through principal-component regression, entropy of batch mixing, k-nearest neighbor (kNN) graph connectivity and common silhouette width (ASW). Biological conservation was assessed with international cluster matching (adjusted Rand index (ARI), normalized mutual data (NMI)), native neighborhood conservation (kNN accuracy), cell sort ASW and uncommon cell sort metrics (remoted label scores). An correct reference mapping integration ought to lead to each excessive conservation of organic variation and excessive batch-removal scores.

Next to effective tuning solely the weights connecting newly added research as proposed above (adaptors), we additionally thought-about (1) coaching enter layers in each encoder and decoder whereas the remainder of the weights have been frozen and (2) effective tuning all weights within the mannequin. We skilled a reference mannequin for every base mannequin utilizing 250,000 cells from two mouse mind studies33,34. Next, we in contrast the mixing efficiency of candidate fine-tuning methods when mapping two question datasets1,35 onto the reference data. Applying scArches trVAE to the mind atlas, the mannequin with the fewest parameters carried out competitively with different approaches in integrating totally different batches whereas preserving distinctions between totally different cell sorts (Fig. 2a–c). Notably, the strongly regularized scArches lowered trainable parameters by 4 to 5 orders of magnitude (Fig. second). Overall, evaluating integration accuracy for various base fashions demonstrates the optimum time and integration efficiency trade-off of utilizing adaptors to incorporate new question datasets in contrast to that of different approaches (Fig. 2e).

Fig. 2: TL and structure surgical procedure permit quick and correct reference mapping.a–c, Comparing totally different granularity ranges within the proposed TL technique by mapping data from two mind research to a reference mind atlas. The reference mannequin was skilled on a subset of 250,000 cells from two mind research after which up to date with data from Zeisel et al.35 and the TM mind subset. Fine-tuning methods fluctuate from coaching just a few question research label weights (a) to enter layers of each the encoder and decoder (b) and retraining the total community (c). d, Number of skilled weights throughout these three granularity ranges. e, Comparison of integration accuracy for various fine-tuning methods on mapping data from two question research on a mind reference atlas throughout varied base fashions. Individual scores are minimal–most scaled between 0 and 1. Overall scores have been computed utilizing a 40:60-weighted imply of batch correction and bio-conservation scores, respectively (see Methods for additional visualization particulars). EBM, entropy of batch mixing; PCR, principal-component regression.

Architectural surgical procedure permits for environment friendly data integrationTo use scArches, one requires a reference atlas mannequin. The high quality of reference mapping carried out by scArches depends on the parameterization and structure chosen for the bottom mannequin in addition to the standard and amount of reference data. To decide the sensitivity of scArches reference mapping to the reference mannequin used, we investigated how a lot reference data are wanted to allow profitable reference mapping. Therefore, we leveraged a human immune cell dataset composed of bone marrow36 and peripheral blood mononuclear cells (PBMCs)37,38,39. We constructed reference fashions of accelerating high quality by incrementally together with extra research in reference constructing whereas utilizing the remainder of the research as question data. To additional problem the mannequin, we included a singular cell sort for every research whereas eradicating it from the remainder of the research. In our experiments, the reference mapping accuracy of scArches scANVI considerably elevated till a minimum of 50% (~10,000 cells) of the data have been used as reference (Fig. 3a–c). Specifically, we noticed distinct clusters of megakaryocyte progenitors, human pluripotent stem cells, CD10+ B cells and erythroid progenitors solely in larger reference ratios (Fig. 3b,c), whereas these have been combined within the lowest reference fraction (Fig. 3a). This remark held true throughout different base fashions (Fig. 3d and Supplementary Fig. 4). We repeated comparable experiments on mind and pancreas datasets (Supplementary Figs. 5 and 6). Overall, whereas efficiency is each mannequin and data dependent, we noticed a sturdy efficiency when a minimum of 50% of the data, together with a number of research batches, are utilized in reference coaching (Fig. 3d and Supplementary Figs. 7–10).

Fig. 3: scArches allows environment friendly reference mapping in contrast to full integration workflow with current data-integration strategies.a–d, Evaluating the impact of the reference (ref) dimension for immune data (n = 20,522) on the standard of reference mapping. a–c, UMAP plots present the latent illustration of the built-in question and reference data collectively for scArches scANVI. Cell sorts highlighted with dashed circles symbolize cells distinctive to a particular research denoted by the batch legend. Reference ratio refers to the fraction of cells within the reference in contrast to all data. The research used as reference are indicated on the backside of every panel. HSPCs, hematopoietic stem and progenitor cells. d, Quantitative metrics for the efficiency of various base fashions throughout varied reference ratios in immune data. e, Comparison of various scArches base fashions skilled with a reference dataset with ~66% of batches in the entire data towards de novo full integration strategies throughout immune (n = 20,522), pancreas (n = 15,681) and mind (n = 332,129) datasets. Conos and mnnCorrect weren’t ready to combine mind data due to extreme reminiscence utilization and time necessities, respectively. MSE, imply squared error.

Reference mapping is designed to generate an built-in dataset with out sharing uncooked data and with restricted computational sources. Thus, it have to be evaluated towards the gold customary of de novo data integration, for which these restrictions will not be current. To assess this, we carried out scArches reference mapping utilizing a reference mannequin containing roughly two-thirds of batches and in contrast this to current full integration autoencoder strategies and different current approaches22,40,41,42,43,44. The total rating for the scArches reference mapping mannequin is comparable to that of de novo integration efficiency (Fig. 3e and Supplementary Figs. 13–15).

We additionally evaluated the pace of scArches reference mapping in contrast to full integration methods. In an scArches pipeline, the reference mannequin should both be constructed as soon as and might be shared or it may be downloaded straight to map question datasets. Therefore, we contemplate the time spent by the person to map question datasets because the related foundation for our comparisons. The operating time can be depending on the bottom mannequin sort. For instance, trVAE was a lot slower than different base fashions due to the utmost imply discrepancy time period, whereas scVI and scANVI have been the quickest (Supplementary Fig. 11a). Overall, scArches can supply a speed-up of up to roughly fivefold and eightfold for scVI and scANVI in contrast to that of operating a de novo autoencoder-based integration for these strategies (Supplementary Fig. 16a). This permits mapping of 1 million question cells in lower than 1 h (Supplementary Fig. 11b).

scArches is delicate to nuanced cell statesWe additional evaluated scArches below a sequence of difficult circumstances. A specific problem for deep learning strategies with many trainable parameters is the small data regime. Thus, we first examined the flexibility of scArches to map uncommon cell sorts. For this function, we subsampled a particular cell sort in our pancreas and immune integration duties (delta cells and CD16+ monocytes, respectively), such that this inhabitants constituted between ~0.1% and ~1.0% of the entire data. Next, we built-in one research as question data and evaluated the standard of reference mapping for the uncommon cell sort. While in all circumstances the question cells are built-in with reference cells, uncommon cluster cells might be combined with different cell sorts when the fraction is smaller than ~0.5%, and we solely noticed a definite cluster for larger fractions (Supplementary Fig. 12).

Second, we evaluated our technique on data with steady trajectories. We skilled a reference mannequin utilizing a pancreatic endocrinogenesis dataset45 from three early time factors (embryonic day (E)12.5, E13.5 and E14.5). We built-in the newest time level (E15.5) as question data. Here question data built-in effectively with reference data, and our velocity46 evaluation on the built-in data confirmed the identified differentiation trajectory towards main alpha, beta, delta and epsilon fates (Supplementary Fig. 13).

Finally, we evaluated how effectively scArches resolves nuanced, transcriptionally comparable cell sorts within the question. We subsequently skilled a reference mannequin excluding pure killer (NK) cells, whereas the reference data contained extremely comparable NKT cells. Integrated question and reference cells resulted in a separate NK cluster in proximity to NKT cells (Supplementary Fig. 14a). Repeating an identical experiment with each NK and NKT cells absent within the reference reproduced distinct clusters for each populations within the neighborhood of one another (Supplementary Fig. 14b).

scArches allows data transfer from reference to questionThe final purpose of query-to-reference mapping is to leverage and transfer data from the reference. This data transfer might be transformative for analyzing new question datasets by transferring discrete cell sort labels that facilitate annotation of question data47,48 or by imputing steady data equivalent to unmeasured modalities which might be current in reference however absent from question measurements32,48,49

We first studied transferring discrete data (for instance, cell sort labels) to question data. We used the not too long ago revealed Tabula Senis3 as our reference, which incorporates 155 distinct cell sorts throughout 23 tissues and 5 age teams starting from 1 month to 30 months from plate-based (SS2) and droplet-based (10x Genomics) assays. As question data, we used cells from the 3-month time level (equal to Tabula Muris (TM)).

The question data consists of 90,120 cells from 24 tissues together with a beforehand unseen tissue, trachea, which we excluded from the reference data. scArches trVAE precisely integrates question and reference data throughout time factors and sequencing applied sciences and creates a definite cluster of tracheal cells (n = 9,330) (Fig. 4a,b and Supplementary Fig. 15a; see Supplementary Fig. 16 for tissue-level data).

Fig. 4: scArches efficiently transfers data from reference to question.a,b, Querying TM (n = 90,120) to the bigger reference atlas Tabula Senis (n = 264,287) utilizing scArches trVAE (a) throughout totally different tissues (b). Tissues have been appropriately grouped throughout the 2 datasets (a,b). c, Location of misclassified and unknown cells after transferring labels from the reference to the question data. The highlighted tissue represents tracheal cells, which we faraway from the reference data. d, Reported uncertainty of the transferred labels, which was low in appropriately categorised cells and excessive within the incorrect and unknown ones, significantly within the trachea. Box plots point out the median (heart strains) and interquartile vary (hinges), and whiskers represents minimal and most values. Numbers of cells (n) are denoted above every field plot. e, Numbers of appropriate, incorrect and unknown cells throughout totally different tissues. The purple dashed line represents tracheal cells solely current in TM. f, Construction of a reference CITE-seq atlas utilizing two PBMC datasets (n = 10,849 cells). g, Integration of scRNA-seq data (n = 10,315) into the CITE-seq reference. h, Imputation of lacking proteins for the question dataset utilizing the reference. BAT, brown adipose tissue; GAT, gonadal adipose tissue; MAT, mesenteric adipose tissue; SCAT, subcutaneous adipose tissue.

We then investigated the transfer of cell sort labels from the reference dataset. Each cell within the question TM was annotated utilizing its closest neighbors within the reference dataset. Additionally, our classification pipeline offers an uncertainty rating for every cell whereas reporting cells with greater than 50% uncertainty as unknown (Methods). scArches achieved ~84% accuracy throughout all tissues (Fig. 4c). Moreover, many of the misclassified cells and cells from the unseen tissue obtained excessive uncertainty scores (Fig. 4d and Supplementary Fig. 15b). Overall, classification outcomes throughout tissues indicated a sturdy prediction accuracy throughout most tissues (Fig. 4e), whereas highlighting cells that weren’t mappable to the reference. Therefore, scArches can efficiently merge giant and sophisticated question datasets into reference atlases. Notably, we used scArches to map a big question (the mouse cell atlas2) onto TM and additional onto a not too long ago revealed human cell panorama (HCL)4 reference, demonstrating applicability to research similarity of cell sorts throughout species (Supplementary Note 1 and Supplementary Figs. 18–21). Overall, scArches-based label projection performs competitively when put next with state-of-the-art strategies equivalent to SVM rejection47,50, Seurat model 3 (ref. 22) and logistic regression classifiers50 (Supplementary Fig. 17).

In addition to the label transfer, one can use reference atlases to impute steady data within the question data equivalent to lacking antibody panels in RNA-seq-only assays. Indeed, one can mix scArches with current multimodal integration architectures equivalent to totalVI32, a mannequin for joint modeling of RNA expression and floor protein abundance in single cells. Leveraging scArches totalVI, we constructed a mobile indexing of transcriptomes and epitopes by sequencing (CITE-seq)51 reference utilizing two publicly obtainable PBMC datasets (Fig. 4f). Next, we built-in question scRNA-seq data into the reference atlas (Fig. 4g) and used the multimodal reference atlas to impute lacking protein data for the question dataset. Using imputed protein abundances, we will distinguish the noticed main populations equivalent to T cells (CD3+, CD4+ and CD8+), B cells (CD19+) and monocytes (CD14+) (Fig. 4h) (see Supplementary Fig. 22 for all proteins).

Preserving COVID-19 cell states after reference mappingIn the research of illness, contextualization with wholesome reference data is important. A profitable disease-to-healthy data integration ought to fulfill three standards: (1) preservation of organic variation of wholesome cell states; (2) integration of matching cell sorts between wholesome reference and illness question; and (3) preservation of distinct illness variation, such because the emergence of latest cell sorts which might be unseen throughout wholesome reference constructing. To showcase how one can carry out illness contextualization with scArches, we created a reference aggregated from bone marrow36, PBMCs37,38,39 and regular lung tissue52,53,54 (n = 154,723; Fig. 5a–c) after which mapped onto it a dataset containing alveolar macrophages and different immune cells collected through bronchoalveolar lavage from (1) wholesome controls and sufferers with (2) reasonable and (3) extreme COVID-19 (n = 62,469)55. As described by Liao and colleagues, this dataset accommodates immune cells discovered within the regular lung (for instance, tissue-resident alveolar macrophages, TRAMs) in addition to distinctive populations which might be absent within the regular lung and emerge solely throughout irritation (for instance, monocyte-derived alveolar macrophages, MoAMs)55. We used a destructive binomial (NB) CVAE base mannequin for this experiment (Methods).

Fig. 5: scArches resolves severity in COVID-19 question data mapped to a wholesome reference and divulges emergent cell states.a–c, Integration of question data from immune and epithelial cells from sufferers with COVID-19 on high of a wholesome immune atlas throughout a number of tissues (a), cell sorts (b) and cell states (c). BALF, bronchoalveolar lavage fluid; DC, dendritic cell; Treg cells, regulatory T cells. d, Comparison of varied macrophage subpopulations throughout each wholesome and COVID-19 states. Top, TRAMs are characterised by expression of FABP4, whereas monocyte-derived inflammatory macrophages (MoMs) are characterised by expression of CCL2. Upregulation of C1QA illustrates maturation of MoMs as they differentiate from monocytes to macrophages. Middle, CXCL5, IFI27 and CXCL10 illustrate context-dependent activation of TRAMs. Bottom, scArches appropriately maps TRAMs from question to TRAMs from reference, whereas preserving MoMs, unseen within the reference, as a definite cell sort. e, Separation of activated question CD8+ T cells from sufferers with COVID-19 from the remainder of CD8+ T cells within the reference. AT, alveolar sort; mDC, myeloid dendritic cells; pDC, plasmacytoid dendritic cells.

We first evaluated the mixing of question batches within the reference. scArches efficiently built-in alveolar macrophages from totally different datasets and preserved organic variability between them, though some ambient RNA indicators remained (Supplementary Note 2 and Supplementary Fig. 23). For instance, activated TRAMs (FABP4+IL1B+CXCL5+) that originate from a single particular person (donor 2 within the Travaglini et al.52 dataset) shaped a definite subcluster inside TRAMs (Fig. 5a–d). We then evaluated the projection of COVID-19 question data onto the reference mannequin. The dataset from Liao and colleagues accommodates the next cell sorts: airway epithelial cells, plasma cells and B cells, CD4+ and CD8+ T cells, NK cells, neutrophils, mast cells, dendritic cells, monocytes and alveolar macrophages (Fig. 5b,c and Supplementary Fig. 24)55. Within the macrophage cluster (characterised by the expression of C1QA), two distinct populations dominated the construction of the embedding (Fig. 5c,d): TRAMs (FABP4+C1Q+CCL2−) and inflammatory MoAMs (FABP4−C1Q+CCL2+). As anticipated, question TRAMs from wholesome controls built-in effectively with TRAMs from the reference dataset. While TRAMs from sufferers with reasonable COVID-19 built-in with TRAMs from management lung tissue, they didn’t combine with regular TRAMs fully, as they have been activated and characterised by elevated expression of IFI27 and CXCL10. MoAMs are predominantly present in samples from sufferers with extreme COVID-19 and to a lesser extent in samples from sufferers with reasonable COVID-19. MoAMs originate from monocytes which might be recruited to websites of an infection (as illustrated by the gradient of C1QA expression) and thus don’t seem in wholesome reference tissue. Indeed, MoAMs have been embedded in nearer proximity to monocytes than to TRAMs in our embedding, reflecting their ontological relationship (see Supplementary Fig. 25 for partition-based graph abstraction56 proximity evaluation).

We then evaluated CD8+ T cells. While the reference bone marrow and blood cells predominantly contained naive CD8+ T cells (CCL5−), lung and bronchoalveolar lavage fluid contained cytotoxic reminiscence CD8+ T cells (GZMA+GZMH+; Fig. 5e and Supplementary Fig. 26). Moreover, cytotoxic reminiscence CD8+ T cells from sufferers with COVID-19 have been characterised by the expression of interferon-response genes ISG15, MX1 and others, which is in settlement with a current report that the interferon response is a characteristic separating extreme acute respiratory syndrome coronavirus 2 pneumonia from different viral and non-viral pneumonias57,58 (Fig. 5e, Supplementary Note 2 and Supplementary Fig. 26).

Overall, the scArches joint embedding was dominated by nuanced organic variation, for instance, macrophage subtypes, even when these subtypes weren’t annotated in reference datasets (for instance, activated TRAMs from sufferers with reasonable COVID-19 or a affected person with a lung tumor). Although illness states have been absent within the reference data, scArches separated these states from the wholesome reference and even preserved organic variation patterns. Hence, disease-to-healthy integration with scArches met all three standards for profitable integration.

DialogueWe launched architectural surgical procedure, an easy-to-implement strategy for TL, reusing neural community fashions by including enter nodes and weights (adaptors) for brand spanking new research after which effective tuning solely these parameters. Architectural surgical procedure can lengthen any conditional neural network-based data-integration technique to allow decentralized reference updating, facilitate mannequin reuse and supply a framework for learning from reference data.

In purposes, we demonstrated how integration of whole-species atlases allows the transfer of cell sort annotations from a reference to a question atlas. We additional confirmed that COVID-19 question data might be mapped on high of a wholesome reference whereas retaining variation amongst each illness and wholesome states, which we promote in scArches by avoiding displaying the strategy the illness impact throughout coaching. In basic, totally different results equivalent to illness states are assumed to be orthogonal in high-dimensional space43; thus, if a batch-confounded impact (for instance, any donor-level covariate when donor is used as batch) shouldn’t be seen throughout coaching, we’d not count on it to be eliminated. We observe this phenomenon in our COVID-19 instance and in a number of experiments: biologically significant variations from held-out alpha cells within the pancreas (Fig. 1d,e) or unseen nuanced cell identities in immune cell data (Supplementary Fig. 14) are mapped to a brand new location when they’re unseen throughout coaching.

The discount in mannequin coaching complexity by coaching adaptors furthermore leads to a rise in pace whereas preserving integration accuracy when put next to full integration strategies. It additionally improves usability and interpretability, as a result of mapping a question dataset to a reference requires no additional hyperparameter optimization and retains reference illustration intact. Adaptors solely affect the primary community layer and subsequently ‘commute’: software order is irrelevant for iteratively increasing a reference, arriving at all times on the similar outcome due to the frozen nature of the community and independence of adaptor weights. With scArches, one can subsequently use pre-trained neural community fashions with out computational experience or graphics processing unit energy to map, for instance, illness data onto saved reference networks ready from unbiased atlases. We make use of those options by offering a mannequin database on Zenodo (Methods).

Model sharing together with reference mapping through scArches permits customers to create customized reference atlases by updating public ones and paves the best way for automated and standardized analyses of single-cell research. Especially for human data, sharing expression profiles is commonly troublesome due to data-protection laws, dimension, complexity and different organizational hurdles. With scArches, customers can receive an outline of the entire dataset to validate harmonized cell sort annotation. By sharing a pre-trained neural community mannequin that may be regionally up to date, worldwide consortia can generate a joint embedding with out requiring entry to the total gene units. In flip, customers can rapidly construct upon this by mapping their very own usually a lot smaller data into the reference, buying strong latent areas, cell sort annotation and identification of delicate state-specific variations with respect to the reference.

scArches is a device that leverages current conditional autoencoder fashions to carry out reference mapping. Thus, by design, it inherits each advantages and limitations of the underlying base fashions. For instance, a limitation of those fashions is that the built-in output is a low-dimensional latent area as an alternative of a corrected characteristic matrix as offered by mnnCorrect or Scanorama. While producing a batch-corrected enter matrix is possible30, this will likely lead to spurious and false indicators comparable to denoising methods59. Similarly, imputation of modalities not measured in question data (for instance, through scArches totalVI) performs higher for extra plentiful options, which has already been outlined within the unique totalVI publication32. An extra limitation is the necessity for a sufficiently giant and various set of samples for reference constructing. Deep learning fashions usually have extra trainable parameters than different integration strategies and thus usually require extra data. This constraint interprets straight to the efficiency of scArches reference mapping (Fig. 3a–d): utilizing a small reference together with a low variety of research leads to poor integration of question data whereas eradicating organic variation equivalent to nuanced cell sorts. Furthermore, even with equal coaching data, reference mannequin efficiency will differ, affecting reference mapping through scArches. As strong and scalable reference constructing continues to be ongoing analysis within the scRNA-seq field7, the selection of reference mannequin is a central problem when utilizing scArches. Yet we show that even imperfect reference fashions (Supplementary Note 3) can be utilized for significant analyses as demonstrated by our data evaluation of sufferers with COVID-19. Finally, one should contemplate the constraints of the bottom mannequin on batch-effect removing throughout reference mapping, through which it’s unlikely to take away stronger batch results than these seen within the coaching data. In our cross-species experiments, reference mapping performs effectively principally in immune cell populations, which seem to include the smallest batch impact throughout species (Supplementary Figs. 18, 20 and 21).

While scArches is relevant in lots of eventualities, it’s best suited when the question data consists of cell sorts and experimental protocols comparable to the reference data. Then, the question data might simply include new cell sorts or states equivalent to illness or other forms of perturbations, that are preserved after mapping. Additionally, we advise towards utilizing scArches for integrating question data with a reference created out of a single research and suggest integration with full pattern entry as an alternative. Further, the variety of overlapping genes between question and reference data also can affect integration high quality. We usually suggest utilizing a bigger set of extremely variable genes (HVGs) within the reference-building step to assure a much bigger characteristic overlap between reference and question, which will increase the robustness of reference mapping within the presence of lacking genes (Methods and Supplementary Fig. 28).

We envision two main instructions for additional purposes and growth. First, scArches might be utilized to generate context-specific large-scale illness atlases. Large illness reference datasets are more and more turning into available60,61,62. By mapping between illness references, we will assess the similarity of those illnesses on the single-cell stage and thus inform for locating mechanisms, reverting illness state or finding out perturbations, for instance, for drug repurposing. The suitability of mannequin organisms for illness analysis might be straight translated into the human context: for instance, projecting mouse single-cell tumor data on a reference human affected person tumor atlas might assist to establish correct tumor fashions that embody desired molecular and mobile properties of a affected person’s microenvironment. Incorporating extra covariates as conditional neurons within the reference mannequin will permit modeling of remedy response with a sure perturbation or drug63,64. Secondly, we envision assembling multimodal single-cell reference atlases to embody epigenomic65, chromosome conformation66, proteome51 and spatially resolved measurements.

In abstract, with the supply of reference atlases, we count on scArches to speed up the usage of these atlases to analyze question datasets.

MethodsStructure surgical procedureOur technique depends on an idea generally known as TL. TL is an strategy through which weights from a mannequin skilled on one process are taken and used as weight initialization or effective tuning for an additional process. We introduce an structure surgical procedure, a technique to apply TL within the context of conditional generative fashions and single-cell data. Our proposed technique is basic and can be utilized to carry out TL on each CVAEs and conditional generative adversarial nets67.

Let us assume that we would like to prepare a reference CVAE mannequin with a d-dimensional dataset (x ϵ Rd) from n totally different research (s ϵ Rn), the place R denotes actual quantity area. We additional assume that the bottleneck z with layer dimension is ok(z ϵ Rk). Then, an enter for a single cell i will likely be x′ = x · s, the place x and s are the d-dimensional gene expression profile and n-dimensional one-hot encoding of research labels, respectively. The · image denotes the row-wise concatenation operation. Therefore, the mannequin receives (d + n)-dimensional and (ok + n)-dimensional vectors as inputs for encoder and decoder, respectively. Assuming m question datasets, the goal mannequin will likely be initialized with all of the parameters from the reference mannequin. To incorporate m new research labels, we add m new dimensions to s in each encoder and decoder networks. We refer to these new added research labels as s′. Next, m new randomly initialized weight vectors are additionally added to the primary layer of the encoder and decoder. Finally, we effective tune the brand new mannequin by solely coaching the weights related to the final m dimensions of x′ that correspond to the situation labels. Let us assume that p and q are the variety of neurons within the first layer of the encoder and decoder; then, in the course of the effective tuning, solely (m) &instances; (p + q) parameters will likely be skilled. Let us parameterize the primary layer of the encoder and decoder a part of the scArches as f1 and g1, respectively. Let us additional assume that ReLU activations are used within the layers. Therefore the equations for f1 and g1 are

$$start{array}{l}f_1(x,s,sprime ;phi _x,phi _s,phi _{sprime })={textrm{max}}(0,phi _x^Tx + phi _s^Ts + phi _{sprime }^Tsprime ) g_1(z,s,sprime ;theta _z,theta _s,theta _{sprime })={textrm{max}}(0,theta _z^Tz + theta _s^Ts + theta _{sprime }^Tsprime ),finish{array}$$

the place ϕ and θ are parameters of encoder and decoder, and T denotes transpose operation. Therefore, the gradients of f and g with respect to ϕs′ and θs′ are

$$start{array}{l}nabla_{phi_{s’}}f_1=left{start{array}{lr} 0 & {mathrm{if}} {phi_x^Tx + phi_s^Ts + phi_{s’}^Ts’} {le} 0 s’ & {mathrm{in any other case}} finish{array} proper. nabla_{theta_{s’}}g_1=left{start{array}{lr} 0 & {mathrm{if}} {theta_z^Tz + theta_s^Ts + theta_{s’}^Ts’} {le} 0 s’ & {mathrm{in any other case}} finish{array} proper. finish{array}$$

Finally, as a result of all different weights besides ϕs′ and θs′ are frozen, we solely compute the gradient of scArches’ price perform with respect to ϕs′ and θs′:

$$start{array}{l}nabla {phi _{sprime }}L{textrm{scArches}}(x,s,sprime ;theta ,phi )=nabla {f_1}L{textrm{scArches}}(x,s,sprime ;phi ) cdot nabla {phi _{sprime }}f_1(x,s,sprime ;phi _x,phi _s,phi _{sprime }) nabla _{theta _{sprime }}L{textrm{scArches}}(x,s,sprime ;theta ,phi )=nabla {g_1}L{textrm{scArches}}(z,s,sprime ;theta ,phi ) cdot nabla _{theta _{sprime }}g_1(x,s,sprime ;theta _z,theta _s,theta _{sprime }).finish{array}$$

scArches base fashionsConditional variational autoencodersVariational autoencoders (VAEs)68 have been proven to study the underlying advanced construction of data. VAEs have been proposed for generative modeling of the underlying data leveraging variational inference and neural networks to maximize the next equation:

$$p_theta (Xmid S)={int} {p_theta } (Xmid Z,S)p_theta (Zmid S)dZ,$$

the place X is a random variable representing the mannequin’s enter, S is a random variable indicating varied situations, θ is the neural community parameters, and (p_theta (Xmid Z,S)) is the output distribution that we pattern Z to reconstruct X. In the next equation, we exploit notations from ref. 29 and a tutorial from ref. 69. We approximate the posterior distribution (p_theta (Z|X,S)) utilizing the variational distribution (q_phi (Z|X,S)) that’s approximated by a deep neural community parameterized with ϕ:

$$start{aligned}L_{textrm{CVAE}}(X,S;phi ,theta ) &=log p_theta (Xmid S) – alpha cdot D_{textrm{KL}}(q_phi (Z|X,S)||p_theta (Z|X,S))= &=mathbb{E}{q_phi (Zmid X,S)}[log p_theta (Xmid Z,S)] – alpha cdot D{textrm{KL}}(q_phi (Z|X,S)||p_theta (Z|S)),finish{aligned}$$

the place (theta={ theta prime ,theta _z,theta _s}) and (phi={ phi prime ,phi _x,phi _s}) are parameters of decoder and encoder, respectively, ({mathbb{E}}) is the expectation and DKL is the Kullback-Leibler divergence scaled by parameter α. On the left-hand aspect, we’ve the log probability of the data and an error time period that is determined by the capability of the mannequin. The right-hand aspect of the above equation is also referred to as the proof decrease certain. CVAE70 is an extension of VAE framework through which (S ne emptyset).

scArches trVAEtrVAE30 builds upon VAE68 with an additional regularization to additional match the distribution between situations. Following the strategy proposed by Lotfollahi et al.30, we use the illustration of the primary layer within the decoder, which is regularized by most imply discrepancy71. For implementation, we use multi-scale radial foundation perform (RBF) kernels outlined as

$$kleft( { x,xprime } proper)=mathop {sum }limits_{i=1}^l kleft( {x,xprime ,gamma _i} proper),$$

the place (kleft( {x,xprime ,gamma _i} proper)=e^^2), γi is a hyperparameter, and l denotes most variety of RBF kernels.

We will parameterize the encoder and decoder a part of scArches as fϕ and gθ, respectively. So the networks fϕ and gθ will settle for inputs x, s and z, s, respectively. Let us distinguish the primary ((g_{theta z,theta _s}^{(1)})) and the remaining layers ((g{theta prime }^{(2)})) of the decoder community (g_theta=g_{theta prime }^{(2)} circ g_{theta _z,theta _s}^{(1)}). Therefore, we will outline the next most imply discrepancy (MMD) price perform:

$$L_{textrm{MMD}}(X,S;phi ,theta z,theta _s)=mathop {sum }limits{i ne j}^{ {textrm{No. research}}} l_{textrm{MMD}}(g_{theta z,theta _s}^{(1)}(f_phi (X{S=i},i),i),g_{theta z,theta _s}^{(1)}(f_phi (X{S=j},j),j)),$$

the place

$$start{array}{rcl}l_{textrm{MMD}}(X,X^prime ) &=& frac{1}{{N_0^2}}mathop {sum }limits_{n=1}^{N_0} mathop {sum }limits_{m=1}^{N_0} ok(x_n,x_m) && + frac{1}{{N_1^2}}mathop {sum }limits_{n=1}^{N_1} mathop {sum }limits_{m=1}^{N_1} ok(x_n^prime ,x_m^prime ) – frac{2}{{N_0N_1}}mathop {sum }limits_{n=1}^{N_0} mathop {sum }limits_{m=0}^{N_1} ok(x_n,x_m^prime ).finish{array}$$

We used the notation XS=i for samples drawn from ith research distribution within the coaching data. Finally, the trVAE’s price perform is

$$L_{textrm{trVAE}}(X,S;phi ,theta )=L_{textrm{CVAE}}(X,S;phi ,theta ) – beta cdot L_{textrm{MMD}}(X,S;phi ,theta _z,theta _s),$$

the place β is a regularization scale parameter. The gradients of trVAE’s price perform with respect to ϕs and θs are

$$start{array}{l}nabla {phi _s}L{textrm{trVAE}}(X,S;theta ,phi )=nabla {phi _s}L{textrm{CVAE}}(X,S;theta ,phi ) – beta cdot nabla {phi _s}L{textrm{MMD}}(X,S;phi ,theta z,theta _s), nabla _{theta _s}L{textrm{trVAE}}(X,S;theta ,phi )=nabla {theta _s}L{textrm{CVAE}}(X,S;theta ,phi ) – beta cdot nabla {theta _s}L{textrm{MMD}}(X,S;phi ,theta _z,theta _s).finish{array}$$

Therefore LtrVAE might be optimized utilizing stochastic gradient ascent with respect to ϕs and θs as all the opposite parameters are frozen.

scArches scVILopez et al.27 developed a completely probabilistic strategy, known as scVI, for normalization and evaluation of scRNA-seq data. scVI can be primarily based on a CVAE, described intimately above. But, in distinction to the trVAE structure, the decoder assumes a zero-inflated destructive binomial (ZINB) distribution; and subsequently the reconstruction loss differs to the MSE lack of trVAE. Another main distinction is that scVI explicitly fashions the library dimension, which is required for the ZINB loss calculation with one other shallow neural community known as the library encoder. Therefore, with comparable notation as above, we’ve the output distribution (p(X|Z,S,L)), the place L is the scaling issue that’s sampled by the outputs of the library encoder, specifically the empirical imply Lμ and the variance Lσ of the log library per batch:

$$L sim {textrm{lognormal}}(L_mu ,L_sigma ^2).$$

When we now separate the outputs of the decoder gθ into (g_theta ^x), the decoded imply proportion of the expression data, and (g_theta ^d), the decoded dropout results, we will write the ZINB mass perform for (p(X|Z,S,L)) within the following closed type:

$$left{start{array}{l} p(X=0 | Z, S, L)= qquad g_{theta}^d(Z, S) + (1 – g_{theta}^d(Z, S))left(frac{Sigma}{Sigma + L cdot g_{theta}^x(Z,S)} proper)^{Sigma} p(X=Y | Z, S, L)= qquad (1 – g_{theta}^d(Z,S))frac{Gamma(Y + Sigma)}{Gamma(Y+1)Gamma(Sigma)}left(frac{Sigma}{Sigma + L cdot g_{theta}^x(Z,S)}proper)^{Sigma}left(frac{g_{theta}^x(Z,S)}{Sigma + L cdot g_{theta}^x(Z, S)}proper)^{Y},finish{array}proper.$$

the place Σ is the gene-specific inverse dispersion, Γ is the gamma perform, and Y represents non-zero entries drawn from a ZINB distribution. Because the proof decrease certain and subsequently the optimization goal might be calculated by making use of the reparameterization trick and supposing Gaussians, which is feasible right here due to the proposed ZINB distribution, we will write the scVI price perform as follows:

$$L_{textrm{scVI}}(X,S;phi ,theta )=L_{textrm{CVAE}}(X,S;phi ,theta ) – alpha cdot D_{textrm{KL}}(q_phi (L|X,S)||p_theta (L)).$$

Furthermore, due to the utilized reparameterization trick, an computerized differentiation operator can be utilized, and the price perform might be optimized by making use of stochastic gradient descent. For the appliance in scArches, we eliminated the library encoder and computed the library dimension for every batch in a closed type by summing up the counts. This doesn’t lower the efficiency of the mannequin and accelerates the surgical procedure step. The ensuing community can then be used equally to the trVAE community by merely retraining solely the situation weights corresponding to the brand new batch annotations in S.

scArches scANVIscANVI is a semi-supervised technique that builds up on the scVI mannequin and was proposed intimately by Xu et al.31. By establishing a mix mannequin, it’s ready to make use of any cell sort annotations throughout autoencoder coaching to enhance latent illustration of the data. In addition to this, scANVI is able to labeling datasets with just some marker gene labels in addition to transferring labels from a labeled dataset to an unlabeled dataset. For the coaching of scANVI, the authors proposed an alternating optimization of the price perform (L_{textrm{scANVI}}(X,S;phi ,theta )) and the classification loss C, which ends up from a shallow neural community that serves as a classifier with a cross-entropy loss after the final softmax layer. In extra element, the price perform might be formulated within the following method:

$$L_{textrm{scANVI}}(X,S;phi ,theta )=L_{textrm{labeled}}(X,S,C;phi ,theta ) + L_{textrm{unlabeled}}(X,S;phi ,theta ),$$

the place C is the cell sorts within the annotated datasets, and each price perform summands Llabeled and Lunlabeled are obtained by comparable calculations as within the case of scVI. The main distinction right here, nonetheless, is that the Kullback–Leibler divergence is utilized to a further latent encoder that takes cell sort annotations under consideration. For the unlabeled case, every pattern is broadcasted into each obtainable cell sort. As scANVI builds up on scVI, we use the identical changes right here to apply surgical procedure. On high of that, we additionally freeze the classifier even for semi-supervised question data, as a result of we would like an unchanging reference efficiency for constructing a cell atlas and likewise to power cells within the question data with the identical cell sort annotation to be close to to the corresponding reference cells within the latent illustration.

scArches totalVIFor the aim of mixing paired measurement of RNA and floor proteins from the identical cells, equivalent to for CITE-seq data, Gayoso et al.32 offered a deep generative mannequin known as totalVI. totalVI learns a joint low-dimensional probabilistic illustration of RNA and protein measurements. For the RNA portion of the data, totalVI makes use of an structure comparable to that of scVI, which we mentioned intimately above; however, for proteins, a brand new mannequin is launched that separates protein data into background and foreground parts. With the surgical procedure performance of scArches added to totalVI, it’s now attainable to study a joint latent area of RNA and protein data on a CITE-seq reference dataset and do surgical procedure on a question dataset with solely RNA data to impute protein data for that question dataset as effectively. To accomplish this purpose, we once more solely retrain the weights that correspond to the brand new batch labels.

CVAEs for single-cell genomicsCVAEs have been first utilized to scRNA-seq data in scVI29 for data integration and differential testing. Here we concentrate on how CVAEs carry out data integration and potential pitfalls. These fashions obtain a matrix of gene expression profile for cells (X) and label (situation) matrix (S). The situation matrix includes a nuisance variable, which we would like to regress out from the data. Labels can encode batch, applied sciences, illness state or different discrete variables. The CVAE mannequin seeks to infer a low-dimensional latent area (Z) for the cell that will be freed from variations defined by the label variable. For instance, if the labels are the experimental batches, then comparable cell sort separated by batch impact within the unique gene expression area will likely be aligned collectively. Importantly, variation attributed to the labels will likely be merely regressed within the latent area whereas nonetheless current within the output of the CVAE. Therefore, the reconstructed output will nonetheless include batch results. Additionally, whereas autoencoder-based data-integration strategies have been proven to carry out greatest when outputting built-in embeddings, these strategies also can output corrected expression matrices. This is achieved by forcing all batches to be reworked to a particular batch as beforehand proven in scGen.

scArches builds upon current CVAEs. The outcomes of the mixing closely rely upon the kind of labels used as batch covariates for situation inputs. If the dataset is the batch covariate, within-dataset donor results won’t be eliminated, however donors turn into extra comparable throughout datasets. In our COVID-19 instance, the illness is used as a question and thus shouldn’t be captured absolutely within the encoder, which is skilled on data from wholesome people. Adaptor coaching removes the donor- and/or dataset-specific batch impact from a illness pattern however doesn’t take away variation unseen in community coaching. Thus, selection of coaching data and selection of batch covariate are essential to assess whether or not variation from illness is eliminated in coaching or not.

Overall, the selection and design of the label matrix is a vital step for optimum consequence. The label matrix can encode one covariate (for instance, batch), a number of covariates (for instance, expertise, cell sorts, illness, species,…) or a mixture of covariates (for instance, expertise and species). However, the interpretability of the latent area will likely be difficult within the presence of advanced label design and would require additional warning.

Model sharingWe presently assist an software programming interface to add and obtain mannequin weights and data (if obtainable) utilizing Zenodo. Zenodo is a general-purpose open-access repository developed to allow researchers to share datasets and software program. We have offered step-by-step guides for the entire pipeline from coaching and importing fashions to downloading, updating and additional sharing fashions. These tutorials might be discovered within the scArches GitHub repository (https://github.com/theislab/scarches).

Feature overlap between reference and questionAn necessary sensible problem for reference mapping utilizing scArches is the variety of options (genes) which might be shared between the question and the reference mannequin and/or dataset. It is necessary to be aware that, with the present pipeline, the question data should have the identical gene set because the reference mannequin. Therefore, the person has to substitute lacking reference genes within the question with zeros. We investigated the impact of zero filling and noticed that integration efficiency was strong when 10% (of two,000 genes) have been lacking from question data. However, the efficiency will deteriorate with bigger variations between question and reference (Supplementary Fig. 28a). We additional noticed good integration with 4,000 HVGs, even when 25% of genes have been lacking from the question data, conveying that the mannequin could be strong if the general variety of shared genes is giant (for instance, 4,000 HVGs, Supplementary Fig. 28b).

Evaluation metricsEvaluation metrics and their definitions within the present paper have been taken from work by Luecken et al.7, except particularly acknowledged in any other case.

Entropy of batch mixingThis metric43 works by establishing a repair similarity matrix for cells. The entropy of blending in a area of cells with c batches is outlined as

$$E=mathop {sum }limits_{i=1}^c p_ilog_c(p_i),$$

the place pi is outlined under as

$$p_i=frac{{ {textrm{no. cells with batch}},i,{textrm{within the area}}}}{{ {textrm{no. cells within the area}}}}.$$

Next, we outline U, a uniform random variable on the cell inhabitants. Let BU be the frequencies of 15 nearest neighbors for the cell U in batch x. We report the entropy of this variable after which common throughout T = 100 measurements of U. To normalize the entropy of the batch mixing rating between 0 and 1, we set the bottom of the logarithm to the variety of batches c.

Average silhouette widthSilhouette width measures the connection between within-cluster distances of a cell and between-cluster distances of that cell to the closest cluster. In basic, an ASW rating of 1 implies clusters which might be effectively separated, an ASW rating of 0 implies overlapping clusters, and an ASW rating of −1 implies sturdy misclassification. When we use the ASW rating as a measure of organic variance, we calculate it on cell sorts within the following method:

$${textrm{ASW}}_c=frac{{textrm{ASW} + 1}}{2},$$

the place the ultimate rating is already scaled between 0 and 1. Therefore bigger values correspond to denser clusters. In distinction to the ASWc rating, we additionally calculate an ASW rating on batches inside cell clusters to receive a measure for batch-effect removing. In this case, we once more scale but in addition invert the ASW rating to have a constant metric comparability:

$${textrm{ASW}}_b=1 – {textrm{abs}}({textrm{ASW}}).$$

A better ultimate rating right here implies higher mixing and subsequently a greater batch-removal impact.

Normalized mutual dataWe use NMI to evaluate the overlap of two totally different cell sort clusterings. In element, we computed a Louvain clustering on the latent illustration of the data and in contrast it to the latent illustration itself in a cell type-wise method. To receive scores between 0 and 1, the overlap was scaled utilizing the imply of entropy phrases for cell sort and cluster labels. Therefore an NMI rating of 1 corresponds to an ideal match and good conservation of organic variance, whereas an NMI rating of 0 corresponds to uncorrelated clustering.

Adjusted Rand indexThis metric considers appropriate clustering overlaps in addition to counting appropriate disagreements between two clusterings. Again, comparable to NMI, cell sort labels within the built-in dataset are in contrast with Louvain clustering. The adjusted Rand index rating is normalized between 0 and 1, the place 1 corresponds to good conservation of organic variance and 0 corresponds to random labeling.

Principal-component regressionIn distinction to principal-component evaluation (PCA), we calculate a linear regression R with respect to the batch label onto every principal element. The whole variance (Var) defined by the batch variable can then be formulated as follows:

$${textrm{Var}}(X|B)=mathop {sum }limits_{i=1}^N {textrm{Var}}(X|{textrm{PC}}_i) cdot R^2({textrm{PC}}_i|B),$$

the place X is the data matrix, B is the batch label, and N is the variety of principal parts (PC).

Graph connectivityFor this metric, we calculate a subset kNN graph (G(N_c,E_c)) for every cell sort label c, such that every subset solely accommodates cells from the given label. The whole graph connectivity rating can then be calculated as follows:

$$gc=frac{1}{C}mathop {sum}limits_{c in C} {frac{{|{textrm{LCC}}(G(N_c,E_c))|}}{N_c}} ,$$

the place C is the set of cell sort labels, (|{textrm{LCC}}()|) is the variety of nodes within the largest related element of the graph, and |Nc| is the variety of nodes with the given cell sort label. This signifies that we examine if the graph illustration of the latent illustration connects all cells with the identical cell sort label. Therefore, a rating of 1 would suggest that every one cells with the identical cell sort label are related, which might additional point out good batch mixing. A graph through which no cells are related would lead to a rating of 0.

Isolated label F
1
We outlined remoted labels as cell sort labels which might be current within the least variety of batches. If there are a number of remoted labels, we merely take the imply of every rating. To decide how effectively these cell sorts are separated from different cell sorts within the latent illustration, we first decide the cluster with the most important variety of an remoted label. Subsequently, an F1 rating of the remoted label towards all different labels inside that cluster is computed, the place the F1 rating is outlined as follows:

$$F_1=2frac{{{textrm{precision}} cdot {textrm{recall}}}}{{textrm{precision} + {textrm{recall}}}}.$$

This leads to a rating between 0 and 1 as soon as once more, the place 1 implies that every one cells with the remoted label are captured within the cluster.

Isolated label silhouetteFor this metric, we use ASWc, outlined above, however solely on the remoted label subset of the latent illustration. Scaling and which means of the rating are the identical as described for ASW. If there are a number of remoted labels, we common over every rating comparable to the remoted labeled F1 rating.

kNN accuracyWe first compute the 15 nearest neighbors of every cell within the data. We then compute the ratio of the proper cell sort annotations inside these 15 neighbors. This cell-wise rating is then averaged over all cell sorts individually after which averaged over all remaining scores once more to receive a single kNN-accuracy rating between 0 and 1. A better kNN-accuracy rating corresponds to higher preservation of native cell sort purity. This metric was impressed by an identical metric utilized in scANVI.

Visualization of integration scoresTo evaluate performances of various fashions, we designed an outline desk (impressed by Saelens et al.72) that shows particular person integration scores as circles and aggregated scores as bars. Each particular person rating is minimal–most scaled to enhance visible comparability of various fashions after which averaged into aggregated scores by class (batch correction and organic conservation). Finally, an total rating is calculated as a weighted sum of batch correction and bio-conservation, contemplating a ratio of 40:60, respectively. When proven, reference and question instances will not be thought-about within the calculation of aggregated scores. Moreover, these time values are scaled collectively to permit direct comparability. The total rating of every mannequin, for every rating, is represented by the colour scheme.

DatasetsAll cell sort labels and metadata have been obtained from unique publications except particularly acknowledged in any other case under.

Brain dataThe mouse mind dataset is a set of 4 publicly obtainable scRNA-seq mouse mind studies1,33,34,35, for which extra data on cerebral areas was offered. We obtained the uncooked rely matrix from Rosenberg et al.34 below GEO accession ID GSE110823, the annotated rely matrix from Zeisel et al.35 from http://mousebrain.org (file title L5_all.loom, downloaded on 9 September 2019) and rely matrices per cell sort from Saunders et al.33 from http://dropviz.org (DGE by area part, downloaded on 30 August 2019). Data from mouse mind tissue sorted by stream cytometry (myeloid and non-myeloid cells, together with the annotation file annotations_FACS.CSV) from TM have been obtained from https://figshare.com (retrieved 14 February 2019). We harmonized cluster labels through fuzzy string matching and tried to protect the unique annotation so far as attainable. Specifically, we annotated ten main cell sorts (neuron, astrocyte, oligodendrocyte, oligodendrocyte precursor cell, endothelial cell, mind pericyte, ependymal cell, olfactory ensheathing cell, macrophage and microglia). In the case of Saunders et al.33, we facilitated the extra annotation data desk for 585 reported cell sorts (annotation.BrainCellAtlasSaundersversion2018.04.01.TXT retrieved from http://dropviz.org on 30 August 2019. Among these, some cell sorts have been annotated as ‘endothelial tip’, ‘endothelial stalk’ and ‘mural’. We examined the subset of the Saunders et al.33 dataset as follows: we used Louvain clustering (default decision parameter, 1.0) to cluster, adopted by gene expression profiling through the rankgenesgroups perform in scanpy. Using marker gene expression, we assigned microglia (C1qa), oligodendrocytes (Plp1), astrocytes (Gfap, Clu) and endothelial cells (Flt1) to the subset. Finally, we utilized scran73 normalization and log (counts + 1) to rework rely matrices. In whole, the dataset consists of 978,734 cells.

PancreasFive publicly obtainable pancreatic islet datasets74,75,76,77,78, with a complete of 15,681 cells in uncooked rely matrix format have been obtained from the Scanorama42 dataset, which has already assigned its cell sorts utilizing batch-corrected gene expression by Scanorama. The Scanorama dataset was downloaded from http://scanorama.csail.mit.edu/data.tar.gz. In the preprocessing step, uncooked rely datasets have been normalized and log reworked by scanpy preprocessing strategies. Preprocessed data have been used straight for the pipeline of scArches. One thousand HVGs have been chosen for coaching the mannequin.

The human cell panoramaThe HCL dataset was obtained from https://figshare.com/articles/HCL_DGE_Data/7235471. Raw rely matrix data for all tissues have been aggregated. A complete of 277,909 cells have been chosen and processed utilizing the scanpy Python package deal. Data have been normalized utilizing dimension issue normalization such that each cell had 10,000 counts after which log reworked. Finally, 5,000 HVGs have been chosen as per their common expression and dispersion. We used processed data straight for coaching scArches on the pre-training section.

The mouse cell atlasThe mouse cell atlas dataset was obtained from https://figshare.com/articles/HCL_DGE_Data/7235471. Raw rely matrix data for all tissues have been aggregated collectively. A complete of 150,126 cells have been chosen and processed utilizing the scanpy Python package deal. Homologous genes have been chosen utilizing BioMart 100 earlier than merging with HCL data. Data have been normalized along with HCL as defined earlier than.

Immune dataThe immune dataset consists of ten human samples from two totally different tissues: bone marrow and peripheral blood. Data from bone marrow samples have been retrieved from Oetjen et al.36, whereas data from peripheral blood samples have been obtained from 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3), Freytag et al.37, Sun et al.38 and Villani et al.79. Details on the retrieval location of datasets, the totally different protocols used and methods through which samples have been chosen for evaluation might be present in Luecken et al.7. We carried out high quality management individually for every pattern however adopted a typical technique for normalization: all samples for which rely data have been obtainable have been individually normalized by scran pooling73. This excludes data from Villani et al.79, which included solely TPM values. All datasets have been log+1 reworked in scanpy80. Cell sort labels have been harmonized ranging from current annotations (Oetjen et al.36) to create a constant set of cell identities. Well-known markers of cell sorts have been collected and used to lengthen annotation to samples for which they weren’t beforehand obtainable. When obligatory, subclustering was carried out to derive extra exact labeling. Finally, cell populations have been eliminated if no label might be assigned. Four thousand HVGs have been chosen for coaching.

Endocrine pancreasThe uncooked dataset of pancreatic endocrinogenesis (n = 22,163)45 is offered on the GEO below accession quantity GSE132188. We thought-about a subset of two,000 HVGs for coaching. Cell sort labels have been obtained from an adata object offered by the authors of scVelo46.

CITE-seqWe obtained three publicly obtainable datasets from 10x Genomics, already curated and preprocessed as described within the totalVI study32. These data embody ‘10k PBMCs from a Healthy Donor—Gene Expression and Cell Surface Protein’ (PBMC, 10k (CITE-seq)81), ‘5k PBMCs from a wholesome donor with cell floor proteins (v3 chemistry)’ (PBMC, 5k (CITE-seq)82) and ‘10k PBMCs from a Healthy Donor (v3 chemistry)’ (PBMC, 10k (RNA-seq)57,83,84). Reference data included 14 proteins, and 4,000 HVGs have been chosen for coaching.

COVID-19The COVID-19 dataset together with its metadata was downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1459261 and https://github.com/zhangzlab/covid_balf. The dataset that was used on this paper consists of n = 62,469 cells. Data from lungs52,53,54, PBMCs37,38,39 and bone marrow36 have been later merged with these from COVID-19 samples. Data have been normalized utilizing scanpy, and a couple of,000 HVGs have been chosen for coaching the mannequin. Cell sort labels have been obtained from the unique research.

Tabula Muris SenisThe TM Senis dataset with GEO accession quantity GSE132042 is publicly obtainable at https://figshare.com/projects/Tabula_Muris_Senis/64982. The dataset accommodates 356,213 cells with cell sort, tissue and technique annotation. We normalized the data utilizing dimension issue normalization with 10,000 counts for every cell. Next, we log+1 reworked the dataset and chosen 5,000 HVGs in accordance to their common expression and dispersion. All preprocessing steps have been carried out utilizing the scanpy Python package deal. In this research, we used a mixture of sequencing expertise and time level as batch covariates.

BenchmarksFull integration strategiesWe ran PCA with 20 principal parts on the ultimate outcomes from Seurat, Scanorama and mnnCorrect to be comparable (comparable strategy as described in ref. 31) when computing metrics to deep learning fashions, which had a latent illustration of dimension 10–20.

                Harmony: we used the Harmony Matrix perform from the Harmony package deal. We offered the perform with a PCA matrix with 20 principal parts on the gene expression matrix.




                Scanorama: we used the correct_scanpy perform from the Scanorama package deal with default parameters.




                Seurat: we utilized Seurat as within the walkthrough (https://satijalab.org/seurat/v3.1/integration.html) with default parameters.




                Liger: we used the Liger technique as within the walkthrough (https://github.com/welch-lab/liger/blob/master/vignettes/walkthrough_pbmc.pdf). We used ok = 20, λ = 5 and backbone = 0.4 with different default parameters. We solely scaled data as we had already preprocessed data.




                Conos: we adopted the Conos tutorial at https://htmlpreview.github.io/?https://raw.githubusercontent.com/kharchenkolab/conos/master/doc/walkthrough.html. Unlike the tutorial, we used our personal preprocessed data for higher comparisons. We used PCA area with parameters ok = 30, ok.self = 5, ncomps = 30, matching.technique = ’mNN’ and metric = ’angular’ to construct the graph. We set the decision to 1 to discover communities. Finally, we saved the corrected pseudo-PCA area with 20 parts.




                mnnCorrect: we used the mnnCorrect perform from the scran package deal with default parameters.



            Cell type-classification strategies


                  Seurat: we adopted the walkthrough (https://satijalab.org/seurat/v3.1/integration.html) and used reciprocal PCA for dimension discount. As described within the unique publication48, we examined projection scores and assigned cells with the bottom 20% of values to be ‘unknown’.




                  SVM: we fitted an SVM mannequin from the scikit-learn library to the reference data and categorised question cells. We assigned cells with uncertainty likelihood higher than 0.7 as ‘unknown’.




                  Logistic regression: we fitted logistic regression from the scikit-learn library to the reference data and predicted question labels.




            All these strategies have been examined on a machine with one eight-core Intel i7-9700KQ CPU addressing 32 GB RAM and one Nvidia GTX 1080 ti (12 GB) addressing 12 GB VRAM.

Model outputThroughout this paper, all low-dimensional representations have been obtained utilizing the latent area of scArches fashions. The output of scArches fashions will likely be confounded with situation variables not match for data-integration purposes however greatest for imputation or denoising eventualities.

Cell sort annotationTo classify labels for the question dataset, we skilled a weighted kNN classifier on the latent-space illustration of the reference dataset. For every question cell c, we extracted its kNNs (Nc). We computed the usual deviation of the closest distances:

$${textrm{s.d.}}{c,N_c}=sqrt {frac{{mathop {sum}nolimits{n in N_c} {({textrm{dist}}(c,n))^2} }}{ok}},$$

the place dist(c, n) is the Euclidean distance of the question cell c and its neighbors n within the latent area. Next, we utilized the Gaussian kernel to distances utilizing

$$D_{c,n,N_c}=e^{ – frac{{textrm{dist}(c,n)}}{{(2/{textrm{s.d.}}_{c,N_c})^2}}}.$$

Next, we computed the likelihood of assigning every label y to the question cell c by normalizing throughout all adjusted distances utilizing

$$p(Y=y|X=c,N_c)=frac{{mathop {sum}nolimits_{i in N_c} {I(y^{(i)}=y) cdot D_{c,n_i,N_c}} }}{{mathop {sum}nolimits_{j in N_c} {D_{c,n_j,N_c}} }},$$

the place y(i) is the label of ith nearest neighbor and I is the indicator perform. Finally, we calculated the uncertainty u for every cell c within the question dataset utilizing its set of closest neighbors within the reference dataset (Nc). We outlined the uncertainty (u_{c,y,N_c}) for a question cell c with label y and Nc as its set of nearest neighbors as

$$u_{c,y,N_c}=1 – p(Y=y|X=c,N_c).$$

We reported cells with greater than 50% uncertainty as unknown to detect out-of-distribution cells with new labels, which don’t exist within the coaching data. Therefore, we labeled every cell c within the question dataset as follows:

$$start{array}{l}hat y_c^prime={textrm{argmin}}y,u{c,y,N_c} hat{y}c=left{ start{array}{lr} hat{y}^prime_c & {mathrm{if}} u{c, hat{y}^prime_c, N_c} {le} 0.5 {mathrm{unknown}} & {mathrm{o.w.}}finish{array} proper}finish{array}$$

Protein imputationFor scArches totalVI, lacking proteins for RNA-seq-only data have been imputed by conditioning question cells as being within the different batches within the reference with protein data. It is feasible to impute primarily based on a particular batch or common throughout all batches. In the instance within the paper, the typical model was used.

Reporting SummaryFurther data on analysis design is offered within the Nature Research Reporting Summary linked to this text.

        Data availability

        All datasets used within the paper are public, referenced and downloadable at https://github.com/theislab/scarches-reproducibility.


      References1.Schaum, N., Karkanias, J., Neff, N. & Pisco, A. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

PubMed Central 
Article 
CAS 

                Google Scholar 

2.Han, X. et al. Mapping the mouse cell atlas by Microwell-seq. Cell 172, 1091–1107 (2018).

CAS 
PubMed 
Article 
PubMed Central 

                Google Scholar 

3.The Tabula Muris Consortium et al. A single cell transcriptomic atlas characterizes ageing tissues within the mouse. Preprint at bioRxiv https://doi.org/10.1101/661728 (2020).

4.Han, X. et al. Construction of a human cell panorama at single-cell stage. Nature 581, 303–309 (2020).

CAS 
PubMed 
Article 
PubMed Central 

                Google Scholar 

5.10x Genomics. 10x Datasets Single Cell Gene Expression, Official 10x Genomics Support. https://www.10xgenomics.com/resources/datasets/

6.Regev, A. et al. Science discussion board: the human cell atlas. eLife 6, e27041 (2017).

PubMed 
PubMed Central 
Article 

                Google Scholar 

7.Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.05.22.111161 (2020).

8.Zheng, H. et al. Cross-domain fault prognosis utilizing data transfer technique: a assessment. IEEE Access 7, 129260–129290 (2019).

Article 

                Google Scholar 

9.Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in pure language processing. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics 15–18 (ACL, 2019).

10.Yang, L., Hanneke, S. & Carbonell, J. A concept of transfer learning with purposes to lively learning. Mach. Learn. 90, 161–189 (2013).

Article 

                Google Scholar 

11.Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. in Proceedings of the twenty fifth International Conference on Neural Information Processing Systems 1097–1105 (NIPS, 2012).

12.Devlin, J., Chang, M.-W., Lee, Okay. & Toutanova, Okay. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805v2 (2018).

13.Hsu, Y.-C., Lv, Z. & Kira, Z. Learning to cluster so as to transfer throughout domains and duties. Preprint at https://arxiv.org/abs/1711.10125 (2017).

14.Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset traits and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).

PubMed 
Article 

                Google Scholar 

15.Dahl, G. E., Yu, D., Deng, L. & Acero, A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2011).

Article 

                Google Scholar 

16.Ker, J., Wang, L., Rao, J. & Lim, T. Deep learning purposes in medical picture evaluation. IEEE Access 6, 9375–9389 (2017).

Article 

                Google Scholar 

17.Avsec, Ž. et al. The Kipoi repository accelerates group alternate and reuse of predictive fashions for genomics. Nat. Biotechnol. 37, 592–600 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

18.Gayoso, A. et al. scvi-tools: a library for deep probabilistic evaluation of single-cell omics data. Preprint at bioRxiv https://doi.org/10.1101/2021.04.28.441833 (2021).

19.Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

20.Stein-O’Brien, G. L. et al. Decomposing cell id for transfer learning throughout mobile measurements, platforms, tissues, and species. Cell Syst. 8, 395–411 (2019).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

21.Lieberman, Y., Rokach, L. & Shay, T. CaSTLe—classification of single cells by transfer learning: harnessing the facility of publicly obtainable single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499 (2018).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

22.Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

23.Hao, Y. et al. Integrated evaluation of multimodal single-cell data. Cell 184, 3573–3587 (2020).

Article 
CAS 

                Google Scholar 

24.Wang, X., Huang, T.-Okay. & Schneider, J. Active transfer learning below mannequin shift. in Proceedings of the thirty first International Conference on Machine Learning 1305–1313 (PMLR, 2014).

25.Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant threat minimization. Preprint at https://arxiv.org/abs/1907.02893 (2019).

26.Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).

CAS 
PubMed 
Article 

                Google Scholar 

27.Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

28.Litvinukova, M. et al. Cells and gene expression packages within the grownup human coronary heart. Preprint at bioRxiv https://doi.org/10.1101/2020.04.03.024075 (2020).

29.Lopez, R., Regier, J., Jordan, M. I. & Yosef, N. Information constraints on auto-encoding variational Bayes. in Advances in Neural Information Processing Systems 6114–6125 (NIPS, 2018).

30.Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution era for unpaired data utilizing transfer VAE. Bioinformatics 36, i610–i617 (2020).

CAS 
PubMed 
Article 
PubMed Central 

                Google Scholar 

31.Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative fashions. Mol. Syst. Biol. 17, e9620 (2021).

PubMed 
PubMed Central 
Article 

                Google Scholar 

32.Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

33.Saunders, A. et al. Molecular variety and specializations among the many cells of the grownup mouse mind. Cell 174, 1015–1030 (2018).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

34.Rosenberg, A. B. et al. Single-cell profiling of the creating mouse mind and spinal wire with split-pool barcoding. Science 360, 176–182 (2018).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

35.Zeisel, A. et al. Molecular structure of the mouse nervous system. Cell 174, 999–1014 (2018).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

36.Oetjen, Okay. A. et al. Human bone marrow evaluation by single-cell RNA sequencing, mass cytometry, and stream cytometry. JCI Insight 3, e124928 (2018).

PubMed Central 
Article 
PubMed 

                Google Scholar 

37.Freytag, S., Tian, L., Lönnstedt, I., Ng, M. & Bahlo, M. Comparison of clustering instruments in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res. 7, 1297 (2018).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

38.Sun, Z. et al. A Bayesian combination mannequin for clustering droplet-based single-cell transcriptomic data from inhabitants research. Nat. Commun. 10, 1649 (2019).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

39.10x Genomics. 10x Datasets Single Cell Gene Expression, Official 10x Genomics Support https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3

40.Korsunsky, I. et al. Fast, delicate and correct integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

41.Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts options of mind cell id. Cell 177, 1873–1887 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

42.Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes utilizing Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

43.Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch results in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

44.Barkas, N. et al. Joint evaluation of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

45.Bastidas-Ponce, A. et al. Comprehensive single cell mRNA profiling reveals an in depth roadmap for pancreatic endocrinogenesis. Development 146, dev173849 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

46.Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states by way of dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).

47.Abdelall, T. et al. A comparability of computerized cell identification strategies for single-cell RNA sequencing data. Genome Biol. 20, 194 (2019).

Article 
CAS 

                Google Scholar 

48.Stuart, T. et al. Comprehensive integration of single cell data. Cell 177, 1888–1902 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

49.Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 651 (2020).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

50.Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

                Google Scholar 

51.Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

52.Travaglini, Okay. J. et al. A molecular cell atlas of the human lung from single cell RNA sequencing. Nature 587, 619–625 (2020).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

53.Reyfman, P. A. et al. Single-cell transcriptomic evaluation of human lung offers insights into the pathobiology of pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 199, 1517–1536 (2019).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

54.Madissoon, E. et al. scRNA-seq evaluation of the human lung, spleen, and esophagus tissue stability after chilly preservation. Genome Biol. 21, 1 (2020).

CAS 
Article 

                Google Scholar 

55.Liao, M. et al. Single-cell panorama of bronchoalveolar immune cells in sufferers with COVID-19. Nat. Med. 26, 842–844 (2020).

CAS 
PubMed 
Article 

                Google Scholar 

56.Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference by way of a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

PubMed 
PubMed Central 
Article 

                Google Scholar 

57.Grant, R. A. et al. Circuits between contaminated macrophages and T cells in SARS-CoV-2 pneumonia. Nature 590, 635–641 (2021).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

58.Muus, C. et al. Integrated analyses of single-cell atlases reveal age, gender, and smoking standing associations with cell type-specific expression of mediators of SARS-CoV-2 viral entry and highlights inflammatory packages in putative goal cells. Preprint at bioRxiv https://doi.org/10.1101/2020.04.19.049254 (2020).

59.Andrews, T. S. & Hemberg, M. False indicators induced by single-cell imputation. F1000Res. 7, 1740 (2019).

PubMed Central 
Article 
PubMed 

                Google Scholar 

60.Schulte-Schrepping, J. et al. Severe COVID-19 is marked by a dysregulated myeloid cell compartment. Cell 182, 1419–1440 (2020).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

61.Wen, W. et al. Immune cell profiling of COVID-19 sufferers within the restoration stage by single-cell sequencing. Cell Discov. 6, 31 (2020).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

62.Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in sufferers with extreme COVID-19. Nat. Med. 26, 1070–1076 (2020).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

63.Lotfollahi, M. et al. Compositional perturbation autoencoder for single-cell response modeling. Preprint at bioRxiv https://doi.org/10.1101/2021.04.14.439903 (2021).

64.Lotfollahi, M., Dony, L., Agarwala, H. & Theis, F. Out-of-distribution prediction with disentangled representations for single-cell RNA sequencing data. in ICML 2020 Workshop on Computational Biology 37 (ICML, 2020).

65.Kelsey, G., Stegle, O. & Reik, W. Single-cell epigenomics: recording the previous and predicting the longer term. Science 358, 69–75 (2017).

CAS 
PubMed 
Article 

                Google Scholar 

66.Qiu, X. et al. Reversed graph embedding resolves advanced single-cell trajectories. Nat. Methods 14, 979–982 (2017).

PubMed 
PubMed Central 

                Google Scholar 

67.Mirza, M. & Osindero, S. Conditional generative adversarial nets. Preprint at https://arxiv.org/abs/1411.1784 (2014).

68.Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at http://arxiv.org/abs/1312.6114 (2013).

69.Doersch, C. Tutorial on variational autoencoders. Preprint at https://arxiv.org/abs/1606.05908 (2016).

70.Sohn, Okay., Lee, H. & Yan, X. Learning structured output illustration utilizing deep conditional generative fashions. in Advances in Neural Information Processing Systems (eds. Cortes, C. et al.) 28, 3483–3491 (Curran Associates, 2015).

71.Gretton, A., Borgwardt, Okay. M., Rasch, M. J., Schölkopf, B. & Smola, A. A kernel two-sample check. J. Mach. Learn. Res. 13, 723–773 (2012).

                Google Scholar 

72.Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparability of single-cell trajectory inference strategies. Nat. Biotechnol. 37, 547–554 (2019).

CAS 
PubMed 
Article 
PubMed Central 

                Google Scholar 

73.Lun, A. T. L., Bach, Okay. & Marioni, J. C. Pooling throughout cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

PubMed 
Article 
CAS 
PubMed Central 

                Google Scholar 

74.Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell inhabitants construction. Cell Syst. 3, 346–360 (2016).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

75.Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

76.Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in well being and kind 2 diabetes. Cell Metab. 24, 593–607 (2016).

CAS 
PubMed 
PubMed Central 
Article 

                Google Scholar 

77.Lawlor, N. et al. Single-cell transcriptomes establish human islet cell signatures and reveal cell-type-specific expression adjustments in sort 2 diabetes. Genome Res. 27, 208–222 (2016).

PubMed 
Article 
CAS 

                Google Scholar 

78.Grün, D. et al. De novo prediction of stem cell id utilizing single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

79.Villani, A.-C. et al. Single-cell RNA-seq reveals new sorts of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

PubMed 
PubMed Central 
Article 
CAS 

                Google Scholar 

80.Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data evaluation. Genome Biol. 19, 15 (2018).

PubMed 
PubMed Central 
Article 

                Google Scholar 

81.10x Genomics. 10k PBMCs from a Healthy Donor, Gene Expression and Cell Surface Protein https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3 (2018).

82.10x Genomics. 5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor with Cell Surface Proteins (v3 Chemistry) https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3? (2019).

83.10x Genomics. 10k PBMCs from a Healthy Donor (v3 Chemistry) https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3?

84.Mould, Okay. J. et al. Airspace macrophages and monocytes exist in transcriptionally distinct subsets in wholesome adults. Am. J. Respir. Crit. Care Med. 203, 946–956 (2020).

Article 

                Google Scholar 

Download references

AcknowledgementsWe are grateful to all members of the Theis laboratory. M.L. is grateful for invaluable suggestions from A. Wolf and monetary assist from the Joachim Herz Stiftung. This work was supported by the BMBF (01IS18036A and 01IS18036B), by the European Union’s Horizon 2020 analysis and innovation program (grant 874656) and by Helmholtz Association’s Initiative and Networking Fund by way of Helmholtz AI (ZT-I-PF-5-01) and sparse2big (ZT-I-0007) and Discovair (grant 874656), all to F.J.T. For the aim of open entry, the authors have utilized a CC BY public copyright licence to any creator accepted manuscript model arising from this submission.

Author informationAffiliationsHelmholtz Center Munich—German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany

Mohammad Lotfollahi, Mohsen Naghipourfar, Malte D. Luecken, Matin Khajavi, Maren Büttner, Marco Wagenstetter, Sergei Rybakov & Fabian J. Theis

School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany

Mohammad Lotfollahi & Fabian J. Theis

Department of Computer Science, Technical University of Munich, Munich, Germany

Žiga Avsec

Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA

Adam Gayoso & Nir Yosef

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA

Nir Yosef

Chan Zuckerberg Biohub, San Francisco, CA, USA

Nir Yosef

Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA

Nir Yosef

Institute of Medical Informatics, University of Münster, Münster, Germany

Marta Interlandi

Department of Mathematics, Technical University of Munich, Munich, Germany

Sergei Rybakov & Fabian J. Theis

Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA

Alexander V. Misharin

ContributionsM.L. conceived the mission with contributions from F.J.T. and Z.A. M.L., M.N., M.W., S.R. and M.Okay. applied fashions and analyzed data. M.B. curated the mouse mind dataset. M.I. designed visualizations and curated the immune dataset. M.L. and M.D.L. analyzed the COVID-19 dataset with assist from A.V.M. A.G. and N.Y. contributed by adapting scArches with scvi-tools. F.J.T. supervised the analysis. All authors wrote the manuscript.

Corresponding authorCorrespondence to
Fabian J. Theis.

Ethics declarations

          Competing pursuits
          F.J.T. experiences possession curiosity in Cellarity. N.Y. is an advisor and/or has fairness in Celsius Therapeutics and Rheos Medicines. The remaining authors declare no competing pursuits.



      Additional dataPeer assessment data Nature Biotechnology thanks Dana Pe’er and the opposite, nameless, reviewer(s) for his or her contribution to the peer assessment of this work.

Publisher’s be aware Springer Nature stays impartial with regard to jurisdictional claims in revealed maps and institutional affiliations.

Supplementary informationRights and permissions
Open Access This article is licensed below a Creative Commons Attribution 4.0 International License, which allows use, sharing, adaptation, distribution and copy in any medium or format, so long as you give applicable credit score to the unique creator(s) and the supply, present a hyperlink to the Creative Commons license, and point out if adjustments have been made. The photographs or different third occasion materials on this article are included within the article’s Creative Commons license, except indicated in any other case in a credit score line to the fabric. If materials shouldn’t be included within the article’s Creative Commons license and your meant use shouldn’t be permitted by statutory regulation or exceeds the permitted use, you have to to receive permission straight from the copyright holder. To view a replica of this license, go to http://creativecommons.org/licenses/by/4.0/.

        Reprints and Permissions

About this text

Cite this textLotfollahi, M., Naghipourfar, M., Luecken, M.D. et al. Mapping single-cell data to reference atlases by transfer learning.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-01001-7

Download quotation

Received: 30 July 2020

Accepted: 28 June 2021

Published: 30 August 2021

DOI: https://doi.org/10.1038/s41587-021-01001-7

        <br><a href="http://gestyy.com/eo7QHe" class="button purchase" rel="nofollow noopener" target="_blank">Read More</a>
282 Views
Spread the love

Related Articles