FIGURE SUMMARY
Title

Mapping single-cell atlases throughout Metazoa unravels cell type evolution

Authors
Tarashansky, A.J., Musser, J.M., Khariton, M., Li, P., Arendt, D., Quake, S.R., Wang, B.
Source
Full text @ Elife

(A) Schematic showing the phylogenetic relationships among seven species analyzed. (B) Challenges in mapping single-cell transcriptomes. Gene duplications cause large numbers of homologs per gene, determined by reciprocal BLAST (cut-off: E-value <10−6), and frequent gene losses and the acquisition of new genes result in large fractions of transcriptomes lacking homology, which limits the amount of information comparable across species. (C) SAMap workflow. Homologous gene pairs initially weighted by protein sequence similarity are used to align the manifolds, low dimensional representations of the cell atlases. Gene-gene correlations calculated from the aligned manifolds are used to update the edge weights in the bipartite graph, which are then used to improve manifold alignment. (D) Mutual nearest neighborhoods improve the detection of cross-species mutual nearest neighbors by connecting cells that target one other’s within-species neighborhoods. (E) Convergence of SAMap is evaluated by the root mean square error (RMSE) of the alignment scores between mapped clusters in adjacent iterations for all 21 pairwise comparisons of the seven species.

Scalability of SAMap.

The runtime (A) and memory usage (B) of all mappings performed in this study are plotted versus the total number of cells from both datasets. For this study, SAMap was run on a standard desktop computer running Ubuntu 18.04, with an 8-core i7 Intel processor and 64 Gb of RAM.

(A) UMAP projection of the combined zebrafish (yellow) and Xenopus (blue) manifolds, with example cell types circled. (B) Sankey plot summarizing the cell type mappings. Edges with alignment score <0.1 are omitted. Edges that connect developmentally distinct secretory cell types are highlighted in black, with connections across germ layers highlighted in red. (C) Heatmaps of alignment scores between developmental time points for ionocyte, forebrain/midbrain, placodal, and neural crest lineages. X-axis: Xenopus. Y-axis: zebrafish. (D) Expressions of orthologous gene pairs linked by SAMap are overlaid on the combined UMAP projection. Expressing cells are color-coded by species, with those connected across species colored cyan. Cells with no expression are shown in gray. The mapped secretory cell types are highlighted with circles. (E) SAMap alignment scores compared to those of benchmarking methods using one-to-one vertebrate orthologs as input. Each dot represents a cell type pair supported by ontogeny annotations.

Existing methods failed to map <italic>D. rerio</italic> and <italic>X. tropicalis</italic> atlases.

(A) UMAP projections of the integration results from SAMap using the full homology graph, compared to LIGER, BBKNN, Scanorama, Seurat, Harmony, and SAMap using 1–1 orthologs. For fair comparisons, all methods were run on the D. rerio and X. torpicalis atlases subsampled to approximately 15,000 cells to satisfy the computational constraints of Seurat and LIGER. (B) Histograms of alignment scores between individual cells.

(A) Expression of orthologous (top) and paralogous (bottom) gene pairs overlaid on the combined UMAP projection. Expressing cells are color-coded by species, with those that are connected across species colored cyan. Cells with no expression are shown in gray. Paralogs are ordered by the evolutionary time when they are inferred to have duplicated. (B) Paralog substitution scores of all cell types. The substitution score counts the number of substituting paralogs that are differentially expressed in a particular cell type while normalizing for the number of differentially expressed genes in a cell type and the number of paralogs of a gene (see Materials and methods). (C) The percentage of paralogs from each phylogenetic age that were substituted for orthologs in frog or zebrafish lineages.

Paralog substitution analysis yields similar results using the SAMap manifold constructed from one-to-one orthologs.

Comparison of substitution rates for different paralog ages (A) and cell type substitution scores (B) calculated from the original frog-zebrafish manifold versus the manifold generated using only one-to-one orthologs. (C) Histogram showing the distribution of correlation differences for paralog substitutions specific to the original (teal) and one-to-one ortholog based analyses (orange), along with those identified in both mappings (blue). Note that the majority of substitution events, especially in the large correlation difference regime, are present in both mapped manifolds.

(A) UMAP projection of the combined manifolds. Tissue type annotations are adopted from the S. mediterranea atlas (Fincher et al., 2018). The schistosome atlas was collected from juvenile worms, which we found to contain neoblasts with an abundance comparable to that of planarian neoblasts (Li et al., 2021). (B) Overlapping expressions of selected tissue-specific TFs with expressing cell types circled. (C) UMAP projection of the aligned manifolds showing planarian and schistosome stem cells, with homologous subpopulations circled. Planarian neoblast data is from Zeng et al., 2018, and cNeoblasts correspond to the Nb2 population, which are pluripotent cells that can rescue neoblast-depleted planarians in transplantation experiments. (D) Distributions of conserved TF expressions in each neoblast subpopulation. Expression values are k-nearest-neighbor averaged and standardized, with negative values set to zero. Blue: planarian; yellow: schistosome.

SAMap-linked gene pairs that are enriched in cell type pairs between <italic>S. mediterranea</italic> and <italic>S. mansoni</italic>.

(A) Rows: linked cell types. Schistosome cell types correspond to Leiden clusters. Columns: genes linked by SAMap with overlapping eukaryotic eggNOG orthology groups. We calculate the average standardized expression of each gene in an orthology group for its corresponding cell type in a particular pair and report the highest expression. A selected set of orthology groups corresponding to transcriptional regulators are labeled. (B) Fluorescence in situ hybridization shows the co-expression of wnt11 (Smp_156540) and a panel of muscle markers (collagen, troponin, myosin and tropomyosin) in S. mansoni juveniles. The body wall muscles are expected to be located close to the parasite surface (dashed outline). The images are maximum intensity projections constructed from ~10 confocal slices with optimal axial spacing recommended by the Zen software collected on a Zeiss LSM 800 confocal microscope using a 40× (N.A. = 1.1, working distance = 0.62 mm) water-immersion objective (LD C-Apochromat Corr M27). (C) Whole mount in situ hybridization images showing that the expression of wnt11 and frizzled (Smp_174350) are concentrated in the parasite tail (arrows) with decreasing gradients extending anteriorly. In planarian muscles, Wnt genes provide the positional cues for setting up the body plan during regeneration (Scimone et al., 2017; Reddien, 2018). The presence of an anterior-posterior expression gradient of wnt11 and frizzled in muscles of schistosome juveniles suggests that they may have similar functional roles in patterning during development.

Schistosome muscle progenitors express canonical muscle markers.

UMAP projections of schistosome stem cells with gene expressions overlaid. μ and μ’ cells are circled. Colormap: expression in units of log2(D+1). For visualization, expression was smoothed via nearest-neighbor averaging using SAM. Note that myod1 and cabp are expressed in both presumptive muscle progenitor populations, whereas all other markers are enriched in μ’ cells. All genes displayed are also expressed in fully differentiated muscle tissues.

(A) Schematic illustrating edge (left) and node (right) transitivities, defined as the fraction of triads (set of three connected nodes) in closed triangles. (B) The percentage of cell type pairs that are topologically equivalent to the green edge in each illustrated motif. (C) Network graphs showing highly connected cell type families. Each node represents a cell type, color-coded by species (detailed annotations are provided in Supplementary file 7). Mapped cell types are connected with an edge. (D) Boxplot showing the median and interquartile ranges of node transitivities for highly connected cell type groups. For all box plots, the whiskers denote the maximum and minimum observations. The average node transitivity per group is compared to a bootstrapped null transitivity distribution, generated by repeatedly sampling subsets of nodes in the cell type graph and calculating their transitivities. **p<5×10−5, ***p<5×10−7. (E) Boxplot showing the median and interquartile ranges of the number of enriched gene pairs in highly connected cell type groups. All cell type connections in these groups have at least 40 enriched gene pairs (dashed line).

Number of enriched gene pairs are mostly independent of edge transitivity.

(A) Box plot showing the median and interquartile ranges of the number of enriched gene pairs in cell type mappings from all 21 pairwise mappings between the seven species. The whiskers denote the maximum and minimum observations. Of cell type mappings, 87% have greater than 40 enriched gene pairs (dashed line). Species acronyms are the same as in Figure 1A. (B) Top left: The edge transitivity is plotted against the number of enriched gene pairs for all cell type pairs in the connectivity graph. Dashed line: the linear best fit, with the Pearson correlation coefficient reported at the top. Top right: magnified view of the mapped cell type pairs supported by small numbers of gene pairs (<40) to show that those edges have low transitivity scores (<0.4). The sublots below show the number of enriched gene pairs and edge transitivity for individual species pairs.

Alignment scores are mostly independent of edge transitivity.

Top left: alignment scores and edge transitivity for all cell type pairs in the connectivity graph including the seven species. Dashed line: the linear best fit, with the Pearson correlation coefficient reported at the top. Alignment scores and edge transitivity for individual species pairs are shown in the remaining subplots.

(A) Enrichment of KOG functional annotations calculated for genes shared in contractile cell types. For each species, genes enriched in individual contractile cell types are combined. (B) Expression and enrichment of conserved muscle genes in contractile cell types. Color: mean standardized expression. Symbol size: the fraction of cells each gene is expressed in per cell type. Homologs are grouped based on overlapping eukaryotic eggNOG orthology groups. If multiple genes from a species are contained within an orthology group, the gene with highest standardized expression is shown. Genes in blue: core transcriptional program of bilaterian muscles; red: transcriptional regulators conserved throughout Metazoa. (C) Enrichment of KOG functional annotations for genes shared by stem cell types. (D) Top: boxplot showing the median and interquartile ranges of the mean standardized expressions of stem cell-enriched genes in multipotent stem cells (MSCs), lineage-committed stem cells (LSCs), and differentiated cells (DCs). MSCs include sponge archaeocytes (Musser et al., 2019), hydra interstitial stem cells (Siebert et al., 2019), planarian neoblasts cluster 0 defined in Fincher et al., 2018, schistosome ε-cells (Tarashansky et al., 2019). LSCs include sponge transition cells, hydra ecto- and endo-epithelial stem cells; planarian piwi+ cells that cluster with differentiated tissues, and schistosome tissue-specific progenitors. Bottom: dot plot showing the mean standardized expressions of selected transcriptional regulators. The transcript IDs corresponding to each gene are listed in Supplementary file 6.

Phylogenetic reconstruction of animal contractile cell transcriptional regulators.

Trees depict Csrp/Crip (A) and Fox group I (B) gene families. Genes labeled red are enriched in at least one contractile gene pair identified via SAMap. Support values indicate bootstrap support from 1000 nonparametric (Csrp) or ultrafast (Fox) bootstrap replicates. Besides these two transcriptional regulators, contractile cells in all seven species were found to be also enriched for transcription factors from the C2H2 Zinc Finger, Lim Homeobox, and Paired Homeobox families, though in different cell types we found enrichment of a number of distinct orthologs. Whether this reflects an ancestral role for these transcription factor families in regulating contractility or their independent evolution will require additional taxonomic sampling and broader coverage of muscle cell diversity to resolve.

Acknowledgments
This image is the copyrighted work of the attributed author or publisher, and ZFIN has permission only to display this image to its users. Additional permissions should be obtained from the applicable author or publisher of the image. Full text @ Elife