Talks

Click here to vote for the Best Talks!

Speakers listed in order of presentation

Saturday, November 4

1. Ksenia Krasileva – Assistant Professor, PMB
Genomics for immunity: how organisms with innate immunity recognize pathogens
Abstract: Innate immune receptors from the NLR protein family control basic organism-organism interactions across kingdoms, including incompatibility within a species. Current availability of genomic biodiversity allows us to examine patterns of innate immune receptor evolution. We examine rapidly evolving genes across plants and fungi, including their genomic and epigenomic context to see common trends in innate immune gene evolution.

2. Chandler Sutherland – PMB (Lab: Krasileva)
High intraspecies allelic diversity in Arabidopsis immune receptors is associated with distinct genomic and epigenomic features
Abstract: Plants, lacking the adaptive immune systems of vertebrates, rely exclusively on germline-encoded innate immunity. Despite the inability to create antibodies, wild populations of plants remain incredibly successful against rapidly evolving pathogens. Nucleotide-binding, leucine-rich repeat receptors (NLRs) are the sensors of the plant immune system, recognizing pathogen-secreted disease proteins called effectors and initiating defense responses. A subset of NLRs show remarkable intraspecies diversity, due to both the genomic processes that generate variation and selection that promotes its maintenance. To unravel the relative contributions of each process, we examined the genomic features and signatures of selection associated with high allelic diversity. Our investigation of NLRs in Arabidopsis Col-0 has revealed that highly variable NLRs show higher expression, less gene body cytosine methylation, and closer proximity to transposable elements than their conserved paralogs. Highly variable NLRs show elevated synonymous and nonsynonymous nucleotide diversity and an increased probability of mutation. Diversifying selection maintains variability at a subset of codons of highly variable NLRs, while purifying selection maintains conservation at non-highly variable NLRs. How these features are established and maintained, and whether they contribute to the observed diversity of NLRs is key to understanding the evolution of plant innate immune receptors.

3. Neo Yin – Statistics (Lab: Charmichael)
LeukoLocator: A Comprehensive Leukocyte Detection Pipeline for Peripheral Blood Whole Slide Images
Abstract: Peripheral blood smears (PBSs) are critical to diagnosing hematological neoplasms including acute and chronic leukemias. An automated process for analyzing 400x-equivalent whole slide images (WSIs) will enable surgical pathology scanners to be used for hematopathology sign out. Due to the large size of the WSIs (~5GB), a combination of statistical and deep learning methods is ideal for analyzing these specimens to ensure a fast and efficient diagnosis. Design: We developed an automated workflow (Fig. 1) employing classical and deep computer vision to automate the tallying of PBS differentials from 400x WSIs. 402 slides from the hematopathology service were scanned, representing a broad diversity of pathologies. The slides were cropped into 2048×2048 pixel tiles, yielding roughly 2.3M regions. Low resolution signal thresholding based on blue channel signal intensity was used to filter out regions without leukocytes. White mask proportion (WMP) was used to filter out regions with high red cell density. The Variance of Laplacian (VoL) was used to filter out regions of excessive blur and bubbles. To further select regions of high image quality, we trained a Resnet50 convolutional neural network (CNN) to separate adequate for analysis from regions of poor quality. 24,479 regions were labeled (adequate=19,263; inadequate=5,216). We then trained a YOLOv8-based neural network using 2,222 regions and 24,479 cells. This network was run on adequate regions to identify the location of leukocytes, which were cropped and classified by the DeepHeme algorithm to produce a full cell differential. Results: The pipeline for statistical region selection filtered ~5,239 regions per slide to 639 regions per slide (~88% reduction), prior to deep learning-based region classification. Region classifier performance achieved a score of 0.987 AUC (F1=0.938, Acc=0.925). YOLO cell detection achieved a performance mAP50 of 0.993 and a mAP50-95 of 0.936. Time for WSI analysis averaged 2,957 cells per slide counted and classified 14.2 cells/second. Conclusion: This fully automated pipeline allows the creation of differential cell counts based on PBS WSIs, effectively handling large file sizes and image artifacts such as blur and bubbles. When paired with pathologist-review, we believe this process will improve clinical efficiency and diagnostic accuracy, by complementing and augmenting the work of the pathologist.

4. Pooja Kathail – Center for Computational Biology (Lab: Ioannidis/Ye (UCSF)                 Characterizing uncertainty in predictions of genomic sequence-to-activity models  
Abstract: Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predic- tions of relative activity levels across the human reference genome, but their performance is limited for predicting the effects of variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 models. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high confidence, albeit often incorrect, predictions on reference sequences and low confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.

5. William Torous – Statistics (Lab: Purdom)
Visualizing scRNA-Seq Data at Population Scale with GloScope
Abstract: Recent advances in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to collect transcriptional measurements more efficiently and at larger scale. Moving beyond projects which aim to quantify cell-level heterogeneity in relatively few patients, studies utilizing this technology increasingly target human health outcomes with cells sequenced from larger cohorts of patients. Of particular interest are how cell-level heterogeneity varies between health-related phenotypes and whether there are transcriptional differences associated with particular patient outcomes. Despite the plethora of methodological advancements in scRNA-Seq, most current tools were designed to quantify information at the cellular level and lack appropriate strategies for the population-scale questions now being asked. The most common analysis approach for this data, which treats individual cells as the statistical unit of observation, risks conflating cell-level variability with the patient-level differences of interest. We propose a statistical framework to represent, summarize, and compare the entire single-cell profile of each patient. Our approach characterizes each patient as a continuous probability distribution over the space of gene-expression vectors. We assume the vectors which characterize each sequenced cell are independently drawn from that patient-level distribution, and we utilize this assumption to estimate each patient’s underlying distribution. For downstream analysis of this population-scale data we propose using the matrix of pairwise statistical divergences between patients’ distributions. A scalar summary of similarity for each pair enables researchers to perform important bioinformatic analyses at the patient-level including visualization and quality control. We have applied our method to a number of scRNA-Seq datasets with study designs ranging from $12$ to $336$ patients and up to $7000$ cells per patient. Our representation is able to differentiate among patients from varied phenotype groups, such as with lung tissue samples from COVID-infected and healthy patients, and has been shown to be a powerful tool to detect potential batch effects.

6. Kennedy Agwamba – IB (Lab: Nachman)
On the demographic history of the Western European house mouse, Mus musculus domesticus
Abstract: Human commensal, Mus mus domesticus, is native to the European continent, with a species range extending from the Middle East to western Europe by the end of the Iron Age. Wild populations of M. m. domesticus are now distributed across Africa, the Americas, and Oceania, a range notably consistent with the global migration patterns of western Europeans that began in the early 16th century. Despite its standing as the premier mammalian model organism for biomedical, ecological, and evolutionary research, important details surrounding the population history of wild house mice remain a mystery. To investigate patterns of genetic structure and infer the demographic history of the Western European house mice, we analyze a collection of 183 mice sampled from western Europe and the Americas, including 59 new whole genome sequences from historically relevant regions of western Europe. Unsupervised clustering analysis groups all samples by geographic location, uniquely identifying a northern European, Mediterranean, and Atlantic Iberian population clade among European samples. Admixture graphs reveal the Atlantic Iberian clade to be sister to all populations of house mice in the Americas, and a migration edge from the UK to the base of the North America clade indicates a distinct secondary introduction of house mice to the Americas. Demographic models reveal that American populations diverged largely within the last 500 years, consistent with the timing of European colonization history in the Americas. Altogether, these results provide clarity around the recent introduction of Western European house mice to the Americas, highlighting the effects of human migration and global colonization on the concurrent spread of an invasive human commensal.

7. Elise Kerdoncuff – MCB (Lab: Moorjani)
Reconstructing 50,000 years of evolutionary history in India: Insights from 3,000 whole genome sequences
Abstract: India has been underrepresented in whole genome surveys of human genetic variation. We generated ~2,700 whole genome sequences for individuals from the Longitudinal Aging Study in India that is the first and only nationally representative study of late life cognition and dementia in India. Our study includes diverse ethno-linguistic groups from India, including individuals from most geographic regions, speakers of all major language families, and tribal and caste groups––providing a comprehensive coverage of genetic variation in India. We show that most Indians derive ancestry from ancient Near Eastern farmers, Eurasian Steppe Pastoralists and South Asian hunter-gatherers that are distantly related to indigenous groups from Andaman Islands. Following the admixture, India experienced a major demographic shift towards endogamy leading to strong founder events. As a consequence, many individuals have extensive identity-by-descent sharing and homozygosity that in turn, increases the rate of homozygous loss of function variants and gene knockouts per individual. Our data also provides insights into the history of archaic gene flow into South Asia. Like most non-Africans, Indians derive ~1-3% ancestry from Neanderthals and Denisovans, however, many questions about the genomic distribution and legacy of archaic ancestry in South Asia remain unexplored due to the lack of large representative samples from India. We recover ~1.5 Gb (or 50%) of Neanderthal and ~0.6 Gb (or 20%) of Denisovan surviving fragments from modern Indian genomes, which is substantially higher than previous surveys of Europeans with larger sample sizes. We show that the recent demographic events, including founder events and admixture, have shaped the landscape of archaic ancestry in India and we map unique South Asian-specific candidates of adaptive introgression. Finally, we show that after accounting for archaic admixture, there is no credible evidence for the early Southern (or coastal) Dispersal of modern human expansion to India. Together, these analyses provide a detailed view of Indian evolutionary history over the past 50,000 years and underscore the potential of expanding genomic surveys to diverse groups outside Europe.

8. Michal Rozenwald – Center for Computational Biology (Lab: Urnov & Streets)
Deciphering human gene control via epigenome editing and machine learning at a single-molecule resolution

Abstract: The human genome, in its living state, carries epigenomic information – chemical modifications occurring on the DNA that can affect gene expression. Research shows that CpG methylation in gene regulatory elements is associated with active and silenced genes. However, how an epigenomic state becomes functional, how it is interpreted by the cell to produce a gene expression state is not fully understood. We propose to address these questions using a model system in which we introduce artificial epigenetic perturbations targeting a specific gene and investigate the changes in its expression. This would allow for much more controlled experiments that directly link methylation states and gene expression. My goal is to develop an experimental and computational approach to learning the “language” of methylation. To accomplish this, I will take advantage of two recently developed technologies to (A) perturb the methylation landscape of a target gene and (B) read out the transcriptional state of that gene with single-molecule resolution. Then, I will develop computational models to understand and interpret the perturbation measurements.

9. Helen Sakharova – Bioengineering (Lab: Lareau)
Rules for synonymous codon choice: When do suboptimal codons matter?
Abstract: Genes can be encoded with seemingly equivalent synonymous codons, but codon choice can have dramatic effects on gene output. Naive rules for codon optimization replace slowly-translated codons with synonymous, optimally-translated codons, with the goal of increasing protein production. However, slowly-translated codons can have important functional roles, for instance by facilitating co-translational folding. While it is widely acknowledged that synonymous codons are not exact synonyms, the basic molecular rules governing codon choice are still poorly understood. We have explored when and where codon choice is most strongly constrained using computational and experimental methods. We conducted a genome-wide screen in yeast that targets positions of conserved slow translation. Using Cas9 retron editing, we created thousands of slow-to-fast synonymous codon substitutions, and grew them together in a pooled competition. Careful controls, including slow-to-slow substitutions and multiple guides targeting each site, allow confident identification of synonymous variants that significantly decrease or increase fitness. In parallel, we are training large language models on hundreds of thousands of eukaryotic genes in order to identify constraints on codon sequences. Combined with our large scale experimental data, our model will produce general rules for predicting the rare but important positions where ‘optimal’ codons are detrimental.

10. Milind Jagota – Computer Science (Lab: Song)
Learning antibody sequence constraints from allelic inclusion
Abstract: ​Antibodies are expressed by B cells and are made of two copies each of a heavy and light chain. Although each B-cell could express two different heavy chains and four different light chains, usually only a unique pair of heavy chain and light chain is expressed – a phenomenon known as allelic exclusion. However, a small fraction of naive B-cells violate allelic exclusion by expressing two productive light chains, one of which is non-functional; this has been called allelic inclusion. We demonstrate that these B-cells can be used to learn constraints on antibody sequence. Using large-scale single-cell sequencing data, we find examples of allelic inclusion in ~5,000 naive B cells expressing two or more light chains, which is an order of magnitude larger than existing datasets. We train machine learning models to identify the non-functional sequences in these cells and are able to predict over 50% more compared to other antibody modeling approaches. We also investigated constraints on mouse heavy chains by analyzing sequences from B cells in early developmental stages, and observe that pairing with the surrogate light chain significantly restricts heavy chain diversity. Our work provides insight into B-cell development and the determinants of antibody functionality.

Sunday, November 5

1. Maya Lemmon-Kishi – Center for Computational Biology (Lab: Nielsen)
ratePlacer: An Efficient Molecular Dating Method for Ancient eDNA Samples
Abstract: Environmental DNA (eDNA) is a rich source of genetic data that captures a broad perspective of biodiversity. Ancient eDNA from lake sediments and permafrost can be recovered and employed to explore patterns of biodiversity across the globe over thousands of years. To discover ecologically relevant patterns, ancient eDNA samples must be accurately dated. While there are many techniques for sample dating, each method has its own limitations such as location-specific calibration and time range. Furthermore, the majority of these techniques are dating the environmental sample and not the DNA itself. The few existing molecular dating methods only use a small fraction of the sequence data, which can result in low confidence of the age estimate. Here we introduce a phylogenetics-based method, ratePlacer, to estimate the sample age using the eDNA sequences. We date samples directly from the sequence data by maximizing the likelihood of read placement in a phylogenetic tree. Using a birth-death process to simulate modern and ancient reads, we show that ratePlacer is able to accurately estimate the sample age across time. Moving forward, we plan to examine the age estimates from various previously reported ancient eDNA samples.

2. Graham Northrup – Center for Computational Biology (Lab: Boot)
Spatial Effects on Host Parasite Coevolution
Abstract: Host-parasite coevolution has been studied in many contexts in order to determine the rules and guidelines for maintenance of host or parasite diversity, strength of selection, and various other traits of interest. In this work we further the theory of host parasite coevolution by explicitly implementing spatial organization of hosts on a grid. This spatial organization and the amount of resulting spatial interactions (or lack thereof) can influence the resulting evolutionary outcomes. In this presentation I will detail the computational approach needed to achieve these results accurately and quickly, as well as discuss our findings and compare them to the results found for models without explicit spatial components.

3. Ruchir Rastogi – EECS (Lab: Ioannidis)
Comparative analysis of single-cell epigenomic and transcriptional response to six vaccines
Abstract: Recent work highlights the importance of epigenomic and transcriptional remodeling of the innate immune system in response to vaccination. Here, we compare the response to six different vaccines (three influenza vaccines, the Pfizer-BioNTech COVID mRNA vaccine, the yellow fever 17D vaccine, and the herpes zoster SHINGRIX vaccine) using longitudinal single-cell ATAC-seq and single-cell RNA-seq in PBMCs of human subjects. Single-cell ATAC-seq confirms persistent accessibility changes, primarily in classical monocytes, in all vaccines. Accessibility of the same transcription factor families (AP-1, IRF/STAT, NF-κB) is modulated in most vaccines, though often in opposite directions. Single-cell RNA-seq reflects similar acute but nonpersistent responses. We show that there is significant heterogeneity in the AP-1 response across subjects, which correlates with the strength of the booster response. Lastly, we integrate scRNA-seq and scATAC-seq data across vaccines and find epigenomically and transcriptionally distinct monocyte subclusters.