Talks

Click here to vote for the Best Talks!

Speakers listed in order of presentation


Saturday, October 26

Session 1:

9:15 – 9:30 AM

Carmelle Catamura (Lareau Lab)Designing de novo splicing using deep learning for allele-specific silencing of disease genes

Abstract: Gene editing has been limited in repeat expansion diseases. The most beneficial therapeutic strategy often involves allele-specificity: targeting the mutant allele while keeping the normal allele intact. Because the repeat sequence is present in both alleles, selective CRISPR targeting of the mutant allele is challenging. Here, we present a computational framework for allele-specific suppression of repeat expansion genes and show its efficacy in suppressing the gene responsible for Huntington’s disease, HTT. Our solution is to engineer a poison exon into only the disease allele: an alternative exon that, when included in an mRNA, will introduce an early stop codon and trigger nonsense-mediated mRNA decay (NMD). We can take advantage of this pathway to induce allele-specific depletion by intentionally engineering new poison exons in the disease allele. To target the mutant copy, we take advantage of benign intronic SNPs that commonly occur with the repeat expansion. CRISPR guides specific to the SNP allele that is in phase with the repeat could recruit an RNA-guided base editor to introduce few nucleotide substitutions. Using a deep learning model of splicing based on thousands of mRNA sequences, we screen in silico for positions where targeting a base editor could create sequences that form new splice sites. Importantly, the novel exon need not be near the actual disease mutation; any highly heterozygous SNP in an intron could potentially serve this purpose. We conclude that editing of just a few carefully chosen nucleotides in a gene can be sufficient to create a new exon, making it a feasible approach for allele-specific mRNA depletion. Evaluating our pipeline on SNPs from the 1000 Genomes project, we found that 62% individuals are viable for this strategy. Our approach combines the power of CRISPR editing with deep learning models of RNA sequence to engineer poison exons with therapeutic potential for a disease that, at present, has no viable treatments.

9:30 – 9:45 AM

Meaghan Marohn (Moorjani Lab) Revisiting the evolution of lactase persistence: insights from South Asian genomes

Abstract: Lactase persistence (LP), the ability to digest milk into adulthood, is a textbook example of natural selection in humans. Multiple mutations upstream of the LCT gene are associated with LP and have been previously shown to be under selection in Europeans and some pastoralist groups from Africa and the Middle East. Our understanding of the origin and evolution of LP in non-Europeans, however, remains incomplete. South Asia is the highest producer of dairy in the world, and dairy products are widely consumed across the subcontinent. We assembled a genome-wide dataset of ~8,500 present-day and ancient genomes from India, Pakistan, and Bangladesh, including diverse timescales, geographic, ethnolinguistic, and subsistence groups. We find the known Eurasian variants are widespread across South Asia, and exhibit a clinal pattern along north-south and east-west gradients. Remarkably, two pastoralist groups– Toda in the south and Gujjar in the north– have unexpectedly high frequency of LP variants, comparable to those seen in Europeans. Ancient DNA analysis uncovers that the Eurasian variants first appeared in South Asia during the historical and medieval periods. We examine the relationship between ancestry and LP frequency among diverse contemporary individuals and infer that LP likely spread through steppe pastoralists-related gene flow. Following this gene flow, many South Asian groups experienced strong founder events, more extreme than in Ashkenazi Jews and Finns. By measuring the strength of the founder events and performing demographic simulations, we confirm that genetic drift alone cannot fully account for the observed high prevalence of LP in pastoralist groups. Notably, we find long, shared haplotypes around at the LCT locus in these pastoralist groups that are identical by descent to those in Europeans. Together, these findings suggest strong selective pressures favoring LP in South Asian pastoralist groups and underscore striking parallels in the evolutionary history of LP across the two sub-continents.

9:45 – 10:00 AM

Seraphina Shi (Huang Lab) and Aidan McLoughlin (Huang Lab) Integrative Deep Multi-Learning for Predicting and Biclustering Cancer Drug Responses (impaCluster): Leveraging Cancer Omics and Drug Molecular Data

Abstract: Precision medicine in cancer treatment leverages the complex relationship between cancer biology and drug molecules, hindered by the genetic complexity of cancer and diverse drug structures. We introduce integrative deep multi-task prediction and biclustering (impaCluster), merging cancer omics, drug data, and drug response to pinpoint cancer cell lines sensitive to specific drugs based on their molecular profiles. impaCluster uses biclustering to identify these sensitive subsets and predict drug sensitivity with enhanced accuracy by iterating between learning cell line and drug embeddings and their response mappings. This approach helps identify tailored treatment strategies by revealing the molecular signatures driving drug response. Furthermore, impaCluster's capability to group unseen cell lines and compounds facilitates quick screening, marking a significant step towards personalized cancer therapy. Our validation through simulations and diverse datasets underscores impaCluster's potential in identifying precise treatment options.

10:00 – 10:15 AM

Fiona Callahan (Nielsen Lab) Challenges in detecting ecological interactions using sedimentary ancient DNA data

Abstract: With increasing availability of ancient and modern environmental DNA technology, whole-community species occurrence and abundance data over time and space is becoming more available. Sedimentary ancient DNA data can be used to infer associations between species, which can generate hypotheses about biotic interactions, a key part of ecosystem function and biodiversity science. Here, we have developed a realistic simulation to evaluate five common methods from different fields for this type of inference. We find that across all methods tested, false discovery rates of inter-species associations are high under realistic simulation conditions. Additionally, we find that with sample sizes that are currently realistic for this type of data, models are typically unable to detect interactions better than random assignment of associations. We also find that at larger sample sizes, information about species abundance improves performance of these models. Different methods perform differentially well depending on the number of taxa in the dataset. Some methods (SPIEC-EASI, SparCC) assume that there are large numbers of taxa in the dataset, and we find that SPIEC-EASI is highly sensitive to this assumption while SparCC is not. We find that for small numbers of species, no method consistently outperforms logistic and linear regression, indicating a need for further testing and methods development.

Session 2:

10:45 - 11:00 AM

Florica Constantine (Dudoit Lab)Poisson Spatial Models for Multi-Sample Spatial Transcriptomics Data

Abstract: As spatial transcriptomics technology is adopted more broadly, we can now collect data from multiple tissue slices that are not necessarily serial sections or from the same patient. Despite opening the door to new enquiries, multiple samples lead to unique challenges that traditional spatial statistical methods are not equipped to handle, such as different coordinate systems that do not necessarily align, different numbers and types of cells (covariates), different tissue architectures (as they are taken from different sites), etc. In this work, we propose a novel generalized spatial linear model for analyzing multi-sample spatial transcriptomics data. Our method for fitting the model enables the estimation of a single set of fixed effects in these settings , as well as the detection of sample-specific spatial patterning. For example, we might wish to identify which genes are differentially expressed between conditions or phenotypes while controlling for spatial correlation between cells, or find specific sample-gene pairs with non-trivial spatial patterning. Beyond specifying the extension to the multi-sample setting, we provide a mathematical and computational framework for efficient and scalable model-fitting and statistical inference. Our method is competitive with or outperforms existing algorithms in the single-sample setting and provides a hitherto nonexistent extension to the multi-sample setting.

11:00 – 11:15 AM

Hanlun Jiang (Listgarten Lab) Statistical machine learning for understanding protein-protein interactions from synthetic coevolution

Abstract: Emergence of protein-protein interactions marks a key step during the evolution of molecular functions. However, quantitatively investigating this process is challenging, as directed evolution experiments typically start from a pre-existing protein complex where the binding evolution has already begun. We describe a strategy to dissect protein coevolution by integrating a recently developed library-on-library screening platform with statistical machine learning. We developed a general framework that employs a fully connected neural network to jointly model high-throughput screening data from multiple, sequential rounds of selection. This approach enabled us to predict the fitness landscape with high accuracy. Based on the predicted complete fitness landscape, we performed epistasis analysis and simulated coevolutionary trajectories to elucidate the detailed interactions that potentially drive the evolution of protein-protein interactions. Our findings provide new insights into the molecular mechanisms of protein coevolution.

11:15 – 11:30 AM

Antoine Koehl (Song Lab) Deep Models of Protein Evolution in Time

Abstract: Deep models of probability distributions p(y) over protein sequences y have driven remarkable recent advances in protein structure prediction and other applications. These models capture complex interactions between different sites within a protein, but they cannot natively describe how proteins evolve over time while experiencing intricate constraints and adapting new functions. Here, we tackle this challenge by intricate constraints and adapting new functions. Here, we tackle this challenge by constructing massive amounts of suitable training data and developing a deep model of Protein Evolution IN Time (PEINT) to learn the transition probability p(y|x, t) from a protein sequence x to another protein sequence y in a given amount of time t. We demonstrate that our model greatly outperforms classical evolutionary models while also achieving competitive performance on downstream tasks. Crucially, we are able to model evolutionary transitions on unaligned sequence pairs, gaining the flexibility to learn the full spectrum of protein evolutionary forces including insertions and deletions.

11:30 – 12:30 PM

Keynote Speaker - Dr. Anshul Kundaje, Associate Professor, Genetics and Computer Science, Stanford University - Performance Hall

Deciphering regulatory syntax and genetic variation with deep learning models

Session 2 continued:

4:00 – 4:15 PM

Hunter Nisonoff (Listgarten Lab)Unlocking Guidance for Discrete State-Space Diffusion and Flow Models

Abstract: Generative models on discrete state-spaces have a wide range of potential applications, particularly in the domain of natural sciences. In continuous state-spaces, controllable and flexible generation of samples with desired properties has been realized using guidance on diffusion and flow models. However, these guidance approaches are not readily amenable to discrete state-space models. Consequently, we introduce a general and principled method for applying guidance on such models. Our method depends on leveraging continuous-time Markov processes on discrete state-spaces, which unlocks computational tractability for sampling from a desired guided distribution. We demonstrate the utility of our approach, Discrete Guidance, on a range of applications including guided generation of images, small-molecules, DNA sequences and protein sequences.

4:15 – 4:30 PM

Helen Sakharova (Lareau Lab)Identifying codon constraints in yeast by combining large unsupervised learning models and genome-wide screens

Abstract: Identical proteins can be encoded in DNA using different synonymous codons, but the choice of codons used can have dramatic effects on the amount of protein produced. When choosing codons to optimize protein production, a naive approach is to replace slowly-translated codons with synonymous, quickly-translated codons. However, slowly-translated codons can have important functional roles, such as facilitating co-translational folding. The rules governing codon choice are still poorly understood. We have explored when and where codon choice is most strongly constrained using computational and experimental methods. We have leveraged existing large unsupervised learning models in order to identify positions where codon choice is restricted. In parallel, we have conducted a genome-wide screen in yeast to measure the fitness effects of synonymous mutations. Using Cas9 retron editing, we created thousands of slow-to-fast synonymous codon substitutions, and grew them together in pooled competition. Careful controls, including slow-to-slow substitutions and multiple guides targeting each site, allow confident identification of synonymous variants that significantly decrease or increase fitness. By combining our model with large scale experimental data, we aim to produce general rules for predicting the rare but important positions where ‘optimal’ codons are detrimental.

4:30 – 4:45 PM

Danielle Stevens (Krasileva Lab)Using the past to predict the future: how receptor-epitope variation can inform plant pathogen outcomes

Abstract: The production of plants and their products is threatened by pathogens and climate change leading to serious economic costs. To limit disease development, plants encode an innate immune system. Plant immune receptor repertoires vary across species and can recognize conserved bacterial epitopes including components of the flagellin and cold shock proteins (csp22). While pathogen recognition is crucial in host-pathogen interactions, most studies use a single pathogen epitope and thus, the impact of diverse, multi-copy epitopes on pathogen outcomes is unknown. Previously, we used comparative genomics to investigate the natural evolution of 30,000+ epitopes from 4,000+ plant-associated bacterial genomes. Overall, we found natural variation was constrained yet experimentally testable. Using standard immune assays, we assessed the immunogenicity of 79.2% of csp22 variants on tomato, revealing differential outcomes and new mechanisms of immune evasion. Other related plants, such as potato, pepper, and eggplant, display different to no immune response, indicating a potential gene loss. Additionally, distantly related grape and citrus, respond to csp22, indicating convergently evolved receptors. Using csp22 epitope variants and their associated receptors as a model, we aim to characterize receptor evolution and functional immune outcomes, and to build machine learning models from these data to speed up prediction of innate immune receptor utility, discovery, and engineering for emerging pathogens.

4:45 - 5:00 PM

Yulin Zhang (Moorjani Lab)  Reconstructing Mutation Patterns over the course of Human Evolution

Abstract: One of the most fundamental discoveries in evolutionary biology is the molecular clock– the notion that mutations occur steadily over time and thus could serve as an "evolutionary clock" for dating past events. Recent whole genome sequencing studies have challenged the validity of the molecular clock by revealing an almost two-fold diIerence in mutation rate over the course of human evolution. This discordance has been hypothesized to stem from a decrease in mutation rates towards the present in humans. To investigate this hypothesis, we leverage the history of archaic gene flow into modern humans to learn about mutation patterns at deep evolutionary timescales. Using whole genome sequences of ~900 individuals from the Human Genome Diversity Panel, we identify Neanderthal and Denisovan archaic segments in present-day individuals from diverse populations. We compare the number and types of derived mutations on the archaic and non-archaic syntenic fragments. We find similar overall mutation rates across lineages over the past 500,000 years. Comparison of mutation spectrum–– i.e., composition of diIerent types of mutations––reveals significant diIerences between archaic and modern human segments. We discuss the implications and mechanisms related to these changes. Future studies that use older calibration points can help to recover the mutation rate at deeper timescales and facilitate the recalibration of the molecular clock.

5:00 – 5:15 PM

Andrew Vaughn (Nielsen Lab)  Ancient selective pressures on metabolic traits contribute to obesity and weight-related diseases in modern populations

Abstract: One of the biggest global health issues of the 21st century is the stark increase in the prevalence of obesity and weight-related diseases, such as type 2 diabetes and coronary artery disease. While diet, physical activity level, and other lifestyle factors are known to play a role in the etiology of these diseases, there is growing awareness of the contribution of genetics to these interrelated phenotypes. However, a central question still remains: why are variants that contribute to such adverse phenotypes found at such appreciable frequencies in modern populations, particularly when their associated phenotypes would appear to be under negative selection? To answer this question, we here develop a method to infer selection on polygenic traits, rigorously test for changing selective pressures through time, and concurrently analyze genetically correlated traits to account for the pleiotropic effects of certain alleles. We apply our method to a set of modern and ancient genomes from western Eurasia and find two explanations for the high prevalence of variants that contribute to weight-related diseases. The first is that the selective pressures on weight-related illness phenotypes appear to have changed significantly through time, which we hypothesize was due to the introduction of agriculture to Europe and the huge dietary shifts that accompanied this major lifestyle transition. The second is the pleiotropic effect of these variants on several interrelated metabolic-associated traits, including triglyceride levels, HDL cholesterol, and BMI. We hypothesize that positive selection acting on one of several metabolic traits at ancient times, but not at more modern times, could explain the high prevalence of disease-causing alleles at the present through pleiotropy. This analysis highlights the continued contribution of ancient selective pressures on our genomes and on modern global health.

5:15 – 5:30 PM

Stacy Li (Sudmant) Characterization of de novo retrotransposition events in the aging germline

Abstract: Aging is an emergent phenomenon hallmarked by the deterioration of physiological processes over time. Genomic instability occurs throughout normal and pathological aging, typically driven by unresolved or erroneously remediated DNA damage. De novo germline mutations are of critical importance to reproductive success in aging: Previous work using short-read sequencing in pedigree-based studies identified a parental age-associated increase in small de novo variants, with a strong bias from the paternal germline. De novo structural variants (dnSVs) have profound impacts on health, disease, and genome evolution, but remain challenging to characterize due to the technical limitations of short-read sequencing. To address this, we conducted a study using PacBio HiFi long-read sequencing to identify dnSVs in bulk sperm samples from eight donors of varying ages. Under this framework, we anticipate recall of clonal dnSVs (shared amongst sperm descended from a common progenitor), with the potential to recover singleton/doublet dnSVs arising in meiosis. To this end, we developed a method to identify de novo retrotransposition events, which can be characterized within the span of a single read. We identified multiple high-confidence AluYb8, AluYa5, and L1HS retrotransposition events. We estimate between 2.1 to 10.2 events per 100 individual cells, with the oldest donor sample yielding the greatest number of events. Furthermore, several L1 events originally raised as insertions were revealed to be complex clonal dnSVs following manual examination. These results suggest that long-read sequencing is a promising method to evaluate the prevalence and spectra of dnSVs in the aging germline. ASHG version Title: Characterization of de novo retrotransposition events in the aging germline Aging is an emergent phenomenon hallmarked by the deterioration of physiological processes over time. De novo germline mutations are directly transmissible to offspring: increased de novo mutation frequency in the male germline in aging poses significant risk to reproductive success. In particular, de novo structural variants (dnSVs) affect large genomic regions and are known contributors to congenital neurodevelopmental disorders. Notably, autism spectrum disorder is associated with both an elevated dnSV burden and advanced paternal age at conception. Consequently, characterization of dnSV burden in the male germline is critical for understanding increased risk of congenital disease in aging. Advances in highly accurate single-molecule long-read sequencing now enable direct characterization of dnSVs without relying on proxy measures (i.e. read depth). This approach captures native methylation context and improves accuracy by identifying mutations in phase context. We applied PacBio HiFi long-read sequencing to identify dnSVs in bulk sperm samples from twenty donors aged 27 to 62 at an average of 45X coverage. We created highly contiguous phased assemblies (average N50 = 140Mb) for each donor and used personal genome alignment to identify clonal (shared amongst sperm descended from a common progenitor) and unique (private to <4 gametes generated during meiosis) variants. In the first phase of our project, we characterized the frequency and distribution of de novo retrotransposition events, identified by well-defined consensus sequences captured within the span of a single read. The majority of de novo events stem from AluYb8, AluYa5, and L1HS activity, ranging between 2.1 to 10.2 events per 100 individual cells. We observe an increase in de novo AluY subfamily events in individuals of advanced paternal age (40+ years). Remarkably, approximately 50% of events were found in genic and lncRNA regions expressed in the testis per cross-analysis with the GTEx dataset. Finally, we identify multiple complex clonal dnSVs occurring in regions enriched with ancient L1 sequences, suggesting microhomology-mediated mechanisms. These results suggest that long-read sequencing is a promising method to evaluate the prevalence and spectra of dnSVs in the aging germline.


Sunday, October 27

11:00 – 12:00 PM

Keynote Speaker - Dr. Allon Wagner, Assistant Professor, Center for Computational Biology, EECS & Molecular and Cell Biology, UC Berkeley: Title tbd