Abstracts – Northern California Computational Biology Student Symposium

TALKS (In order of presentations)

Lynn Yi, UC Berkeley
Zika infection of neural progenitor cells perturbs transcription in neurodevelopmental pathways

Background: A recent study of the gene expression patterns of Zika virus (ZIKV) infected human neural progenitor cells (hNPCs) revealed transcriptional dysregulation and identified cell-cycle-related pathways that are affected by infection. However deeper exploration of the information present in the RNA-Seq data can be used to further elucidate the manner in which Zika infection of hNPCs affects the transcriptome, refining pathway predictions and revealing isoform-specific dynamics.

Methodology/Principal Findings: We analyzed data published by Tang et al. using state-of-the-art tools for transcriptome analysis. By accounting for the experimental design and estimation of technical and inferential variance we were able to pinpoint Zika infection affected pathways that highlight Zika’s neural tropism. The examination of differential genes reveals cases of isoform divergence.

Conclusions/Significance: Transcriptome analysis of Zika infected hNPCs has the potential to identify the molecular signatures of Zika infected neural cells. These signatures may be useful for diagnostics and for the resolution of infection pathways that can be used to harvest specific targets for further study.

Christopher Hann-Soden, UC Berkeley
Climate influences substrate response in a fungal decomposer

Models of climate change that treat microbial decomposition as a static process are inadequate because they fail to account for feedback mechanisms that could mitigate or exacerbate climate change. Warmer temperatures are expected to accelerate microbial metabolisms leading to faster rates of decomposition, and subsequently more carbon in the atmosphere rather than the soil, but decomposers may adapt to climate change or the composition of microbial communities may shift. To better understand the influence of climate on decomposition by acclimation, we studied the transcriptional response to different carbon sources dependent on temperature in a microbial decomposer, Neurospora discreta. To better understand the effect of adaptation, we compared this response between populations of N. discreta from a broad latitudinal range. Using this multi-factorial design, we assayed the interactions between temperature, substrate, and population. This expression-level and population genomic scale understanding of a decomposer’s response to climate change will underpin a new model of carbon cycling in climate change.

Shivani Mahajan, UC Berkeley
Convergent evolution of Y chromosome gene content in flies

Y chromosomes are characterized by abundant gene loss. Sex chromosomes have formed repeatedly across Diptera from ordinary autosomes, and X chromosomes mostly conserve their ancestral genes. Yet, the nature of the gene repertoire of Y chromosomes is unknown. In this study, we traced the gene content evolution of Y chromosomes across 15 Diptera species using a subtraction pipeline that infers Y genes from male and female genome and transcriptome data. Application to Drosophila melanogaster data shows that our methodology has high sensitivity and specificity to identify Y genes, and we also discover a novel protein-coding gene on the D. melanogaster Y chromosome. Old Y chromosomes have few genes remaining, but the number of inferred Y genes varies substantially between species, and we both identify Y-linked genes in species without morphologically distinguishable sex chromosomes, but also fail to detect Y genes in others with differentiated X and Y sex chromosomes. Young Y chromosomes still show clear evidence of their autosomal origins, but contrary to mammals, most genes on old Y chromosomes in flies are not simply remnants of genes originally present on the proto-sex chromosome that escaped degeneration, but instead were recruited to the Y secondarily from autosomes. Despite no overlap in Y-linked gene content in different species with independently formed sex chromosomes, we find that genes that have been maintained on or recruited to the Y have evolved convergent gene functions associated with testis function. Thus, male-specific selection is a dominant force shaping gene content evolution of Y chromosomes.

Govinda Kamath, Stanford University
HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution

Long-read sequencing technologies have potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce mis-assemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that achieves optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding “hinges” to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read datasets from the NCTC project. Besides producing more finished assemblies than the manual pipeline of NCTC based on the HGAP assembler and Circlator, HINGE allows us to identify 40 datasets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches either fragment the assembly or resolve the ambiguity arbitrarily.

Rachel Goldfeder, Stanford University
SCOTCH: A novel method to detect insertions and deletions from NGS data

Clinical-grade genome sequencing and interpretation requires accurate and complete genotype calls across the entire genome. While single nucleotide variant detection is highly accurate and consistent, these variants explain only a small fraction of disease risk. Other types of variation that disrupt the open reading frame, such as insertions and deletions (INDELs), are more likely to be harmful. However, current methods have low sensitivity for larger (>= five bases) INDELs, primarily due to challenges surrounding aligning sequence reads that span INDELs. We present SCOTCH, a novel INDEL detection method that leverages signatures of poor read alignment, read depth information, and machine learning approaches to accurately identify INDELs from next-generation DNA sequencing data. Using biologically realistic simulated genomes and sequence reads with technologically representative error profiles (generated by ART), we evaluate SCOTCH and several currently available INDEL callers. We show that SCOTCH outperforms current methods, particularly for larger INDELs. Finally, we show an application of SCOTCH to clinical genome sequencing datasets to find novel INDELs associated with cardiac disease. This method will enable researchers and clinicians to more accurately identify INDELs associated with previously unexplained genetic conditions.

Nathan Hendel, UCSF
Flagellar size control can be achieved through a simple diffusion-based length measure

An important question in cell biology is how cells know how big to make their organelles. The eukaryotic flagellum is an ideal model for studying size control because its linear geometry makes it essentially one-dimensional, greatly simplifying mathematical modeling. The assembly of flagella is regulated by intraflagellar transport (IFT), in which trains of kinesin motors walk to the tip of the flagellum and deposit the cargo necessary for the flagellum to grow. The competing length control factor is a length-independent decay of the flagellum. In Chlamydomonas reinhardtii flagella, this process results in initial rapid growth followed by convergence to a steady-state length. Curiously, the rate at which motors are recruited to begin transport is indirectly proportional to the length, implying some kind of communication between the base and the tip. We propose a model in which motors unbind after cargo delivery and diffuse back to the base, and are reused in IFT. In this model, the diffusion time of the motors serves as a proxy for length measurement. To explore the viability of this diffusion-based length control, we computationally built this model in three different ways. In the first, we built an agent-based model in which we used object-oriented programming to explicitly model flagella and motors, including time dynamics. In the second, we modeled the number density along the flagellum as a vector, and built a stochastic matrix to simulate time dynamics and determine a steady-state. In the third, we used differential equations to directly solve for the steady-state length. In all three, we found that the diffusion model can achieve steady-state length and an inverse relationship between length and recruitment rate. This is remarkable because this is perhaps the simplest explanation of length control, giving it credence in light of evolution.

Wei Liu, UC Santa Cruz
Literature-based identification of lncRNA-disease association

Several human lncRNAs have been well demonstrated for key roles in multiple diseases. Here we proposed a literature-oriented algorithm, LiWi, for identifying potential lncRNA-disease associations based on peer-reviewed literatures. Evaluation showed LiWi is effective in not only extracting reported associations but also identifying novel ones. Based on a network constructed from literature published before 2013, LiWi successfully extracted four known prostate-cancer-related lncRNAs and further identified 27 novel ones with 16 (59% ) validated by later published literatures as well as independent expression data. Compared with the manual curated LncRNADisease database(Chen, et al., 2013), LiWi inferred 28 additional prostate cancer-related candidates automatically, 18 (64.28%) of which are validated independently. Follow-up systematic mining further annotated 110 human lncRNAs to 20 common diseases, providing a first literature-based systematic profile of human lncRNA-disease association.

Guhan Venkataraman, Stanford University
De novo mutations in autism implicate the synaptic elimination network

Several hypotheses exist for the genetic etiology of autism; one of note is referred to as the “synaptic elimination hypothesis.” Given that increases in both dendritic spine density and brain weight (both of which are characteristic of autism) can be caused by mutations in genes regulating synaptic elimination, the hypothesis developed that autism could be a disease of abnormal synaptic elimination. Autism has been shown to have a major genetic risk component – the architecture of documented autism in families has been shown to be passed down for generations. But, while inherited risk plays an important role in the autistic nature of children, de novo (germline) mutations have also been implicated in autism risk, especially with regards to neuronal development. We hypothesized that an increased burden of de novo mutations in synaptic elimination genes will lead to the synaptic pruning abnormalities observed in autism, such as increased dendritic spine density and increased brain weight. We used the dnenrich package, a network burden analysis tool, to test for enrichment in the synaptic elimination gene set on a set of 3982 exomes from family-based trios with one autism-affected child. The package has been shown to be particularly powerful for identifying de novo mutations with small individual association to phenotype, but large effect in combination. Here we found that autism de novo variants published in the literature are Bonferroni-significantly enriched in a gene set implicated in synaptic elimination. Additionally, several of the genes in this synaptic elimination set that were enriched in protein-protein interactions (CACNA1C, SHANK2, SYNGAP1, NLGN3, NRXN1, and PTEN) have been previously confirmed as genes that confer risk for the disorder. The results demonstrate that autism-associated de novos are linked to proper synaptic pruning and density, hinting at the etiology of autism and suggesting pathophysiology for downstream correction and treatment.

Stephen Nayfach, UCSF
An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography

We present the Metagenomic Intra-species Diversity Analysis System (MIDAS), which is an integrated computational pipeline for quantifying bacterial species abundance and strain-level genomic variation, including gene content and single nucleotide polymorphisms (SNPs), from shotgun metagenomes. Our method leverages a database of >30,000 bacterial reference genomes which we clustered into species groups. These cover the majority of abundant species in the human microbiome but only a small proportion of microbes in other environments, including soil and seawater. We applied MIDAS to stool metagenomes from 98 Swedish mothers and their infants over one year and used rare SNPs to track strains between hosts. Using this approach, we found extensive mother-to-infant transmission of early colonizing, non-spore-forming strains. In contrast, late colonizing strains tended to be spore-formers, likely transmitted from the environment. This pattern was missed previously because the species composition of infant and mother microbiomes converged over time. We also applied MIDAS to 198 globally distributed marine metagenomes and used gene content to show that many prevalent bacterial species have population structure that correlates with geographic location. Strain-level genetic variants present in metagenomes clearly reveal extensive structure and dynamics that are obscured when data is analyzed at a higher taxonomic resolution.

Amit Akula, UC SF
DT-MRI Visualization of the brain’s optical networks to understand MS Pathology

While researchers have not been able to fully characterize the pathology of multiple sclerosis (MS), conventional Magnetic Resonance Imaging (MRI) techniques have been the “gold standard for the diagnosis and monitoring of MS” (Guglielmetti, Lassmann). Using advanced MRI techniques, such as diffusion weighted imaging, and specifically spherical deconvolution tractography to image the brain’s neural tracts, recent studies have furthered MRI’s predictive capabilities to leverage the brain’s connectivity to develop a “composite MRI-based measure of motor network integrity” that appears to “predict disability substantially better than conventional non-network based MRI measures” (Pardini). Given the clinical-radiological paradox for diseases like MS, or the inability for radiological diagnostics to strongly predict patient outcomes, developing a robust predictor for vision clinical ouctomes is highly desired. This research project was to conduct analysis of the brain’s visual network: the white matter tracts underlying the parts of the brain associated with visual processing. Using data from the Human Connectome Project, a brain atlas of the visual pathways was constructed. Connectivity metrics, such as Pagerank and other graph metrics, were applied to the constructed visual graph. The constructed visual network, and the computed graph metric, quantitatively measure the ability of the brain to coordinate on visual tasks. By correlating my analysis of the visual network with two standard measures of visual strength, visual evoke potentials (VEP) and optic coherence tomography (OCT), I will describe how to use these visual pathways to better predict visual clinical outcomes for MS patients. (Martinez-Lapiscina).

Amanda Mok, UC Berkeley
Genome-wide analysis of vitamin D response elements implicates vitamin D regulation of antigen presentation in multiple sclerosis risk

Recent studies have demonstrated a causal role of vitamin D serum levels in multiple sclerosis; however, the mechanisms underlying this association are not yet known. The vitamin D receptor is expressed in nearly all tissue types, but its target genes are specific to cell type and function. Thus, genome-wide analysis of vitamin D signaling may elucidate the downstream factors driving MS pathogenesis. We investigated whether genetic variants in putative vitamin D response elements (VDREs) are associated with MS in non-Hispanic White members of Kaiser Permanente Northern California (1,098 cases; 10,329 controls). Genotypes were obtained through whole-genome profiling and imputation. We identified 6,250 SNPs within 4,764 VDREs from six ChIP-seq datasets for analysis. To detect the effect of VDRE variants on MS susceptibility independent of well-established risk factors, we performed logistic regression for each SNP and controlled for sex, genetic ancestry, a weighted genetic risk score for 110 non-HLA risk alleles, and four independent SNPs in the MHC, the region with the strongest genetic contribution to MS risk. Results show significant disease associations for 19 SNPs in 16 VDREs (FDR q<0.05), including rs2213584 (OR=1.34, p=5.34e-07) and rs2395182 (OR=1.47, p=1.38e-06) in a VDRE 5kb downstream of HLA-DRA. We additionally restricted analyses to VDRE SNPs within 10kb of genes established through recent MS GWAS and found significant associations for rs1335532 (OR=0.83, p=1.97e-04) in CD58 and rs9831894 (OR=0.76, p=5.96e-04) in CD86. As HLA-DRA, CD58, and CD86 are all cell surface proteins involved in antigen presentation, findings suggest that vitamin D regulation of antigen presentation may mediate MS risk. A majority (82%) of the MS-associated VDREs demonstrated specificity to B-lymphocytes, and none of the identified VDREs contained the canonical DR3 binding motif. We describe the first results from a comprehensive genomics approach to characterizing the contribution of vitamin D metabolism in MS. Our findings also have important implications for cancer and other autoimmune conditions more broadly, for which vitamin D plays a role.

Jonathan King, UC Berkeley
A Novel Genetic Algorithm for Detecting FLT3 Internal Tandem Duplications in Acute Myeloid Leukemia Patients

Background: The presence of internal tandem duplications (ITDs) in the FLT3 gene has been shown to be a predictor of poor prognosis for patients with Acute Myeloid Leukemia. Though several biological and computational assays exist to identify FLT3-ITD+ individuals, they do not provide several important molecular details. We propose a novel algorithm for analysis of Next Generation Sequencing (NGS) data to detect ITDs within FLT3 that offers detailed results with improved accuracy over existing assays. The program will soon be available on Illumina’s BaseSpace cloud-computing platform.

Methods: Our algorithm takes advantage of the highly unique sequences of nucleotides that occur within ITDs. The program recognizes these by identifying sequencing reads that are aligned to the reference sequence on one side, but misaligned on the other. The program reconstructs this region for each ITD by utilizing regular expression pattern matching to account for the high variability in sequence between ITDs. An approximately 20 nucleotide long search sequence from this region is then used to detect, reconstruct, and report on the ITD’s size, location, allele frequency (AF), sequence, and protein coding modifications.

Results: in silico testing was completed using simulated samples that contained ITDs of known size, length, and AF. Each sample was then analyzed with the software to test whether the specifics of each test ITD insert could be recovered. We determined the program’s specificity to be greater than 99.9% and, when an ITD’s AF is 2% or greater the program achieves 100% sensitivity. Furthermore, the program demonstrates high accuracy in reporting ITD position and size despite the great variations in these characteristics known to exist in the patient population.

In summary, our novel algorithm provides sensitive and accurate detection of FLT3-ITDs that is both quantitative and detailed in description, thus enabling improved diagnosis and treatment stratification of leukemia patients.

POSTERS (Alphabetically Ranked)

Cameron Adams, UC Berkeley

1000 Genomes Project HLA Imputation Panel

The human leukocyte antigen (HLA) region is one of the most polymorphic regions in the human genome. Genes in the HLA are essential to immune system function, and many polymorphisms in this region are associated with increased risk of autoimmune disease. The polymorphic nature of the HLA makes genotyping HLA alleles difficult and expensive. Imputation is often the only method available to obtain HLA genotypes. However, existing reference panels for HLA imputation are quite limited. Currently, only mono-ethnic reference panels for European and Asian populations are publicly available. These panels are of high quality, but are not suitable for imputing samples of non-European or East Asian ancestry. To address this, we created an HLA imputation reference panel using the 1000 Genomes Project (GP) samples that will impute HLA types for ancestrally-diverse samples.

Our new imputation reference panel was created using samples from the GP. The individuals (n=930) came from four superpopulations: European (34%), East Asian (29%), African (19%), and American (18%). Genotypes were downloaded from the GP website and HLA types were downloaded from dbMHC. Genotypes and HLA types were used to generate the imputation reference panel using MakeReference.
HLA genotypes will be imputed using SNP2HLA, and performance of the imputation panel will be evaluated by testing concordance between assayed and imputed HLA genotype in three models. First, we will impute HLA genotypes for the 930 samples in the GP used to create the reference panel, and test the concordance between imputed and assayed HLA genotypes. Second, we will create an imputation panel composed of a random subsample of the 930 samples and use it to impute HLA types for the remaining samples and test the concordance between imputed and assayed HLA genotypes. And third, we will compare the performance of the T1DGC panel (n=5225; European) versus our GP reference panel on European and African American samples. We aim to provide the first ancestrally-diverse publicly available HLA imputation panel.

Shaked Afik, UC Berkeley

Targeted reconstruction of T cell receptor sequence from short single cell RNA-sequencing links CDR3 length to differentiation state

Single cell RNA-sequencing (scRNA-seq) is a promising platform to study how differences in the T cell receptor (TCR) contribute to heterogeneity in the T cell phenotype. However, the ability to link antigen specificity of TCRs to single cell transcriptome data has presented a major technical challenge, as the variability of the sequences that encode the antigen recognition region (the CDR3 region) makes TCR identification using standard transcriptome analysis a challenging task. Current protocols to directly sequence the TCR use long reads, increasing the cost and decreasing the number of cells that can be feasibly analyzed. We present TRAPeS (“TCR Reconstruction Algorithm for Paired-end Single cells”), a fast and efficient algorithm to reconstruct TCR sequences in single cells using short (~25bp) paired-end reads. We apply it to investigate heterogeneity in the CD8+ T cell response in humans and mice, showing that it is accurate and more sensitive than previous approaches. scRNA-seq analysis of T cells specific for an epitope from Yellow Fever Virus Vaccine recovers two populations: one with naive-like and one with effector memory-like profiles. TCR analysis of the two populations reveals they have markedly different CDR3 length and divergence from germline sequence, suggesting that TCR usage contributes to heterogeneity in differentiation of the CD8+ T cell response to YFV. TRAPeS is publically available, and can be readily used to investigate the relationship between the TCR repertoire and cellular phenotype.

Daniel Asarnow, UC San Francisco

Signal Subtraction in Electron Microscopy

Single-particle cryo-EM has recently emerged as a tool for protein structure determination at atomic resolution. Freed from crystallographic constraints, proteins visualized by cryo-EM frequently display significant conformational heterogeneity. This heterogeneity permits recovery of multiple states, yet homogeneous structural information may dominate reconstructions, causing biologically interesting differences to become blurred out of the final models. Therefore, heterogeneous zones must be computationally isolated, in order to avoid undue influence from neighboring regions. Typically, a “soft mask” is defined in order to zero-out signals reconstructed outside the region of interest. Partial signal subtraction from single-particle electron micrographs is an alternate technique for focused classification and refinement, that consists in directly subtracting the simulated contribution of a partial density from each particle image. It is conceptually similar to soft masking, but acts at the level of raw particle images rather than that of intermediate reconstructions. Signal subtraction can result in improved resolution and/or classification within a region of interest, as the subtracted component is no longer present to drive alignments prior to application of the soft mask. Signal subtraction is also useful for realistic simulation of hypothetical data, e.g. in which a complex subunit or protein domain is not present. Unfortunately, several practical obstacles have made signal subtraction less accessible than soft masking to the majority of electron microscopists. A key difficulty has been the lack of clear protocols for effective, and un-artifacted, subtraction. Here, we give an explicit recounting of a successful signal subtraction process, including creation of a suitable subtraction map, simulation of particle image contributions, recentering of subtracted particles, and the details of a Fourier-correlation based per-particle normalization procedure. These procedural details, along with open-source signal subtraction scripts in use at UCSF, should aid researchers in effectively applying the single subtraction technique to their own work.

Daniel Bliss, UC Berkeley

Progress toward a biophysical theory of serial dependence

Daily life often requires us to keep track of visual information that is no longer present in the environment. For example, when watching the opening scenes of a movie, we must remember which facial features belong to which lines of dialog in order to form a representation of each character’s identity. One might expect that forming a veridical representation of each visual stimulus would be optimal to guide future decision making, but evidence suggests that the information we actually store in working memory is distorted as a function of our recent history: Visual features are pulled away from their true values to make them more similar to features that captured our attention in the past. This bias has been termed serial dependence. The mechanisms by which serial dependence emerges in the brain have been a matter of dispute. Some authors have posited a source of the bias in visual cortex and deemed the effect perceptual in nature, whereas others have identified its correlate in association cortex and considered it a phenomenon of working memory. Here we present our progress toward a unified theory of serial dependence that includes explicit biophysical modeling of visual cortex, association cortex, and a motor network. We view the serial dependence of visual representations as emergent from specific interactions between these areas that change over time, accounting for the relationship between the magnitude of the distortion as measured in behavior and the duration of the memory delay between stimulus and response. Our theory offers a detailed framework for future empirical research into serial dependence, points out the potential link between this phenomenon and sensory adaptation, and entails a number of novel testable hypotheses.

Alexander Clark, UC Berkeley

Pericyte Morphology: an examination of cellular characteristics in the Blood Brain Barrier

The Blood-Brain Barrier (BBB) is instrumental in biomedical research because of its role in tumor metastasis, neurological disease, drug delivery, and more. To study its complexities, a BBB model would provide the ability to test pathogenesis and drug therapy. Our project in the Searson Group of the Institute for NanoBioTechnology at Johns Hopkins combines BBB cellular components in order to investigate effective cell culture conditions for pericytes (PC) and endothelial cells (EC) on 3D microvessel glass rods, PC morphology using rod assay, and the ways EC/PC co-culture influences the morphology and function. Using pericytes from human brain derived induced pluripotent stem cells (iPSCs), PCs are seeded onto small glass rod devices coated with laminin, collagen IV, and fibronectin to mimic capillaries. 3D z-stack images obtained using confocal fluorescence microscopy are converted to 2D images using the Matlab program UNWRAP in order to examine cellular membrane processes and proximity to neighboring PCs. Primarily, we are using UNWRAP to create 2D images to study cell diameter, shape, processes, and direction of wrapping around rods (blood vessels). Secondly, we are culturing PCs on two different sized glass rods with diameters of 200 micrometers and 15-20 micrometers. In these experiments we are looking for a correlation between PC morphology and capillary diameter. Thirdly, we are co-culturing ECs and PCs on the same rods to examine how co-culturing cells affects morphology.

Cole Helsell, UC San Francisco

Automated analysis of zebrafish behavioral screening data reveals novel, light-activated noxious compounds

Psychoactive compounds have an enormous influence on our present understanding of both the animal nervous system and human mental health. For example, some of the most compelling evidence that gamma-aminobutyric acid (GABA) is an inhibitory neurotransmitter hinged on the discovery of bicuculline, a GABAA receptor antagonist today used to induce an epilepsy-like behavior in rodents. Almost all presently known psychoactive compounds were discovered through their effects on animal behavior, often incidentally. Intuitively, one might therefore consider animal behavior an important tool for psychoactive drug discovery. However, the cost and timescale of behavioral research in rodents rules out high-throughput discovery of drugs with behavioral phenotypes. Larval zebrafish, on the other hand, show robust behavioral responses to existing psychoactive drugs and are amenable to high-throughput behavioral analysis.

Here, I report on a group of compounds discovered in a high-throughput zebrafish behavioral screen of a 10,000 compound library. These compounds appeared to potentiate dark-adapted fishes’ startle responses to bright lights. The most salient of these compounds is now published under the name ‘optovin’. Optovin proved to be a photo-activated ligand of TRPA1, a known nociceptor. By retesting them on more individuals at variable doses, I confirmed that most of the remaining optovin-related compounds cause novel visual startle phenotypes, including responses to colors of light ignored by untreated fish, hypoactivity after first exposure, and hyperactivity on contact. Chemoinformatic analysis of the compounds suggests some broad differences, but does not capture the phenotypic classification of the compounds. Dimensional reduction and clustering of behaviors based on image analysis successfully recapitulates the phenomenological differences between the compounds, suggesting that bioinformatic approaches may be able to successfully pick out interesting compounds from large behavioral screens without extensive manual analysis of the behavioral data.

Bernard Kim, UCLA

Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples

The distribution of fitness effects (DFE) has considerable importance in population genetics. To date, estimates of the DFE come from studies using a small number of individuals. Thus, estimates of the proportion of moderately to strongly deleterious new mutations may be unreliable because such variants are unlikely to be segregating in the data. Additionally, the true functional form of the DFE is unknown, and estimates of the DFE differ significantly between studies. Here we present a flexible and computationally tractable method, called Fit∂a∂i, to estimate the DFE using the site frequency spectrum from a large number of individuals. We apply our approach to the frequency spectrum of 1300 Europeans from the Exome Sequencing Project ESP6400 dataset, 1298 Danes from the LuCamp dataset, and 432 Europeans from the 1000 Genomes Project to estimate the DFE of deleterious nonsynonymous mutations. We infer significantly fewer (0.38-0.84x) strongly deleterious mutations with selection coefficient |s| > 0.01 and more (1.24-1.43x) weakly deleterious mutations with selection coefficient |s| < 0.001 compared to previous estimates. Furthermore, a DFE that is a mixture distribution of a point mass at neutrality plus a gamma distribution fits best to two of the three datasets. Our results suggest that nearly neutral forces play a larger role in human evolution than previously thought.

Minseung Kim, UC Davis

Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli
A significant obstacle in training predictive cell models is the lack of integrated data sources. We developed semi-supervised normalization pipelines and performed experimental characterization (growth, transcriptional, proteome) to create Ecomics, a consistent, quality-controlled multi-omics compendium for Escherichia coli with cohesive meta-data information. We then used this resource to train a multi-scale model that integrates four omics layers to predict genome-wide concentrations and growth dynamics. The genetic and environmental ontology reconstructed from the omics data is substantially different and complementary to the genetic and chemical ontologies. The integration of different layers confers an incremental increase in the prediction performance, as does the information about the known gene regulatory and protein-protein interactions. The predictive performance of the model ranges from 0.54 to 0.87 for the various omics layers, which far exceeds various baselines. This work provides an integrative framework of omics-driven predictive modeling that is broadly applicable to guide biological discovery.

Haleigh Miller, UC San Francisco

Machine Learning Disease Classification with Metagenomic Microbiome Data

Recent studies have shown that shotgun metagenomic data can be used with machine-learning methods to identify disease status; further supporting the view that microbial dysbiosis is associated with several diseases [1,2]. Shotgun metagenomic microbiome data provides a plethora of information that can be used to characterize the microbial environment of an individual. This information can be grouped into functional and taxonomical categories: functional data refers to what genes are present and what pathways they can be annotated to, taxonomical data describe the taxonomy of the species present. We further investigated the role of dysbiosis in health with machine-learning methods, using both functional and taxonomical features of sequence data and implementing techniques to avoid over-fitting classifiers to within-study noise. Data was obtained from 14 different publicly available datasets totaling 2,148 samples from 4 different continents and encompassing 10 different diseases. Data was analyzed within their own study to look at within disease predictability, taxonomical and functional signatures, and potential biomarkers. Samples were then grouped into “healthy” and disease categories in attempts to characterize a universal unhealthy microbiome. Efficacy of prediction varied across diseases, suggesting varying importance of the microbiome in disease. Predictive capability in separating “healthy” versus diseased individuals was slight but significant, suggesting there are common trends in the microbial communities across diseases. On average functional classifiers showed slightly lower efficacy than taxonomical classifiers, but when used in unison functional attributes were shown to have higher feature importance.

Roshni Patel, UC Berkeley

Analysis toolkit for assessing unique threats posed by novel pathogen strains

Knowledge of the virulent genes in a pathogen’s genome results in a better understanding of the species’ capabilities. This information is essential in order to prepare for and respond to potential acts of bioterrorism. However, as a result of genetic engineering and horizontal transfer, a novel pathogen strain can pose a unique threat. The Pathogen Virulence Analysis Toolkit (PVAT) addresses this issue by identifying the unique and virulent genes in an organism. Utilizing knowledge of protein clustering, PVAT can more accurately compare the genome of the novel strain against reference genomes of other strains belonging to the same species. Data is contextualized by determining the average number of unique and virulent genes found in the given species. Gene data was produced by comparing reference genomes to the Livermore Metagenomics Analysis Toolkit (LMAT) gene database and MvirDB, a virulence database.

Amogh Pathi, UC Berkeley/Children’s Hospital Oakland Research Institute

Senior Author: Damini Jawaheer

Genetic Markers and Anti-TNF Biologic Response, A Precision Medicine Approach

Background: Rheumatoid arthritis is an incurable inflammatory disease of the joints that leads to significant disability. Anti-TNF biologic medications are effective in stopping the progression of RA into more severe forms. However, up to 30% of patients on anti-TNF medications do not respond to them. Currently there are no tests to predict response, and therefore, anti-TNF treatment cannot be targeted to only those patients who will respond. This results in non-responders being given these drugs which have significant side-effects.

Objective: To identify genetic biomarkers that can predict response to anti-TNF drugs.

Methods: Anti-TNF response after 3 months of treatment (ΔDAS-3) was evaluated in 723 Caucasian RA patients as the change in Disease Activity Scores from baseline (before initial therapy). Genotype data from 278,716 genome-wide single nucleotide polymorphisms (SNPs) were cleaned and then phased using SHAPEIT software. Additional SNPs were imputed based on 1000 Genomes reference haplotypes using IMPUTE2 software. Principal Components Analysis was used to test for population stratification. A multivariate linear regression model was run (PLINK software) to test for association between ΔDAS-3 and each SNP (genotyped and imputed), adjusting for baseline DAS28, type of anti-TNF drug, other medications and principal components.

Results: Currently, SNP genotypes are being phased and we expect to impute approximately 35,000,000 new SNPs into our dataset. We previously identified 3 genomic regions (6q15, 6q27, and 10q25.3) to be associated with response in a Japanese dataset, and anticipate that these associations will be replicated in the current Caucasian dataset. Additionally, novel associations which could be ethnic-specific may be detected at other loci.

Conclusions: The SNP genotypes associated with response can be used to prescreen patients so that anti-TNF therapy can be targeted only to potential responders. Additionally it will help doctors plan treatment schemes more efficiently, using a precision medicine approach.

Ryan Polischuk, UC Davis

Influence of chromosome proximity in the formation of chromosome aberrations in mixed lineage leukemia

Translocations of the MLL (mixed lineage leukemia) gene trigger acute myeloid or lymphoid leukemia that affect people of all ages and may be linked to some types of chemotherapy. The MLL gene, located at 11q23, regulates gene expression during hematopoiesis, and has some 79 known translocation partners (Meyer et al., 2013), which creates an elaborate recombinome. Inspection of the entire recombinome shows a bias towards rearrangements in close linear (same chromosome) proximity to the MLL gene itself, suggesting local accessibility to MLL is a factor determining rearrangement probability. To test this hypothesis, and to further understand the underlying structural determiners of these mutations, we performed statistical analysis of Hi-C data and of sequence homology, and we developed a polymer-based local 3-D reconstruction of the MLL region.

Our statistical analysis, done by extending the Statistics of Chromosome Interphase Positioning (SCHIP) method (Arsuaga et al., 2004) to the increased information density of Hi-C contact matrices, demonstrated that both the short and long arms of chromosome 6 are significantly closer to 11q compared to the rest of the genome (6p contains a frequent MLL partner, AF6). Our analysis of sequence homology between the regions containing MLL and AF6 showed frequent similarities between the two; however, these were not necessarily the regions of significant chromosome proximity, which suggests that sequence homology plays a minor role in the formation of the translocation. We therefore suggest that chromosome proximity is a key factor in the formation of the translocations between AF6 and MLL. We plan to expand upon this analysis with statistical results from our local 3-D reconstruction of MLL and its proximal translocation partners.

Maxime Pouokam, UC Davis

Statistical of topology invariants for assessing reproducibility of 3D chromatin configuration reconstructions

It is well known that the three dimensional (3D) architecture of eukaryotic influences biological processes such as gene regulation and cancer-driving gene fusions. Obtaining 3D reconstruction has been challenging due to the high level of condensation of genomes. Chromosome conformation capture (CCC) data has provided an unprecedented approach to analyze the 3D organization of genomes and obtaining candidates reconstructions thereof. While problems of accuracy can be addressed experimentally, questions of reproducibility can be addressed statistically-the purpose of this study. This work complements the study by Segal et al., where they addressed the later using four different statistical metrics, which were later used to perform statistical inference. However these statistical metrics, that included procrustes analysis and within-configuration inter-point distances, were mainly of geometrical nature and therefore extremely sensitive to certain perturbations of the data. The proposed approach also had difficulties defining an adequate null distribution for the estimation of p-values. Here we present an alternative approach of assessing reproducibility using a different metric based on a topological invariant namely the linking number. Since topological invariants do not change upon continuous deformations of the data one expects that they should be more robust to noise than those of geometrical nature. Our results lead to the conclusion that the approach presented here refines previous results therefore provides a better way of accessing reproducibility of candidates reconstructions. This new approach not only provides means for accessing reproducibility but also has potentials of addressing other biological questions such as topology complexity of genomes.

Arthur Rand, UC Santa Cruz

Mapping DNA Methylation with High Throughput Nanopore Sequencing

Chemical modifications to DNA regulate cellular state and function. The Oxford Nanopore MinION is a portable single-molecule DNA sequencer that can sequence long fragments of genomic DNA. We present a general probabilistic framework for detecting chemical modifications to bases in nanopore signal and mapping them to a reference sequence. Using this framework, we show that the MinION can detect three cytosine variants: cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine, and two adenine variants, adenine and 6-methyladenine. On synthetic DNA, we classify cytosine base modifications with a median accuracy of 80% in a three-way comparison. Our method maps 5-methylcytosine motifs with 96% accuracy and 6-methyladenine at 91% accuracy at roughly 30X coverage. Finally, we show that our model is sensitive enough to detect changes in genomic methylation levels during different growth phases in E. coli.

Jacqueline Robinson, UCLA

Genomic flatlining in the endangered island fox

With species declining worldwide due to exploitation and habitat loss, a vital element in mitigating future biodiversity loss will include understanding the genetic consequences of small population size and isolation. Population genetics theory holds that small populations in isolation will suffer from declining diversity, increasing genetic load, and eventual extirpation, in a process dubbed the “extinction vortex.” Island species therefore present an ideal system for empirically studying the impacts of small population size and isolation. Island foxes (Urocyon littoralis) are dwarfed descendants of the mainland gray fox (U. cinereoargenteus) that have persisted on California’s Channel Islands for thousands of generations. We sequenced complete genomes from each island population and the mainland to examine heterozygosity and deleterious variation within individual genomes. The genome of the San Nicolas Island fox (U. l. dickeyi) is almost entirely monomorphic, and retains the lowest observed genomic heterozygosity of any outbred species to date. Furthermore, genomes from all island populations exhibit reduced variation and an elevation in the proportion of deleterious alleles, resulting from strong genetic drift and reduced efficacy of selection. Through demographic simulations, we show that the observed patterns of heterozygosity within the island genomes can be attributed to severe bottlenecks and long-term small population size. The case of the island fox, with its persistence despite extreme lack of genetic variability and increased genetic load, raises questions about the generality of the small population paradigm.

Neda Ronaghi, UC Santa Cruz

Transcriptome Wide Analysis of Alternative Splicing in Primates During Neuronal Development

Alternative splicing (AS) is a major mechanism that contributes to transcriptome diversity in mammals. Here, we carried out an evolutionary comparison of transcriptome wide changes in the mRNA processing of Homo Sapiens and one of our oldest ancestors, the Rhesus Macaque. An embryonic stem cell based cortical neurosphere differentiation assay was developed to attempt to represent Human and Rhesus cortical neuron development. We compared gene expression patterns during the early stages of cortical development in order to identify different transcriptional networks that may underlie the differences in brain development between the two species. We found that as genes monotonically increase in expression, they are more enriched in Biological processes specific to neuro-development, and as their gene expression values monotonically decrease, they are enriched in cell-signaling ontologies. To further study the differences in post-transcriptional regulation between Human and Rhesus during neuronal development, we use a probabilistic program that quantifies the expression levels of AS events called MISO.

Cameron Soulette, UC Santa Cruz

Full-length transcript analysis of alternative mRNA processing by nanopore sequencing

Cancer genomics has been revolutionized by the advent of next-generation sequencing. Short-read high-throughput cDNA sequencing (RNA-seq) has lead to the discovery of numerous cancer specific pre-mRNA splicing alterations, such as those caused by mutations in the splicing factors SF3B1, SRSF1, and U2AF1. Although these complex splicing alterations have been characterized across various cancer types, current methods lack the resolution necessary to study these events in the context of full-length isoforms. Our work is aimed toward overcoming this limitation by developing a full-length cDNA sequencing approach to study differential expression of cancer-specific transcripts. Our approach utilizes known alternative splicing (AS) alterations observed in SRSF1 overexpression (SRSF1oe) in MCF-10A cells to develop a robust cDNA sequencing approach for the Oxford Nanopore MinION. We will compare transcript isoform abundance between MCF-10A SRSF1oe and MCF-10A control cells. We will also produce short-read RNA-seq in parallel to compare and quantify the amount of new information gained by full-length cDNA sequencing. Our initial results yielded 20,301 and 6,896 full-length cDNA sequenced reads from control and SRSF1oe MCF-10A cells, respectively. Further processing revealed a substantial number of genes (n=1719) contained moderate read coverage of >10reads, 57 of which are SRSF1 regulated AS events. Moreover, we identified interesting alternative transcription initiation patterns in SRSF3 and SRSF1, in which the alternative initiation site in SRSF1 may serve to protect it from nonsense-mediated degradation. In sum, our results demonstrate the utility and effectiveness of full-isoform sequencing.

Aaron Stern, UC Berkeley

A full-likelihood method for detecting directional selection from SNP data

A major goal of genetics research is to identify where selection has acted on genetic loci and shaped neighboring diversity via “selective sweeps.” Current methods for detecting selective sweeps primarily use summary statistics obtained from sequence data. These statistics often have power to to detect only particular classes of sweeps. For example, methods used to detect completed sweeps extend poorly to incomplete sweeps; similarly, methods used to identify sweeps of a standing variant extend poorly to sweeps from a single haplotype. Here, we present the first full-likelihood method for detection of general sweeps from SNP data. Via importance sampling of the posterior on ancestral recombination graphs relating the observed individuals, we obtain an asymptotically consistent estimate of the likelihood ratio of selection vs. neutrality. Our results show that our method offers improved classification across sweep classes despite confounding factors such as population growth and admixture. We also apply our method to an analysis of selection amongst 18 European haplotypes, particularly examining the history of the LCT (lactase) gene.

Jennifer Yip, UC Berkeley

Detection of Virulence Factors and Genetic Engineering in Pathogens

Malicious genetic engineering poses a threat to biosecurity, requiring early detection to combat effectively. Additionally, it is becoming more difficult to differentiate between natural horizontal gene transfer and deliberate genetic modifications of bacterial DNA. Thus, it is crucial to continue developing tools capable of predicting the functions and virulent capacity of a pathogen. The Pathogen Virulence Analysis Toolkit (PVAT) we built determines the genetic makeup of bacterial strains, by consolidating genomic data from multiple databases and comparing gene characteristics. Running the PVAT pipeline results in a comprehensive gene analysis that allows for detection of possible virulent enhancements. The aim was to construct a genetic profile of the observed organism’s virulent and unique genes. The scope of the project spans from increasing the efficiency of identifying potentially harmful genes to designing a straightforward user interface to display the results.