CCB researchers have developed a variety of tools, web servers and databases. Many of these resources are available online or can be downloaded; usage agreements vary.
- ANDY – ANDY is a general, fault-tolerant tool for database searching on computer clusters.
- ASTRAL – The ASTRAL compendium provides databases and tools useful for analyzing protein structures and their sequences. It is partially derived from, and augments the SCOP: Structural Classification of Proteins database. Most of the resources provided here depend upon the coordinate files maintained and distributed by the Protein Data Bank.
- cgi-lib.pl – The cgi-lib.pl library has become the de facto standard library for creating Common Gateway Interface (CGI) scripts in the Perl language.
- doublet – doublet is a modified version of the Smith-Waterman algorithm that incorporates patterns of dipeptide covariation to align protein sequences.
- SCOP – Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.
- SCOR – The Structural Classification of RNA (SCOR) is a database designed to provide a comprehensive perspective and understanding of RNA motif structure, function, tertiary interactions and their relationships. SCOR 2.0.3 provides a survey of the three-dimensional motifs contained in 579 NMR and X-ray RNA structures available as of May 15, 2004. This includes 8,270 secondary structural elements, of which 2,920 are hairpin loops and 5,350 are internal loops. The structural elements are organized in a directed acyclic graph (DAG) architecture, allowing multiple parent classes for a motif. Users can browse the database or search by PDB or NDB identifier, keyword or sequence. Descriptions and cartoon representations of each of the classes are available. RNA motifs reported in the literature (e.g. Kink turns, S-turns, GNRA loops) are incorporated into the classification.
- SIFTER – Statistical Inference of Function Through Evolutionary Relationships.
- WebLogo – WebLogo is a web based application designed to make the generation of sequence logos as easy and painless as possible.
- Microarray Analysis
- ScanAlyze – Process fluorescent images of microarrays. Includes semi-automatic definition of grids and complex pixel and spot analyses. Outputs to tab-delimited text files for transfer to any database. Written by Michael Eisen. For Windows only. Manual. Source Code.
- Cluster Analysis and Visualization Software – Cluster and TreeView are an integrated pair of programs for analyzing and visualizing the results of complex microarray experiments.
- Cluster – Perform a variety of types of cluster analysis and other types of processing on large microarray datasets. Currently includes hierarchical clustering, self-organizing maps (SOMs), k-means clustering, principal component analysis. Hierarchical clustering methods described in Eisen et al. (1998) PNAS 95:14863.
- FuzzyK – A C++ command line program that will perform fuzzy k-means clustering on gene expression data. See Gasch AP and Eisen MB (2002). Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 3(11), 1-22.
- TreeView – Graphically browse results of clustering and other analyses from Cluster. Supports tree-based and image based browsing of hierarchical trees. Multiple output formats for generation of images for publications. For Windows only.
- MapleTree – MapleTree is a java-based, open source, cross-platform visualization tool to graphically browse the results of clustering analyses from our Cluster and Fuzzy K clustering software, and many other clustering and analysis programs.
- Combined Expression Data and Sequence Analysis Software
- GMEP – Compute genome-mean expression profiles from expression and sequence data.
- Sequence Analysis Software
- WPH-finder – Given a cis-regulatory element, search for putative co-regulated sequences in the genome as sequences with similar word content (word profile hits).
- GATA – (platform independent): Graphic Alignment Tool for Comparative Sequence Analysis.
- TFEM – (ANSI C code): Transcription Factor Expectation Maximization, an algorithm for detecting DNA regulatory motifs by incorporating positional trends in information content.
- MONKEY – MONKEY is a set of programs designed to search alignments of non-coding DNA sequence for matches to matrices representing the sequence specificity of transcription factors
- EMnEM – Expectation-Maximization on Evolutionary Mixtures
- Berkeley Quantitative Genome Browser – QGB duplicates much of the annotation display functionality typical of a genome browser but with an emphasis on custom and quantitative data such as that produced in high throughput sequencing and genome-wide association studies.
- Tiling Microarray Analysis – The Berkeley Drosophila Transcription Network Project has put together a variety of resources to assist in the processing and interpretation of chip-chip tiling microarray experiments, particularly those performed with Affymetrix and Nimblegen chips. These resources include a comprehensive package of Java applications for mapping, normalizing, testing, filtering, printing, plotting, comparing, and annotating results from tiling microarry experiments.
- Web Tools
- Probabilistic sequence modeling
- DART library – includes
- Handel (phylo-alignment/ancestral reconstruction of DNA and proteins)
- xrate (mutation rate measurement, alignment annotation)
- stemloc and Evol Deeds (RNA multiple alignment, structure prediction, ancestral reconstruction)
- simgenome (simulation of genome sequence alignments; requires GSimulator)
- DART library – includes
- Format conversion
- Gene Expression Analyzer (GEA) – A tool for clustering and significant analysis of SAGE and Microarray gene expression data.
- Knorm – An appealing statistical method for gene association inference across multiple dependent experimental conditions.
- LMM – A method for predicting transcription factor binding sites by evaluating a candidate site in a local genomic context. Available upon request from the Huang Lab.
- TilingAnalyzer – A method for analyzing multiple RNA tiling arrays. Available upon request from the Huang Lab.
- Berkeley Madonna – Modeling and analysis of dynamic systems.
- AMAP – Protein multiple alignment
- Cufflinks – Transcript assembly and abundance estimation for RNA-Seq
- FSA – Fast Statistical Alignment
- GENEMAPPER – Reference based gene annotation
- MAVID – Multiple alignment of large genomic sequences
- MERCATOR – Homology mapping
- MJOIN – Neighbor joining with subtree weights
- PARALIGN – Alignment polytope construction
- SLAM – Pairwise simultaneous alignment and gene finding
- SLIM – Minimum network design for optimizing the search space for pair hidden Markov models
- TopHat – Splice junction mapper for short RNA-seq reads
- VISTA – Visualization tool for global alignments
Sjölander Group – Phylogenomics
- FAT-CAT: Fast Approximate Tree Classification. FAT-CAT uses HMMs at internal nodes of PhyloFacts trees to derive functional sub-classifications of protein sequences, and simultaneous functional and taxonomic classification of metagenome sequences in microbial community datasets.
- PhyloFacts – Phylogenomic encyclopedias of protein families across the Tree of Life. PhyloFacts release 3.0.2 contains >7.3M protein sequences from >99K unique taxa clustered into ~100K families. Each family has a phylogenetic tree, multiple sequence alignment, hidden Markov model, predicted orthology groups and experimental and annotation data. These data can be downloaded from the website and searched using the FAT-CAT HMM search webserver (described above).
- PHOG – PhyloFacts Orthology Group: phylogenetic orthologs.
- SATCHMO-JS – Simultaneous alignment and tree construction using hidden Markov models.
- (naive)BayesCall – Basecaller for the Illumina Platform. Software accompaniment to “Kao, W.C., Stevens, K. and Song, Y.S. BayesCall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Research, 19 (2009) 1884-1895.” & “Kao, W.C. and Song, Y.S. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing. Proc. 14th Annual Intl. Conf. on Research in Computational Molecular Biology (RECOMB 2010), Lecture Notes in Computer Science 6044, pages 233–247, 2010.
- ASF – Two-Locus Asymptotic Sampling Formula. Software accompaniment to “Jenkins, P.A. and Song, Y.S. Closed-form two-locus sampling distributions: accuracy and universality. Genetics, 183 (2009) 1087-1103.”
- overpaint – Gene Conversion. Software accompaniment to “Yin, J. Jordan, M. I., and Song, Y. S. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data, Proceedings of ISMB 2009, Bioinformatics, 25 (2009) i231-i239.”
- Wright_Fisher_MP and Moran_MP – Multi-locus Match Probability. Software accompaniment to “Bhaskar, A. and Song, Y.S.
Multi-locus match probability in a finite population: A fundamental difference between the Moran and Wright-Fisher models. Proceedings of ISMB 2009, Bioinformatics, 25 (2009) i187-i195.”
- COB – Estimating Recombination Rates. Software accompaniment to “Lyngsø, R., Song, Y.S., and Hein, J. Accurate computation of likelihoods in the coalescent with recombination via parsimony. Proc. 12th Annual Intl. Conf. on Research in Computational Molecular Biology (RECOMB 2008), Lecture Notes in Computer Science 4955, pages 463-477.” COB is a parsimony-based method of computing likelihoods accurately under the coalescent with recombination.
- BLOSSOC – Whole-Genome Association Mapping. This program combines a recently found linear-time algorithm for phasing genotypes on trees with a tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. Software accompaniment to “Ding, Z., Mailund, T., and Song, Y.S. Efficient whole-genome association mapping using local phylogenies for unphased genotype data. Bioinformatics, 24 (2008) 2215-2221.”
- HapBound and SHRUB – HapBound and SHRUB respectively compute lower and upper bounds on the minimum number of crossover recombinations. SHRUB constructs an ancestral recombination graph for the input data. Software accompaniment to “Song, Y.S., Wu, Y. and Gusfield, D. Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution, Proceedings of ISMB 2005. Bioinformatics, 21, Suppl.1, (2005) i413-i422.”
- Beagle – Beagle computes the minimum number of crossover recombinations. It also produces an ancestral recombination graph. Software accompaniment to “Lyngsø, R., Song, Y.S., and Hein, J. Minimum Recombination Histories by Branch and Bound. Proceedings of WABI 2005, Lecture Notes in Computer Science, 3692, pp. 239-250.”
- HapBound-GC and SHRUB-GC – HapBound-GC and SHRUB-GC respectively compute lower and upper bounds on the minimum combined number of crossover and gene-conversion recombinations. SHRUB-GC constructs a graphical representation of evolutionary history involving coalescent, mutation, crossover and gene-conversion events. Software accompaniment to “Song, Y.S., Ding, Z., Gusfield, D., Langley, C.H., and Wu, Y. Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in the Derivation of SNP Sequences in Populations. Proceedings of RECOMB 2006. Lecture Notes in Computer Science, 3909, (2006) 231-245.”
Speed and Dudoit Groups
- SMA – An R package for the analysis of cDNA microarray data
- RMAExpress – An application for generating RMA expression measures for affy data (for Windows operating systems).
- Bioconductor – Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data. Many of the software packages were authored by members of the Dudoit and Speed groups.
Theoretical Evolutionary Genomics Group (Huelsenbeck, Nielsen, Slatkin)
- MrBayes – MrBayes is a program that estimates phylogeny using as input an alignment. The program using Markov chain Monte Carlo to approximate the posterior probability distribution of trees.
- Structurama – Structurama infers population structure using as input genetic data for a set of individuals. It uses a Dirichlet process prior, which allows the number of populations to be a random variable.
- MDIV – MDIV is a program that will simultaneously estimate divergence times and migration rates between two populations under the infinite sites model and under a finite sites model (HKY). This version of the program is only applicable to a single locus and assumes equal population sizes in all populations. A program with enhanced features and better documentation is available from Jody Hey’s web site. A Windows executable version of the program, documentation, and an example infile are coming soon. Please send enquiries regarding source code or executables for other platforms to Rasmus Nielsen.
- MISAT – MISAT is a program for estimating the likelihood surface for q (4 times the effective population size times the mutation rate) for microsatellite data. A Windows executable version of the program, documentation, and an example infile are coming soon. Please send enquiries regarding source code or executables for other platforms to Rasmus Nielsen.
- SweepFinder – SweepFinder is a program implementing the method described in: Nielsen et al. 2005. Genomic scans for selective sweeps using SNP data. Genome Research 1566-1575. It can be used to detect the location of a selective sweep based on SNP data, as well as estimate the frequency spectrum of observed SNP data in the presence of missing data. Source Code and Instructions are coming soon.
- trueFS – trueFS is a program used for finding the ascertainment corrected frequency spectrum based on ascertained SNP data. For information regarding the method, please see: Nielsen, R., M. J. Hubisz and A. G. Clark. 2004. Reconstituting the frequency spectrum of ascertained SNP data. Genetics 168: 2373-2382. Source Code and Instructions are coming soon.
- codonbias – This program will allow the user to estimate selection coefficients relating to optimal codon usage. Source Code and Instructions are coming soon.
- CodonRecSim – CodonRecSim is an old program written by R. Nielsen for simulating samples in a codon based models under the coalescent with recombination. Source Code and Instructions are coming soon.
- PATRI – PATRI (PaTeRnity Inference) is a program for paternity analysis of genetic data. Windows Executable; Executable for Linux on Sun processor; Executable for Linux on Intel processor; Documentation (Readme file); and Example infile are coming soon.
- SAP – SAP (Statistical Assignment Package) is a program that assigns unknown DNA sequences to species and taxonomic groups using a Bayesian approach to calculate a probability that the unknown DNA sequence groups monophyletically with database sequences belonging to a taxonomic group.
- MIMAR – MIMAR (MCMC estimation of the Isolation-Migration model Allowing for Recombination) is a Markov chain Monte Carlo method to estimate parameters of an isolation-migration model.
- iMCMC – This program was inspired by Paul Lewis’s fantastic windows program MCROBOT. iMCMC is a Macintosh application that illustrates Markov chain Monte Carlo (MCMC) for a simple landscape.