My Software Research

Hello and welcome to my academic website. Here you can find information about my research, publications, software development, and other news. I (hope to) regularly write posts on topics related to ecology, evolution and bioinformatics with the goal of documenting fun stories and useful Python and R code. I’m currently a post-doc in the Donoghue Lab at Yale University. Before starting here I earned by Bachelors at the University of Minnesota, worked as a research associate at Brown University, and earned my PhD at the University of Chicago. 
 

Primary research interests:

  • Genomics: RADseq bioinformatics and phylogenetics
  • Ecology: Interspecific reproductive interference in hyperdiverse lineages
  • Evolution: Comparative approaches to measuring genomic introgression
  • Botany: Pollen-pistil relationships and floral evolution in Pedicularis
my software research

Topics

Bioinformatics – I created and still actively develop and research the program pyrad, which is used to assemble genomic RADseq data sets. I love learning about Python programming and linux, and I’m a huge advocate for open-source code and data sharing. You can follow active development of my projects on github.

Reproductive Interference – My research software is broadly focused on understanding why some clades of organisms contain more species than others, and within this context I’m especially interested in the long-recognized relationship among flowering plants where families with more specialized pollination mechanisms tend to contain more species. To understand this pattern I study how flower evolution affects rates of speciation, and reciprocally, how interactions among co-occurring species, especially within highly diverse clades, affect flower evolution.

Pedicularis – Much of my research is focused on the hyperdiverse genus Pedicularis in its center of diversity in the Hengduan Mountains of China (the eastern side of the Tibetan plateau), where nearly 300 species are endemic. Pedicularis in this region frequently overlap in their geographic ranges, flower synchronously during the short alpine summer, and share bumblebee pollinators. This combination of factors raises the potential for pollen transfer between different species, which can have negative fitness consequences (termed reproductive interference), and consequently impose strong selection on floral phenotypes. I combine genomic RADseq approaches above and below the species-level with field-based experiments and ecological studies to better understand the evolutionary consequences of these interactions.

Oaks – Coyne and Orr described oaks as “a worst case scenario of the biological species” due to their often blurred species boundaries resulting from hybridization. Whether as a consequence of hybridization, insufficiently variable markers, or sorting of ancestral polymorphisms in large populations of long lived trees, previous phylogenetic studies have shown great difficulty in resolving oak relationships at both shallow and deep evolutionary timescales. Applying genomic RADseq data we have resolved a backbone phylogeny of the American clade oaks (Hipp et al. 2014) and shown evidence of frequent hybridization among the live oaks which affects phylogenetic inference (Eaton et al. 2015), and which has likely persisted over millions of years (Cavender-Bares et al. 2015).

Viburnum – A model system for phylogenetic comparative methods in plants, and in particular historical biogeography and leaf trait evolution. We are using RADseq methods to infer deep-scale phylogenies, infer introgression between lineages, resolve polyploid lineages, and study replicate radiations of leaf forms among Viburnum in central and south America.

Software

pyRAD

Software to assemble de novo RADseq loci from restriction-site associated sequence data (RAD,ddRAD,GBS,PEddRAD,PEGBS). Alignment clustering allows for indel variation to better align highly divergent samples, merge and trim methods canbe employed to rescue overlapping paired end data, and reverse complement clustering is available to improve GBS assemblies of short overlapping contigs.

simrrls

Software to simulate fastQ formatted RAD, ddRAD or GBS data on an input species tree under a coalescent model. See example usage at this post.

Simulating raw RADseq data

The program simrrls can be used to simulate raw (fastq format) sequence data on an input topology under the coalescent in a manner that emulates restriction-site associated DNA, with slight variations for different data types (e.g., RADseq, ddRAD, GBS, paired-end data).


RADseq was originally developed to generate data for bug testing pyrad and to create example RADseq data sets for use in tutorials. However, I can imagine other uses for it, such as investigating the effects of missing data, insufficient sequencing coverage and error rates on assembly results.

To use the program you will need to first install the Egglib Python module, used here to perform coalescent simulations.

Calling the script – simrrls

$ simrrls -h

optional arguments:
  -h, --help      show this help message and exit
  --version       show program's version number and exit
  -o outname      [str] output file name prefix (default 'out')
  -mc dropout     [0/1] allelic dropout from mutation to cut sites (default 0)
  -ms dropout     [0/1] allelic dropout from new cut sites in seq (default 0)
  -e error        [float] sequencing error rate (default 0.0005)
  -f datatype     [str] datatype (default rad) (options: rad, gbs, ddrad,
                  pairddrad, pairgbs)
  -I indels       [float] rate of indel mutations (default 0) ex: 0.001
  -l length       [int] length of simulated sequences (default 100)
  -L nLoci        [int] number of loci to simulate (default 100)
  -n Ninds        [int] N individuals from each taxon (default 1)
  -N Ne           [int] pop size (Ne for all lineages; default 5e5)
  -t tree         [str] file name or newick string of ultrametric tree
                  (default 12 taxon balanced tree w/ bls=1)
  -u mu           [float] per site mutation rate (default 1e-9)
  -df depthfunc   [str] model for sampling copies (default norm, other=exp)
  -dm depthmean   [int] mean sampled copies in norm, 1/m for exp (default 10)
  -ds depthstd    [int] stdev sampled copies, used with norm model (default 0)
  -c1 cut_1       [str] restriction site 1 (default CTGCAG)
  -c2 cut_2       [str] restriction site 1 (default CCGG)
  -i1 min_insert  [int] total frag len = (2*l)+insert (default 100)
  -i2 max_insert  [int] total frag len = (2*l)+insert (default 400)
  -r1 seed_1      [int] random seed 1 (default 1234567)
  -r2 seed_2      [int] random seed 2 (default 7654321)

Datatypes– RADseq is the default type for which a barcode and restriction site overhang are attached to the left side of single end reads. Depending on the data type selected simrrls will put cut sites and barcodes in the proper ends of sequences so that they emulate RAD, GBS, or ddRAD data, single or paired-end. The choice also effects how locus dropout occurs (i.e., one or two cutters present).

Allelic Dropout– There are two ways in which allelic dropout (mutation-disruption) can occur, each of which can be toggled. The first is mutations to the cut site recognition sequence (-mc). If this is turned on then an allele will be dropped if a mutation occurs within the length of the restriction recognition sites determined by the cutters (-c1 and -c2) and the data type (-f) which determines whether one or two cutters are used (e.g., rad is a single cutter, ddrad is double-cutter). The other form of mutation disruption occurs when mutations give rise to new cut sites within the length of a selected fragment (-ms). Here the total fragment length is determined by (-l) and the max insert length (-i2), and the probability of disruption by the cutters (-c1 and -c2), and the substitution rate (-N and -u).

Sequencing coverage and error– Another source of missing data in RADseq data sets is low sequencing coverage. You can set the sampling rate as the number of reads that are sampled from each haplotype. The mean (-dm) and standard deviation (-ds). To examine the effect sequencing rate has on base calls, etc., you could combine low coverage sequencing with a high error rate (-e).

Topology– Data are simulated under a Jukes-Cantor model with equal base frequencies. If there is no user-supplied topology (-t) then simrrls uses a default 12 tip topology, shown below. Branch lengths are in coalescent units. The outgroup taxon in the tree “X” is not included in the output, but is used to polarize mutations relative to an outgroup for creating indels.

RADseq

Output – Two data files are created using the output name prefix (-o), which if it were ‘testrad’ would create testrad_R1_.fastq.gz and testrad_barcodes.txt. The first is a compressed fastq file with sequence data and the latter is a text file mapping barcodes to sample names. If you selected paired-end data it would create two sequence files, one with “_R1_” in the name and the other with “_R2_”. With these fastq data and a barcode map the data can then be assembled in pyrad.

Examples –

Modified population parameters:

$ simrrls -o test2 -N 1e6 -u 2e-8

Modified sequencing parameters:

$ simrrls -o test3 -L 5000 -l 200 -e 0.001 -dm 10 -ds 2

Modified library type (In this case allowing paired-end reads overlap):

$ simrrls -o test4 -f pairddrad -i1 -50 -i2 200

Modified topology:

$ echo "((a:1,b:1):1,c:2);" > treefile
$ simrrls -o test5 -t treefile

More info – Check out the git repository: (simrlls.py)

pyrad

Description

The benefit of pyRAD over most alternative methods for analyzing RADseq-like data comes in its use of an alignment-clustering method (vsearch) that allows for the inclusion of indel variation which improves identification of homology across highly divergent samples. For this reason pyRAD is commonly employed for RADseq studies at deeper phylogenetic scales, however, it works equally well at shallow scales.

Flexibility

pyRAD is intended for use with any type of restriction-site associated DNA. It currently supports RAD, ddRAD, PE-ddRAD, GBS, PE-GBS, EzRAD, PE-EzRAD, 2B-RAD, nextRAD, and can be extended to other types. Below is the download link as well as a number of tutorials. There is a “general use” pyRADtutorial which explains how to install and setup input files, and also tutorials with example data sets for different data types and analyses. pyRAD was initially described in the following publication (preprintjournal)