Gene Expression: RNA-seq
Gene expression is a key determinant of cellular phenotypes. While
high-throughput microarrays have been the predominant technology
for measuring gene expression for years, all hybridization-based
technologies are subject to biases and limitations, such as reliance on
known gene models and potential for cross-hybridization to probes with
similar sequences. Microarrays offer a limited ability to fully catalog
and quantify the diverse spectrum of RNA molecules expressed from
cellular genomes, and are of little use for detecting previously
unknown transcripts, novel exons, or exons with novel splice
Next-generation sequencing has revolutionized transcriptomics by enabling massively parallel sequencing of cDNA (RNA-seq). In addition to overcoming the limitations mentioned above for microarrays, RNA-seq provides a digital readout of gene expression, enabling a higher sensitivity and an essentially unlimited dynamic range.
The core currently offers the following bioinformatics analysis workflow for RNA-seq experiments:
- Pre-alignment quality assessment:
- Per-base sequence quality
- Per-base sequence content
- Per-base GC content
- Search for overrepresented sequences (adapters, primers, etc)
- Alignment to a reference genome using TopHat:
- Homo sapiens
- Mus musculus
- Rattus norvegicus
- Bos taurus
- Canis familiaris
- Gallus gallus
- Drosophila melanogaster
- Arabidopsis thaliana
- Caenorhabditis elegans
- Saccharomyces cerevisiae
- Post-alignment quality assessment:
- Flagging duplicate reads
- Estimation of library complexity
- Insert size distribution (for paired-end sequencing)
- Analysis of coverage over transcript position
- Transcript assembly
- Differential expression testing
- Isoforms: tests of differential expression for each isoform.
- Genes: tests of differential expression summing expression of all transcripts at the gene level.
- Primary transcripts: tests of differential expression summing expression of all transcripts sharing the same transcription start site, regardless of coding sequence.
- Coding sequence: tests of differential expression summing expression of all transcripts sharing the same coding sequence, regardless of transcription start site.
- Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
- Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
- Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.
- Visualization: assistance with visualization using IGV.
Q: What software packages do you use in your RNA-seq pipeline?
The core uses a combination of custom-built and open-source application-specific software, including the Tuxedo suite of tools (Bowtie, Tophat, Cufflinks, and cummeRbund) for RNA-seq alignment, assembly, transcript abundance estimation, and differential expression testing. The STAR aligner may also be used for alignment. When performing RNA-seq for examining expression at the gene level only, count-based negative binomial methods such as DESeq or edgeR will be used.
Q: How much sequence do I need?
RNA-seq has the unquestionable advantage over current hybridization-based microarrays. However, with the cost per sample of RNA-Seq still higher than microarray, it can be beneficial to multiplex samples in a single sequencing reaction. But how many reads are sufficient?
RNA-seq is a counting application - coverage is estimated in the number of reads (as opposed to genome sequencing, where coverage is measured in gigabases). Most published RNA-seq papers have sequenced from as little as 20-million reads per sample, up to as many as 100-million reads per sample, with an average of about 40-million reads per sample. The number of reads required depends on your research question. At the lower end (20M), you can really only focus on expression at the gene level (although this data is still arguably better than microarrays). At the upper end (100M) you can confidently examine novel and rarely expressed transcripts.
The ENCODE consortium's Standards, Guidelines and Best Practices for RNA-Seq recommends
Estimating the depth of sequencing coverage (i.e., the average number of times each nucleotide is sequenced) is somewhat different for RNA-seq than for genome sequencing, and involves several assumptions. Sequencing depth in RNA-seq is calculated as (NT sequenced / #mRNAs per cell) / (#nucleotides/mRNA). For example, if you have 50-million 100-bp paired end reads, you have 10-billion, or 10^10 NTs sequenced. Assuming you have 2e6 mRNAs per cell and an average transcript length of 1500 bases: (10^10 nt / 2e6 mRNAs-per-cell) / (1500NT/mRNA) = 3X coverage of an mRNA present at one copy per cell.
Q: Which sequencing platform should I use?
See Table 1, "Comparison of Sequencing Technologies" from this paper in the January 2012 issue of Nature Reviews Genetics for a comparison of read length, insert size, run time, reads per run, error rate, and relative cost per nucleotide for several common sequencing platforms, including Roche 454, Illumina GAIIx, Illumina HiSeq 2000, Illumina MiSeq, PacBio, and Ion Torrent. Yongde Bao's DNA Sciences (Sequencing) Core currently has an Illumina GAIIx and will be getting a MiSeq. While the MiSeq has a much shorter run time, its primary applications are for targeted resequencing studies (e.g. following up GWAS hits in a small region), amplicon sequencing, microbial identification (where assembly is not an issue), or targeted sequencing of small regions in a clinical setting. The MiSeq's throughput is too low to make RNA-seq studies in complex genomes cost-effective. If you would like to schedule a joint meeting with directors from both the Bioinformatics Core and the DNA Sciences (Sequencing) Core, please email us and indicate this on your consultation request form.
Q: How much will this cost me?
Bioinformatics costs are separate from sequencing costs, and are the same whether data is generated here at the DNA Sciences Core or elsewhere through an external vendor. Read the about page, and submit a consultation request form.