Gene Expression: RNA-seq
Gene expression is a key determinant of cellular phenotypes. While
high-throughput microarrays have been the predominant technology
for measuring gene expression for years, all hybridization-based
technologies are subject to biases and limitations, such as reliance on
known gene models and potential for cross-hybridization to probes with
similar sequences. Microarrays offer a limited ability to fully catalog
and quantify the diverse spectrum of RNA molecules expressed from
cellular genomes, and are of little use for detecting previously
unknown transcripts, novel exons, or exons with novel splice
junctions.
Next-generation sequencing has revolutionized transcriptomics by
enabling massively parallel sequencing of cDNA (RNA-seq). In addition
to overcoming the limitations mentioned above for microarrays, RNA-seq
provides a digital readout of gene expression, enabling a higher
sensitivity and an essentially unlimited dynamic range.

Services
The core currently offers the following bioinformatics analysis workflow for RNA-seq experiments:
- Pre-alignment quality assessment:
-
- Per-base sequence quality
- Per-base sequence content
- Per-base GC content
- Search for overrepresented sequences (adapters, primers, etc)
- Alignment to a reference genome using TopHat:
-
- Homo sapiens
- Mus musculus
- Rattus norvegicus
- Bos taurus
- Canis familiaris
- Gallus gallus
- Drosophila melanogaster
- Arabidopsis thaliana
- Caenorhabditis elegans
- Saccharomyces cerevisiae
- Post-alignment quality assessment:
-
- Flagging duplicate reads
- Estimation of library complexity
- Insert size distribution (for paired-end sequencing)
- Analysis of coverage over transcript position
- Transcript assembly
- Differential expression testing
-
- Isoforms: tests of differential expression for each isoform.
- Genes: tests of differential expression summing expression of all transcripts at the gene level.
- Primary transcripts: tests of differential expression summing expression of all transcripts sharing the same transcription start site, regardless of coding sequence.
- Coding sequence: tests of differential expression summing expression of all transcripts sharing the same coding sequence, regardless of transcription start site.
- Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
- Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
- Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.
- Visualization: assistance with visualization using IGV.
FAQs
Q: What software packages do you use in your RNA-seq
pipeline?
The core uses a combination of custom-built and open-source
application-specific software, including the Tuxedo suite of tools (Bowtie, Tophat, Cufflinks, and cummeRbund) for RNA-seq alignment, assembly,
transcript abundance estimation, and differential expression testing.
The
STAR aligner may also be used for alignment. When performing
RNA-seq for examining expression at the gene level only, count-based
negative binomial methods such as DESeq or edgeR
will be used.
Q: How much sequence do I need?
RNA-seq has the unquestionable advantage over current
hybridization-based microarrays. However, with the cost per sample of
RNA-Seq still higher than microarray, it can be beneficial to multiplex
samples in a single sequencing reaction. But how many reads are
sufficient?
RNA-seq is a counting application - coverage is estimated in
the number of reads (as opposed to genome sequencing, where
coverage is measured in gigabases). Most published RNA-seq papers have
sequenced from as little as 20-million reads per sample, up to as many
as 100-million reads per sample, with an average of about 40-million
reads per sample. The number of reads required depends on your research
question. At the lower end (20M), you can really only focus on
expression at the gene level (although this data is still arguably
better than microarrays). At the upper end (100M) you can confidently
examine novel and rarely expressed transcripts.
The ENCODE consortium's Standards,
Guidelines and Best Practices for RNA-Seq recommends
and for
Estimating the depth of sequencing coverage (i.e., the
average number of times each nucleotide is sequenced) is somewhat
different for RNA-seq than for genome sequencing, and involves several
assumptions. Sequencing depth in RNA-seq is calculated as (NT sequenced
/ #mRNAs per cell) / (#nucleotides/mRNA). For example, if you have
50-million 100-bp paired end reads, you have 10-billion, or 10^10 NTs
sequenced. Assuming you have 2e6 mRNAs per cell and an average
transcript length of 1500 bases: (10^10 nt / 2e6 mRNAs-per-cell) /
(1500NT/mRNA) = 3X coverage of an mRNA present at one copy per
cell.
Q: Which sequencing platform should I use?
See Table 1, "Comparison of Sequencing
Technologies" from this paper in the January 2012 issue
of Nature Reviews Genetics for a comparison of read length, insert
size, run time, reads per run, error rate, and relative cost per
nucleotide for several common sequencing platforms, including Roche
454, Illumina GAIIx, Illumina HiSeq 2000, Illumina MiSeq, PacBio, and
Ion Torrent. Yongde Bao's DNA Sciences (Sequencing) Core currently has an
Illumina GAIIx and will be getting a MiSeq. While the MiSeq has a much
shorter run time, its primary applications are for targeted
resequencing studies (e.g. following up GWAS hits in a small region),
amplicon sequencing, microbial identification (where assembly is not an
issue), or targeted sequencing of small regions in a clinical setting.
The MiSeq's throughput is too low to make RNA-seq studies in complex
genomes cost-effective. If you would like to schedule a joint meeting
with directors from both the Bioinformatics Core and the DNA Sciences
(Sequencing) Core, please email us and indicate this on your consultation request form.
Q: How much will this cost me?
Bioinformatics costs are separate from sequencing costs, and are the
same whether data is generated here at the DNA Sciences Core or
elsewhere through an external vendor. Read the about page, and submit a consultation request form.

