This workshop will:
In this workshop we will access (The Cancer Genome Atlas) data available at the NCI's Genomic Data Commons (GDC). Before we start, it is important to know that the data is deposit in two different databases:
The legacy archive (https://portal.gdc.cancer.gov/legacy-archive/search/f) which contains unharmonized legacy data from repositories that predate the GDC (e.g. CGHub). Also, legacy data is not actively maintained, processed, or harmonized by the GDC (Source: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Legacy_Archive/)
Harmonized database (https://portal.gdc.cancer.gov/) which contains data from many cancer projects processed using standardized pipelines and the reference genome GRCh38 (hg38). This gives the advantage of analyzing multiple cancer types or the same cancer type across multiple projects. You can find more information about the pipelines supporting data harmonization at https://gdc.cancer.gov/about-data/gdc-data-harmonization and at https://docs.gdc.cancer.gov/Encyclopedia/pages/Harmonized_Data/.
An overview of the web portal is available at https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/.
Also, a more in-depth comparison between the legacy and Harmonized data was recently published:
GDC provides the data with two access levels:
You can find more information about those two levels and how to get access to controlled data at: https://gdc.cancer.gov/access-data/data-access-processes-and-tools.
Each TCGA sample has a unique identifier called TCGA barcode, which contains important information about each sample. A description of the barcode is shown below (Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/).
You can find a table with all the code and meanings at https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables.
In order to filter the data available in GDC some fields are available such as project (TCGA, TARGET, etc.), data category (Transcriptome Profiling, DNA methylation, Clinical, etc.), data type (Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value, etc.), experimental strategy (miRNA-Seq, RNA-Seq, etc.), Workflow Type, platform, access type and others.
In terms of data granularity, a project has data on several categories, each category contains several data types that might have been produced with different workflows, experimental strategy and platforms. In that way, if you select data type "Gene Expression Quantification" the data category will be Transcriptome Profiling.
You can find the entry possibilities for each filter at the repository page of the database at https://portal.gdc.cancer.gov/repository.
Before we start, it is important to know that the R/Bioconductor environment provides a data structure called SummarizedExperiment
, which was created to handle both samples metadata (age, gender, etc), genomics data (i.e. DNA methylation beta value) and genomics metadata information (chr, start, end, gene symbol) in the same object.
You can access samples metadata with colData
function, genomics data with assays
and genomics metadata with rowRanges
.
suppressMessages({
library(TCGAbiolinks)
library(MultiAssayExperiment)
library(maftools)
library(dplyr)
library(ComplexHeatmap)
})
clinical <- GDCquery_clinic("TCGA-COAD")
head(clinical)
query <- GDCquery(project = "TCGA-ACC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)
clinical.BCRtab.all$clinical_drug_acc %>%
head %>%
as.data.frame
The RNA-Seq pipeline produces raw counts, FPKM and FPKM-UQ quantifications and is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/.
The following options are used to search mRNA results using TCGAbiolinks:
Here is the example to download the raw counts, which can be used with DESeq2 (http://bioconductor.org/packages/DESeq2/) for differential expression analysis.
query.exp.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)
head(raw.counts)
query.exp.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
fpkm.uq.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)
head(fpkm.uq.counts)
TCGAbiolinks has provided a few functions to download mutation data from GDC. There are two options to download the data:
GDCquery_Maf
which will download MAF aligned against hg38.This example will download MAF (mutation annotation files) for variant calling pipeline muse.
Pipelines options are: muse
, varscan2
, somaticsniper
, mutect
. For more information please access
GDC docs.
You can download the data using TCGAbiolinks GDCquery_Maf
function.
maf <- GDCquery_Maf("COAD", pipelines = "muse")
maf %>% head %>% as.data.frame
Then visualize the results using the maftools package.
# create maftools input
maftools.input <- maf %>% read.maf
# Check summary
plotmafSummary(maf = maftools.input,
rmOutlier = TRUE,
addStat = 'median',
dashboard = TRUE)
oncoplot(maf = maftools.input,
top = 10,
removeNonMutated = TRUE)
# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input,
plot = FALSE,
useSyn = TRUE)
plotTiTv(res = titv)
You can extract sample summary from MAF object.
getSampleSummary(maftools.input) %>% head
The Copy Number Variation Analysis Pipeline is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:
You can access "Gene Level Copy Number Scores" from GISTIC with the code below:
query <- GDCquery(project = "TCGA-GBM",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores",
access = "open")
GDCdownload(query)
scores <- GDCprepare(query)
scores[1:5,1:5]
You can visualize the data using the R/Bioconductor package complexHeatmap.
scores.matrix <- scores %>%
dplyr::select(-c(1:3)) %>% # Removes metadata from the first 3 columns
as.matrix
rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)
# gain in more than 200 samples
gain.more.than.twohundred.samples <- which(rowSums(scores.matrix == 1) > 200)
# loss in more than 200 samples
loss.more.than.twohundred.samples <- which(rowSums(scores.matrix == -1) > 200)
lines.selected <- c(gain.more.than.twohundred.samples,loss.more.than.twohundred.samples)
Heatmap(scores.matrix[lines.selected,],
show_column_names = FALSE,
show_row_names = TRUE,
row_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))
The processed DNA methylation data measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = $M/(M+U)$ [@zhou2017comprehensive] which ranges from 0 being unmethylated and 1 fully methylated.
More information about the DNA methylation pipeline is available at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/.
We will download two Glioblastoma (GBM) as a summarizedExperiment object.
query_met.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 27",
barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)
data.hg38
You can access the probes information with rowRanges
.
data.hg38 %>% rowRanges %>% as.data.frame %>% head
You can access the samples metadata with colData
.
data.hg38 %>% colData %>% as.data.frame
You can access the DNA methylation levels with assay
.
data.hg38 %>% assay %>% head %>% as.data.frame
# plot 10 most variable probes
data.hg38 %>%
assay %>%
rowVars %>%
order(decreasing = TRUE) %>%
head(10) -> idx
pal_methylation <- colorRampPalette(c("#000436",
"#021EA9",
"#1632FB",
"#6E34FC",
"#C732D5",
"#FD619D",
"#FF9965",
"#FFD32B",
"#FFFC5A"))(100)
Heatmap(assay(data.hg38)[idx,],
show_column_names = TRUE,
show_row_names = TRUE,
name = "Methylation Beta-value",
row_names_gp = gpar(fontsize = 8),
column_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))
Please, check our ATAC-seq Workshop: http://rpubs.com/tiagochst/atac_seq_workshop
sessionInfo()
All source code used to produce the workshops are available at https://github.com/tiagochst/ELMER_workshop_2019.
We have a set of recorded videos, explaining some of the workshops.