Workshop description¶

Aims¶

This workshop will:

Introduce you to the TCGA (The Cancer Genome Atlas) data available at the NCI's Genomic Data Commons (GDC).
Demonstrate how to access and import TCGA data using the R/Bioconductor package TCGAbiolinks [@colaprico2015tcgabiolinks].
Provide instruction to visualize copy number alteration and mutation data using the R/Bioconductor package maftools.

Useful links¶

the NCI's Genomic Data Commons (GDC): https://gdc.cancer.gov/.
TCGAbiolinks package: http://bioconductor.org/packages/TCGAbiolinks/
maftools package: http://bioconductor.org/packages/maftools/

R/Bioconductor packages used¶

Pre-requisites¶

Basic knowledge of R syntax
Understand the pipe operator ("%>%") (help material https://r4ds.had.co.nz/pipes.html)
Understand the SummarizedExperiment data structure (help material http://bioconductor.org/packages/SummarizedExperiment/)

Introduction¶

Understanding GDC (genomics data common portal)¶

In this workshop we will access (The Cancer Genome Atlas) data available at the NCI's Genomic Data Commons (GDC). Before we start, it is important to know that the data is deposit in two different databases:

The legacy archive (https://portal.gdc.cancer.gov/legacy-archive/search/f) which contains unharmonized legacy data from repositories that predate the GDC (e.g. CGHub). Also, legacy data is not actively maintained, processed, or harmonized by the GDC (Source: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Legacy_Archive/)
Harmonized database (https://portal.gdc.cancer.gov/) which contains data from many cancer projects processed using standardized pipelines and the reference genome GRCh38 (hg38). This gives the advantage of analyzing multiple cancer types or the same cancer type across multiple projects. You can find more information about the pipelines supporting data harmonization at https://gdc.cancer.gov/about-data/gdc-data-harmonization and at https://docs.gdc.cancer.gov/Encyclopedia/pages/Harmonized_Data/.

An overview of the web portal is available at https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/.

Also, a more in-depth comparison between the legacy and Harmonized data was recently published:

Gao, Galen F., et al. “Before and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons’ Data.” Cell systems 9.1 (2019): 24-34. (https://doi.org/10.1016/j.cels.2019.06.006)

Understanding TCGA data¶

Data access¶

GDC provides the data with two access levels:

Open: includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements.
Controlled: includes individually identifiable data such as low-level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements

You can find more information about those two levels and how to get access to controlled data at: https://gdc.cancer.gov/access-data/data-access-processes-and-tools.

TCGA barcode description¶

Each TCGA sample has a unique identifier called TCGA barcode, which contains important information about each sample. A description of the barcode is shown below (Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/).

You can find a table with all the code and meanings at https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables.

Data structure¶

In order to filter the data available in GDC some fields are available such as project (TCGA, TARGET, etc.), data category (Transcriptome Profiling, DNA methylation, Clinical, etc.), data type (Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value, etc.), experimental strategy (miRNA-Seq, RNA-Seq, etc.), Workflow Type, platform, access type and others.

In terms of data granularity, a project has data on several categories, each category contains several data types that might have been produced with different workflows, experimental strategy and platforms. In that way, if you select data type "Gene Expression Quantification" the data category will be Transcriptome Profiling.

You can find the entry possibilities for each filter at the repository page of the database at https://portal.gdc.cancer.gov/repository.

The SummarizedExperiment data structure¶

Before we start, it is important to know that the R/Bioconductor environment provides a data structure called SummarizedExperiment, which was created to handle both samples metadata (age, gender, etc), genomics data (i.e. DNA methylation beta value) and genomics metadata information (chr, start, end, gene symbol) in the same object. You can access samples metadata with colData function, genomics data with assays and genomics metadata with rowRanges.

Loading required packages¶

suppressMessages({
    library(TCGAbiolinks)
    library(MultiAssayExperiment)
    library(maftools)
    library(dplyr)
    library(ComplexHeatmap)
})

clinical <- GDCquery_clinic("TCGA-COAD")
head(clinical)

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-ACC
--------------------
oo Filtering results
--------------------
ooo By data.format
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-ACC
Of the 7 files for download 7 already exist.
All samples have been already downloaded

  |========================================                              |  57%

Warning message:
“Duplicated column names deduplicated: 'metastatic_tumor_site' => 'metastatic_tumor_site_1' [39]”

  |======================================================================| 100%

clinical.BCRtab.all$clinical_drug_acc  %>% 
  head  %>% 
  as.data.frame

RNA-Seq data¶

The RNA-Seq pipeline produces raw counts, FPKM and FPKM-UQ quantifications and is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/.

The following options are used to search mRNA results using TCGAbiolinks:

data.category: "Transcriptome Profiling"
data.type: "Gene Expression Quantification"
workflow.type: "HTSeq - Counts", "HTSeq - FPKM", "HTSeq - FPKM-UQ"

Here is the example to download the raw counts, which can be used with DESeq2 (http://bioconductor.org/packages/DESeq2/) for differential expression analysis.

query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded

|====================================================|100%                      Completed after 1 s

head(raw.counts)

query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
fpkm.uq.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded

|====================================================|100%                      Completed after 0 s

head(fpkm.uq.counts)

Mutation¶

TCGAbiolinks has provided a few functions to download mutation data from GDC. There are two options to download the data:

Use GDCquery_Maf which will download MAF aligned against hg38.

This example will download MAF (mutation annotation files) for variant calling pipeline muse. Pipelines options are: muse, varscan2, somaticsniper, mutect. For more information please access GDC docs.

You can download the data using TCGAbiolinks GDCquery_Maf function.

maf <- GDCquery_Maf("COAD", pipelines = "muse")
maf %>% head %>% as.data.frame

============================================================================
 For more information about MAF data please read the following GDC manual and web pages:
 GDC manual: https://gdc-docs.nci.nih.gov/Data/PDF/Data_UG.pdf
 https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
 https://gdc.cancer.gov/about-gdc/variant-calling-gdc
============================================================================
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-COAD
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-COAD
Of the 1 files for download 1 already exist.
All samples have been already downloaded

|=================================================================| 100%  286 MB

Then visualize the results using the maftools package.

# create maftools input
maftools.input <- maf %>% read.maf

-Validating
-Silent variants: 77997 
-Summarizing
--Mutiple centers found
BCM;WUGSC;WUGSC;BCM;BCM;BI--Possible FLAGS among top ten genes:
  TTN
  MUC16
  SYNE1
  OBSCN
-Processing clinical data
--Missing clinical data
-Finished in 14.3s elapsed (12.8s cpu)

# Check summary
plotmafSummary(maf = maftools.input, 
               rmOutlier = TRUE, 
               addStat = 'median', 
               dashboard = TRUE)

oncoplot(maf = maftools.input, 
         top = 10, 
         removeNonMutated = TRUE)

# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input, 
            plot = FALSE, 
            useSyn = TRUE)
plotTiTv(res = titv)

You can extract sample summary from MAF object.

getSampleSummary(maftools.input) %>% head

Copy number alteration data¶

The Copy Number Variation Analysis Pipeline is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:

Genes with focal CNV values smaller than -0.3 are categorized as a "loss" (-1)
Genes with focal CNV values larger than 0.3 are categorized as a "gain" (+1)
Genes with focal CNV values between and including -0.3 and 0.3 are categorized as "neutral" (0).

You can access "Gene Level Copy Number Scores" from GISTIC with the code below:

query <- GDCquery(project = "TCGA-GBM",
             data.category = "Copy Number Variation",
             data.type = "Gene Level Copy Number Scores",              
             access = "open")
GDCdownload(query)
scores <- GDCprepare(query)
scores[1:5,1:5]

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 1 files for download 1 already exist.
All samples have been already downloaded
Reading GISTIC file
Reading file: GDCdata/TCGA-GBM/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/45e4aef6-2dbf-405d-a033-722241c79565/GBM.focal_score_by_genes.txt

You can visualize the data using the R/Bioconductor package complexHeatmap.

scores.matrix <- scores %>% 
  dplyr::select(-c(1:3)) %>%  # Removes metadata from the first 3 columns
  as.matrix

rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)

# gain in more than 200 samples
gain.more.than.twohundred.samples <- which(rowSums(scores.matrix == 1) > 200)

# loss in more than 200 samples
loss.more.than.twohundred.samples <- which(rowSums(scores.matrix == -1) > 200)

lines.selected <- c(gain.more.than.twohundred.samples,loss.more.than.twohundred.samples)

Heatmap(scores.matrix[lines.selected,],
        show_column_names = FALSE, 
        show_row_names = TRUE,
        row_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))

DNA methylation data¶

The processed DNA methylation data measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = $M/(M+U)$ [@zhou2017comprehensive] which ranges from 0 being unmethylated and 1 fully methylated.

More information about the DNA methylation pipeline is available at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/.

We will download two Glioblastoma (GBM) as a summarizedExperiment object.

query_met.hg38 <- GDCquery(project = "TCGA-GBM", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 27", 
                           barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded

|====================================================|100%                      Completed after 0 s

Joining, by = c("Composite.Element.REF", "Chromosome", "Start", "End", "Gene_Symbol", "Gene_Type", "Transcript_ID", "Position_to_TSS", "CGI_Coordinate", "Feature_Type")
Starting to add information to samples
 => Add clinical information to samples
Add FFPE information. More information at: 
=> https://cancergenome.nih.gov/cancersselected/biospeccriteria 
=> http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
 => Adding subtype information to samples
gbm subtype information from:doi:10.1016/j.cell.2015.12.028

data.hg38

class: RangedSummarizedExperiment 
dim: 27578 2 
metadata(1): data_release
assays(1): ''
rownames(27578): cg00000292 cg00002426 ... cg27662877 cg27665659
rowData names(7): Composite.Element.REF Gene_Symbol ... CGI_Coordinate
  Feature_Type
colnames(2): TCGA-02-0116-01A-01D-0199-05 TCGA-14-3477-01A-01D-0915-05
colData names(113): sample patient ...
  subtype_Telomere.length.estimate.in.blood.normal..Kb.
  subtype_Telomere.length.estimate.in.tumor..Kb.

You can access the probes information with rowRanges.

data.hg38 %>% rowRanges %>% as.data.frame %>% head

You can access the samples metadata with colData.

data.hg38 %>% colData %>% as.data.frame

You can access the DNA methylation levels with assay.

data.hg38 %>% assay %>% head %>% as.data.frame

# plot 10 most variable probes
data.hg38 %>% 
  assay %>% 
  rowVars %>% 
  order(decreasing = TRUE) %>% 
  head(10) -> idx

pal_methylation <- colorRampPalette(c("#000436",
                                      "#021EA9",
                                      "#1632FB",
                                      "#6E34FC",
                                      "#C732D5",
                                      "#FD619D",
                                      "#FF9965",
                                      "#FFD32B",
                                      "#FFFC5A"))(100)

Heatmap(assay(data.hg38)[idx,],
        show_column_names = TRUE, 
        show_row_names = TRUE,
         name = "Methylation Beta-value", 
         row_names_gp = gpar(fontsize = 8),
         column_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))

ATAC-Seq data¶

Please, check our ATAC-seq Workshop: http://rpubs.com/tiagochst/atac_seq_workshop

Session information¶

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] ComplexHeatmap_2.1.0        dplyr_0.8.3                
 [3] maftools_2.0.20             MultiAssayExperiment_1.8.1 
 [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
 [7] BiocParallel_1.16.6         matrixStats_0.55.0         
 [9] Biobase_2.42.0              GenomicRanges_1.34.0       
[11] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[13] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[15] TCGAbiolinks_2.13.6        

loaded via a namespace (and not attached):
  [1] uuid_0.1-2                  backports_1.1.4            
  [3] circlize_0.4.8              aroma.light_3.12.0         
  [5] NMF_0.21.0                  plyr_1.8.4                 
  [7] selectr_0.4-1               ConsensusClusterPlus_1.46.0
  [9] repr_1.0.1                  lazyeval_0.2.2             
 [11] splines_3.5.1               ggplot2_3.2.1              
 [13] gridBase_0.4-7              sva_3.30.1                 
 [15] digest_0.6.20               foreach_1.4.7              
 [17] htmltools_0.3.6             magrittr_1.5               
 [19] memoise_1.1.0               cluster_2.1.0              
 [21] doParallel_1.0.15           limma_3.38.3               
 [23] Biostrings_2.50.2           readr_1.3.1                
 [25] annotate_1.60.1             wordcloud_2.6              
 [27] R.utils_2.9.0               prettyunits_1.0.2          
 [29] colorspace_1.4-1            blob_1.2.0                 
 [31] rvest_0.3.4                 ggrepel_0.8.1              
 [33] xfun_0.9                    crayon_1.3.4               
 [35] RCurl_1.95-4.12             jsonlite_1.6               
 [37] genefilter_1.64.0           zeallot_0.1.0              
 [39] survival_2.44-1.1           zoo_1.8-6                  
 [41] iterators_1.0.12            glue_1.3.1                 
 [43] survminer_0.4.6             registry_0.5-1             
 [45] gtable_0.3.0                zlibbioc_1.28.0            
 [47] XVector_0.22.0              GetoptLong_0.1.7           
 [49] shape_1.4.4                 scales_1.0.0               
 [51] DESeq_1.34.1                rngtools_1.4               
 [53] DBI_1.0.0                   edgeR_3.24.3               
 [55] bibtex_0.4.2                ggthemes_4.2.0             
 [57] Rcpp_1.0.2                  xtable_1.8-4               
 [59] progress_1.2.2              clue_0.3-57                
 [61] bit_1.1-14                  matlab_1.0.2               
 [63] km.ci_0.5-2                 httr_1.4.1                 
 [65] RColorBrewer_1.1-2          pkgconfig_2.0.2            
 [67] XML_3.98-1.19               R.methodsS3_1.7.1          
 [69] locfit_1.5-9.1              reshape2_1.4.3             
 [71] tidyselect_0.2.5            rlang_0.4.0                
 [73] AnnotationDbi_1.44.0        munsell_0.5.0              
 [75] tools_3.5.1                 downloader_0.4             
 [77] generics_0.0.2              RSQLite_2.1.2              
 [79] broom_0.5.2                 evaluate_0.14              
 [81] stringr_1.4.0               knitr_1.24                 
 [83] bit64_0.9-7                 survMisc_0.5.5             
 [85] purrr_0.3.2                 EDASeq_2.16.3              
 [87] nlme_3.1-141                R.oo_1.22.0                
 [89] xml2_1.2.2                  biomaRt_2.38.0             
 [91] compiler_3.5.1              curl_4.1                   
 [93] png_0.1-7                   ggsignif_0.6.0             
 [95] tibble_2.1.3                geneplotter_1.60.0         
 [97] stringi_1.4.3               GenomicFeatures_1.34.8     
 [99] lattice_0.20-38             IRdisplay_0.7.0            
[101] Matrix_1.2-17               KMsurv_0.1-5               
[103] vctrs_0.2.0                 pillar_1.4.2               
[105] lifecycle_0.1.0             GlobalOptions_0.1.0        
[107] data.table_1.12.2           bitops_1.0-6               
[109] rtracklayer_1.42.2          R6_2.4.0                   
[111] latticeExtra_0.6-28         hwriter_1.3.2              
[113] ShortRead_1.40.0            gridExtra_2.3              
[115] codetools_0.2-16            assertthat_0.2.1           
[117] pkgmaker_0.27               rjson_0.2.20               
[119] withr_2.1.2                 GenomicAlignments_1.18.1   
[121] Rsamtools_1.34.1            GenomeInfoDbData_1.2.1     
[123] mgcv_1.8-28                 hms_0.5.1                  
[125] IRkernel_1.0.2              tidyr_1.0.0                
[127] ggpubr_0.2.3                pbdZMQ_0.3-3               
[129] base64enc_0.1-3

Workshop materials¶

Source code¶

All source code used to produce the workshops are available at https://github.com/tiagochst/ELMER_workshop_2019.

Workshops HTMLs¶

ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop

Workshop videos¶

We have a set of recorded videos, explaining some of the workshops.

All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
ELMER algorithm: https://youtu.be/PzC31K9vfu0
ELMER data: https://youtu.be/R00wG--tGo8
ELMER analysis part1: https://youtu.be/bcd4uyxrZCw
ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
ELMER summarizing several analyses: https://youtu.be/moLeik7JjLk
ATAC-Seq workshop: https://youtu.be/3ftZecz0lU4

submitter_id	year_of_diagnosis	classification_of_tumor	last_known_disease_status	updated_datetime	primary_diagnosis	tumor_stage	age_at_diagnosis	morphology	days_to_last_known_disease_status	⋯	treatments_radiation_treatment_or_therapy	treatments_radiation_days_to_treatment_start	treatments_radiation_treatment_effect	treatments_radiation_initial_disease_status	treatments_radiation_regimen_or_line_of_therapy	treatments_radiation_treatment_anatomic_site	treatments_radiation_treatment_outcome	treatments_radiation_days_to_treatment_end	bcr_patient_barcode	disease
<chr>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>	<int>	<chr>	<lgl>	⋯	<chr>	<lgl>	<lgl>	<lgl>	<lgl>	<lgl>	<lgl>	<lgl>	<chr>	<chr>
TCGA-3L-AA1B	2013	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Adenocarcinoma, NOS	stage i	22379	8140/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-3L-AA1B	COAD
TCGA-4N-A93T	2013	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Adenocarcinoma, NOS	stage iiib	24523	8140/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-4N-A93T	COAD
TCGA-4T-AA8H	2013	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Mucinous adenocarcinoma	stage iia	15494	8480/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-4T-AA8H	COAD
TCGA-5M-AAT4	2009	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Adenocarcinoma, NOS	stage iv	27095	8140/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-5M-AAT4	COAD
TCGA-5M-AAT6	2009	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Adenocarcinoma, NOS	stage iv	14852	8140/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-5M-AAT6	COAD
TCGA-5M-AATE	2011	not reported	not reported	2019-08-08T16:33:45.855164-05:00	Adenocarcinoma, NOS	stage iia	27870	8140/3	NA	⋯	no	NA	NA	NA	NA	NA	NA	NA	TCGA-5M-AATE	COAD

bcr_patient_uuid	bcr_patient_barcode	bcr_drug_barcode	bcr_drug_uuid	form_completion_date	pharmaceutical_therapy_drug_name	clinical_trial_drug_classification	pharmaceutical_therapy_type	pharmaceutical_tx_started_days_to	pharmaceutical_tx_ongoing_indicator	⋯	regimen_indication	regimen_indication_notes	regimen_number	route_of_administration	stem_cell_transplantation	stem_cell_transplantation_type	therapy_type_notes	total_dose	total_dose_units	tx_on_clinical_trial
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
bcr_patient_uuid	bcr_patient_barcode	bcr_drug_barcode	bcr_drug_uuid	form_completion_date	drug_name	clinical_trail_drug_classification	therapy_type	days_to_drug_therapy_start	therapy_ongoing	⋯	regimen_indication	regimen_indication_notes	regimen_number	route_of_administration	stem_cell_transplantation	stem_cell_transplantation_type	therapy_type_notes	total_dose	total_dose_units	tx_on_clinical_trial
CDE_ID:	CDE_ID:2003301	CDE_ID:	CDE_ID:	CDE_ID:	CDE_ID:2975232	CDE_ID:3378323	CDE_ID:2793530	CDE_ID:3392465	CDE_ID:3103479	⋯	CDE_ID:2793511	CDE_ID:2793516	CDE_ID:2744948	CDE_ID:2003586	CDE_ID:3090688	CDE_ID:2730901	CDE_ID:2001762	CDE_ID:1515	CDE_ID:3088785	CDE_ID:3925111
FB54458D-C373-46C2-841E-82663E13EFAA	TCGA-OR-A5JM	TCGA-OR-A5JM-D49539	B4F66718-B197-4EFB-AF9B-380B238BF6B5	2013-10-3	sunitinib	[Not Available]	Targeted Molecular therapy	378	NO	⋯	[Not Available]	[Not Applicable]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	NO
FB54458D-C373-46C2-841E-82663E13EFAA	TCGA-OR-A5JM	TCGA-OR-A5JM-D49540	2B5DD32A-1F38-4899-A90F-77C16B1F900E	2013-10-3	ketoconazole	[Not Available]	Targeted Molecular therapy	378	NO	⋯	[Not Available]	[Not Applicable]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	NO
344F7EA4-2BD4-4F2D-89F2-9FC7F572A3D6	TCGA-OR-A5JY	TCGA-OR-A5JY-D48310	1125569B-151B-4740-A864-9F8BFC63E2B4	2013-9-11	xeloda	[Not Available]	Chemotherapy	78	NO	⋯	[Not Available]	[Not Applicable]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	NO
6D980941-219E-40A0-9B41-E84C3FB0BD1A	TCGA-OR-A5K2	TCGA-OR-A5K2-D48215	3C5B0FF6-2A7C-456E-8662-D38484AA7F3C	2013-9-10	Adriamycin	[Not Available]	Chemotherapy	118	NO	⋯	[Not Available]	[Not Applicable]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	[Not Available]	NO

X1	TCGA-14-0736-02A-01R-2005-01	TCGA-06-0211-02A-02R-2005-01
<chr>	<dbl>	<dbl>
ENSG00000000003.13	4282	3853
ENSG00000000005.5	34	8
ENSG00000000419.11	899	1785
ENSG00000000457.12	293	551
ENSG00000000460.15	143	378
ENSG00000000938.11	286	683

X1	TCGA-14-0736-02A-01R-2005-01	TCGA-06-0211-02A-02R-2005-01
<chr>	<dbl>	<dbl>
ENSG00000242268.2	1532.591	3517.609
ENSG00000270112.3	4253.036	4978.410
ENSG00000167578.15	261186.594	135247.024
ENSG00000273842.1	0.000	0.000
ENSG00000078237.5	224251.000	180269.615
ENSG00000146083.10	117080.711	197064.420

Hugo_Symbol	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_Position	End_Position	Strand	Variant_Classification	Variant_Type	⋯	FILTER	CONTEXT	src_vcf_id	tumor_bam_uuid	normal_bam_uuid	case_id	GDC_FILTER	COSMIC	MC3_Overlap	GDC_Validation_Status
<chr>	<int>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
ATAD3B	83858	BCM	GRCh38	chr1	1485803	1485803	+	Nonsense_Mutation	SNP	⋯	PASS	TCAGTCGACCC	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	COSM1333470	True	Unknown
PLCH2	9651	BCM	GRCh38	chr1	2487195	2487195	+	Silent	SNP	⋯	PASS	AGGAGCCCTGC	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	COSM1340725;COSM1340726	True	Unknown
CHD5	26038	BCM	GRCh38	chr1	6146395	6146395	+	Missense_Mutation	SNP	⋯	PASS	AGTTGCGATAC	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	COSM911251	True	Unknown
IFNLR1	163702	BCM	GRCh38	chr1	24159060	24159060	+	Missense_Mutation	SNP	⋯	PASS	GGCCCGTGGCA	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	COSM1340840	True	Unknown
YTHDF2	51441	BCM	GRCh38	chr1	28769136	28769136	+	3'UTR	SNP	⋯	PASS	AAAAAAAGAAA	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	NA	True	Unknown
LRP8	7804	BCM	GRCh38	chr1	53289634	53289634	+	Silent	SNP	⋯	PASS	CGTTCGTGGAT	9130f121-b7ce-460f-b90c-8e31add6cd10	7de9d9e2-c4a4-4311-826f-973cb7987c66	861ad835-790a-47ad-8c03-06f7eb7b5710	7a70f061-9a6f-408e-a416-7f5295ceba3b	NA	NA	True	Unknown

Tumor_Sample_Barcode	Missense_Mutation	Nonsense_Mutation	Nonstop_Mutation	Splice_Site	Translation_Start_Site	total
<fct>	<int>	<int>	<int>	<int>	<int>	<dbl>
TCGA-AA-A010-01A-01D-A17O-10	6440	691	5	128	3	7267
TCGA-CA-6717-01A-11D-1835-10	5989	773	8	124	5	6899
TCGA-AZ-4315-01A-01D-1408-10	5222	536	2	79	2	5841
TCGA-AA-3984-01A-02D-1981-10	3867	475	5	59	0	4406
TCGA-AA-A00N-01A-02D-A17O-10	3690	431	4	52	4	4181
TCGA-CK-4951-01A-01D-1408-10	3254	205	6	72	5	3542

	TCGA-02-0116-01A-01D-0199-05	TCGA-14-3477-01A-01D-0915-05
	<dbl>	<dbl>
cg00000292	0.78495250	0.11314067
cg00002426	0.06028424	0.02843910
cg00003994	0.04285454	0.24091388
cg00005847	0.62490042	0.07579907
cg00006414	NA	NA
cg00007981	0.04020515	0.03530111

A tibble: 5 × 5
Gene Symbol	Gene ID	Cytoband	TCGA-19-1790-01B-01D-1224-01	TCGA-41-2572-01A-01D-1224-01
<chr>	<dbl>	<chr>	<dbl>	<dbl>
ENSG00000008128.21	0	1p36.33	0	0
ENSG00000008130.14	0	1p36.33	0	0
ENSG00000067606.14	0	1p36.33	0	0
ENSG00000078369.16	0	1p36.33	0	0
ENSG00000078808.15	0	1p36.33	0	0

A data.frame: 6 × 12
	seqnames	start	end	width	strand	Composite.Element.REF	Gene_Symbol	Gene_Type	Transcript_ID	Position_to_TSS	CGI_Coordinate	Feature_Type
	<fct>	<int>	<int>	<int>	<fct>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
cg00000292	chr16	28878779	28878780	2	*	cg00000292	ATP2A1;ATP2A1;ATP2A1;ATP2A1;ATP2A1	protein_coding;protein_coding;protein_coding;protein_coding;protein_coding	ENST00000357084.6;ENST00000395503.7;ENST00000536376.4;ENST00000562185.4;ENST00000563975.1	373;290;-1275;-465;-83	CGI:chr16:28879633-28880547	N_Shore
cg00002426	chr3	57757816	57757817	2	*	cg00002426	SLMAP;SLMAP;SLMAP;SLMAP;SLMAP;SLMAP	protein_coding;protein_coding;protein_coding;protein_coding;protein_coding;protein_coding	ENST00000295951.6;ENST00000295952.6;ENST00000383718.6;ENST00000428312.4;ENST00000449503.5;ENST00000467901.1	1585;368;261;257;257;514	CGI:chr3:57756198-57757263	S_Shore
cg00003994	chr7	15686237	15686238	2	*	cg00003994	MEOX2	protein_coding	ENST00000262041.5	576	CGI:chr7:16399497-16399700	.
cg00005847	chr2	176164345	176164346	2	*	cg00005847	AC009336.19;HOXD3;HOXD3;HOXD3;RP11-387A1.5	protein_coding;protein_coding;protein_coding;protein_coding;antisense	ENST00000468418.4;ENST00000249440.4;ENST00000410016.4;ENST00000432796.2;ENST00000608941.1	13259;267;3453;27387;1372	CGI:chr2:176164685-176165509	N_Shore
cg00006414	chr7	149125745	149125746	2	*	cg00006414	RN7SL521P;ZNF398;ZNF425;ZNF425	misc_RNA;protein_coding;protein_coding;protein_coding	ENST00000488398.3;ENST00000426851.5;ENST00000378061.5;ENST00000483014.1	242;-672;602;562	CGI:chr7:149126122-149127136	N_Shore
cg00007981	chr11	94129428	94129429	2	*	cg00007981	PANX1;PANX1	protein_coding;protein_coding	ENST00000227638.6;ENST00000436171.2	499;498	CGI:chr11:94128394-94129607	Island