翻译 - TCGAbiolinks 3 - Downloading and preparing files for analysis

TCGAbiolinks:数据下载和准备

发表时间:2019年3月20日

TCGAbiolinks提供了一些从GDC进行数据下载和数据准备的函数,以方便进行下游的分析。

1. 信息介绍

本节首先解释不同的下载方法和SummarizedExperiment对象,它是TCGAbiolinks中使用的默认数据结构,后面是一些示例。.

  1. 数据下载:两个方法差异

    TCGAbiolinks可以使用两种方法下载GDC数据:

    • client:此方法创建MANIFEST文件并使用 GDC Data Transfer Tool下载数据。此方法更可靠,但与api方法相比可能更慢。
    • api:此方法使用 GDC Application Programming Interface (API)下载数据。这将创建一个MANIFEST文件,并且下载的数据将是一个格式为tar.gz的压缩文件。如果文件的大小和数量太大,这个tar.gz文件会太大导致下载失败的可能性提高。为了解决这个问题,我们将使用files.per.chunk功能将文件拆分成多个小文件,例如,如果chunks.per.download等于10,我们将每个tar.gz分为10个小文件下载。
  2. 数据准备:SummarizedExperiment对象

    使用 SummarizedExperiment package,我们可以从SummarizedExperiment对象中提取三个主要的数据矩阵

    • colData(data):获得样本信息的矩阵,包括了从对应TCGA论文中获得的临床数据以及肿瘤亚型信息
    • assay(data):获得Assay信息的矩阵,就是每一个样本中基因的表达量
    • rowRanges(data):获得特征(一般是指基因)信息的矩阵,包括特征的元数据,例如基因所在基因组范围
  3. Summarized Experiment:注释信息

    使用GDCprepare函数时,会调用一个参数SummarizedExperiment,该参数决定了输出类型为Summarized Experiment(默认选项)或数据框。为了创建一个Summarized Experiment对象,我们需要使用最新的基因组注释文件进行数据注释。比如:1)对于legacy数据(与hg19对齐的数据),TCGAbiolinks正在使用GRCh37.p13进行注释;2)对于harmonized数据(与hg38对齐的数据),TCGAbiolinks正在使用GRCh38.p7 (May 2017)进行注释

    不幸的是,在GRCh38.p7 这样的注释文件更新后,比如一些基因缩写名称的改变/删除、更改基因坐标等。这可能会导致一些TCGA数据的丢失。例如,如果基因被删除,我们就不能再映射它了,那么在SummarizedExperiment中这些信息会丢失。

    如果设置SummarizedExperimentFALSE,您将获得未修改的数据,并需要您自己去注释。

    此外,DNA甲基化数据并没有更新。但是可以在这里找到最新的元数据:http://zwdzwd.github.io/InfiniumAnnotation

    相关讨论见:第91期第50期

2. 参数

2.1. GDCdownload函数

ArgumentDescription
queryA query for GDCquery function
token.fileToken file to download controled data (only for method = “client”)
methodUses the API (POST method) or gdc client tool. Options “api”, “client”. API is faster, but the data might get corrupted in the download, and it might need to be executed again
directoryDirectory/Folder where the data was downloaded. Default: GDCdata
files.per.chunkThis will make the API method only download n (files.per.chunk) files at a time. This may reduce the download problems when the data size is too large. Expected a integer number (example files.per.chunk = 6)

2.2. GDCprepare函数

ArgumentDescription
queryA query for GDCquery function
saveSave result as RData object?
save.filenameName of the file to be save if empty an automatic will be created
directoryDirectory/Folder where the data was downloaded. Default: GDCdata
summarizedExperimentCreate a summarizedExperiment? Default TRUE (if possible)
remove.files.preparedRemove the files read? Default: FALSE This argument will be considered only if save argument is set to true
add.gistic2.mutIf a list of genes (gene symbol) is given, columns with gistic2 results from GDAC firehose (hg19) and a column indicating if there is or not mutation in that gene (hg38) (TRUE or FALSE - use the MAF file for more information) will be added to the sample matrix in the summarized Experiment object.
mut.pipelineIf add.gistic2.mut is not NULL this field will be taken in consideration. Four separate variant calling pipelines are implemented for GDC data harmonization. Options: muse, varscan2, somaticsniper, MuTect2. For more information: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
mutant_variant_classificationList of mutant_variant_classification that will be consider a sample mutant or not. Default: “Frame_Shift_Del”, “Frame_Shift_Ins”, “Missense_Mutation”, “Nonsense_Mutation”, “Splice_Site”, “In_Frame_Del”, “In_Frame_Ins”, “Translation_Start_Site”, “Nonstop_Mutation”

3. Legacy数据库:搜索下载数据

在这个例子中,我们将使用GDC API方法从Legacy数据库(数据与参考基因组hg19对齐)下载两个样本的基因表达数据,并显示object数据和元数据。

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-GBM",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",
                           experimental.strategy = "RNA-Seq",
                           barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                           legacy = TRUE)
GDCdownload(query, method = "api", files.per.chunk = 10)
data <- GDCprepare(query)
# Gene expression aligned against hg19.
datatable(as.data.frame(colData(data)), 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = FALSE)
patientbarcodesampleshortLetterCodedefinitionclassification_of_tumorlast_known_disease_statusupdated_datetime.xprimary_diagnosistumor_stageage_at_diagnosisvital_statusmorphologydays_to_deathdays_to_last_known_disease_statuscreated_datetime.xstate.xdays_to_recurrencediagnosis_idtumor_gradetreatmentstissue_or_organ_of_origindays_to_birthprogression_or_recurrenceprior_malignancysite_of_resection_or_biopsydays_to_last_follow_upcigarettes_per_dayweightupdated_datetime.yalcohol_historyalcohol_intensitybmiyears_smokedcreated_datetime.ystate.yexposure_idheightupdated_datetimecreated_datetimegenderyear_of_birthstateracedemographic_idethnicityyear_of_deathbcr_patient_barcodedbgap_accession_numberdisease_typereleasedstate.1primary_siteproject_idnamesubtype_patientsubtype_Tissue.source.sitesubtype_Studysubtype_BCRsubtype_Whole.exomesubtype_Whole.genomesubtype_RNAseqsubtype_SNP6subtype_U133asubtype_HM450subtype_HM27subtype_RPPAsubtype_Histologysubtype_Gradesubtype_Age..years.at.diagnosis.subtype_Gendersubtype_Survival..months.subtype_Vital.status..1.dead.subtype_Karnofsky.Performance.Scoresubtype_Mutation.Countsubtype_Percent.aneuploidysubtype_IDH.statussubtype_X1p.19q.codeletionsubtype_IDH.codel.subtypesubtype_MGMT.promoter.statussubtype_Chr.7.gain.Chr.10.losssubtype_Chr.19.20.co.gainsubtype_TERT.promoter.statussubtype_TERT.expression..log2.subtype_TERT.expression.statussubtype_ATRX.statussubtype_DAXX.statussubtype_Telomere.Maintenancesubtype_BRAF.V600E.statussubtype_BRAF.KIAA1549.fusionsubtype_ABSOLUTE.puritysubtype_ABSOLUTE.ploidysubtype_ESTIMATE.stromal.scoresubtype_ESTIMATE.immune.scoresubtype_ESTIMATE.combined.scoresubtype_Original.Subtypesubtype_Transcriptome.Subtypesubtype_Pan.Glioma.RNA.Expression.Clustersubtype_IDH.specific.RNA.Expression.Clustersubtype_Pan.Glioma.DNA.Methylation.Clustersubtype_IDH.specific.DNA.Methylation.Clustersubtype_Supervised.DNA.Methylation.Clustersubtype_Random.Forest.Sturm.Clustersubtype_RPPA.clustersubtype_Telomere.length.estimate.in.blood.normal..Kb.subtype_Telomere.length.estimate.in.tumor..Kb.
TCGA-14-0736TCGA-14-0736-02A-01R-2005-01TCGA-14-0736-02ATRRecurrent Solid Tumornot reportednot reported2017-03-04T16:44:35.784223-06:00c71.9not reported18219dead9440/3460livec2cca6a5-69cd-5e31-b0e9-fa80d38e1375not reported[object Object]c71.9-18219not reportednot reportedc71.94602017-03-04T16:37:25.850486-06:00live93758177-ffbb-567a-8ae4-04ad1d48cb452017-03-04T16:37:28.862150-06:00male1950liveblack or african americane597a52f-1c28-5248-a751-22d810af3013not reported2000TCGA-14-0736Glioblastoma MultiformetruelegacyBrainTCGA-GBMGlioblastoma Multiforme
TCGA-06-0211TCGA-06-0211-02A-02R-2005-01TCGA-06-0211-02ATRRecurrent Solid Tumornot reportednot reported2017-03-04T16:44:35.784223-06:00c71.9not reported17515dead9440/3360livecec1bfa2-94cf-5bf5-ad44-d1aad1c78586not reported[object Object]c71.9-17515not reportednot reportedc71.93602017-03-04T16:37:25.850486-06:00live68b17566-c576-5237-badd-6e1979a8f86f2017-03-04T16:37:28.862150-06:00male1950livewhitefe9a665e-6e67-5c8c-ac4f-b316f44db506not hispanic or latino1997TCGA-06-0211Glioblastoma MultiformetruelegacyBrainTCGA-GBMGlioblastoma Multiforme
# Only first 100 to make render faster
datatable(assay(data)[1:5,], 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = TRUE)
TCGA-14-0736-02A-01R-2005-01TCGA-06-0211-02A-02R-2005-01
A1BG798.1285402.2892
A2M11681.657333835.5229
NAT1164.599870.9712
NAT25.6374.2689
RP11-986E7.7206399.098145121.6649
rowRanges(data)
## GRanges object with 100 ranges and 3 metadata columns:
##                seqnames              ranges strand |      gene_id
##                   <Rle>           <IRanges>  <Rle> |  <character>
##           A1BG    chr19   58856544-58864865      - |         A1BG
##            A2M    chr12     9220260-9268825      - |          A2M
##           NAT1     chr8   18027986-18081198      + |         NAT1
##           NAT2     chr8   18248755-18258728      + |         NAT2
##   RP11-986E7.7    chr14   95058395-95090983      + | RP11-986E7.7
##            ...      ...                 ...    ... .          ...
##         ADORA1     chr1 203059782-203136533      + |       ADORA1
##        ADORA2A    chr22   24813847-24838328      + |      ADORA2A
##        ADORA2B    chr17   15848231-15879060      + |      ADORA2B
##         ADORA3     chr1 112025970-112106584      - |       ADORA3
##          ADPRH     chr3 119298115-119308792      + |        ADPRH
##                entrezgene ensembl_gene_id
##                 <numeric>     <character>
##           A1BG          1 ENSG00000121410
##            A2M          2 ENSG00000175899
##           NAT1          9 ENSG00000171428
##           NAT2         10 ENSG00000156006
##   RP11-986E7.7         12 ENSG00000273259
##            ...        ...             ...
##         ADORA1        134 ENSG00000163485
##        ADORA2A        135 ENSG00000128271
##        ADORA2B        136 ENSG00000170425
##         ADORA3        140 ENSG00000121933
##          ADPRH        141 ENSG00000144843
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4. Harmonized数据库:搜索下载数据

在这个例子中,我们将从harmonized数据库下载两个样本的基因表达数据(数据与参考基因组hg38对齐),并显示了object数据和元数据。

library(TCGAbiolinks)
library(SummarizedExperiment)
# Gene expression aligned against hg38
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query)
data <- GDCprepare(query)
datatable(as.data.frame(colData(data)), 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = FALSE)
patientbarcodesampleshortLetterCodedefinitionclassification_of_tumorlast_known_disease_statusupdated_datetime.xprimary_diagnosistumor_stageage_at_diagnosisvital_statusmorphologydays_to_deathdays_to_last_known_disease_statuscreated_datetime.xstate.xdays_to_recurrencediagnosis_idtumor_gradetreatmentstissue_or_organ_of_origindays_to_birthprogression_or_recurrenceprior_malignancysite_of_resection_or_biopsydays_to_last_follow_upcigarettes_per_dayweightupdated_datetime.yalcohol_historyalcohol_intensitybmiyears_smokedcreated_datetime.ystate.yexposure_idheightupdated_datetimecreated_datetimegenderyear_of_birthstateracedemographic_idethnicityyear_of_deathbcr_patient_barcodedbgap_accession_numberdisease_typereleasedstate.1primary_siteproject_idnamesubtype_patientsubtype_Tissue.source.sitesubtype_Studysubtype_BCRsubtype_Whole.exomesubtype_Whole.genomesubtype_RNAseqsubtype_SNP6subtype_U133asubtype_HM450subtype_HM27subtype_RPPAsubtype_Histologysubtype_Gradesubtype_Age..years.at.diagnosis.subtype_Gendersubtype_Survival..months.subtype_Vital.status..1.dead.subtype_Karnofsky.Performance.Scoresubtype_Mutation.Countsubtype_Percent.aneuploidysubtype_IDH.statussubtype_X1p.19q.codeletionsubtype_IDH.codel.subtypesubtype_MGMT.promoter.statussubtype_Chr.7.gain.Chr.10.losssubtype_Chr.19.20.co.gainsubtype_TERT.promoter.statussubtype_TERT.expression..log2.subtype_TERT.expression.statussubtype_ATRX.statussubtype_DAXX.statussubtype_Telomere.Maintenancesubtype_BRAF.V600E.statussubtype_BRAF.KIAA1549.fusionsubtype_ABSOLUTE.puritysubtype_ABSOLUTE.ploidysubtype_ESTIMATE.stromal.scoresubtype_ESTIMATE.immune.scoresubtype_ESTIMATE.combined.scoresubtype_Original.Subtypesubtype_Transcriptome.Subtypesubtype_Pan.Glioma.RNA.Expression.Clustersubtype_IDH.specific.RNA.Expression.Clustersubtype_Pan.Glioma.DNA.Methylation.Clustersubtype_IDH.specific.DNA.Methylation.Clustersubtype_Supervised.DNA.Methylation.Clustersubtype_Random.Forest.Sturm.Clustersubtype_RPPA.clustersubtype_Telomere.length.estimate.in.blood.normal..Kb.subtype_Telomere.length.estimate.in.tumor..Kb.
TCGA-14-0736TCGA-14-0736-02A-01R-2005-01TCGA-14-0736-02ATRRecurrent Solid Tumornot reportednot reported2017-03-04T16:44:35.784223-06:00c71.9not reported18219dead9440/3460livec2cca6a5-69cd-5e31-b0e9-fa80d38e1375not reported[object Object]c71.9-18219not reportednot reportedc71.94602017-03-04T16:37:25.850486-06:00live93758177-ffbb-567a-8ae4-04ad1d48cb452017-03-04T16:37:28.862150-06:00male1950liveblack or african americane597a52f-1c28-5248-a751-22d810af3013not reported2000TCGA-14-0736Glioblastoma MultiformetruelegacyBrainTCGA-GBMGlioblastoma Multiforme
TCGA-06-0211TCGA-06-0211-02A-02R-2005-01TCGA-06-0211-02ATRRecurrent Solid Tumornot reportednot reported2017-03-04T16:44:35.784223-06:00c71.9not reported17515dead9440/3360livecec1bfa2-94cf-5bf5-ad44-d1aad1c78586not reported[object Object]c71.9-17515not reportednot reportedc71.93602017-03-04T16:37:25.850486-06:00live68b17566-c576-5237-badd-6e1979a8f86f2017-03-04T16:37:28.862150-06:00male1950livewhitefe9a665e-6e67-5c8c-ac4f-b316f44db506not hispanic or latino1997TCGA-06-0211Glioblastoma MultiformetruelegacyBrainTCGA-GBMGlioblastoma Multiforme
datatable(assay(data)[1:5,], 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = TRUE)
TCGA-14-0736-02A-01R-2005-01TCGA-06-0211-02A-02R-2005-01
ENSG000000000031027433.82686424382.673556
ENSG0000000000522979.33887092481.99005963
ENSG00000000419810469.937335738697.921797
ENSG0000000045746320.617401739986.1652223
ENSG0000000046026077.411982831642.5655792
rowRanges(data)
## GRanges object with 100 ranges and 3 metadata columns:
##                   seqnames              ranges strand | ensembl_gene_id
##                      <Rle>           <IRanges>  <Rle> |     <character>
##   ENSG00000000003     chrX 100627109-100639991      - | ENSG00000000003
##   ENSG00000000005     chrX 100584802-100599885      + | ENSG00000000005
##   ENSG00000000419    chr20   50934867-50958555      - | ENSG00000000419
##   ENSG00000000457     chr1 169849631-169894267      - | ENSG00000000457
##   ENSG00000000460     chr1 169662007-169854080      + | ENSG00000000460
##               ...      ...                 ...    ... .             ...
##   ENSG00000005421     chr7   95297676-95324707      - | ENSG00000005421
##   ENSG00000005436     chr2   75652000-75710989      - | ENSG00000005436
##   ENSG00000005448     chr2   74421678-74425755      + | ENSG00000005448
##   ENSG00000005469     chr7   87345681-87399795      + | ENSG00000005469
##   ENSG00000005471     chr7   87401697-87480435      - | ENSG00000005471
##                   external_gene_name original_ensembl_gene_id
##                          <character>              <character>
##   ENSG00000000003             TSPAN6       ENSG00000000003.13
##   ENSG00000000005               TNMD        ENSG00000000005.5
##   ENSG00000000419               DPM1       ENSG00000000419.11
##   ENSG00000000457              SCYL3       ENSG00000000457.12
##   ENSG00000000460           C1orf112       ENSG00000000460.15
##               ...                ...                      ...
##   ENSG00000005421               PON1        ENSG00000005421.7
##   ENSG00000005436              GCFC2       ENSG00000005436.12
##   ENSG00000005448              WDR54       ENSG00000005448.15
##   ENSG00000005469               CROT       ENSG00000005469.10
##   ENSG00000005471              ABCB4       ENSG00000005471.14
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

5. GDCprepare:输出

此功能仍在开发中,并不适用于所有情况。具体请参阅下表中的信息。此外,关于数据查询、数据下载、数据准备的示例可以在这个Gist能找到。

5.1. Harmonized数据

Data.categoryData.typeWorkflow TypeStatus
Transcriptome ProfilingGene Expression QuantificationHTSeq - CountsData frame or SE (losing 5% of information when mapping to genomic regions)
Transcriptome ProfilingGene Expression QuantificationHTSeq - FPKM-UQReturning only a (losing 5% of information when mapping to genomic regions)
Transcriptome ProfilingGene Expression QuantificationHTSeq - FPKMReturning only a (losing 5% of information when mapping to genomic regions)
Transcriptome ProfilingIsoform Expression QuantificationNot needed
Transcriptome ProfilingmiRNA Expression QuantificationNot neededReturning only a dataframe for the moment
Copy number variationCopy Number SegmentNot neededReturning only a dataframe for the moment
Copy number variationMasked Copy Number SegmentNot neededReturning only a dataframe for the moment
Copy number variationGene Level Copy Number ScoresNot neededReturning only a dataframe for the moment
Simple Nucleotide Variation
Raw Sequencing Data
BiospecimenSlide Image
BiospecimenBiospecimen Supplement
Clinical

5.2. Legacy数据

Data.categoryData.typePlatformfile.typeStatus
Transcriptome Profiling
Copy number variation-Affymetrix SNP Array 6.0nocnv_hg18.segWorking
-Affymetrix SNP Array 6.0hg18.segWorking
-Affymetrix SNP Array 6.0nocnv_hg19.segWorking
-Affymetrix SNP Array 6.0hg19.segWorking
-Illumina HiSeqSeveralWorking
Simple Nucleotide VariationSimple somatic mutation
Raw Sequencing Data
Biospecimen
Clinical
Protein expressionMDA RPPA Core-Working
Gene expressionGene expression quantificationIllumina HiSeqnormalized_resultsWorking
Illumina HiSeqresultsWorking
HT_HG-U133A-Working
AgilentG4502A_07_2-Data frame only
AgilentG4502A_07_1-Data frame only
HuEx-1_0-st-v2FIRMA.txtNot Preparing
gene.txtNot Preparing
Isoform expression quantification
miRNA gene quantification
Exon junction quantification
Exon quantification
miRNA isoform quantification
DNA methylationIllumina Human Methylation 450Not usedWorking
Illumina Human Methylation 27Not usedWorking
Illumina DNA Methylation OMA003 CPINot usedWorking
Illumina DNA Methylation OMA002 CPINot usedWorking
Illumina Hi SeqNot working
Raw Microarray Data
Structural Rearrangement
Other

6. 案例

6.1. Legacy archive数据库

6.1.1. DNA甲基化:获取所有TCGA IDAT文件

#-------------------------------------------------------
# Example to idat files from TCGA projects
#-------------------------------------------------------
library(TCGAbiolinks)
library(SummarizedExperiment)
projects <- TCGAbiolinks:::getGDCprojects()$project_id
projects <- projects[grepl('^TCGA',projects,perl=T)]
match.file.cases.all <- NULL
for(proj in projects){
    print(proj)
    query <- GDCquery(project = proj,
                      data.category = "Raw microarray data",
                      data.type = "Raw intensities", 
                      experimental.strategy = "Methylation array", 
                      legacy = TRUE,
                      file.type = ".idat",
                      platform = "Illumina Human Methylation 450")
    match.file.cases <- getResults(query,cols=c("cases","file_name"))
    match.file.cases$project <- proj
    match.file.cases.all <- rbind(match.file.cases.all,match.file.cases)
    tryCatch(GDCdownload(query, method = "api", files.per.chunk = 20),
             error = function(e) GDCdownload(query, method = "client"))
}
# This will create a map between idat file name, cases (barcode) and project
readr::write_tsv(match.file.cases.all, path =  "idat_filename_case.txt")
# code to move all files to local folder
for(file in dir(".",pattern = ".idat", recursive = T)){
    TCGAbiolinks::move(file,basename(file))
}

6.1.2. DNA甲基化:hg19注释

library(TCGAbiolinks)
library(SummarizedExperiment)
query_meth.hg19 <- GDCquery(project= "TCGA-LGG", 
                            data.category = "DNA methylation", 
                            platform = "Illumina Human Methylation 450", 
                            barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"), 
                            legacy = TRUE)
GDCdownload(query_meth.hg19)
data.hg19 <- GDCprepare(query_meth.hg19)

6.1.3. 蛋白组:蛋白质表达

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Protein expression",
                  legacy = TRUE, 
                  barcode = c("TCGA-OX-A56R-01A-21-A44T-20","TCGA-08-0357-01A-21-1898-20"))
GDCdownload(query)
data <- GDCprepare(query, save = TRUE, 
                   save.filename = "gbmProteinExpression.rda",
                   remove.files.prepared = TRUE)

6.1.4. 转录组:基因表达并通过hg19注释

library(TCGAbiolinks)
library(SummarizedExperiment)
# Aligned against Hg19
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)
GDCdownload(query.exp.hg19)
data <- GDCprepare(query.exp.hg19)

6.2. Harmonized数据库

6.2.1. Copy Number

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

6.2.2. GISTIC2

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-ACC",
             data.category = "Copy Number Variation",
             data.type = "Gene Level Copy Number Scores",              
             access="open")
GDCdownload(query)
data <- GDCprepare(query)

6.2.3. 转录组:基因表达并通过hg38注释

library(TCGAbiolinks)
library(SummarizedExperiment)
# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
expdat <- GDCprepare(query = query.exp.hg38,
                     save = TRUE, 
                     save.filename = "exp.rda")

6.2.4. miRNA

library(TCGAbiolinks)
library(SummarizedExperiment)
query.mirna <- GDCquery(project = "TARGET-AML", 
                        experimental.strategy = "miRNA-Seq",
                        data.category = "Transcriptome Profiling", 
                        barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
                        data.type = "miRNA Expression Quantification")
GDCdownload(query.mirna)
mirna <- GDCprepare(query = query.mirna,
                     save = TRUE, 
                     save.filename = "mirna.rda")


query.isoform <- GDCquery(project = "TARGET-AML", 
                          experimental.strategy = "miRNA-Seq",
                          data.category = "Transcriptome Profiling", 
                          barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
                          data.type = "Isoform Expression Quantification")
GDCdownload(query.isoform)

isoform <- GDCprepare(query = query.isoform,
                    save = TRUE, 
                    save.filename = "mirna-isoform.rda")

6.2.5. DNA甲基化:与hg38比对

library(TCGAbiolinks)
library(SummarizedExperiment)
#-------------------------------------- 
# DNA methylation data
#-------------------------------------- 
# DNA methylation aligned to hg38
query_met.hg38 <- GDCquery(project= "TCGA-LGG", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 450", 
                           barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)

6.2.6. DNA甲基化IDAT

# Using sesame  http://bioconductor.org/packages/sesame/
# Please cite 10.1093/nar/gky691 and doi: 10.1093/nar/gkt090.
library(TCGAbiolinks)
library(SummarizedExperiment)
proj <- "TCGA-ACC"
query <- GDCquery(project = proj,
                  data.category = "Raw microarray data",
                  data.type = "Raw intensities", 
                  experimental.strategy = "Methylation array", 
                  legacy = TRUE,
                  barcode = c("TCGA-OR-A5JT","CGA-OR-A5LG","TCGA-OR-A5JX"),
                  file.type = ".idat",
                  platform = "Illumina Human Methylation 450")
tryCatch(GDCdownload(query, method = "api", files.per.chunk = 20),
         error = function(e) GDCdownload(query, method = "client"))
betas <- GDCprepare(query)
更新时间:2019-05-25 17:13:23

本文由 AlphaJP 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-downloading-and-preparing-files-for-analysis
最后更新:2019-05-25 17:13:23

评论

Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×