翻译 - TCGAbiolinks 2 - Searching GDC database

TCGAbiolinks:搜索GDC数据库

发表时间:2019年3月20日

TCGAbiolinks提供了一些搜索GDC数据库的功能。本节首先介绍GDC数据的两个可以用的源:Harmonized和Legacy Archive。然后提供一些实例,探究如何如何访问和使用它们。

1. 一些有用的信息

1.1. 不同GDC源:Harmonized和Legacy Archive

TCGAbiolinks可以使用两个GDC源下载GDC数据:

  • **GDC Legacy Archive:**提供对以前存储在CGHub和TCGA数据协调中心(DCC)并由TCGA Data Portal托管的未经修改的数据副本的访问,这里的数据使用GRCh37(hg19)和GRCh36(hg18)基因组作为参考。
  • GDC harmonized database:这里的数据以GRCh38(hg38)基因组作为参考,并使用GDC Bioinformatics Pipelines对提供的生物样本和临床数据进行标准化

1.2. 理解Barcode

TCGA Barcode 由一组标识符组成。每一个Barcode都对应一个TCGA数据元素。有关元数据标识符如何构成Barcode 的说明,请参阅下面信息。其中,Aliquot Barcode包含最多数量的标识符。

案例:

  • Aliquot barcode:TCGA-G4-6317-02A-11D-2064-05
  • Participant:TCGA-G4-6317
  • Sample:TCGA-G4-6317-02

有关更多信息,请查看TCGA wiki

2. 搜索参数

可以使用GDCquery函数轻松搜索GDC数据。该函数可以使用以下参数来搜索:

ProjectA list of valid project (see table below)]
data.categoryA valid project (see list with TCGAbiolinks:::getProjectSummary(project))
data.typeA data type to filter the files to download
workflow.typeGDC workflow type
legacySearch in the legacy repository
accessFilter by access type. Possible values: controlled, open
platformExample:
CGH- 1x1M_G4447AIlluminaGA_RNASeqV2
AgilentG4502A_07IlluminaGA_mRNA_DGE
Human1MDuoHumanMethylation450
HG-CGH-415K_G4124AIlluminaGA_miRNASeq
HumanHap550IlluminaHiSeq_miRNASeq
ABIH-miRNA_8x15K
HG-CGH-244ASOLiD_DNASeq
IlluminaDNAMethylation_OMA003_CPIIlluminaGA_DNASeq_automated
IlluminaDNAMethylation_OMA002_CPIHG-U133_Plus_2
HuEx- 1_0-st-v2Mixed_DNASeq
H-miRNA_8x15Kv2IlluminaGA_DNASeq_curated
MDA_RPPA_CoreIlluminaHiSeq_TotalRNASeqV2
HT_HG-U133AIlluminaHiSeq_DNASeq_automated
diagnostic_imagesmicrosat_i
IlluminaHiSeq_RNASeqSOLiD_DNASeq_curated
IlluminaHiSeq_DNASeqCMixed_DNASeq_curated
IlluminaGA_RNASeqIlluminaGA_DNASeq_Cont_automated
IlluminaGA_DNASeqIlluminaHiSeq_WGBS
pathology_reportsIlluminaHiSeq_DNASeq_Cont_automated
Genome_Wide_SNP_6bio
tissue_imagesMixed_DNASeq_automated
HumanMethylation27Mixed_DNASeq_Cont_curated
IlluminaHiSeq_RNASeqV2Mixed_DNASeq_Cont
file.typeTo be used in the legacy database for some platforms, to define which file types to be used.
barcodeA list of barcodes to filter the files to download
experimental.strategyFilter to experimental stratey. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array. Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
sample.typeA sample type to filter the files to download

2.1. project options

project中可以选择的选项如下:

表格内容注释:

  • dbgap_accession_number:dbgap关联ID
  • disease_type:疾病的类型
  • releasable:是否发布
  • released:是否公开
  • state:目前的状态是否开放
  • primary_site:样本组织来源
  • project_id:项目的ID
  • id:数据编号
  • name:疾病名称
  • tumor:疾病名称的缩写
dbgap_accession_numberdisease_typereleasablereleasedstateprimary_siteproject_ididnametumor
Epithelial Neoplasms, NOS,Adenomas and AdenocarcinomasfalsetrueopenThyroid glandTCGA-THCATCGA-THCAThyroid CarcinomaTHCA
phs000465Acute Myeloid LeukemiafalsetrueopenBloodTARGET-AMLTARGET-AMLAcute Myeloid LeukemiaAML
phs000467NeuroblastomafalsetrueopenNervous SystemTARGET-NBLTARGET-NBLNeuroblastomaNBL
GliomasfalsetrueopenBrainTCGA-LGGTCGA-LGGBrain Lower Grade GliomaLGG
Cystic, Mucinous and Serous Neoplasms,Adenomas and Adenocarcinomas,Complex Epithelial Neoplasms,Squamous Cell NeoplasmsfalsetrueopenCervix uteriTCGA-CESCTCGA-CESCCervical Squamous Cell Carcinoma and Endocervical AdenocarcinomaCESC
Lipomatous Neoplasms,Soft Tissue Tumors and Sarcomas, NOS,Fibromatous Neoplasms,Myomatous Neoplasms,Nerve Sheath Tumors,Synovial-like NeoplasmsfalsetrueopenRetroperitoneum and peritoneum,Bones, joints and articular cartilage of limbs,Other and unspecified parts of tongue,Stomach,Other and unspecified male genital organs,Colon,Connective, subcutaneous and other soft tissues,Meninges,Ovary,Corpus uteri,Peripheral nerves and autonomic nervous system,Uterus, NOS,KidneyTCGA-SARCTCGA-SARCSarcomaSARC
Adenomas and AdenocarcinomasfalsetrueopenAdrenal glandTCGA-ACCTCGA-ACCAdrenocortical CarcinomaACC
phs000468OsteosarcomafalsetrueopenBoneTARGET-OSTARGET-OSOsteosarcomaOS
Cystic, Mucinous and Serous Neoplasms,Adenomas and AdenocarcinomasfalsetrueopenRectosigmoid junction,Unknown,Rectum,Colon,Connective, subcutaneous and other soft tissuesTCGA-READTCGA-READRectum AdenocarcinomaREAD
phs000470Rhabdoid TumorfalsetrueopenKidneyTARGET-RTTARGET-RTRhabdoid TumorRT
Adenomas and AdenocarcinomasfalsetrueopenLiver and intrahepatic bile ductsTCGA-LIHCTCGA-LIHCLiver Hepatocellular CarcinomaLIHC
Adenomas and AdenocarcinomasfalsetrueopenKidneyTCGA-KICHTCGA-KICHKidney ChromophobeKICH
Thymic Epithelial NeoplasmsfalsetrueopenHeart, mediastinum, and pleura,ThymusTCGA-THYMTCGA-THYMThymomaTHYM
Cystic, Mucinous and Serous Neoplasms,Adenomas and AdenocarcinomasfalsetrueopenStomachTCGA-STADTCGA-STADStomach AdenocarcinomaSTAD
Squamous Cell NeoplasmsfalsetrueopenBronchus and lungTCGA-LUSCTCGA-LUSCLung Squamous Cell CarcinomaLUSC
Mesothelial NeoplasmsfalsetrueopenHeart, mediastinum, and pleura,Bronchus and lungTCGA-MESOTCGA-MESOMesotheliomaMESO
Cystic, Mucinous and Serous Neoplasms,Epithelial Neoplasms, NOS,Adenomas and Adenocarcinomas,Ductal and Lobular NeoplasmsfalsetrueopenPancreasTCGA-PAADTCGA-PAADPancreatic AdenocarcinomaPAAD
Transitional Cell Papillomas and Carcinomas,Epithelial Neoplasms, NOS,Adenomas and Adenocarcinomas,Squamous Cell NeoplasmsfalsetrueopenBladderTCGA-BLCATCGA-BLCABladder Urothelial CarcinomaBLCA
phs000466Clear Cell Sarcoma of the KidneyfalsetrueopenKidneyTARGET-CCSKTARGET-CCSKClear Cell Sarcoma of the KidneyCCSK
Squamous Cell NeoplasmsfalsetrueopenOther and ill-defined sites in lip, oral cavity and pharynx,Palate,Other and unspecified parts of tongue,Hypopharynx,Lip,Tonsil,Gum,Larynx,Oropharynx,Floor of mouth,Bones, joints and articular cartilage of other and unspecified sites,Other and unspecified parts of mouth,Base of tongueTCGA-HNSCTCGA-HNSCHead and Neck Squamous Cell CarcinomaHNSC
Adenomas and AdenocarcinomasfalsetrueopenKidneyTCGA-KIRCTCGA-KIRCKidney Renal Clear Cell CarcinomaKIRC
Not Reported,GliomasfalsetrueopenBrainTCGA-GBMTCGA-GBMGlioblastoma MultiformeGBM
Nevi and MelanomasfalsetrueopenSkinTCGA-SKCMTCGA-SKCMSkin Cutaneous MelanomaSKCM
phs001374Epithelial Neoplasms, NOS,Squamous Cell NeoplasmstruetrueopenBronchus and lungVAREPOP-APOLLOVAREPOP-APOLLOVA Research Precision Oncology ProgramAPOLLO
Adenomas and AdenocarcinomasfalsetrueopenOther and unspecified parts of biliary tract,Gallbladder,Liver and intrahepatic bile ductsTCGA-CHOLTCGA-CHOLCholangiocarcinomaCHOL
Not Reported,Cystic, Mucinous and Serous Neoplasms,Epithelial Neoplasms, NOS,Adenomas and AdenocarcinomasfalsetrueopenCorpus uteri,Uterus, NOSTCGA-UCECTCGA-UCECUterine Corpus Endometrial CarcinomaUCEC
Cystic, Mucinous and Serous Neoplasms,Adenomas and Adenocarcinomas,Squamous Cell NeoplasmsfalsetrueopenEsophagus,StomachTCGA-ESCATCGA-ESCAEsophageal CarcinomaESCA
Cystic, Mucinous and Serous Neoplasms,Epithelial Neoplasms, NOS,Adenomas and Adenocarcinomas,Complex Epithelial NeoplasmsfalsetrueopenRectosigmoid junction,ColonTCGA-COADTCGA-COADColon AdenocarcinomaCOAD
Adnexal and Skin Appendage Neoplasms,Basal Cell Neoplasms,Adenomas and Adenocarcinomas,Cystic, Mucinous and Serous Neoplasms,Epithelial Neoplasms, NOS,Squamous Cell Neoplasms,Fibroepithelial Neoplasms,Ductal and Lobular Neoplasms,Complex Epithelial NeoplasmsfalsetrueopenBreastTCGA-BRCATCGA-BRCABreast Invasive CarcinomaBRCA
Not Reported,Cystic, Mucinous and Serous NeoplasmsfalsetrueopenOvaryTCGA-OVTCGA-OVOvarian Serous CystadenocarcinomaOV
Myeloid LeukemiasfalsetrueopenHematopoietic and reticuloendothelial systemsTCGA-LAMLTCGA-LAMLAcute Myeloid LeukemiaLAML
Not Reported,Mature B-Cell LymphomasfalsetrueopenHeart, mediastinum, and pleura,Testis,Stomach,Lymph nodes,Bones, joints and articular cartilage of other and unspecified sites,Brain,Thyroid gland,Small intestine,Colon,Connective, subcutaneous and other soft tissues,Other and unspecified major salivary glands,Retroperitoneum and peritoneum,Hematopoietic and reticuloendothelial systems,BreastTCGA-DLBCTCGA-DLBCLymphoid Neoplasm Diffuse Large B-cell LymphomaDLBC
Nevi and MelanomasfalsetrueopenEye and adnexaTCGA-UVMTCGA-UVMUveal MelanomaUVM
Cystic, Mucinous and Serous Neoplasms,Adenomas and Adenocarcinomas,Ductal and Lobular NeoplasmsfalsetrueopenProstate glandTCGA-PRADTCGA-PRADProstate AdenocarcinomaPRAD
phs001179Germ Cell Neoplasms,Acinar Cell Neoplasms,Miscellaneous Tumors,Thymic Epithelial Neoplasms,Gliomas,Basal Cell Neoplasms,Complex Mixed and Stromal Neoplasms,Ductal and Lobular Neoplasms,Neuroepitheliomatous Neoplasms,Complex Epithelial Neoplasms,Adnexal and Skin Appendage Neoplasms,Mesothelial Neoplasms,Mucoepidermoid Neoplasms,Not Reported,Adenomas and Adenocarcinomas,Cystic, Mucinous and Serous Neoplasms,Specialized Gonadal Neoplasms,Epithelial Neoplasms, NOS,Squamous Cell Neoplasms,Transitional Cell Papillomas and Carcinomas,Paragangliomas and Glomus Tumors,Nevi and Melanomas,MeningiomasfalsetrueopenTestis,Gallbladder,Unknown,Other and unspecified parts of biliary tract,Adrenal gland,Thyroid gland,Spinal cord, cranial nerves, and other parts of central nervous system,Peripheral nerves and autonomic nervous system,Stomach,Cervix uteri,Bladder,Small intestine,Breast,Prostate gland,Other and ill-defined sites,Other and unspecified major salivary glands,Rectum,Retroperitoneum and peritoneum,Pancreas,Heart, mediastinum, and pleura,Other and ill-defined digestive organs,Bronchus and lung,Liver and intrahepatic bile ducts,Other and unspecified female genital organs,Thymus,Penis,Nasopharynx,Ovary,Uterus, NOS,Vulva,Other and unspecified urinary organs,Trachea,Ureter,Other endocrine glands and related structures,Not Reported,Colon,Anus and anal canal,Vagina,Skin,Esophagus,Eye and adnexa,KidneyFM-ADFM-ADFoundation Medicine Adult Cancer Clinical Dataset (FM-AD)AD
Germ Cell NeoplasmsfalsetrueopenTestisTCGA-TGCTTCGA-TGCTTesticular Germ Cell TumorsTGCT
phs000471High-Risk Wilms TumorfalsetrueopenKidneyTARGET-WTTARGET-WTHigh-Risk Wilms TumorWT
phs001444Lymphoid Neoplasm Diffuse Large B-cell LymphomafalsetrueopenLymph NodesNCICCR-DLBCLNCICCR-DLBCLGenomic Variation in Diffuse Large B Cell LymphomasDLBCL
Cystic, Mucinous and Serous Neoplasms,Acinar Cell Neoplasms,Adenomas and AdenocarcinomasfalsetrueopenBronchus and lungTCGA-LUADTCGA-LUADLung AdenocarcinomaLUAD
phs001184Lymphoid Neoplasm Diffuse Large B-cell LymphomafalsetrueopenLymph NodesCTSP-DLBCL1CTSP-DLBCL1CTSP Diffuse Large B-Cell Lymphoma (DLBCL) CALGB 50303DLBCL1
Adenomas and AdenocarcinomasfalsetrueopenKidneyTCGA-KIRPTCGA-KIRPKidney Renal Papillary Cell CarcinomaKIRP
Paragangliomas and Glomus TumorsfalsetrueopenHeart, mediastinum, and pleura,Other endocrine glands and related structures,Adrenal gland,Connective, subcutaneous and other soft tissues,Other and ill-defined sites,Spinal cord, cranial nerves, and other parts of central nervous system,Retroperitoneum and peritoneumTCGA-PCPGTCGA-PCPGPheochromocytoma and ParagangliomaPCPG
Complex Mixed and Stromal NeoplasmsfalsetrueopenUterus, NOSTCGA-UCSTCGA-UCSUterine CarcinosarcomaUCS
Myeloid Leukemias,Lymphoid LeukemiasfalsetrueopenHematopoietic and reticuloendothelial systemsTARGET-ALL-P3TARGET-ALL-P3Acute Lymphoblastic Leukemia - Phase IIIALL

2.2. sample.type options

sample.type可用参数如下:

tissue.codeshortLetterCodetissue.definition
01TPPrimary solid Tumor
02TRRecurrent Solid Tumor
03TBPrimary Blood Derived Cancer - Peripheral Blood
04TRBMRecurrent Blood Derived Cancer - Bone Marrow
05TAPAdditional - New Primary
06TMMetastatic
07TAMAdditional Metastatic
08THOCHuman Tumor Original Cells
09TBMPrimary Blood Derived Cancer - Bone Marrow
10NBBlood Derived Normal
11NTSolid Tissue Normal
12NBCBuccal Cell Normal
13NEBVEBV Immortalized Normal
14NBMBone Marrow Normal
20CELLCControl Analyte
40TRBRecurrent Blood Derived Cancer - Peripheral Blood
50CELLCell Lines
60XPPrimary Xenograft Tissue
61XCLCell Line Derived Xenograft Tissue

其他搜索参数(data.category、data.type、workflow.type、platform、file.type)可以在下面找到。请注意,这些表格并不是完整的

2.3. Harmonized data options (legacy = FALSE)

datatable(readr::read_csv("https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454"),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 40), 
          rownames = FALSE)
## Parsed with column specification:
## cols(
##   Data.category = col_character(),
##   Data.type = col_character(),
##   `Workflow Type` = col_character(),
##   Platform = col_character()
## )
Data.categoryData.typeWorkflow TypePlatform
Transcriptome ProfilingGene Expression QuantificationHTSeq - Counts
Transcriptome ProfilingGene Expression QuantificationHTSeq - FPKM
Transcriptome ProfilingGene Expression QuantificationHTSeq - FPKM-UQ
Transcriptome ProfilingIsoform Expression Quantification-
Transcriptome ProfilingmiRNA Expression Quantification-
Copy number variationCopy Number Segment
Copy number variationMasked Copy Number Segment
Copy number variationGene Level Copy Number Scores
Simple Nucleotide Variation-
Raw Sequencing Data-
Biospecimen-
Clinical-
DNA MethylationMethylation Beta ValueIllumina Human Methylation 450
DNA MethylationMethylation Beta ValueIllumina Human Methylation 27

2.4. Legacy archive data options (legacy = TRUE)

datatable(readr::read_csv("https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=1817673686"),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 40), 
          rownames = FALSE)
## Parsed with column specification:
## cols(
##   Data.category = col_character(),
##   Data.type = col_character(),
##   Platform = col_character(),
##   file.type = col_character()
## )
Data.categoryData.typePlatformfile.type
Biospecimen
Clinical
Copy number variation-Affymetrix SNP Array 6.0nocnv_hg18.seg
Copy number variation-Affymetrix SNP Array 6.0hg18.seg
Copy number variation-Affymetrix SNP Array 6.0nocnv_hg19.seg
Copy number variation-Affymetrix SNP Array 6.0hg19.seg
Copy number variation-Illumina HiSeq-
DNA methylationIllumina Human Methylation 450Not used
DNA methylationIllumina Human Methylation 27Not used
DNA methylationIllumina DNA Methylation OMA003 CPINot used
DNA methylationIllumina DNA Methylation OMA002 CPINot used
DNA methylationIllumina Hi Seq
DNA methylationBisulfite sequence alignment
DNA methylationMethylation percentage
DNA methylationAligned reads
Gene expressionGene expression quantificationIllumina HiSeqnormalized_results
Gene expressionGene expression quantificationIllumina HiSeqresults
Gene expressionGene expression quantificationHT_HG-U133A-
Gene expressionGene expression quantificationAgilentG4502A_07_2-
Gene expressionGene expression quantificationAgilentG4502A_07_1-
Gene expressionGene expression quantificationHuEx-1_0-st-v2FIRMA.txt
Gene expressionGene expression quantificationgene.txt
Gene expressionIsoform expression quantification--
Gene expressionmiRNA gene quantification-hg19.mirna
Gene expressionmiRNA gene quantificationhg19.mirbase20
Gene expressionmiRNA gene quantificationmirna
Gene expressionExon junction quantification--
Gene expressionExon quantification--
Gene expressionmiRNA isoform quantification-hg19.isoform
Gene expressionmiRNA isoform quantification-isoform
Other
Protein expressionMDA RPPA Core-
Raw microarray dataRaw intensitiesIllumina Human Methylation 450idat
Raw Microarray DataRaw intensitiesIllumina Human Methylation 27idat
Raw sequencing data
Simple nucleotide variationSimple somatic mutation
Structural Rearrangement

3. Harmonized数据库案例

3.1. DNA甲基化数据:Recurrent tumor samples

在这个例子中,我们将访问Harmonized数据库(legacy = FALSE),并搜索recurrent glioblastoma multiform (GBM) 和low grade gliomas (LGG) 样本的所有DNA甲基化数据。

# 导入必备的包,在后续的代码中都要先导入包,才能使用这些函数
library(TCGAbiolinks)
library(SummarizedExperiment)

# 查询信息
query <- GDCquery(project = c("TCGA-GBM", "TCGA-LGG"),
                  data.category = "DNA Methylation",
                  legacy = FALSE,
                  platform = c("Illumina Human Methylation 450"),
                  sample.type = "Recurrent Solid Tumor"
)

# 展示查询结果
datatable(getResults(query), 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
data_releasedata_typeupdated_datetimefile_namesubmitter_idfile_idfile_sizecasesidcreated_datetimemd5sumdata_formataccessplatformstateversiondata_categorytypeexperimental_strategyprojectanalysis_idanalysis_updated_datetimeanalysis_created_datetimeanalysis_submitter_idanalysis_stateanalysis_workflow_linkanalysis_workflow_typetissue.definition
12.0 - 15.0Methylation Beta Value2018-11-30T04:41:54.596454+00:00jhu-usc.edu_GBM.HumanMethylation450.2.lvl-3.TCGA-06-0171-02A-11D-2004-05.gdc_hg38.txt5978b8ef-dc9a-4a00-9c0e-ec1772bce4cc-beta-value9d5e1554-95cd-4ced-9b51-19e0b42d4b31141286194TCGA-06-0171-02A-11D-2004-059d5e1554-95cd-4ced-9b51-19e0b42d4b312016-10-27T21:58:12.297090-05:006955a67ab70c840a668b49a42d4dae71TXTopenIllumina Human Methylation 450released1DNA Methylationmethylation_beta_valueMethylation ArrayTCGA-GBMa22168b5-32a5-48e2-b603-f26c4ad16e952018-09-06T13:49:07.196637-05:002016-10-27T21:58:12.297090-05:005978b8ef-dc9a-4a00-9c0e-ec1772bce4cc-workflowreleasedhttps://github.com/NCI-GDC/met-liftover-toolLiftoverRecurrent Solid Tumor

数据在这里只显示了一部分

3.2. DNA甲基化数据+基因表达数据:Colon Adenocarcinoma tumor

在这个例子中,我们将访问Harmonized数据库(legacy = FALSE),并搜索具有DNA甲基化数据和基因表达数据的Colon Adenocarcinoma tumor (TCGA-COAD)患者

query.met <- GDCquery(project = "TCGA-COAD",
                      data.category = "DNA Methylation",
                      legacy = FALSE,
                      platform = c("Illumina Human Methylation 450"))
query.exp <- GDCquery(project = "TCGA-COAD",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - FPKM-UQ")

# Get all patients that have DNA methylation and gene expression.
common.patients <- intersect(substr(getResults(query.met, cols = "cases"), 1, 12),
                             substr(getResults(query.exp, cols = "cases"), 1, 12))

# Only seelct the first 5 patients
query.met <- GDCquery(project = "TCGA-COAD",
                      data.category = "DNA Methylation",
                      legacy = FALSE,
                      platform = c("Illumina Human Methylation 450"),
                      barcode = common.patients[1:5])
query.exp <- GDCquery(project = "TCGA-COAD",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - FPKM-UQ",
                      barcode = common.patients[1:5])

datatable(getResults(query.met, cols = c("data_type","cases")),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
data_typecases
Methylation Beta ValueTCGA-AA-3712-01A-21D-1721-05
Methylation Beta ValueTCGA-AA-3712-11A-01D-1721-05
Methylation Beta ValueTCGA-CK-6747-01A-11D-1837-05
Methylation Beta ValueTCGA-AA-3502-11A-01D-1407-05
Methylation Beta ValueTCGA-AA-3502-01A-01D-1407-05
Methylation Beta ValueTCGA-D5-6536-01A-11D-1721-05
Methylation Beta ValueTCGA-CM-6676-01A-11D-1837-05
datatable(getResults(query.exp, cols = c("data_type","cases")), 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
data_typecases
Gene Expression QuantificationTCGA-AA-3712-11A-01R-1723-07
Gene Expression QuantificationTCGA-AA-3712-01A-21R-1723-07
Gene Expression QuantificationTCGA-CK-6747-01A-11R-1839-07
Gene Expression QuantificationTCGA-AA-3502-01A-01R-1410-07
Gene Expression QuantificationTCGA-D5-6536-01A-11R-1723-07
Gene Expression QuantificationTCGA-CM-6676-01A-11R-1839-07

3.3. 原始测序数据:

这个例子显示了用户如何搜索乳腺癌(Breast Cancer)的原始测序数据(“Controlled”),并验证文件名称和与之相关的Barcodes。

query <- GDCquery(project = c("TCGA-BRCA"),
                  data.category = "Sequencing Reads",  
                  sample.type = "Primary solid Tumor")
# Only first 5 to make render faster
datatable(getResults(query, rows = 1:5,cols = c("file_name","cases")), 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
file_namecases
TCGA-A7-A26E-01A-11R-A168-13_mirna_gdc_realn.bamTCGA-A7-A26E-01A-11R-A168-13
TCGA-E2-A156-01A-11D-A12B-09_IlluminaGA-DNASeq_exome_gdc_realn.bamTCGA-E2-A156-01A-11D-A12B-09
c399b9e0c9d7b4320262377b2e901557_gdc_realn.bamTCGA-A7-A5ZX-01A-12D-A29N-09
TCGA-E2-A1LS-01A-12R-A156-13_mirna_gdc_realn.bamTCGA-E2-A1LS-01A-12R-A156-13
45ea8a7da00be45d52b0cc712c7c771c_gdc_realn.bamTCGA-D8-A1JK-01A-11D-A13L-09

4. Legacy archive数据库案例

4.1. 甲基化数据

4.1.1. 甲基化数据 - 来自Array-based assays

该案例显示了用户如何搜索基于Illumina Human Methylation 450和Illumina Human Methylation 27平台的glioblastoma multiform (GBM)和low grade gliomas (LGG)的DNA甲基化数据。

query <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
                  legacy = TRUE,
                  data.category = "DNA methylation",
                  platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27"))
datatable(getResults(query, rows = 1:100), 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
data_releasedata_typetagsfile_namesubmitter_idfile_idfile_sizecasesstate_commentidmd5sumupdated_datetimedata_formataccessplatformstateversiondata_categorytypeexperimental_strategyprojectcodecenter_namecenter_short_namecenter_center_idcenter_namespacecenter_center_typetissue.definition
Methylation beta valuemethjhu-usc.edu_GBM.HumanMethylation450.8.lvl-3.TCGA-76-6661-01B-11D-1844-05.txt8f3bf221-d738-4850-aa00-ca1e0d10a7e721285109TCGA-76-6661-01B-11D-1844-058f3bf221-d738-4850-aa00-ca1e0d10a7e7b6662864029f2b2f272569128a13371f2017-03-05T18:38:56.646240-06:00TXTopenIllumina Human Methylation 450liveDNA methylationfileMethylation arrayTCGA-GBM05Johns Hopkins / University of Southern CaliforniaJHU_USC7ef3885b-37ce-5e16-8ba3-9d75b6690008jhu-usc.eduCGCCPrimary solid Tumor

数据在这里只显示了一部分

4.1.2. 甲基化数据 - 来自whole-genome bisulfite sequencing (WGBS)

query <- GDCquery(project = c("TCGA-LUAD"),
                  legacy = TRUE,
                  data.type = "Methylation percentage",
                  experimental.strategy = "Bisulfite-Seq")

# VCF - controlled data
query <- GDCquery(project = c("TCGA-LUAD"),
                  legacy = TRUE,
                  data.type = "Bisulfite sequence alignment",
                  experimental.strategy = "Bisulfite-Seq")


# WGBS BAM files - controlled data
query <- GDCquery(project = c("TCGA-LUAD"),
                  legacy = TRUE,
                  data.type = "Aligned reads",
                  data.category = "Raw sequencing data",
                  experimental.strategy = "Bisulfite-Seq")

4.2. 基因表达数据

该案例显示了用户如何搜索glioblastoma multiform (GBM)的基因表达数据,并且该数据已经进行基因表达量标准化。更详细的信息,请参考rnaseqV2 TCGA wiki

# Gene expression aligned against hg19.
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",
                           experimental.strategy = "RNA-Seq",
                           barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                           legacy = TRUE)
datatable(getResults(query.exp.hg19), 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
data_releasedata_typetagsfile_namesubmitter_idfile_idfile_sizecasesstate_commentidmd5sumupdated_datetimedata_formataccessplatformstateversiondata_categorytypeexperimental_strategyprojectcodecenter_namecenter_short_namecenter_center_idcenter_namespacecenter_center_typetissue.definition
Gene expression quantificationnormalized,gene,v2unc.edu.b469eb7c-723f-4870-b4e4-ebfaae7a118b.1536566.rsem.genes.normalized_results217d72e9-4d6f-409d-911c-0a70b17a0adc437283TCGA-14-0736-02A-01R-2005-01217d72e9-4d6f-409d-911c-0a70b17a0adcbeda9f89f08fc6a892a72e8b704fdbd92017-03-05T11:34:30.601697-06:00TXTopenIllumina HiSeqliveGene expressionfileRNA-SeqTCGA-GBM07University of North CarolinaUNCee7a85b3-8177-5d60-a10c-51180eb9009cunc.eduCGCCRecurrent Solid Tumor
Gene expression quantificationnormalized,gene,v2unc.edu.152afe8c-f67c-4d7c-93ac-e1b7edd56c54.1544649.rsem.genes.normalized_results973ce0ac-f613-4b99-b2ab-3e2d5548f05f436272TCGA-06-0211-02A-02R-2005-01973ce0ac-f613-4b99-b2ab-3e2d5548f05f84478e78d95e1155019ccb7e0e0fea2f2017-03-05T18:20:31.987895-06:00TXTopenIllumina HiSeqliveGene expressionfileRNA-SeqTCGA-GBM07University of North CarolinaUNCee7a85b3-8177-5d60-a10c-51180eb9009cunc.eduCGCCRecurrent Solid Tumor

5. 获得查询结果文件清单

如果要从查询对象获取文件清单, 可以使用函数getManifest。如果将函数中的save参数设置为 TRUE, 则将创建一个 txt 文件。同时该文件可以用于GDC-client Data transfer tool (DTT) 及其对应的GUI版本ddt-ui

getManifest(query.exp.hg19,save = FALSE) 
##                                      id
## 40 217d72e9-4d6f-409d-911c-0a70b17a0adc
## 97 973ce0ac-f613-4b99-b2ab-3e2d5548f05f
##                                                                              filename
## 40 unc.edu.b469eb7c-723f-4870-b4e4-ebfaae7a118b.1536566.rsem.genes.normalized_results
## 97 unc.edu.152afe8c-f67c-4d7c-93ac-e1b7edd56c54.1544649.rsem.genes.normalized_results
##                                 md5   size state
## 40 beda9f89f08fc6a892a72e8b704fdbd9 437283  live
## 97 84478e78d95e1155019ccb7e0e0fea2f 436272  live

6. ATAC-seq数据

目前,ATAC-seq数据可以从GDC publication page获得,具体的列表如下:

datatable(getResults(TCGAbiolinks:::GDCquery_ATAC_seq())[,c("file_name","file_size")], 
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
file_namefile_size
TCGA-ATAC_PanCancer_PeakSet.txt37522221
TCGA-ATAC_DataS1_DonorsAndStats_v4.xlsx251795
MESO_bigWigs.tgz1507969705
TCGA-ATAC_DataS5_GWAS_v2.xlsx999661
COAD_bigWigs.tgz8939070313

我们还可以使用函数GDCquery_ATAC_seq过滤文件清单,并使用函数GDCdownload过滤后的数据保存到本地。

query <- TCGAbiolinks:::GDCquery_ATAC_seq(file.type = "rds") 
GDCdownload(query,method = "client")

query <- TCGAbiolinks:::GDCquery_ATAC_seq(file.type = "bigWigs") 
GDCdownload(query,method = "client")

7. 每位患者的信息

检索每个data_category + data_type + experimental_strategy + platform下的文件数。几乎像https://portal.gdc.cancer.gov/exploration

tab <-  getSampleFilesSummary("TCGA-ACC")
datatable(tab,
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
更新时间:2019-05-25 17:07:26

本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-searching-gdc-database
最后更新:2019-05-25 17:07:26

评论

Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×