翻译 - TCGAbiolinks 4 - Clinical data

TCGAbiolinks:临床数据

发表时间:2019年3月20日

TCGAbiolinks提供了一些搜索,下载和解析临床数据的函数。本节首先解释GDC中临床信息的不同来源,然后提供访问不同来源临床数据的必要函数,最后显示不同来源临床数据之间的不一致。

1. 一些有用的信息

1.1. 临床数据:不同来源

在GDC数据库中,可以从两个来源检索临床数据:

  • Clinical indexed:使用XML文件创建的精炼的临床数据。
  • XML:完成的临床信息文件

1.2. 临床数据:不同来源数据的区别:

  • XML文件有更多的信息:辐射,药物信息,后续,生物样本等。而Clinical indexed只是XML文件的一个子集
  • Clinical indexed包含更新后的数据,这个数据包含后续信息。例如:如果患者在第一次收集临床数据时仍然存活,并且在下一次随访中死亡,则:1)Clinical indexed将显示为死亡;2)而XML将有两个字段,一个是第一个时间段显示他还活着(在临床部分),另一个字段显示在后续中他死亡。你可以在这里看到这种情况:

2. Clinical indexed:数据获取

在这个例子中,我们将获取Clinical indexed数据。

clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
datatable(clinical, filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)
submitter_idclassification_of_tumorlast_known_disease_statusupdated_datetimeprimary_diagnosistumor_stageage_at_diagnosisvital_statusmorphologydays_to_deathdays_to_last_known_disease_statuscreated_datetimestatedays_to_recurrencediagnosis_idtumor_gradetissue_or_organ_of_origindays_to_birthprogression_or_recurrenceprior_malignancysite_of_resection_or_biopsydays_to_last_follow_upcigarettes_per_dayweightalcohol_historyalcohol_intensitybmiyears_smokedexposure_idheightgenderyear_of_birthracedemographic_idethnicityyear_of_deathtreatment_idtherapeutic_agentstreatment_intent_typetreatment_or_therapybcr_patient_barcodedisease
TCGA-05-4244not reportednot reported2018-09-06T21:10:41.393019-05:00Adenocarcinoma, NOSstage iv25752alive8140/3released71222be8-573b-5d40-a15e-57649e0aec0anot reportedLower lobe, lung-25752not reportednot reportedLower lobe, lung02.0821917808219219e0b9d9-717b-53b1-a0f4-44ca005aeeddmale1939not reported139ab7b9-af85-58e3-ac01-933eeae4afb9not reported834430fa-3aa8-570c-bac0-0c09865fbb2bTCGA-05-4244LUAD
TCGA-05-4245not reportednot reported2018-09-06T21:10:41.393019-05:00Adenocarcinoma, NOSstage iiia29647alive8140/3releasede2fb63d6-2eac-535c-bcd0-e9575ebaf006not reportedUpper lobe, lung-29647not reportednot reportedUpper lobe, lung7301.753424657534258d858a93-54e3-5432-9589-68c6b90a24c8male1928not reported3ec79286-1857-5eeb-a778-0c86ff2a13f6not reported85506557-5c39-5d73-ba25-228aee1aea2aTCGA-05-4245LUAD

显示部分结果

3. XML:临床数据解析

直接从XML获取临床数据的过程如下:

  1. 使用GDCqueryGDCDownload函数搜索/下载生物样本(biospecimen)或临床XML文件
  2. 使用GDCprepare_clinic函数解析XML文件。

需要注意的是,一名患者与其他临床信息之间的关系为:1:n,即一名患者可以有多个相同的临床信息,比如一名患者可以进行多次放射治疗。出于这个原因,在解析XML的时候,TCGAbiolinks只提供某个临床信息的单个表,比如只有药物信息、或者只有辐射通知等。通过clinical.info可以获得对应的临床信息。

3.1. clinical.info参数

对于每一个数据类别,clinical.info可以提供的信息如下:

data.categoryclinical.info
Clinicaldrug
Clinicaladmin
Clinicalfollow_up
Clinicalradiation
Clinicalpatient
Clinicalstage_event
Clinicalnew_tumor_event
Biospecimensample
Biospecimenbio_patient
Biospecimenanalyte
Biospecimenaliquot
Biospecimenprotocol
Biospecimenportion
Biospecimenslide
Othermsi

3.2. 数据获取案例

以下是直接从临床XML文件中获取临床数据的几个示例。

  1. 病人的临床信息:

    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-COAD", 
                      data.category = "Clinical", 
                      file.type = "xml", 
                      barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
    GDCdownload(query)
    clinical <- GDCprepare_clinic(query, clinical.info = "patient")
    datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
    
    bcr_patient_barcodeadditional_studiestumor_tissue_sitehistological_typeother_dxgendervital_statusdays_to_birthdays_to_last_known_alivedays_to_deathdays_to_last_followuprace_listtissue_source_sitepatient_idbcr_patient_uuidhistory_of_neoadjuvant_treatmentinformed_consent_verifiedicd_o_3_siteicd_o_3_histologyicd_10tissue_prospective_collection_indicatortissue_retrospective_collection_indicatordays_to_initial_pathologic_diagnosisage_at_initial_pathologic_diagnosisyear_of_initial_pathologic_diagnosisperson_neoplasm_cancer_statusethnicityweightheightday_of_form_completionmonth_of_form_completionyear_of_form_completionresidual_tumoranatomic_neoplasm_subdivisionprimary_lymph_node_presentation_assessmentlymph_node_examined_countnumber_of_lymphnodes_positive_by_henumber_of_lymphnodes_positive_by_ihcpreoperative_pretreatment_cea_levelnon_nodal_tumor_depositscircumferential_resection_marginvenous_invasionlymphatic_invasionperineural_invasion_presentmicrosatellite_instabilitynumber_of_loci_testednumber_of_abnormal_locikras_gene_analysis_performedkras_mutation_foundkras_mutation_codonbraf_gene_analysis_performedbraf_gene_analysis_resultsynchronous_colon_cancer_presenthistory_of_colon_polypscolon_polyps_presentloss_expression_of_mismatch_repair_proteins_by_ihcloss_expression_of_mismatch_repair_proteins_by_ihc_resultsnumber_of_first_degree_relatives_with_cancer_diagnosisradiation_therapypostoperative_rx_txprimary_therapy_outcome_successhas_new_tumor_events_informationhas_drugs_informationhas_radiations_informationhas_follow_ups_informationprojectstage_event_system_versionstage_event_clinical_stagestage_event_pathologic_stagestage_event_tnm_categoriesstage_event_psastage_event_gleason_gradingstage_event_ann_arborstage_event_serum_markersstage_event_igcccg_stagestage_event_masaoka_stage
    TCGA-AA-3972ColonColon AdenocarcinomaYesMALEAlive-263600AA397293cd5d07-e0d3-40f8-ae7f-beadb1efb7c1NoYESC18.78140/3C18.7NOYES0722008WITH TUMOR1852010R2Sigmoid ColonYES1235.15NONONONONONONO0YESYESNOYESTCGA-COAD6thStage IVT3N1M1
    TCGA-AA-3972ColonColon AdenocarcinomaYesMALEAlive-263600AA397293cd5d07-e0d3-40f8-ae7f-beadb1efb7c1NoYESC18.78140/3C18.7NOYES0722008WITH TUMOR1852010R2Sigmoid ColonYES1235.15NONONONONONONO0YESYESNOYESTCGA-COAD6thStage IVT3N1M1
    TCGA-RU-A8FLColonColon AdenocarcinomaNoMALEAlive-18975921BLACK OR AFRICAN AMERICANRUA8FL5F6ED48B-3B5A-4D20-8FBF-C9AA3FEEA186NoYESC18.08140/3C18.0NOYES0512011WITH TUMORNOT HISPANIC OR LATINO125187.962412014R0CecumYES21524.6NO65NOYESNONONONONONO0NOYESPartial Remission/ResponseYESYESYESYESTCGA-COAD7thStage IIIBT3N2aMX
  2. 药物的临床信息:

    clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
    datatable(clinical.drug, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
    
    bcr_patient_barcodetx_on_clinical_trialregimen_numberbcr_drug_barcodebcr_drug_uuidtotal_dosetotal_dose_unitsprescribed_doseprescribed_dose_unitsnumber_cyclesdays_to_drug_therapy_startdays_to_drug_therapy_endtherapy_typesdrug_nameclinical_trail_drug_classificationregimen_indicationregimen_indication_notesroute_of_administrationstherapy_ongoingmeasure_of_responseday_of_form_completionmonth_of_form_completionyear_of_form_completionproject
    TCGA-AA-3972NOTCGA-AA-3972-D3742645D2A60B-C35E-4BAC-9769-B1D5C0153FED601246ChemotherapyCapecitabineNOClinical Progressive Disease3122012TCGA-COAD
    TCGA-AA-3972NOTCGA-AA-3972-D3742645D2A60B-C35E-4BAC-9769-B1D5C0153FED601246ChemotherapyCapecitabineNOClinical Progressive Disease3122012TCGA-COAD

    只显示了部分信息

  3. 放射的临床信息:

    clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
    datatable(clinical.radiation, options = list(scrollX = TRUE,  keys = TRUE), rownames = FALSE)
    
    bcr_patient_barcodebcr_radiation_barcodebcr_radiation_uuiddays_to_radiation_therapy_startdays_to_radiation_therapy_endradiation_typeradiation_type_notesradiation_dosageunitsnumfractionsanatomic_treatment_siteregimen_indicationregimen_indication_notesradiation_treatment_ongoingcourse_numbermeasure_of_responseday_of_form_completionmonth_of_form_completionyear_of_form_completionproject
    TCGA-RU-A8FLTCGA-RU-A8FL-R6827507403C45-7D5F-4DAB-AFD2-0AC913DD0AFB788788Internal568.1Gy5Distant RecurrenceNOStable Disease14112014TCGA-COAD
  4. 临床管理的信息:

    clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
    datatable(clinical.admin, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
    
    bcr_patient_barcodebcrfile_uuidbatch_numberproject_codedisease_codeday_of_dcc_uploadmonth_of_dcc_uploadyear_of_dcc_uploadpatient_withdrawalprogramdbgap_registration_codeproject
    TCGA-AA-3972Nationwide Children's HospitalEA58E3D4-40E0-406F-A7A3-C7C282CFA20341.78.0TCGACOAD22122016falseTCGA-COAD
    TCGA-AA-3972Nationwide Children's HospitalEA58E3D4-40E0-406F-A7A3-C7C282CFA20341.78.0TCGACOAD22122016falseTCGA-COAD

3.3. 微卫星标记(Microsatellite)数据

MSI-Mono-Dinucleotide Assay可以测试:1)一组四个单核苷酸重复基因座( mononucleotide repeat loci),包括polyadenine tracts BAT25、BAT26、BAT40 和 transforming growth factor receptor type II;2)三个二核苷酸重复基因座(dinucleotide repeat loci),包括CA repeats in D2S123、D5S346 和 D17S250。在该测定中包括两个额外的五核苷酸基因座(pentanucleotide loci),包括Penta D和Penta E,可以被用来评估样品特性。如果在肿瘤和匹配的非肿瘤组织或单核血细胞之间检测到微卫星标记数目的变化,则使用多重荧光标记的PCR和毛细管电泳来鉴定MSI。通过singleplex PCR重新评估等同或失败的标记物。

**分类:**1) microsatellite-stable (MSS);2) low level MSI (MSI-L) :如果少于40%的标记被改变;3) high level MSI (MSI-H):如果大于40%的标记被改变。

参考:TCGA维基

Level 3的数据已经包含在提交的BCR临床数据中中,可以按如下方式下载:

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Other",
                  legacy = TRUE,
                  access = "open",
                  data.type = "Auxiliary test",
                  barcode = c("TCGA-AD-A5EJ","TCGA-DM-A0X9"))  
GDCdownload(query)
msi_results <- GDCprepare_clinic(query, "msi")
datatable(msi_results, options = list(scrollX = TRUE, keys = TRUE))
bcr_aliquot_uuidmononucleotide_and_dinucleotide_marker_panel_analysis_statusmononucleotide_marker_panel_analysis_statusbcr_patient_barcode
16139cab7-959f-4c49-b9e7-cc25ea8e041cMSI-HTCGA-AD-A5EJ
2511f645e-f107-4f58-9c6a-6f931e612bd6MSSTCGA-DM-A0X9

4. Legacy临床数据获取

在Legacy数据库中可用的临床数据类型包括:

  • Biospecimen data (Biotab format):生物样本数据
  • Tissue slide image (SVS format):组织载玻片图像
  • Clinical Supplement (XML format):临床补充
  • Pathology report (PDF):病理报告
  • Clinical data (Biotab format):临床数据
  1. Tissue slide image 获取

    # Tissue slide image files
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-COAD", 
                      data.category = "Clinical", 
                      data.type = "Tissue slide image",
                      legacy = TRUE,
                      barcode = c("TCGA-RU-A8FL","TCGA-AA-3972")) 
    query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
    
    data_releasedata_typetagsfile_namesubmitter_idfile_idfile_sizecasesstate_commentidmd5sumupdated_datetimedata_formataccessplatformstateversiondata_categorytypeprojectcodecenter_namecenter_short_namecenter_center_idcenter_namespacecenter_center_typetissue.definition
    494Tissue slide imageimageTCGA-RU-A8FL-01A-01-TSA.3743238D-C8AE-49A3-B5C9-3B18C1B64964.svs530083be-6bdf-49e8-85dc-3c3ee5d5dcd5186948553TCGA-RU-A8FL530083be-6bdf-49e8-85dc-3c3ee5d5dcd550b05bcaadce3553bb763a734f9763162017-03-05T18:19:57.255497-06:00SVSopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
    975Tissue slide imageimageTCGA-AA-3972-01A-01-BS1.6c038157-503f-47dc-9445-72d7cdcae1a7.svs0d4a3c6c-0ab2-44d2-9b08-90f5ea84555f130569853TCGA-AA-39720d4a3c6c-0ab2-44d2-9b08-90f5ea84555f65d8fe397397d1fd633fd3279301310f2017-03-05T18:32:29.523130-06:00SVSopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
    980Tissue slide imageimageTCGA-AA-3972-01A-01-TS1.3492c5d3-141f-48f9-a61b-48a1f595792c.svs4966c8e3-37fd-4296-8a8a-216def9ec31162750835TCGA-AA-39724966c8e3-37fd-4296-8a8a-216def9ec31118d8cc99e47b1522264f0f00cd4f85ca2017-03-05T09:56:10.833727-06:00SVSopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
  2. Pathology report 获取

    # Pathology report
    query <- GDCquery(project = "TCGA-COAD", 
                      data.category = "Clinical", 
                      data.type = "Pathology report",
                      legacy = TRUE,
                      barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))  
    query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
    
    data_releasedata_typeupdated_datetimefile_namesubmitter_idfile_idfile_sizecasesstate_commentidmd5sumdata_formataccessplatformstateversiondata_categorytypeprojectcodecenter_namecenter_short_namecenter_center_idcenter_namespacecenter_center_typetissue.definition
    8Pathology report2017-03-05T18:42:42.189892-06:00TCGA-RU-A8FL.92DF9BB6-DB3F-40A7-AE1E-8272C46B1968.pdfa4753077-2bd3-4301-8424-b7575c8ccd66206913TCGA-RU-A8FLa4753077-2bd3-4301-8424-b7575c8ccd661c0cd3716d7c879d41d973282926c489PDFopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
    307Pathology report2017-03-05T16:30:32.274191-06:00TCGA-AA-3972.2562de97-b8b4-4547-9e7f-4c0fab6552b3.pdfb77a41e9-cf0d-4b94-9576-09e91b6d8f618658TCGA-AA-3972b77a41e9-cf0d-4b94-9576-09e91b6d8f61626f8c94641ea67b6bc4b96de0ad4fd0PDFopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
  3. Clinical Supplement 获取

    # Clinical Supplement
    query <- GDCquery(project = "TCGA-COAD", 
                      data.category = "Clinical", 
                      data.type = "Clinical Supplement",
                      legacy = TRUE,
                      barcode = c("TCGA-RU-A8FL","TCGA-AA-3972")) 
    query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
    
    data_releasedata_typeupdated_datetimefile_namesubmitter_idfile_idfile_sizecasesstate_commentidcreated_datetimemd5sumdata_formataccessstateversiondata_categorytypeprojecttissue.definition
    275Clinical Supplement2017-03-04T16:40:34.621217-06:00nationwidechildrens.org_clinical.TCGA-RU-A8FL.xml3c5a4713-6855-42d4-aed6-3129bfe80c5862955TCGA-RU-A8FL3c5a4713-6855-42d4-aed6-3129bfe80c582016-05-04T08:54:03.280502-05:005f997f26c363f1d0727aaea5760b7c92BCR XMLopenliveClinicalclinical_supplementTCGA-COAD
    427Clinical Supplement2017-03-04T16:40:34.621217-06:00nationwidechildrens.org_clinical.TCGA-AA-3972.xmlc76af5df-aab0-47a0-a543-77668be3f0c766578TCGA-AA-3972c76af5df-aab0-47a0-a543-77668be3f0c72016-05-04T08:54:13.372056-05:0013edc34d65d52aa25b74dd9d769e1febBCR XMLopenliveClinicalclinical_supplementTCGA-COAD
  4. Clinical data 获取

    # Clinical data
    query <- GDCquery(project = "TCGA-COAD", 
                      data.category = "Clinical", 
                      data.type = "Clinical data",
                      legacy = TRUE,
                      file.type = "txt")  
    query %>% getResults %>% select(-matches("cases"))%>% datatable(options = list(scrollX = TRUE, keys = TRUE))
    
    data_releasedata_typetagsfile_namesubmitter_idfile_idfile_sizestate_commentidcreated_datetimemd5sumupdated_datetimedata_formataccessplatformstateversiondata_categorytypeprojectcodecenter_namecenter_short_namecenter_center_idcenter_namespacecenter_center_typetissue.definition
    23Clinical datadrugnationwidechildrens.org_clinical_drug_coad.txt0415ffe2-a98d-40b9-ac60-6753fce56c7b2371630415ffe2-a98d-40b9-ac60-6753fce56c7b2016-04-20T16:20:56.238694-05:00dac877287ee1938fc69659127c02a1512017-03-04T20:47:52.066809-06:00BiotabopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
    25Clinical datapatientnationwidechildrens.org_clinical_patient_coad.txtb58b5947-d2b6-4cc7-9eff-cc0083d5bf4b396060b58b5947-d2b6-4cc7-9eff-cc0083d5bf4b2016-04-20T16:20:56.238694-05:001f458eee1a95f3f45725e9881dcb1bf32017-03-05T11:00:22.460578-06:00BiotabopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
    29Clinical dataradiationnationwidechildrens.org_clinical_radiation_coad.txt36b48c2d-f45d-4995-bc2d-931f5c190919626436b48c2d-f45d-4995-bc2d-931f5c1909192016-04-20T16:20:56.238694-05:00f8f2c8a4a8131fa4fb600ead81565c392017-03-05T11:47:51.010995-06:00BiotabopenClinicalliveClinicalfileTCGA-COAD36Nationwide Children's Hospital BCRNCHa6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6anationwidechildrens.orgBCR
  5. Biospecimen data 获取

    GDCdownload(query)
    clinical.biotab <- GDCprepare(query)
    names(clinical.biotab)
    ## [1] "clinical_radiation_coad"          "clinical_nte_coad"               
    ## [3] "clinical_patient_coad"            "clinical_drug_coad"              
    ## [5] "clinical_follow_up_v1.0_nte_coad" "clinical_omf_v4.0_coad"          
    ## [7] "clinical_follow_up_v1.0_coad"
    datatable(clinical.biotab$clinical_radiation_coad, options = list(scrollX = TRUE, keys = TRUE))
    
    bcr_patient_uuidbcr_patient_barcodebcr_radiation_barcodebcr_radiation_uuidform_completion_dateradiation_therapy_typeradiation_therapy_siteradiation_total_doseradiation_adjuvant_unitsradiation_adjuvant_fractions_totalradiation_therapy_started_days_toradiation_therapy_ongoing_indicatorradiation_therapy_ended_days_totreatment_best_responsecourse_numberradiation_type_othertherapy_regimentherapy_regimen_other
    1bcr_patient_uuidbcr_patient_barcodebcr_radiation_barcodebcr_radiation_uuidform_completion_dateradiation_typeanatomic_treatment_siteradiation_dosageunitsnumfractionsdays_to_radiation_therapy_startradiation_treatment_ongoingdays_to_radiation_therapy_endmeasure_of_responsecourse_numberradiation_type_notesregimen_indicationregimen_indication_notes
    2CDE_ID:CDE_ID:2673794CDE_ID:CDE_ID:CDE_ID:CDE_ID:2842944CDE_ID:2793522CDE_ID:2721441CDE_ID:61446CDE_ID:61465CDE_ID:3008313CDE_ID:2842745CDE_ID:3008333CDE_ID:2857291CDE_ID:2732184CDE_ID:2195477CDE_ID:2793511CDE_ID:2793516
    3e6ec5a68-7555-4f26-bd7e-9cdb4c5f7004TCGA-AA-3549TCGA-AA-3549-R38338B72A855F-225F-4537-A74F-8485ABDBA0D02012-12-13ExternalDistant Recurrence9Gy[Not Available]1126NO1126Radiographic Progressive Disease[Not Available][Not Applicable][Not Available][Not Available]
    4bce3ce45-4fb3-4d8e-9ec7-d24427c2ba4dTCGA-AA-3692TCGA-AA-3692-R383452054309D-1EBC-4311-BF8A-621F6447F3852012-12-13ExternalDistant Recurrence39Gy[Not Available]31NO426Radiographic Progressive Disease[Not Available][Not Applicable][Not Available][Not Available]
    5bce3ce45-4fb3-4d8e-9ec7-d24427c2ba4dTCGA-AA-3692TCGA-AA-3692-R383461081C34F-DA75-4856-966C-8F9B10E784AA2012-12-13ExternalDistant Recurrence38Gy[Not Available]365NO761Radiographic Progressive Disease[Not Available][Not Applicable][Not Available][Not Available]

    只显示了部分数据

5. 临床数据不一致

Clinical indexed的临床数据中发现了一些不一致的情况,并且正在由GDC团队进行调查。这些不一致是:

  • Vital status(生命状态)字段未正确更新
  • Tumor Grade(肿瘤等级)字段未填充
  • Progression or Recurrence(进展或重复)字段未填充

5.1. Vital status不一致

# Get XML files and parse them
clin.query <- GDCquery(project = "TCGA-READ", data.category = "Clinical", file.type = "xml", barcode = "TCGA-F5-6702")
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")

# Get indexed data
clinical.index <- GDCquery_clinic("TCGA-READ")

dplyr::select(clinical.patient,vital_status,days_to_death,days_to_last_followup) %>% datatable
vital_statusdays_to_deathdays_to_last_followup
1Alive66
dplyr::select(clinical.patient.followup, vital_status,days_to_death,days_to_last_followup) %>% datatable
vital_statusdays_to_deathdays_to_last_followup
1Dead869
2Alive452
# Vital status should be the same in the follow up table 
dplyr::filter(clinical.index,submitter_id == "TCGA-F5-6702") %>% dplyr::select(vital_status,days_to_death,days_to_last_follow_up) %>% datatable
vital_statusdays_to_deathdays_to_last_follow_up
1alive869452

5.2. Progression or Recurrence 和 Tumor Grade 不一致

# Get XML files and parse them
recurrent.samples <- GDCquery(project = "TCGA-LIHC",
                             data.category = "Transcriptome Profiling",
                             data.type = "Gene Expression Quantification", 
                             workflow.type = "HTSeq - Counts",
                             sample.type =  "Recurrent Solid Tumor")$results[[1]] %>% select(cases)
recurrent.patients <- unique(substr(recurrent.samples$cases,1,12))
clin.query <- GDCquery(project = "TCGA-LIHC", data.category = "Clinical", file.type = "xml", barcode = recurrent.patients)
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient") 

# Get indexed data
GDCquery_clinic("TCGA-LIHC") %>% dplyr::filter(submitter_id %in% recurrent.patients) %>% 
    dplyr::select(progression_or_recurrence,days_to_recurrence,tumor_grade) %>% datatable
progression_or_recurrencedays_to_recurrencetumor_grade
1not reportednot reported
2not reportednot reported
# XML data
clinical.patient %>% dplyr::select(bcr_patient_barcode,neoplasm_histologic_grade) %>% datatable
bcr_patient_barcodeneoplasm_histologic_grade
1TCGA-DD-AACAG3
2TCGA-ZS-A9CFG2

6. 临床数据过滤功能

此外,还提供了一些处理临床数据的功能。

例如,函数TCGAquery_SampleTypes将根据参数typesample的类型过滤Barcodes。

ArgumentDescription
barcodeis a list of samples as TCGA barcodes
typesamplea character vector indicating tissue type to query. Example:
TPPRIMARY SOLID TUMOR
TRRECURRENT SOLID TUMOR
TBPrimary Blood Derived Cancer-Peripheral Blood
TRBMRecurrent Blood Derived Cancer-Bone Marrow
TAPAdditional-New Primary
TMMetastatic
TAMAdditional Metastatic
THOCHuman Tumor Original Cells
TBMPrimary Blood Derived Cancer-Bone Marrow
NBBlood Derived Normal
NTSolid Tissue Normal
NBCBuccal Cell Normal
NEBVEBV Immortalized Normal
NBMBone Marrow Normal

该函数TCGAquery_MatchedCoupledSampleTypes将从全部样本中过滤获得typesample类型的样本。例如,如果将TPTR设置为typesample,则该函数将返回对应患者的条形码(如果它具有两种类型)。因此,如果患者有TP而但没有TR,则不会返回条形码。如果患者有TP和TR,则返回两个条形码。

案例:

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))

7. 其他有用的代码

要获取TGCA样本的所有信息,您可以使用以下脚本:

# This code will get all clinical indexed data from TCGA
library(data.table)
library(dplyr)
library(regexPipes)
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% 
    regexPipes::grep("TCGA",value=T) %>% 
    sort %>% 
    plyr::alply(1,GDCquery_clinic, .progress = "text") %>% 
    rbindlist
readr::write_csv(clinical,path = paste0("all_clin_indexed.csv"))

# This code will get all clinical XML data from TCGA
getclinical <- function(proj){
    message(proj)
    while(1){
        result = tryCatch({
            query <- GDCquery(project = proj, data.category = "Clinical",file.type = "xml")
            GDCdownload(query)
            clinical <- GDCprepare_clinic(query, clinical.info = "patient")
            for(i in c("admin","radiation","follow_up","drug","new_tumor_event")){
                message(i)
                aux <- GDCprepare_clinic(query, clinical.info = i)
                if(is.null(aux) || nrow(aux) == 0) next
                # add suffix manually if it already exists
                replicated <- which(grep("bcr_patient_barcode",colnames(aux), value = T,invert = T) %in% colnames(clinical))
                colnames(aux)[replicated] <- paste0(colnames(aux)[replicated],".",i)
                if(!is.null(aux)) clinical <- merge(clinical,aux,by = "bcr_patient_barcode", all = TRUE)
            }
            readr::write_csv(clinical,path = paste0(proj,"_clinical_from_XML.csv")) # Save the clinical data into a csv file
            return(clinical)
        }, error = function(e) {
            message(paste0("Error clinical: ", proj))
        })
    }
}
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% 
    regexPipes::grep("TCGA",value=T) %>% sort %>% 
    plyr::alply(1,getclinical, .progress = "text") %>% 
    rbindlist(fill = TRUE) %>% setDF %>% subset(!duplicated(clinical))

readr::write_csv(clinical,path = "all_clin_XML.csv")
# result: https://drive.google.com/open?id=0B0-8N2fjttG-WWxSVE5MSGpva1U
# Obs: this table has multiple lines for each patient, as the patient might have several followups, drug treatments,
# new tumor events etc...
更新时间:2019-05-25 17:18:18

本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-clinical-data
最后更新:2019-05-25 17:18:18

评论

Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×