发表时间:2019年3月20日
TCGAbiolinks提供了一些搜索,下载和解析临床数据的函数。本节首先解释GDC中临床信息的不同来源,然后提供访问不同来源临床数据的必要函数,最后显示不同来源临床数据之间的不一致。
在GDC数据库中,可以从两个来源检索临床数据:
在这个例子中,我们将获取Clinical indexed数据。
clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
datatable(clinical, filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
submitter_id | classification_of_tumor | last_known_disease_status | updated_datetime | primary_diagnosis | tumor_stage | age_at_diagnosis | vital_status | morphology | days_to_death | days_to_last_known_disease_status | created_datetime | state | days_to_recurrence | diagnosis_id | tumor_grade | tissue_or_organ_of_origin | days_to_birth | progression_or_recurrence | prior_malignancy | site_of_resection_or_biopsy | days_to_last_follow_up | cigarettes_per_day | weight | alcohol_history | alcohol_intensity | bmi | years_smoked | exposure_id | height | gender | year_of_birth | race | demographic_id | ethnicity | year_of_death | treatment_id | therapeutic_agents | treatment_intent_type | treatment_or_therapy | bcr_patient_barcode | disease |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-05-4244 | not reported | not reported | 2018-09-06T21:10:41.393019-05:00 | Adenocarcinoma, NOS | stage iv | 25752 | alive | 8140/3 | released | 71222be8-573b-5d40-a15e-57649e0aec0a | not reported | Lower lobe, lung | -25752 | not reported | not reported | Lower lobe, lung | 0 | 2.08219178082192 | 19e0b9d9-717b-53b1-a0f4-44ca005aeedd | male | 1939 | not reported | 139ab7b9-af85-58e3-ac01-933eeae4afb9 | not reported | 834430fa-3aa8-570c-bac0-0c09865fbb2b | TCGA-05-4244 | LUAD | ||||||||||||||
TCGA-05-4245 | not reported | not reported | 2018-09-06T21:10:41.393019-05:00 | Adenocarcinoma, NOS | stage iiia | 29647 | alive | 8140/3 | released | e2fb63d6-2eac-535c-bcd0-e9575ebaf006 | not reported | Upper lobe, lung | -29647 | not reported | not reported | Upper lobe, lung | 730 | 1.75342465753425 | 8d858a93-54e3-5432-9589-68c6b90a24c8 | male | 1928 | not reported | 3ec79286-1857-5eeb-a778-0c86ff2a13f6 | not reported | 85506557-5c39-5d73-ba25-228aee1aea2a | TCGA-05-4245 | LUAD |
显示部分结果
直接从XML获取临床数据的过程如下:
GDCquery
和GDCDownload
函数搜索/下载生物样本(biospecimen
)或临床XML文件GDCprepare_clinic
函数解析XML文件。需要注意的是,一名患者与其他临床信息之间的关系为:1:n,即一名患者可以有多个相同的临床信息,比如一名患者可以进行多次放射治疗。出于这个原因,在解析XML的时候,TCGAbiolinks只提供某个临床信息的单个表,比如只有药物信息、或者只有辐射通知等。通过clinical.info
可以获得对应的临床信息。
clinical.info
参数对于每一个数据类别,clinical.info
可以提供的信息如下:
data.category | clinical.info |
---|---|
Clinical | drug |
Clinical | admin |
Clinical | follow_up |
Clinical | radiation |
Clinical | patient |
Clinical | stage_event |
Clinical | new_tumor_event |
Biospecimen | sample |
Biospecimen | bio_patient |
Biospecimen | analyte |
Biospecimen | aliquot |
Biospecimen | protocol |
Biospecimen | portion |
Biospecimen | slide |
Other | msi |
以下是直接从临床XML文件中获取临床数据的几个示例。
病人的临床信息:
library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
file.type = "xml",
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
bcr_patient_barcode | additional_studies | tumor_tissue_site | histological_type | other_dx | gender | vital_status | days_to_birth | days_to_last_known_alive | days_to_death | days_to_last_followup | race_list | tissue_source_site | patient_id | bcr_patient_uuid | history_of_neoadjuvant_treatment | informed_consent_verified | icd_o_3_site | icd_o_3_histology | icd_10 | tissue_prospective_collection_indicator | tissue_retrospective_collection_indicator | days_to_initial_pathologic_diagnosis | age_at_initial_pathologic_diagnosis | year_of_initial_pathologic_diagnosis | person_neoplasm_cancer_status | ethnicity | weight | height | day_of_form_completion | month_of_form_completion | year_of_form_completion | residual_tumor | anatomic_neoplasm_subdivision | primary_lymph_node_presentation_assessment | lymph_node_examined_count | number_of_lymphnodes_positive_by_he | number_of_lymphnodes_positive_by_ihc | preoperative_pretreatment_cea_level | non_nodal_tumor_deposits | circumferential_resection_margin | venous_invasion | lymphatic_invasion | perineural_invasion_present | microsatellite_instability | number_of_loci_tested | number_of_abnormal_loci | kras_gene_analysis_performed | kras_mutation_found | kras_mutation_codon | braf_gene_analysis_performed | braf_gene_analysis_result | synchronous_colon_cancer_present | history_of_colon_polyps | colon_polyps_present | loss_expression_of_mismatch_repair_proteins_by_ihc | loss_expression_of_mismatch_repair_proteins_by_ihc_results | number_of_first_degree_relatives_with_cancer_diagnosis | radiation_therapy | postoperative_rx_tx | primary_therapy_outcome_success | has_new_tumor_events_information | has_drugs_information | has_radiations_information | has_follow_ups_information | project | stage_event_system_version | stage_event_clinical_stage | stage_event_pathologic_stage | stage_event_tnm_categories | stage_event_psa | stage_event_gleason_grading | stage_event_ann_arbor | stage_event_serum_markers | stage_event_igcccg_stage | stage_event_masaoka_stage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-AA-3972 | Colon | Colon Adenocarcinoma | Yes | MALE | Alive | -26360 | 0 | AA | 3972 | 93cd5d07-e0d3-40f8-ae7f-beadb1efb7c1 | No | YES | C18.7 | 8140/3 | C18.7 | NO | YES | 0 | 72 | 2008 | WITH TUMOR | 18 | 5 | 2010 | R2 | Sigmoid Colon | YES | 12 | 3 | 5.15 | NO | NO | NO | NO | NO | NO | NO | 0 | YES | YES | NO | YES | TCGA-COAD | 6th | Stage IV | T3N1M1 | |||||||||||||||||||||||||||||
TCGA-AA-3972 | Colon | Colon Adenocarcinoma | Yes | MALE | Alive | -26360 | 0 | AA | 3972 | 93cd5d07-e0d3-40f8-ae7f-beadb1efb7c1 | No | YES | C18.7 | 8140/3 | C18.7 | NO | YES | 0 | 72 | 2008 | WITH TUMOR | 18 | 5 | 2010 | R2 | Sigmoid Colon | YES | 12 | 3 | 5.15 | NO | NO | NO | NO | NO | NO | NO | 0 | YES | YES | NO | YES | TCGA-COAD | 6th | Stage IV | T3N1M1 | |||||||||||||||||||||||||||||
TCGA-RU-A8FL | Colon | Colon Adenocarcinoma | No | MALE | Alive | -18975 | 921 | BLACK OR AFRICAN AMERICAN | RU | A8FL | 5F6ED48B-3B5A-4D20-8FBF-C9AA3FEEA186 | No | YES | C18.0 | 8140/3 | C18.0 | NO | YES | 0 | 51 | 2011 | WITH TUMOR | NOT HISPANIC OR LATINO | 125 | 187.96 | 24 | 1 | 2014 | R0 | Cecum | YES | 21 | 5 | 24.6 | NO | 65 | NO | YES | NO | NO | NO | NO | NO | NO | 0 | NO | YES | Partial Remission/Response | YES | YES | YES | YES | TCGA-COAD | 7th | Stage IIIB | T3N2aMX |
药物的临床信息:
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
datatable(clinical.drug, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
bcr_patient_barcode | tx_on_clinical_trial | regimen_number | bcr_drug_barcode | bcr_drug_uuid | total_dose | total_dose_units | prescribed_dose | prescribed_dose_units | number_cycles | days_to_drug_therapy_start | days_to_drug_therapy_end | therapy_types | drug_name | clinical_trail_drug_classification | regimen_indication | regimen_indication_notes | route_of_administrations | therapy_ongoing | measure_of_response | day_of_form_completion | month_of_form_completion | year_of_form_completion | project |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-AA-3972 | NO | TCGA-AA-3972-D37426 | 45D2A60B-C35E-4BAC-9769-B1D5C0153FED | 60 | 1246 | Chemotherapy | Capecitabine | NO | Clinical Progressive Disease | 3 | 12 | 2012 | TCGA-COAD | ||||||||||
TCGA-AA-3972 | NO | TCGA-AA-3972-D37426 | 45D2A60B-C35E-4BAC-9769-B1D5C0153FED | 60 | 1246 | Chemotherapy | Capecitabine | NO | Clinical Progressive Disease | 3 | 12 | 2012 | TCGA-COAD |
只显示了部分信息
放射的临床信息:
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
datatable(clinical.radiation, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
bcr_patient_barcode | bcr_radiation_barcode | bcr_radiation_uuid | days_to_radiation_therapy_start | days_to_radiation_therapy_end | radiation_type | radiation_type_notes | radiation_dosage | units | numfractions | anatomic_treatment_site | regimen_indication | regimen_indication_notes | radiation_treatment_ongoing | course_number | measure_of_response | day_of_form_completion | month_of_form_completion | year_of_form_completion | project |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-RU-A8FL | TCGA-RU-A8FL-R68275 | 07403C45-7D5F-4DAB-AFD2-0AC913DD0AFB | 788 | 788 | Internal | 568.1 | Gy | 5 | Distant Recurrence | NO | Stable Disease | 14 | 11 | 2014 | TCGA-COAD |
临床管理的信息:
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
datatable(clinical.admin, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
bcr_patient_barcode | bcr | file_uuid | batch_number | project_code | disease_code | day_of_dcc_upload | month_of_dcc_upload | year_of_dcc_upload | patient_withdrawal | program | dbgap_registration_code | project |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-AA-3972 | Nationwide Children's Hospital | EA58E3D4-40E0-406F-A7A3-C7C282CFA203 | 41.78.0 | TCGA | COAD | 22 | 12 | 2016 | false | TCGA-COAD | ||
TCGA-AA-3972 | Nationwide Children's Hospital | EA58E3D4-40E0-406F-A7A3-C7C282CFA203 | 41.78.0 | TCGA | COAD | 22 | 12 | 2016 | false | TCGA-COAD |
MSI-Mono-Dinucleotide Assay可以测试:1)一组四个单核苷酸重复基因座( mononucleotide repeat loci),包括polyadenine tracts BAT25、BAT26、BAT40 和 transforming growth factor receptor type II;2)三个二核苷酸重复基因座(dinucleotide repeat loci),包括CA repeats in D2S123、D5S346 和 D17S250。在该测定中包括两个额外的五核苷酸基因座(pentanucleotide loci),包括Penta D和Penta E,可以被用来评估样品特性。如果在肿瘤和匹配的非肿瘤组织或单核血细胞之间检测到微卫星标记数目的变化,则使用多重荧光标记的PCR和毛细管电泳来鉴定MSI。通过singleplex PCR
重新评估等同或失败的标记物。
**分类:**1) microsatellite-stable (MSS);2) low level MSI (MSI-L) :如果少于40%的标记被改变;3) high level MSI (MSI-H):如果大于40%的标记被改变。
参考:TCGA维基
Level 3的数据已经包含在提交的BCR临床数据中中,可以按如下方式下载:
library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-COAD",
data.category = "Other",
legacy = TRUE,
access = "open",
data.type = "Auxiliary test",
barcode = c("TCGA-AD-A5EJ","TCGA-DM-A0X9"))
GDCdownload(query)
msi_results <- GDCprepare_clinic(query, "msi")
datatable(msi_results, options = list(scrollX = TRUE, keys = TRUE))
bcr_aliquot_uuid | mononucleotide_and_dinucleotide_marker_panel_analysis_status | mononucleotide_marker_panel_analysis_status | bcr_patient_barcode | |
---|---|---|---|---|
1 | 6139cab7-959f-4c49-b9e7-cc25ea8e041c | MSI-H | TCGA-AD-A5EJ | |
2 | 511f645e-f107-4f58-9c6a-6f931e612bd6 | MSS | TCGA-DM-A0X9 |
在Legacy数据库中可用的临床数据类型包括:
Tissue slide image 获取
# Tissue slide image files
library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
data.type = "Tissue slide image",
legacy = TRUE,
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
data_release | data_type | tags | file_name | submitter_id | file_id | file_size | cases | state_comment | id | md5sum | updated_datetime | data_format | access | platform | state | version | data_category | type | project | code | center_name | center_short_name | center_center_id | center_namespace | center_center_type | tissue.definition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
494 | Tissue slide image | image | TCGA-RU-A8FL-01A-01-TSA.3743238D-C8AE-49A3-B5C9-3B18C1B64964.svs | 530083be-6bdf-49e8-85dc-3c3ee5d5dcd5 | 186948553 | TCGA-RU-A8FL | 530083be-6bdf-49e8-85dc-3c3ee5d5dcd5 | 50b05bcaadce3553bb763a734f976316 | 2017-03-05T18:19:57.255497-06:00 | SVS | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR | |||||
975 | Tissue slide image | image | TCGA-AA-3972-01A-01-BS1.6c038157-503f-47dc-9445-72d7cdcae1a7.svs | 0d4a3c6c-0ab2-44d2-9b08-90f5ea84555f | 130569853 | TCGA-AA-3972 | 0d4a3c6c-0ab2-44d2-9b08-90f5ea84555f | 65d8fe397397d1fd633fd3279301310f | 2017-03-05T18:32:29.523130-06:00 | SVS | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR | |||||
980 | Tissue slide image | image | TCGA-AA-3972-01A-01-TS1.3492c5d3-141f-48f9-a61b-48a1f595792c.svs | 4966c8e3-37fd-4296-8a8a-216def9ec311 | 62750835 | TCGA-AA-3972 | 4966c8e3-37fd-4296-8a8a-216def9ec311 | 18d8cc99e47b1522264f0f00cd4f85ca | 2017-03-05T09:56:10.833727-06:00 | SVS | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR |
Pathology report 获取
# Pathology report
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
data.type = "Pathology report",
legacy = TRUE,
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
data_release | data_type | updated_datetime | file_name | submitter_id | file_id | file_size | cases | state_comment | id | md5sum | data_format | access | platform | state | version | data_category | type | project | code | center_name | center_short_name | center_center_id | center_namespace | center_center_type | tissue.definition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | Pathology report | 2017-03-05T18:42:42.189892-06:00 | TCGA-RU-A8FL.92DF9BB6-DB3F-40A7-AE1E-8272C46B1968.pdf | a4753077-2bd3-4301-8424-b7575c8ccd66 | 206913 | TCGA-RU-A8FL | a4753077-2bd3-4301-8424-b7575c8ccd66 | 1c0cd3716d7c879d41d973282926c489 | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR | ||||||
307 | Pathology report | 2017-03-05T16:30:32.274191-06:00 | TCGA-AA-3972.2562de97-b8b4-4547-9e7f-4c0fab6552b3.pdf | b77a41e9-cf0d-4b94-9576-09e91b6d8f61 | 8658 | TCGA-AA-3972 | b77a41e9-cf0d-4b94-9576-09e91b6d8f61 | 626f8c94641ea67b6bc4b96de0ad4fd0 | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR |
Clinical Supplement 获取
# Clinical Supplement
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
data.type = "Clinical Supplement",
legacy = TRUE,
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
query %>% getResults %>% datatable(options = list(scrollX = TRUE, keys = TRUE))
data_release | data_type | updated_datetime | file_name | submitter_id | file_id | file_size | cases | state_comment | id | created_datetime | md5sum | data_format | access | state | version | data_category | type | project | tissue.definition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
275 | Clinical Supplement | 2017-03-04T16:40:34.621217-06:00 | nationwidechildrens.org_clinical.TCGA-RU-A8FL.xml | 3c5a4713-6855-42d4-aed6-3129bfe80c58 | 62955 | TCGA-RU-A8FL | 3c5a4713-6855-42d4-aed6-3129bfe80c58 | 2016-05-04T08:54:03.280502-05:00 | 5f997f26c363f1d0727aaea5760b7c92 | BCR XML | open | live | Clinical | clinical_supplement | TCGA-COAD | |||||
427 | Clinical Supplement | 2017-03-04T16:40:34.621217-06:00 | nationwidechildrens.org_clinical.TCGA-AA-3972.xml | c76af5df-aab0-47a0-a543-77668be3f0c7 | 66578 | TCGA-AA-3972 | c76af5df-aab0-47a0-a543-77668be3f0c7 | 2016-05-04T08:54:13.372056-05:00 | 13edc34d65d52aa25b74dd9d769e1feb | BCR XML | open | live | Clinical | clinical_supplement | TCGA-COAD |
Clinical data 获取
# Clinical data
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
data.type = "Clinical data",
legacy = TRUE,
file.type = "txt")
query %>% getResults %>% select(-matches("cases"))%>% datatable(options = list(scrollX = TRUE, keys = TRUE))
data_release | data_type | tags | file_name | submitter_id | file_id | file_size | state_comment | id | created_datetime | md5sum | updated_datetime | data_format | access | platform | state | version | data_category | type | project | code | center_name | center_short_name | center_center_id | center_namespace | center_center_type | tissue.definition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | Clinical data | drug | nationwidechildrens.org_clinical_drug_coad.txt | 0415ffe2-a98d-40b9-ac60-6753fce56c7b | 237163 | 0415ffe2-a98d-40b9-ac60-6753fce56c7b | 2016-04-20T16:20:56.238694-05:00 | dac877287ee1938fc69659127c02a151 | 2017-03-04T20:47:52.066809-06:00 | Biotab | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR | |||||
25 | Clinical data | patient | nationwidechildrens.org_clinical_patient_coad.txt | b58b5947-d2b6-4cc7-9eff-cc0083d5bf4b | 396060 | b58b5947-d2b6-4cc7-9eff-cc0083d5bf4b | 2016-04-20T16:20:56.238694-05:00 | 1f458eee1a95f3f45725e9881dcb1bf3 | 2017-03-05T11:00:22.460578-06:00 | Biotab | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR | |||||
29 | Clinical data | radiation | nationwidechildrens.org_clinical_radiation_coad.txt | 36b48c2d-f45d-4995-bc2d-931f5c190919 | 6264 | 36b48c2d-f45d-4995-bc2d-931f5c190919 | 2016-04-20T16:20:56.238694-05:00 | f8f2c8a4a8131fa4fb600ead81565c39 | 2017-03-05T11:47:51.010995-06:00 | Biotab | open | Clinical | live | Clinical | file | TCGA-COAD | 36 | Nationwide Children's Hospital BCR | NCH | a6b3bcf1-9ca6-56e9-8f04-0e3a63e60a6a | nationwidechildrens.org | BCR |
Biospecimen data 获取
GDCdownload(query)
clinical.biotab <- GDCprepare(query)
names(clinical.biotab)
## [1] "clinical_radiation_coad" "clinical_nte_coad"
## [3] "clinical_patient_coad" "clinical_drug_coad"
## [5] "clinical_follow_up_v1.0_nte_coad" "clinical_omf_v4.0_coad"
## [7] "clinical_follow_up_v1.0_coad"
datatable(clinical.biotab$clinical_radiation_coad, options = list(scrollX = TRUE, keys = TRUE))
bcr_patient_uuid | bcr_patient_barcode | bcr_radiation_barcode | bcr_radiation_uuid | form_completion_date | radiation_therapy_type | radiation_therapy_site | radiation_total_dose | radiation_adjuvant_units | radiation_adjuvant_fractions_total | radiation_therapy_started_days_to | radiation_therapy_ongoing_indicator | radiation_therapy_ended_days_to | treatment_best_response | course_number | radiation_type_other | therapy_regimen | therapy_regimen_other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | bcr_patient_uuid | bcr_patient_barcode | bcr_radiation_barcode | bcr_radiation_uuid | form_completion_date | radiation_type | anatomic_treatment_site | radiation_dosage | units | numfractions | days_to_radiation_therapy_start | radiation_treatment_ongoing | days_to_radiation_therapy_end | measure_of_response | course_number | radiation_type_notes | regimen_indication | regimen_indication_notes |
2 | CDE_ID: | CDE_ID:2673794 | CDE_ID: | CDE_ID: | CDE_ID: | CDE_ID:2842944 | CDE_ID:2793522 | CDE_ID:2721441 | CDE_ID:61446 | CDE_ID:61465 | CDE_ID:3008313 | CDE_ID:2842745 | CDE_ID:3008333 | CDE_ID:2857291 | CDE_ID:2732184 | CDE_ID:2195477 | CDE_ID:2793511 | CDE_ID:2793516 |
3 | e6ec5a68-7555-4f26-bd7e-9cdb4c5f7004 | TCGA-AA-3549 | TCGA-AA-3549-R38338 | B72A855F-225F-4537-A74F-8485ABDBA0D0 | 2012-12-13 | External | Distant Recurrence | 9 | Gy | [Not Available] | 1126 | NO | 1126 | Radiographic Progressive Disease | [Not Available] | [Not Applicable] | [Not Available] | [Not Available] |
4 | bce3ce45-4fb3-4d8e-9ec7-d24427c2ba4d | TCGA-AA-3692 | TCGA-AA-3692-R38345 | 2054309D-1EBC-4311-BF8A-621F6447F385 | 2012-12-13 | External | Distant Recurrence | 39 | Gy | [Not Available] | 31 | NO | 426 | Radiographic Progressive Disease | [Not Available] | [Not Applicable] | [Not Available] | [Not Available] |
5 | bce3ce45-4fb3-4d8e-9ec7-d24427c2ba4d | TCGA-AA-3692 | TCGA-AA-3692-R38346 | 1081C34F-DA75-4856-966C-8F9B10E784AA | 2012-12-13 | External | Distant Recurrence | 38 | Gy | [Not Available] | 365 | NO | 761 | Radiographic Progressive Disease | [Not Available] | [Not Applicable] | [Not Available] | [Not Available] |
只显示了部分数据
在Clinical indexed
的临床数据中发现了一些不一致的情况,并且正在由GDC团队进行调查。这些不一致是:
# Get XML files and parse them
clin.query <- GDCquery(project = "TCGA-READ", data.category = "Clinical", file.type = "xml", barcode = "TCGA-F5-6702")
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
# Get indexed data
clinical.index <- GDCquery_clinic("TCGA-READ")
dplyr::select(clinical.patient,vital_status,days_to_death,days_to_last_followup) %>% datatable
vital_status | days_to_death | days_to_last_followup | |
---|---|---|---|
1 | Alive | 66 |
dplyr::select(clinical.patient.followup, vital_status,days_to_death,days_to_last_followup) %>% datatable
vital_status | days_to_death | days_to_last_followup | |
---|---|---|---|
1 | Dead | 869 | |
2 | Alive | 452 |
# Vital status should be the same in the follow up table
dplyr::filter(clinical.index,submitter_id == "TCGA-F5-6702") %>% dplyr::select(vital_status,days_to_death,days_to_last_follow_up) %>% datatable
vital_status | days_to_death | days_to_last_follow_up | |
---|---|---|---|
1 | alive | 869 | 452 |
# Get XML files and parse them
recurrent.samples <- GDCquery(project = "TCGA-LIHC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
sample.type = "Recurrent Solid Tumor")$results[[1]] %>% select(cases)
recurrent.patients <- unique(substr(recurrent.samples$cases,1,12))
clin.query <- GDCquery(project = "TCGA-LIHC", data.category = "Clinical", file.type = "xml", barcode = recurrent.patients)
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
# Get indexed data
GDCquery_clinic("TCGA-LIHC") %>% dplyr::filter(submitter_id %in% recurrent.patients) %>%
dplyr::select(progression_or_recurrence,days_to_recurrence,tumor_grade) %>% datatable
progression_or_recurrence | days_to_recurrence | tumor_grade | |
---|---|---|---|
1 | not reported | not reported | |
2 | not reported | not reported |
# XML data
clinical.patient %>% dplyr::select(bcr_patient_barcode,neoplasm_histologic_grade) %>% datatable
bcr_patient_barcode | neoplasm_histologic_grade | |
---|---|---|
1 | TCGA-DD-AACA | G3 |
2 | TCGA-ZS-A9CF | G2 |
此外,还提供了一些处理临床数据的功能。
例如,函数TCGAquery_SampleTypes
将根据参数typesample
的类型过滤Barcodes。
Argument | Description | |
---|---|---|
barcode | is a list of samples as TCGA barcodes | |
typesample | a character vector indicating tissue type to query. Example: | |
TP | PRIMARY SOLID TUMOR | |
TR | RECURRENT SOLID TUMOR | |
TB | Primary Blood Derived Cancer-Peripheral Blood | |
TRBM | Recurrent Blood Derived Cancer-Bone Marrow | |
TAP | Additional-New Primary | |
TM | Metastatic | |
TAM | Additional Metastatic | |
THOC | Human Tumor Original Cells | |
TBM | Primary Blood Derived Cancer-Bone Marrow | |
NB | Blood Derived Normal | |
NT | Solid Tissue Normal | |
NBC | Buccal Cell Normal | |
NEBV | EBV Immortalized Normal | |
NBM | Bone Marrow Normal |
该函数TCGAquery_MatchedCoupledSampleTypes
将从全部样本中过滤获得typesample
类型的样本。例如,如果将TP
和TR
设置为typesample
,则该函数将返回对应患者的条形码(如果它具有两种类型)。因此,如果患者有TP而但没有TR,则不会返回条形码。如果患者有TP和TR,则返回两个条形码。
案例:
bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",
"TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
"TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
"TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
"TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
"TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
"TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
"TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
"TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
"TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
"TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
"TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
"TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
"TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
"TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
"TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
"TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")
S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")
# Retrieve multiple tissue types NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))
# Retrieve multiple tissue types FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))
要获取TGCA样本的所有信息,您可以使用以下脚本:
# This code will get all clinical indexed data from TCGA
library(data.table)
library(dplyr)
library(regexPipes)
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
regexPipes::grep("TCGA",value=T) %>%
sort %>%
plyr::alply(1,GDCquery_clinic, .progress = "text") %>%
rbindlist
readr::write_csv(clinical,path = paste0("all_clin_indexed.csv"))
# This code will get all clinical XML data from TCGA
getclinical <- function(proj){
message(proj)
while(1){
result = tryCatch({
query <- GDCquery(project = proj, data.category = "Clinical",file.type = "xml")
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
for(i in c("admin","radiation","follow_up","drug","new_tumor_event")){
message(i)
aux <- GDCprepare_clinic(query, clinical.info = i)
if(is.null(aux) || nrow(aux) == 0) next
# add suffix manually if it already exists
replicated <- which(grep("bcr_patient_barcode",colnames(aux), value = T,invert = T) %in% colnames(clinical))
colnames(aux)[replicated] <- paste0(colnames(aux)[replicated],".",i)
if(!is.null(aux)) clinical <- merge(clinical,aux,by = "bcr_patient_barcode", all = TRUE)
}
readr::write_csv(clinical,path = paste0(proj,"_clinical_from_XML.csv")) # Save the clinical data into a csv file
return(clinical)
}, error = function(e) {
message(paste0("Error clinical: ", proj))
})
}
}
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
regexPipes::grep("TCGA",value=T) %>% sort %>%
plyr::alply(1,getclinical, .progress = "text") %>%
rbindlist(fill = TRUE) %>% setDF %>% subset(!duplicated(clinical))
readr::write_csv(clinical,path = "all_clin_XML.csv")
# result: https://drive.google.com/open?id=0B0-8N2fjttG-WWxSVE5MSGpva1U
# Obs: this table has multiple lines for each patient, as the patient might have several followups, drug treatments,
# new tumor events etc...
本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-clinical-data
最后更新:2019-05-25 17:18:18
Update your browser to view this website correctly. Update my browser now