A
1
HODGKIN'S LYMPHOMA (HL) DICTIONARY MODELING
2
For a more detailed overview of our dictionary format, we encourage you to explore our D4CG Data Dictionary wiki.
3
For more information regarding our collaborators and executive committee, please visit our D4CG Data Commons page.
4
5
PUBLIC LICENSE
6
Any dictionary creation and decisions made are preliminary, adaptable, and open to modification in response to the requirements and input of the HL consortia members.
7
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
8
9
10
MODELING PROCESS
11
The process of creating a data dictionary involves the following steps:
12
Step 1
13
D4CG Project Managers are responsible for diligently collecting the Case Report Forms (CRFs), REDCap forms, or any useful clinical-reporting documentation in the form of Word or Excel, from participating institutions to ensure the comprehensive and accurate documentation of essential clinical data.
14
Step 2
15
D4CG Data Standards Team takes on the crucial task of precisely comparing each received item. We carefully examine and analyze the forms to identify similarities and differences between the forms. The comparison process is of utmost importance as it aims to achieve standardization and consensus on data elements (variables and values) across all participating institutions. We strongly encourage each group to come up with a consensus on each term (i.e. using Sex rather than Gender). Our team then leverages already well-established terminologies and ontologies, including the National Cancer Institute Thesaurus (NCIt), to obtain computable definitions that have broad recognition and acceptance within the research community. By adopting these widely accepted standards, we ensure consistency and harmonization of data elements across different research groups.
16
Step 3
17
After the initial draft of the data dictionary is prepared, we conduct meetings with representatives from each institution to facilitate a comprehensive review. During these meetings, valuable input is sought from the participants regarding the necessity of specific data elements, potential elements for removal, and any necessary modifications that should be considered. These collaborative discussions play a vital role in ensuring the data dictionary aligns with the needs and requirements of all members involved. This is also referred to as the "line-by-line" dictionary review.
18
Step 4
19
After extensive collaboration and discussion in multiple meetings, we facilitate a "balloting" process where each representative has the opportunity to thoroughly review the data dictionary on their own time and provide valuable feedback, comments, and questions. Subsequently, we convene more meetings to address and discuss the raised questions, feedback, and comments, ensuring that all perspectives are considered and any necessary adjustments or clarifications are made. This inclusive approach allows us to foster a comprehensive and well-rounded data dictionary that reflects the collective expertise and insights of our team, leading to a robust and standardized framework of the data dictionary. More rounds of "balloting" are conducted if needed.
20
Step 5
21
Following the collaborative "balloting" process, the next step in our data standardization journey is the "tiering" process. Depending on each group's decision, this may occur either before or after the "balloting" process. During the tiering process, consortia members collectively determine the importance and significance of each data element. For instance, certain data elements, such as TUMOR_SITE, may be ranked as Tier 1, signifying their critical relevance to the data contribution. Conversely, other data elements, like TUMOR_SIZE, may be classified as Tier 3, indicating their supplementary but non-crucial nature. For further explanation on the distinctions between the different tiers, see below or visit our wiki page linked above.
22
23
24
DESCRIPTORS
25
Domain
26
Protocol - relating to the consortium and institutional data contributor, to the timing of reported event, and to the patient's involvement in a clinical trial or study.
27
Demographics - relating to patient characteristics and medical history.
28
Testing - relating to various modes and types of testing.
29
Disease Attributes - relating to the description of the disease.
30
Treatment - relating to various modes and types of treatments administered to the patient.
31
Events - relating to adverse events and other long-term outcomes.
32
33
34
Table Guide
35
DD| Domain Declaration - The row indicates the domain of the next table in the spreadsheet.
36
TD| Table Declaration - The row is the beginning of a new table and includes the name of the table.
37
TG | Table Guidance - The row contains a short description of how the table should be implemented by contributors.
38
VD | Variable Declaration - The row describes a variable. The placeholder "_undefined_" is used to ensure that permissible values are not declared on the same row.
39
PD | Permissible Value Declaration - The row describes a permissible value.
40
DPD | Deprecated Permissible Value Declaration - The row describes a permissible value that was in the previous version of the data dictionary but is not to be used in the current version.
41
42
43
Data Type
44
String - free-text, can be a single word or multiple words.
45
Enum - one of a set list of permissible values.
46
Integer - a whole number, typically reserved for ordinal variables (such as time period number) or the age of a patient.
47
Decimal - for variables that are not guaranteed to be whole values, such as lab results, doses, etc.
48
49
50
Tiers
51
Tier 1 - contributors must include, regardless of the resource cost
52
Tier 2 - contributors should prioritize inclusion if resources are available
53
Tier 3 - contributors shouldn’t prioritize inclusion, but can include if resources are available
54
55
56
Mapping
57
SSSOM Predicates:
58
skos:exactMatch - target (current version) is the same as the source (previous version).
59
skos:narrowMatch - the target (current version) is a narrower concept than the source (previous version).
60
skos:broadMatch - the target (current version) is a broader concept than the source (previous version).
61
62
Predicate Format:
63
predicate [disease_group].[previous_data_dictionary_version].[table_name].[variable_name].[permissible_value]
64
65
Predicate Example:
66
skos:exactMatch [EWS].[v2.1].[Tumor Assessment].[TUMOR_CLASSIFICATION]
67
skos:narrowMatch [HL].[v1.0].[Disease Characteristics].[KARNOFSKY]
68
skos:broadMatch [CNS].[v1.0].[Radiation Therapy].[ENERGY_TYPE]
69
70
71
72
LONGITUDINAL REPORTING
73
AGE_AT:
74
For HIPAA-compliance reasons, the PCDC does not use dates. Dates in the source data are required to be transformed into the age (in days) of the patient at the time of the observation. These AGE_AT variables can be found throughtout the PCDC data model.
75
76
TIME_PERIOD:
77
The D4CG utilize time periods as valuable complements to the patient's age (expressed in days). Each time period is associated with specific reference IDs, a type, an ordinal number (relevant for multiple occurrences of a single type), the year when the time period commenced, and, when applicable and relevant, the patient's age in days at the start and end of the period. Please visit our wiki page linked above for more details and examples.
78
79
80
GENOMIC REPORTING
81
Harmonizing clinical genomics data presents a common challenge due to variations in data granularity across different institutions. To address this diversity, we have introduced an "ALTERATION" variable that serves as a flexible solution to represent the non-standardized "name" of each alteration. This variable provides the flexibility for institutions to utilize terminology that corresponds to the specific language and conventions commonly used within their respective disease groups. Please visit our wiki page linked above for more details and examples.
82
83
84
TUMOR REPORTING
85
The tumor assessment table serves as a comprehensive repository for describing various tumor types linked to the cancer. Each tumor is accompanied by specific reported attributes (TUMOR_SUBMITTER_ID & PRIMARY_TUMOR_SUBMITTED_ID), as outlined in the data dictionary of each disease group. Please visit our wiki page linked above for more details and examples.
86
87
88
ADDITIONAL DETAILS
89
Initial dictionary draft was collaboratively created using the following data descriptors:
90
Children's Oncology Group (COG)
91
Nodular Lymphocyte Predominant B-Cell Lymphoma (NLPHL)
92
St Jude Children's Research Hospital (SJCRH)
93
94
Dictionary workshops were conducted in person at the following locations:
95
The Chicago Firehouse Restaurant during ASCO - 6/3/2018
96
San Diego Water Grill during ASH - 12/2/2018
97
Hash House a Go Go, Orlando during ASH - 12/6/2019
98
99
100