CONSolidated REcommendations for sharing Individual participant Data (CONSIDER)

http://w3id.org/CONSIDER

1 Format

1.1 Share person table in CDISC or OMOP format

A listing of person participating in the study is always present. Person table includes typically demographic data (year of birth, gender at birth). CDISC SDTM person table is DM.xpt domain. Specification is at https://www.cdisc.org/standards/foundational/sdtm (requires creating a free login). Examle of DM table is available at https://github.com/lhncbc/r-snippets-bmi/blob/master/cdisc/inst/extdata/cdisc01/csv/dm.csv OMOP person table specification can be found at https://github.com/OHDSI/CommonDataModel/wiki/PERSON

Positive Example: Trial http://clinicaltrials.gov/ct2/show/NCT01612169 IPD data posted on NIDA Data Share (at https://datashare.nida.nih.gov/study/nidactn0049) provide file dem.csv provides one row per person with person_id and basic demographic data. (following CDISC standard).

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if person table is provided, 0 if not provided

1.2 Group data and data elements into relevant data domains (e.g., medication history, laboratory results history, medical procedure history)

Consider integration of research and routine healthcare data (e.g., from Electronic Health Record system or healthcare billing data). The emergence of several common data models (CDMs) for healthcare data shows that there is a common way to organize clinical data. For example, rather than considering numerical value as test result and unit in which the value is expressed as two separate data elements, a review of several CDMs shows that these are typically grouped into a single data row.

Positive Example:

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if data elements grouped into relevent domains, 0 if data elements are not grouped

1.3 Follow a convention when using relative time.

The recomendation applies to a context where absolute dates in raw data were replaced with relative dates. The convention is to start counting at day 1, not 0. Assumme that index event (e.g., day when patient consented to the trial or day of first visit) has been specified at datetime granularity. Refer to the first day as relative day 1. Do not use day 0 as a relative date. For example if the index event is signing of informed consent and it was signed at 10:31am on March 10, 2011, the index date-time is midnight of March 10, 2011. In relative time, of an event on the next day (on March 11) at 11:15am, would have relative time of Day 2, 11:15am. (for analogous discussion in astronomical data see https://en.wikipedia.org/wiki/Sol_(day_on_Mars)#Usage_in_Mars_landers)

Positive Example: NCT00262522 uses relative time as day 1 for each patient is the date of their first visit, with all events referenced as days relative to day 1.

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if a convention for releative time is used, 0 if not

1.4 Utilize previosly defined Common Data Elements and reference them by their identifiers

If you considered formally defined research Common Data Elements at study design (more common for studies initiated after 2015), provide a spreadsheet file that lists all CDEs utilized by your study. Include unique CDE identifiers (e.g., PhenX VariableID). This recommendation promotes two aspects. First is to adopt established CDEs. Second, if the study did adopt CDE, it must clearly indicte which DEs are CDEs and which are not (such DEs would be unique to the study).

Positive Example: Data elements in AllofUs study are linked to LOINC codes. Elements are listed here and example of mapping can be seen for this weight (http://athena.ohdsi.org/search-terms/terms/903121) data element (mapped to LOINC CDE 29463-7 (https://loinc.org/29463-7/)).

Challenging Example: AllofUs study provides the source for each data element (http://athena.ohdsi.org/search-terms/terms?vocabulary=PPI&conceptClass=Clinical+Observation&page=1&pageSize=50&query=). Identifiers for individual CDE are not provided. Although, in many cases, the instrument does not have a formal identifier that could be listed.

Score: For studies initiatied after Jan 1, 2015,1 if common data elements are used, and 0 if no defined common data elements are used

1.5 Use formats that can be natively loaded (without highly specialized add-ons) into multiple statistical platforms

The preferred file types are comma/tab separated values (.CSV) files instead of SAS XPT, XLS/XSLX), which require add ons or conversions to be read in and used in different statistical platforms (e.g., SAS, STATA, R, etc.)

Positive Example: NCT01751646 provides IPD in CSV files easily usable in a multitude of statistical platforms. Trial https://clinicaltrials.gov/ct2/show/study/NCT00005159 provides 3 formats

Challenging Example: NCT00951249 provides IPD as SAS XPT files which require processing and conversion to use in any non-SAS platform for view and analysis

Score: Score 1 or 0. No Partial score. 1if a format that can be natively loaded into multiple platforms is used, 0 if a format is used that cannot be natively loaded into multiple platforms and requires conversion, add-ons, or specialty software.

2 Data Sharing

2.1 Register your study at ClinicalTrials.gov registry

Trial titled RC-HIVMAB060-00-AB (VRC01) in People With Chronic HIV Infection Undergoing Analytical Treatment Interruption is registered at ClinicalTrials.gov under NCT02471326 https://clinicaltrials.gov/ct2/show/study/NCT02471326. It allows retrieval of study metadata that is unified across various platform.

Positive Example:

Challenging Example: dbGaP repository contains datasets that originate from a clinical trial but the trial reference at ClinicalTrials.gov is not provided.

Score: Score 1 or 0. No partial score. 1 if study is registered, 0 if stidy is not registered.

2.2 Do not limit study metadata to the legally required elements. Also populate optional elements (such as data sharing metadata)

Some study metadata collected by ClinicalTrials.gov registry are mandated by regulation wheras others are added as optional elements. The FAIR principles are reflectad by many of thos optional fields.

The following ClinicalTrials.gov optional fields are important:

Registration data: (1) Study Protocol (is a type of Supporting information; see https://prsinfo.clinicaltrials.gov/definitions.html#AvailableStudyData); (2) Statistical Analysis Plan, (3) Informed Consent Form, (4) Data Dictionary (use type of Supporting Information of Other and specify “Data dictionary”), (5) Result reference (field IsResultsRef=Y in combination with field Citation; it specifies the article PMID or full bibliographic data for publication that describes the results of the study. Such linked publication does not depend on proper extraction of NCT identifier in the abstract (automatic detection of result publication may miss some article whereas ResultRef is manually specified by study record manager; see https://prsinfo.clinicaltrials.gov/definitions.html#RefCitations)

Results Data: (1) Statistical Analyses: Provide entries to results of test of statistical significance for primary and secondary outcome measures. Report results using the scenario of ‘Statistical Test of Hypothesis’ or ‘Method of Estimation’. Use field ‘Other Statistical Analysis’ to provide a description and the results of any other scientifically appropriate tests of statistical significance if the statistical analysis cannot be submitted using the ‘Statistical Test of Hypothesis’ or ‘Method of Estimation’ options (see https://prsinfo.clinicaltrials.gov/results_definitions.html#Result_Outcome_Analysis). (2) Specify ‘Source Vocabulary Name’ when reporting adverse events. For data re-use, it is important to describe what terminology you used (if any) to describe adverse events (e.g., MedDRA or SNOMED CT)

Positive Example:

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if optional fields are included in study registration data, 0 if only required fields are completed.

2.3 Fully populate data_sharing_plan text filed on ClinicalTrials.gov (if sharing data)

On ClinicalTrials.gov, if you answer Yes to share_ipd_data, do not leave the data_sharing_plan text empty

Positive Example: NCT03518060 (https://clinicaltrials.gov/ct2/show/NCT03518060?term=NCT03518060&draw=2&rank=1) provides a link on its ClinicalTrials.gov registration page to where study IPD can be requested.

Challenging Example: Out of 69 reviewed HIV trials IPD sharing plans on ClinicalTrials.gov, 2 (NCT02756208 [https://clinicaltrials.gov/ct2/show/study/NCT02756208] and NCT03275701 [https://clinicaltrials.gov/ct2/show/study/NCT03275701]) left the plan description blank.

Score: Score 1 or 0. No partial score. 1 if a full data sharing plan with details on how to find IPD is included, 0 if no data sharing plan of if an incomplete data sharing plan is included

2.5 Provide basic summary results using results registry component of Clinicaltrials.gov

If the clinical trial registry allows (such as on ClinicalTrials.gpv) upload basic summary results to the registry at the completion of the study or when first available.

Positive Example: NCT00962780 (https://clinicaltrials.gov/ct2/show/results/NCT00962780) has basic summary results of the study posted on ClinicalTrials.gov using the results registry component

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if results are included, 0 if no results are uploaded. If a registry does not support posting results, score is 0.

2.7 Provide de-identified Individual Participant Data

Several sponsors require sharing of IPD data to allow for external validation of results and to facilitate secondary research. This recommendation is linked to Data Sharing: Registry: Link IPD

Positive Example: NCT00933595 in the ClinicalTrial.gov record provides a link to request IPD through the data sharing platform https://biolincc.nhlbi.nih.gov/studies/lung_hiv/

Challenging Example: For studies registerd on ClinicalTrials.gov during 2019, 68.2% of studies answering whether they plan to share IPD, answered ‘No’.

Score: Score 1 or 0. No Partial score. 1 if study shared IPD, 0 if no IPD were shared.

3 Study Design

3.1 Adopt previously defined applicable Common Data Elements

This recommendation assumes there are significant resources (financial or staff) that can be used for this goal. Common Data Elements initiatives in various domains (e.g., PhenX, PROMIS) aim to standardize data collection.

Positive Example: AllofUs study adopted LOINC terminology to document body measurements.

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if CDEs are used, 0 if no CDEs are used.

4 Case Report Forms

4.1 Share all Case Report Forms used in a study

It is extremely common to organize data collection into forms. Interpretation of data is easier if full context for each collected DE is provided. This is also referred to sharing annotated CRFs.

Positive Example: UK Biobank provides screenshot of the form (see UKBB Showcase for data element http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=22501 This allows researchers understand why value -1: never went to school appears in the data since the form shows a check box near the question that allows research participant to indicate that they never went to school. ClinicalTrials.gov allows upload of resource type: annotated CRFs.

Challenging Example: NCT00962780 provides no CRFs and does not organize the data based on form collection, giving users no indication of how the data was collected.

Score: Score is a percent. 1 if all CRFs are shared, 0 if no CRFs are shared. A score of 0.5 means 50% of of the CRFs used were shared.

4.2 Share Case Report Forms in non-PDF, machine-readable format.

Machine-readable Case Report Forms allow for easier integration and use os the information provided from the empty Case Report Forms

Positive Example: For example, REDCap, OpenClinica and several other Electronic Data Capture (EDC) systems allow export into CDISC Operational Data Model (ODM) format for forms. If no cross-platform standard is supported by your EDC, provide CRFs in the platform-specific format.

Challenging Example: NCT00005273 provides case report forms as PDFS of photocopied forms making any automated integration or machine reading of the information not plausible.

Score: Score 1 or 0. No Partial score. 1 if CRFs included in machiene-readable format, 0 if shared in a non- machiene -readable format (such as PDFs) or not shared at all.

4.3 List all CRFs

Provide a machine readable list of forms.

Positive Example:

Challenging Example:

Score:

5 Data Dictionary

5.1 Provide data dictionary

Data can not be interpreted if necessary metadata is missing. If spreadsheet format of data is provided, a data dictionary explains and describes what each data column contains.

Positive Example: On ClinicalTrials.gov approximately 1260 studies provided uploaded data dictionaries or links to the data dictionary

Challenging Example: NCT00711009 provided IPD and other documents in the shared daata package, but excluded a data dictionary

Score:

5.2 Provide data dictionary in machine readable format

DD can be in PDF, that is not fully machine readable without processing issues that require human attention. E.g., removal of header and footer text.

Positive Example: NCT01751646 (https://dash.nichd.nih.gov/study/18343) provides data dictionary as a single CSV file.

Challenging Example:

Score:

5.3 Separate data dictionary from de-identified individual participant data. Since it contains no participant level data, do not require local ethical approval as a condition of releasing the data dictionary (avoid a requestwall for data dictionary).

If DD does contain important intellectual property (IP), consider creating a smaller list of DEs that do not contain any IP and release this limited subset of DEs without employing a request-wall.

Positive Example: NCT01769456 has the data dictionary on the data sharing platform and is available for download without requiring any request, approval or the filling out of any documents. Another example for study NCT00005159 is here (https://biolincc.nhlbi.nih.gov/media/studies/nlms/Code_Manuals_and_Forms.pdf).

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if the data dictionary is provided seperately and in advance of any request for de-identified data, 0 if the data dictionary is only provided as part of the de-identified data package.

5.4 Share a data dictionary as soon as possible. Do not wait until the data collection is complete.

Data dictionary has scientific value for other studies in the field and can help such studies speed up their study design phase. For registry studies (without a fixed end date), early sharing is equally important.

Positive Example: AllofUs research study shared their Case Report Forms at the start of the study (see https://www.researchallofus.org/data-sources/survey-explorer).

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if a data dictionary is shared as early as possible, 0 if the data dictionary is not made available until after data collection and study completion

5.5 Provide data dictionary in a single, machine-readable file.

This simplifies machine processing of available study data. Using a single file approach also ensures that each file (if scattered across multiple) uses the same structure (e.g., DE label, DE data type, DE permissible values [for categorical DEs])

Positive Example: NCT01772823 provides a single data dictionary document as a CSV containing all data elements and information pertaining to the data elements such as data type and description

Challenging Example: Trials NCT00005274 and NCT00005274 provided DD in several files. In order to relate those to CDEs, manual processing is required. NCT01233531 has 17 data dictionary files, which includes documents in different formats and document types. Also includes identical file names that represent dictionaries for different data based on visit type and study population group. NIDA Data Share trial had 65 data files but only 63 data dictionary PDF files. Matching data files with data dictionary requires manual matching.

Score: Score 1 or 0. No Partial score. 1if data dictionary inlcuded in a single macheiene-readable format, 0 if the data dictionary is in multiple files or a non maciene-readable format.

5.6 For each data element, provide a data type (such as numeric, date, string, categorical)

Specifying data type helps computers to process the information properly. Data type also helps with semantic matching to corresponding CDEs. For example, date of death data type is stated as date (not as character). Most studies collect categorical data types.

Positive Example: NCT00491556 provides the data type for each of the data elements in the DD UK Biobank uses a comprehensive set of data types available at http://biobank.ctsu.ox.ac.uk/showcase/help.cgi?cd=value_type The UK Biobank description and listed types are in italic below The Value Type of a Data-Field describes the type of variable corresponding to it. There are 10 categories: 1) Integer - whole numbers, for example the age of a participant on a particular date; 2) Categorical (single) - a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice; 3) Categorical (multiple) - sets of answers selected from a coded list or tree of options, for instance concurrent medications; 4) Continuous - floating-point numbers, for example the height of a participant; 5) Text - data composed of alphanumeric characters, for example the first line of an address; 6) Date - a calendar date, for example 14th October 2010; 7) Time - a time, for example 13:38:05 on 14th October 2010; 8) Compound - a set of values required as a whole to describe some compound property, for example an ECG trace; 9) Binary object - a complex dataset (blob), for example an image; 10) Records - a summary showing the volume of records data available via the secure portal.

Challenging Example: NCT00046280 does not provide data type at all.

Score: Score is a percent. 1 if all data elements havedata type, 0 if no elements have type. A score of 0.5 means 50% of data elements incldue a data type.

5.7 For categorical data elements, provide a list of permissible values and distinguish when numerical code or string code is a code for a permissible value (versus actual number or string)

For example, for educational level data element, it is important to know what possible values were considered during data collection. While it is possible to discover those permissible values from IPD, if some values were never applicable to any of the subjects, the reverse-engineered permissible value list will be incomplete. In terms of standards, CDISC ODM and REDCap provide a mechanism to list permissible values.

Positive Example: NCT00683579 provides permissible values and definitions for the values associated with categorical data elements.

Challenging Example: NCT00000590 does not provide permissible vales for categorical variables in the data dictionary. Another example is providing permissible value in the same document that describes the data elements, but in non-machine readable way.

Score: Score 1 or 0. No Partial score. 1 if a list of permissable values are included, 0 if no list is included

5.8 Distinguish categorical string data elements from free-text string data elements

Categorical data type is often not properly assigned to string and numerical data elements. If this is the case, the data dictionary must separate true free text strings from strings that are picked from a list of enumerated possible values. The same problem applies to numerical data elements. Data dictionary should distinguish proper numerical data elements from numerical-categorical.

Positive Example: While NCT01751646 does not provide a separate permissable values dictionary or have a catorgorical data type listed, it does provide the permissable values and a categorical data type flag in the primary data dictionary.

Challenging Example: NCT01233531 does not provide a permissable values dictionary or label any data element as categorical. There is also no flagindicating data elements as categorical making it imporssible from the provided data dictionaries to know which elements are string-proper compared to string-categorical and which are numeric-proper versus numeric-categorical.

Score:

5.10 Link data elements or permissible values to applicable routine healthcare terminologies (either because you designed them to be linked or post-hoc, they can be semantically linked as equivalent)

Positive Example: NCT00963235 states in the data dictionary the use of LOINC for the coding of lab tests in the study

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if data elements linked to routing healthcare terminologies, 0 if t daat elements are not linked to routine terminologies and custom values are used.

5.11 Provide complete data dictionary (all elements in data are listed in a dictionary) and all types of applicable dictionaries (date elements, forms [or groupings], and permissible values)

Note: NIDA trial had 5 data files that were missing a data dictionary file.

Positive Example: NCT00000590 (https://biolincc.nhlbi.nih.gov/studies/pactg/) provides 100% of the data elements found in the IPD, in the data dictionary

Challenging Example: NCT01751646 (https://dash.nichd.nih.gov/study/18343) includes less then 50% of the data elements in the data dictionary

Score: Score is a percent. 1 if all data elements are included in dictionary, 0 if no data elements are included. A score of 0.5 means 50% of data elements are incldued in the data dictionary.

5.12 Include sufficient description for data elements

In some cases, a descriptive name can be sufficient to define a data element and interpreting a data element is straightforward. However, if a study has two distinct data elements with identical name, data can be hard to interpret.

Positive Example: All elements in UK Biobank have description. See http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=22501 for example for data element titled: (Year ended full time education) has the following detailed description: Some values have special meanings defined by Data-Coding 100306. Units of measurement are calendar year.

Challenging Example: In study NCT01751646: Vitamin D Absorption in HIV Infected Young Adults Being Treated With Tenofovir Containing cART: a description for forms C100 and B100 both state: Specimen Tracking Form. This makes it to interpret if the data represent the same or two distinct specimen. Avoid identical descriptions for 2 separate items.

Score: Score is a percent. 1 if all data elements have adequate dsecriptions, 0 if no descriptions are included. A score of 0.5 means 50% of data elements incldue a sufficient description.

5.13 Use identifiers (unique where applicable) for data element, forms and permissible values.

Permissible value data dictionary should be linked to data element data dictionary. Provide annonated case report forms

Positive Example:

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if all identifiers for elements, forms and permissable values are unique, 0 if duplicate identifiers are used.

6 Data de-identification

6.1 Provide data de-identification notes

Data de-identification notes state how identifiers have been removed or redacted to ensure compliance with privacy regulations. This includes the remoeal of personal identifers and the shifting or relativsation of dates.

Positive Example: NCT00490412 (https://dash.nichd.nih.gov/study/17335) provides data de-identification notes and methodology prior to sharing IPD and in shared data packages.

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if data de-identification notes are provided, 0 if de-identification notes are not provided.

7 Choice of a Data Sharing platform

7.1 Use platforms that allows download of all studies available on the platform

In order for a computerized algorighm to access a data sharing platform, a machine readable list (not a web search interface) is required (FAIR principles)

Positive Example: ClinicalStudyDataRequest.com platform provides a spreadsheet download with a list of all studies included on the platform. (available at https://www.clinicalstudydatarequest.com/Documents/All%20Sponsor-Funder%20Studies.xls

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if the platform used allows for the downloading of a list of all studies available, 0 if the platform does not have this capability

7.2 Choose a platform that supports batch request (ability to request multiple studies with one request)

Survey of data recipients indicates that batch requests are preferred because the simply retrieval of multiple studies by a data recipient. Filing a separate request for multiple studies is less eficient.

Positive Example: Vivli allows a researcher to define multiple studies in a single request.

Challenging Example:

Score: Score 1 or 0. No Partial score. 1 if batch study requests can be made, 0 if each study must be requested separately