Data Requirements • dmcognigen

library(dmcognigen)
library(dplyr)

data("dmcognigen_pk_requirements")

data("dmcognigen_cov")
data("dmcognigen_pk")

Introduction

Data requirements are specifications of a dataset. They describe required variables with varying details like variable label, type, description, source, formula, rounding specifications, and relationships to other variables.

Generally, the requirements are maintained in a file. These functions support gathering and applying this information from Excel files, Word (docx) tables, or a data.frame.

For the purposes of this vignette, consider files in a directory that include tables with these variables:

#>            name                                           description
#>   variable_name                                         Variable Name
#>  variable_label                                        Variable Label
#>          pk_ard Whether the variable is in the Analysis Ready Dataset
#>          pk_mif       Whether the variable is in the Model Input File
#>   format_decode  Pairs of (generally) numeric values and descriptions

Identify data requirements files

With a versioning strategy based on the latest modification date of the file, there might be multiple versions of the data requirements file that exist in the data assembly directory.

Consider the data assembly directory below:

asmbdat_directory <- file.path(tempdir(), "asmbdat")
asmbdat_directory
#> [1] "/tmp/Rtmp675OXs/asmbdat"

#> /tmp/Rtmp675OXs/asmbdat
#> ├── data-requirements-2019-12-01.xlsx
#> ├── data-requirements-2020-01-01.xlsx
#> ├── data-requirements-2020-01-10.xlsx
#> ├── data-requirements-2025-01-16.xlsx
#> ├── legacy-data-requirements-2000-01-01.docx
#> └── qc-data-requirements-2025-01-16.xlsx

Use the available_requirements_table() function to collect information about multiple data requirements files. Supported files that match the pattern and have the required sheet names or positions will be kept. By default, the required sheet name is "specs".

The resulting table can be used to identify the desired version of the data requirements file to pass to the read_requirements() function.

available_requirements_table(
  path = asmbdat_directory,
  pattern = "req"
) %>% 
  select(-c(created, modified)) %>% 
  print.data.frame()
#>                                                        path  path_date
#> 1 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 2 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 3 /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#>         date                               sheets is_qc
#> 1 2020-01-01                                specs FALSE
#> 2 2020-01-10              specs, unit_conversions FALSE
#> 3 2025-01-16 specs, unit_conversions, discussions FALSE

Use sheet = NULL for no required sheet names. This is required to include docx files in these searches.

available_requirements_table(
  path = asmbdat_directory,
  pattern = "req",
  sheet = NULL
) %>%
  select(-c(created, modified)) %>% 
  print.data.frame()
#>                                                               path  path_date
#> 1        /tmp/Rtmp675OXs/asmbdat/data-requirements-2019-12-01.xlsx 2019-12-01
#> 2        /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 3        /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 4        /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#> 5 /tmp/Rtmp675OXs/asmbdat/legacy-data-requirements-2000-01-01.docx 2000-01-01
#>         date                               sheets is_qc
#> 1 2019-12-01                              Sheet 1 FALSE
#> 2 2020-01-01                                specs FALSE
#> 3 2020-01-10              specs, unit_conversions FALSE
#> 4 2025-01-16 specs, unit_conversions, discussions FALSE
#> 5 2000-01-01                                 NULL FALSE

Use drop_qc = FALSE to include versions of the data requirements used for QC. These files are ignored otherwise.

available_requirements_table(
  path = asmbdat_directory,
  pattern = "req",
  sheet = NULL,
  drop_qc = FALSE
) %>% 
  select(-c(created, modified)) %>% 
  print.data.frame()
#>                                                               path  path_date
#> 1        /tmp/Rtmp675OXs/asmbdat/data-requirements-2019-12-01.xlsx 2019-12-01
#> 2        /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 3        /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 4        /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#> 5 /tmp/Rtmp675OXs/asmbdat/legacy-data-requirements-2000-01-01.docx 2000-01-01
#> 6     /tmp/Rtmp675OXs/asmbdat/qc-data-requirements-2025-01-16.xlsx 2025-01-16
#>         date                                            sheets is_qc
#> 1 2019-12-01                                           Sheet 1 FALSE
#> 2 2020-01-01                                             specs FALSE
#> 3 2020-01-10                           specs, unit_conversions FALSE
#> 4 2025-01-16              specs, unit_conversions, discussions FALSE
#> 5 2000-01-01                                              NULL FALSE
#> 6 2025-01-16 specs, unit_conversions, discussions, qc_findings  TRUE

Read data requirements

If no path is provided, read_requirements() will read the latest version of the matching data requirements files in the working directory. The latest version is selected as the file with the most recent date in the filename. If no dates are detected in the filenames, the matching file with the most recent modification time is selected.

requirements <- read_requirements(asmbdat_directory)
#> ✔ Detected requirements file: data-requirements-2025-01-16.xlsx
#> ℹ Modification time: 2025-01-16 20:55:50.4306
#> ℹ Sheet name: "specs"
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
glimpse(requirements)
#> Rows: 54
#> Columns: 5
#> $ variable_name  <chr> "ONUM", "NUM", "STUDYID", "USUBJID", "ID", "TSFD", "TSP…
#> $ variable_label <chr> "Overall Sequence Number", "Sequence Number", "Study Id…
#> $ pk_ard         <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ pk_mif         <chr> "x", "x", "", "", "x", "x", "x", "", "", "x", "", "x", …
#> $ format_decode  <chr> "", "", "", "", "", "", "", "", "", "0=Dose\n1=Xanomeli…

Use the subset argument to provide subset/filter criteria:

requirements_pk_mif <- read_requirements(
  asmbdat_directory,
  subset = pk_mif == "x"
)
#> ✔ Detected requirements file: data-requirements-2025-01-16.xlsx
#> ℹ Modification time: 2025-01-16 20:55:50.4306
#> ℹ Sheet name: "specs"
#> ✔ Applied requested subset: `pk_mif == "x"`
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
glimpse(requirements_pk_mif)
#> Rows: 16
#> Columns: 5
#> $ variable_name  <chr> "ONUM", "NUM", "ID", "TSFD", "TSPD", "DVID", "EVID", "M…
#> $ variable_label <chr> "Overall Sequence Number", "Sequence Number", "Subject …
#> $ pk_ard         <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ pk_mif         <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ format_decode  <chr> "", "", "", "", "", "0=Dose\n1=Xanomeline Concentration…

Use as_requirements() to apply requirements attributes to an existing data frame:

requirements_pk_mif_with_as <- as_requirements(
  as.data.frame(requirements_pk_mif)
)
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
identical(requirements_pk_mif_with_as, requirements_pk_mif)
#> [1] TRUE

Attributes

When possible, additional attributes are assigned to the result of read_requirements() and as_requirements(). Resulting requirements objects can be used directly in other functions in this package to apply these attributes.

decode_tbls

This attribute will be set if the variable_name_col and decode_col arguments are defined and those variables are in requirements.

See more details on using decode_tbls in the Decode Tables vignette.

attr(requirements, "decode_tbls")
#> 
#> ── Decode tables ───────────────────────────────────────────────────────────────
#> 
#> ── DVID ──
#> 
#> 0=Dose
#> 1=Xanomeline Concentration (ug/mL)
#> 
#> ── EVID ──
#> 
#> 0=PK or PD measure
#> 1=Dose
#> 2=Other
#> 
#> ── MDV ──
#> 
#> 0=PK or PD measure
#> 1=Dose or Other
#> 
#> ── BLQFN ──
#> 
#> 0=No
#> 1=Yes
#> 
#> ── FED ──
#> 
#> 0=Fasted
#> 1=Fed
#> 
#> ── RACEN ──
#> 
#> 1=White/Caucasian
#> 2=Black/African American
#> 3=Asian
#> 4=American Indian or Alaska Native
#> 
#> ── SEXF ──
#> 
#> 0=Male
#> 1=Female
#> 
#> ── RFCAT ──
#> 
#> 1=Normal Function (>=90 mL/min)
#> 2=Mild Impairment (60-89 mL/min)
#> 3=Moderate Impairment (30-59 mL/min)
#> 4=Severe Impairment (15-29 mL/min)
#> 5=End Stage Disease (<15 mL/min or Dialysis)
#> 
#> ── NCILIV ──
#> 
#> 0=Normal Group A
#> 1=Mild Group B1
#> 2=Mild Group B2
#> 3=Moderate Group C
#> 4=Severe Group D

Join decodes

To join new variables to a data set based on decode_tbls or requirements objects, use join_decode_labels() or join_decode_levels().

dmcognigen_pk %>% 
  select(USUBJID, RACEN, SEXF) %>% 
  join_decode_labels(dmcognigen_pk_requirements, lvl_to_lbl = list(RACEN = "RACEC", "{var}C")) %>% 
  cnt(RACEN, RACEC, SEXF, SEXFC, n_distinct_vars = USUBJID)
#> ✔ Joined `RACEC` by `RACEN`.
#> 
#> ── RACEN ──
#> 
#> 1=White/Caucasian
#> 2=Black/African American
#> 4=American Indian or Alaska Native
#> ✔ Joined `SEXFC` by `SEXF`.
#> 
#> ── SEXF ──
#> 
#> 0=Male
#> 1=Female
#> # A tibble: 5 × 7
#>   RACEN RACEC                            SEXF SEXFC n_USUBJID     n n_cumulative
#>   <dbl> <chr>                           <dbl> <chr>     <int> <int>        <int>
#> 1     1 White/Caucasian                     0 Male        104  1456         1456
#> 2     1 White/Caucasian                     1 Fema…       126  1764         3220
#> 3     2 Black/African American              0 Male          6    84         3304
#> 4     2 Black/African American              1 Fema…        17   238         3542
#> 5     4 American Indian or Alaska Nati…     0 Male          1    14         3556

Create or modify variables as factors

Use set_decode_factors() to either modify variables in-place or create new variables.

dmcognigen_cov %>% 
  set_decode_factors(requirements, new_names = list(RACEN = "RACEN", "{var}FCT")) %>% 
  cnt(RACEN, across(ends_with("FCT")), n_cumulative = FALSE)
#> ✔ Created new variable `NCILIVFCT` as a factor of `NCILIV`.
#> ✔ Modified variable `RACEN` as a factor of `RACEN`.
#> ✔ Created new variable `RFCATFCT` as a factor of `RFCAT`.
#> ✔ Created new variable `SEXFFCT` as a factor of `SEXF`.
#> # A tibble: 22 × 5
#>    RACEN           NCILIVFCT      RFCATFCT                         SEXFFCT     n
#>    <fct>           <fct>          <fct>                            <fct>   <int>
#>  1 White/Caucasian Normal Group A Mild Impairment (60-89 mL/min)   Male       26
#>  2 White/Caucasian Normal Group A Mild Impairment (60-89 mL/min)   Female     30
#>  3 White/Caucasian Normal Group A Moderate Impairment (30-59 mL/m… Male       65
#>  4 White/Caucasian Normal Group A Moderate Impairment (30-59 mL/m… Female     78
#>  5 White/Caucasian Normal Group A NA                               Male        3
#>  6 White/Caucasian Normal Group A NA                               Female      5
#>  7 White/Caucasian Mild Group B1  Mild Impairment (60-89 mL/min)   Male        3
#>  8 White/Caucasian Mild Group B1  Mild Impairment (60-89 mL/min)   Female      3
#>  9 White/Caucasian Mild Group B1  Moderate Impairment (30-59 mL/m… Male        1
#> 10 White/Caucasian Mild Group B1  Moderate Impairment (30-59 mL/m… Female      6
#> # ℹ 12 more rows

labels_named_list

This attribute will be set if the variable_name_col and variable_label_col arguments are defined and those variables are in the requirements. The requirements can be used directly in set_labels() or the attribute can be extracted and possibly modified.

attr(requirements, "labels_named_list") %>% 
  head()
#> $ONUM
#> [1] "Overall Sequence Number"
#> 
#> $NUM
#> [1] "Sequence Number"
#> 
#> $STUDYID
#> [1] "Study Identifier"
#> 
#> $USUBJID
#> [1] "Unique Subject Identifier"
#> 
#> $ID
#> [1] "Subject ID"
#> 
#> $TSFD
#> [1] "Time Since First Dose (h)"

Apply the labels to a data set:

dmcognigen_cov_labels_removed %>% 
  str()
#> tibble [254 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ STUDYID: chr [1:254] "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" ...
#>   ..- attr(*, "label")= chr ""
#>  $ USUBJID: chr [1:254] "01-701-1015" "01-701-1023" "01-701-1028" "01-701-1033" ...
#>   ..- attr(*, "label")= chr ""
#>  $ RACEN  : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "label")= chr ""
#>  - attr(*, "label")= chr "CDISCPILOT01 Covariates"

# apply the labels
dmcognigen_cov_labels_removed %>% 
  set_labels(labels = requirements) %>% 
  str()
#> ℹ Inheriting labels from `requirements` <requirements> object.
#> tibble [254 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ STUDYID: chr [1:254] "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" ...
#>   ..- attr(*, "label")= chr "Study Identifier"
#>  $ USUBJID: chr [1:254] "01-701-1015" "01-701-1023" "01-701-1028" "01-701-1033" ...
#>   ..- attr(*, "label")= chr "Unique Subject Identifier"
#>  $ RACEN  : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "label")= chr "Race"
#>  - attr(*, "label")= chr "CDISCPILOT01 Covariates"