library(dmcognigen)
library(dplyr)
data("dmcognigen_pk_requirements")
data("dmcognigen_cov")
data("dmcognigen_pk")
Introduction
Data requirements are specifications of a dataset. They describe required variables with varying details like variable label, type, description, source, formula, rounding specifications, and relationships to other variables.
Generally, the requirements are maintained in a file. These functions
support gathering and applying this information from Excel files, Word
(docx) tables, or a data.frame
.
For the purposes of this vignette, consider files in a directory that include tables with these variables:
#> name description
#> variable_name Variable Name
#> variable_label Variable Label
#> pk_ard Whether the variable is in the Analysis Ready Dataset
#> pk_mif Whether the variable is in the Model Input File
#> format_decode Pairs of (generally) numeric values and descriptions
Identify data requirements files
With a versioning strategy based on the latest modification date of the file, there might be multiple versions of the data requirements file that exist in the data assembly directory.
Consider the data assembly directory below:
asmbdat_directory <- file.path(tempdir(), "asmbdat")
asmbdat_directory
#> [1] "/tmp/Rtmp675OXs/asmbdat"
#> /tmp/Rtmp675OXs/asmbdat
#> ├── data-requirements-2019-12-01.xlsx
#> ├── data-requirements-2020-01-01.xlsx
#> ├── data-requirements-2020-01-10.xlsx
#> ├── data-requirements-2025-01-16.xlsx
#> ├── legacy-data-requirements-2000-01-01.docx
#> └── qc-data-requirements-2025-01-16.xlsx
Use the available_requirements_table()
function to
collect information about multiple data requirements files. Supported
files that match the pattern
and have the required
sheet
names or positions will be kept. By default, the
required sheet name is "specs"
.
The resulting table can be used to identify the desired version of
the data requirements file to pass to the
read_requirements()
function.
available_requirements_table(
path = asmbdat_directory,
pattern = "req"
) %>%
select(-c(created, modified)) %>%
print.data.frame()
#> path path_date
#> 1 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 2 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 3 /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#> date sheets is_qc
#> 1 2020-01-01 specs FALSE
#> 2 2020-01-10 specs, unit_conversions FALSE
#> 3 2025-01-16 specs, unit_conversions, discussions FALSE
Use sheet = NULL
for no required sheet names. This is
required to include docx files in these searches.
available_requirements_table(
path = asmbdat_directory,
pattern = "req",
sheet = NULL
) %>%
select(-c(created, modified)) %>%
print.data.frame()
#> path path_date
#> 1 /tmp/Rtmp675OXs/asmbdat/data-requirements-2019-12-01.xlsx 2019-12-01
#> 2 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 3 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 4 /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#> 5 /tmp/Rtmp675OXs/asmbdat/legacy-data-requirements-2000-01-01.docx 2000-01-01
#> date sheets is_qc
#> 1 2019-12-01 Sheet 1 FALSE
#> 2 2020-01-01 specs FALSE
#> 3 2020-01-10 specs, unit_conversions FALSE
#> 4 2025-01-16 specs, unit_conversions, discussions FALSE
#> 5 2000-01-01 NULL FALSE
Use drop_qc = FALSE
to include versions of the data
requirements used for QC. These files are ignored otherwise.
available_requirements_table(
path = asmbdat_directory,
pattern = "req",
sheet = NULL,
drop_qc = FALSE
) %>%
select(-c(created, modified)) %>%
print.data.frame()
#> path path_date
#> 1 /tmp/Rtmp675OXs/asmbdat/data-requirements-2019-12-01.xlsx 2019-12-01
#> 2 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-01.xlsx 2020-01-01
#> 3 /tmp/Rtmp675OXs/asmbdat/data-requirements-2020-01-10.xlsx 2020-01-10
#> 4 /tmp/Rtmp675OXs/asmbdat/data-requirements-2025-01-16.xlsx 2025-01-16
#> 5 /tmp/Rtmp675OXs/asmbdat/legacy-data-requirements-2000-01-01.docx 2000-01-01
#> 6 /tmp/Rtmp675OXs/asmbdat/qc-data-requirements-2025-01-16.xlsx 2025-01-16
#> date sheets is_qc
#> 1 2019-12-01 Sheet 1 FALSE
#> 2 2020-01-01 specs FALSE
#> 3 2020-01-10 specs, unit_conversions FALSE
#> 4 2025-01-16 specs, unit_conversions, discussions FALSE
#> 5 2000-01-01 NULL FALSE
#> 6 2025-01-16 specs, unit_conversions, discussions, qc_findings TRUE
Read data requirements
If no path
is provided, read_requirements()
will read the latest version of the matching data requirements files in
the working directory. The latest version is selected as the file with
the most recent date in the filename. If no dates are detected in the
filenames, the matching file with the most recent modification time is
selected.
requirements <- read_requirements(asmbdat_directory)
#> ✔ Detected requirements file: data-requirements-2025-01-16.xlsx
#> ℹ Modification time: 2025-01-16 20:55:50.4306
#> ℹ Sheet name: "specs"
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
glimpse(requirements)
#> Rows: 54
#> Columns: 5
#> $ variable_name <chr> "ONUM", "NUM", "STUDYID", "USUBJID", "ID", "TSFD", "TSP…
#> $ variable_label <chr> "Overall Sequence Number", "Sequence Number", "Study Id…
#> $ pk_ard <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ pk_mif <chr> "x", "x", "", "", "x", "x", "x", "", "", "x", "", "x", …
#> $ format_decode <chr> "", "", "", "", "", "", "", "", "", "0=Dose\n1=Xanomeli…
Use the subset
argument to provide subset/filter
criteria:
requirements_pk_mif <- read_requirements(
asmbdat_directory,
subset = pk_mif == "x"
)
#> ✔ Detected requirements file: data-requirements-2025-01-16.xlsx
#> ℹ Modification time: 2025-01-16 20:55:50.4306
#> ℹ Sheet name: "specs"
#> ✔ Applied requested subset: `pk_mif == "x"`
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
glimpse(requirements_pk_mif)
#> Rows: 16
#> Columns: 5
#> $ variable_name <chr> "ONUM", "NUM", "ID", "TSFD", "TSPD", "DVID", "EVID", "M…
#> $ variable_label <chr> "Overall Sequence Number", "Sequence Number", "Subject …
#> $ pk_ard <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ pk_mif <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", …
#> $ format_decode <chr> "", "", "", "", "", "0=Dose\n1=Xanomeline Concentration…
Use as_requirements()
to apply requirements attributes
to an existing data frame:
requirements_pk_mif_with_as <- as_requirements(
as.data.frame(requirements_pk_mif)
)
#> ✔ Applied the "labels_named_list" attribute
#> ✔ Applied the "decode_tbls" attribute
identical(requirements_pk_mif_with_as, requirements_pk_mif)
#> [1] TRUE
Attributes
When possible, additional attributes are assigned to the result of
read_requirements()
and as_requirements()
.
Resulting requirements
objects can be used directly in
other functions in this package to apply these attributes.
decode_tbls
This attribute will be set if the variable_name_col
and
decode_col
arguments are defined and those variables are in
requirements.
See more details on using decode_tbls
in the Decode Tables vignette.
attr(requirements, "decode_tbls")
#>
#> ── Decode tables ───────────────────────────────────────────────────────────────
#>
#> ── DVID ──
#>
#> 0=Dose
#> 1=Xanomeline Concentration (ug/mL)
#>
#> ── EVID ──
#>
#> 0=PK or PD measure
#> 1=Dose
#> 2=Other
#>
#> ── MDV ──
#>
#> 0=PK or PD measure
#> 1=Dose or Other
#>
#> ── BLQFN ──
#>
#> 0=No
#> 1=Yes
#>
#> ── FED ──
#>
#> 0=Fasted
#> 1=Fed
#>
#> ── RACEN ──
#>
#> 1=White/Caucasian
#> 2=Black/African American
#> 3=Asian
#> 4=American Indian or Alaska Native
#>
#> ── SEXF ──
#>
#> 0=Male
#> 1=Female
#>
#> ── RFCAT ──
#>
#> 1=Normal Function (>=90 mL/min)
#> 2=Mild Impairment (60-89 mL/min)
#> 3=Moderate Impairment (30-59 mL/min)
#> 4=Severe Impairment (15-29 mL/min)
#> 5=End Stage Disease (<15 mL/min or Dialysis)
#>
#> ── NCILIV ──
#>
#> 0=Normal Group A
#> 1=Mild Group B1
#> 2=Mild Group B2
#> 3=Moderate Group C
#> 4=Severe Group D
Join decodes
To join new variables to a data set based on decode_tbls
or requirements
objects, use
join_decode_labels()
or
join_decode_levels()
.
dmcognigen_pk %>%
select(USUBJID, RACEN, SEXF) %>%
join_decode_labels(dmcognigen_pk_requirements, lvl_to_lbl = list(RACEN = "RACEC", "{var}C")) %>%
cnt(RACEN, RACEC, SEXF, SEXFC, n_distinct_vars = USUBJID)
#> ✔ Joined `RACEC` by `RACEN`.
#>
#> ── RACEN ──
#>
#> 1=White/Caucasian
#> 2=Black/African American
#> 4=American Indian or Alaska Native
#> ✔ Joined `SEXFC` by `SEXF`.
#>
#> ── SEXF ──
#>
#> 0=Male
#> 1=Female
#> # A tibble: 5 × 7
#> RACEN RACEC SEXF SEXFC n_USUBJID n n_cumulative
#> <dbl> <chr> <dbl> <chr> <int> <int> <int>
#> 1 1 White/Caucasian 0 Male 104 1456 1456
#> 2 1 White/Caucasian 1 Fema… 126 1764 3220
#> 3 2 Black/African American 0 Male 6 84 3304
#> 4 2 Black/African American 1 Fema… 17 238 3542
#> 5 4 American Indian or Alaska Nati… 0 Male 1 14 3556
Create or modify variables as factors
Use set_decode_factors()
to either modify variables
in-place or create new variables.
dmcognigen_cov %>%
set_decode_factors(requirements, new_names = list(RACEN = "RACEN", "{var}FCT")) %>%
cnt(RACEN, across(ends_with("FCT")), n_cumulative = FALSE)
#> ✔ Created new variable `NCILIVFCT` as a factor of `NCILIV`.
#> ✔ Modified variable `RACEN` as a factor of `RACEN`.
#> ✔ Created new variable `RFCATFCT` as a factor of `RFCAT`.
#> ✔ Created new variable `SEXFFCT` as a factor of `SEXF`.
#> # A tibble: 22 × 5
#> RACEN NCILIVFCT RFCATFCT SEXFFCT n
#> <fct> <fct> <fct> <fct> <int>
#> 1 White/Caucasian Normal Group A Mild Impairment (60-89 mL/min) Male 26
#> 2 White/Caucasian Normal Group A Mild Impairment (60-89 mL/min) Female 30
#> 3 White/Caucasian Normal Group A Moderate Impairment (30-59 mL/m… Male 65
#> 4 White/Caucasian Normal Group A Moderate Impairment (30-59 mL/m… Female 78
#> 5 White/Caucasian Normal Group A NA Male 3
#> 6 White/Caucasian Normal Group A NA Female 5
#> 7 White/Caucasian Mild Group B1 Mild Impairment (60-89 mL/min) Male 3
#> 8 White/Caucasian Mild Group B1 Mild Impairment (60-89 mL/min) Female 3
#> 9 White/Caucasian Mild Group B1 Moderate Impairment (30-59 mL/m… Male 1
#> 10 White/Caucasian Mild Group B1 Moderate Impairment (30-59 mL/m… Female 6
#> # ℹ 12 more rows
labels_named_list
This attribute will be set if the variable_name_col
and
variable_label_col
arguments are defined and those
variables are in the requirements
. The
requirements
can be used directly in
set_labels()
or the attribute can be extracted and possibly
modified.
attr(requirements, "labels_named_list") %>%
head()
#> $ONUM
#> [1] "Overall Sequence Number"
#>
#> $NUM
#> [1] "Sequence Number"
#>
#> $STUDYID
#> [1] "Study Identifier"
#>
#> $USUBJID
#> [1] "Unique Subject Identifier"
#>
#> $ID
#> [1] "Subject ID"
#>
#> $TSFD
#> [1] "Time Since First Dose (h)"
Apply the labels to a data set:
dmcognigen_cov_labels_removed %>%
str()
#> tibble [254 × 3] (S3: tbl_df/tbl/data.frame)
#> $ STUDYID: chr [1:254] "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" ...
#> ..- attr(*, "label")= chr ""
#> $ USUBJID: chr [1:254] "01-701-1015" "01-701-1023" "01-701-1028" "01-701-1033" ...
#> ..- attr(*, "label")= chr ""
#> $ RACEN : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr ""
#> - attr(*, "label")= chr "CDISCPILOT01 Covariates"
# apply the labels
dmcognigen_cov_labels_removed %>%
set_labels(labels = requirements) %>%
str()
#> ℹ Inheriting labels from `requirements` <requirements> object.
#> tibble [254 × 3] (S3: tbl_df/tbl/data.frame)
#> $ STUDYID: chr [1:254] "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" ...
#> ..- attr(*, "label")= chr "Study Identifier"
#> $ USUBJID: chr [1:254] "01-701-1015" "01-701-1023" "01-701-1028" "01-701-1033" ...
#> ..- attr(*, "label")= chr "Unique Subject Identifier"
#> $ RACEN : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr "Race"
#> - attr(*, "label")= chr "CDISCPILOT01 Covariates"