Summarize multiple response variables by group or pattern

mean_group_tbl() calculates summary statistics (i.e., mean, median, standard deviation, minimum, maximum, and count of non-missing values) for continuous (i.e., interval and ratio-level) variables, grouped either by another variable in your dataset or by a matched pattern in the variable names.

mean_group_tbl(
  data,
  var_stem,
  group,
  var_input = "stem",
  regex_stem = FALSE,
  ignore_stem_case = FALSE,
  group_type = "variable",
  group_name = NULL,
  regex_group = FALSE,
  ignore_group_case = FALSE,
  remove_group_non_alnum = TRUE,
  na_removal = "listwise",
  only = NULL,
  var_labels = NULL,
  ignore = NULL
)

Arguments

data

A data frame.

var_stem

A character vector with one or more elements, where each represents either a variable stem or the complete name of a variable present in data. A variable 'stem' refers to a common naming pattern shared among related variables, typically reflecting repeated measures of the same idea or a group of items assessing a single concept.

group

A character string representing a variable name or a pattern used to search for variables in data.

var_input

A character string specifying whether the values supplied to var_stem should be treated as variable stems (stem) or as complete variable names (name). By default, this is set to stem, so the function searches for variables that begin with each stem provided. Setting this argument to name directs the function to look for variables that exactly match the provided names.

regex_stem

A logical value indicating whether to use Perl-compatible regular expressions when searching for variable stems. Default is FALSE.

ignore_stem_case

A logical value indicating whether the search for columns matching the supplied var_stem is case-insensitive. Default is FALSE.

group_type

A character string that defines how the group argument should be interpreted. Should be one of pattern or variable. Defaults to variable, which searches for a matching variable name in data.

group_name

An optional character string used to rename the group column in the final table When group_type is set to variable, the column name defaults to the matched variable name from data. When set to pattern, the default column name is group.

regex_group

A logical value indicating whether to use Perl-compatible regular expressions when searching for group variables or matching variable name patterns. Default is FALSE.

ignore_group_case

A logical value specifying whether the search for a grouping variable (if group_type is variable) or for variables matching a pattern (if group_type is pattern) should be case-insensitive. Default is FALSE. Set to TRUE to ignore case.

remove_group_non_alnum

A logical value indicating whether to remove all non-alphanumeric characters (i.e., anything that is not a letter or number) from group. Default is TRUE.

na_removal

A character string specifying how missing values are handled. Must be one of listwise or pairwise. Defaults to listwise.

listwise: Removes any row that has at least one missing value across all variables returned or analyzed. (Effectively uses complete cases only.)
pairwise: Handles missing values per variable or per pair of variables, using all available data, even if other variables in the row have missing values.

only

A character string or vector of character strings specifying which summary statistics to return. Defaults to NULL, which includes mean (mean), median (median) standard deviation (sd), minimum (min), maximum (max), and count of non-missing values (nobs).

var_labels

An optional named character vector or list used to assign custom labels to variable names. Each element must be named and correspond to a variable included in the returned table. If var_input is set to stem, and any element is either unnamed or refers to a variable not present in the table, all labels will be ignored and the table will be printed without them.

ignore

An optional named vector or list indicating values to exclude from variables matching specified stems (or names), and, if applicable, from a grouping variable in data. Defaults to NULL, indicating that all values are retained. To specify exclusions for variables identified by var_stem, use the corresponding stems or variable names as names in the vector or list. To exclude multiple values from these variables or a grouping variable, supply them as a named list.

Value

A tibble showing summary statistics for continuous variables, grouped either by a specified variable in the dataset or by matching patterns in variable names.

Author

Ama Nyame-Mensah

Examples

sdoh_child_ages_region <- 
  dplyr::select(sdoh, c(REGION, ACS_PCT_AGE_0_4, ACS_PCT_AGE_5_9,
                        ACS_PCT_AGE_10_14, ACS_PCT_AGE_15_17))

mean_group_tbl(data = sdoh_child_ages_region,
               var_stem = "ACS_PCT_AGE",
               group = "REGION",
               group_name = "us_region",
               na_removal = "pairwise",
               var_labels = c(
                 ACS_PCT_AGE_0_4 = "% of population between ages 0-4",
                 ACS_PCT_AGE_5_9 = "% of population between ages 5-9",
                 ACS_PCT_AGE_10_14 = "% of population between ages 10-14",
                 ACS_PCT_AGE_15_17 = "% of population between ages 15-17"))
#> # A tibble: 16 × 9
#>    variable        variable_label us_region  mean median    sd   min   max  nobs
#>    <chr>           <chr>          <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
#>  1 ACS_PCT_AGE_0_4 % of populati… Midwest    5.90   5.81 1.13   2.4  12.0   1055
#>  2 ACS_PCT_AGE_0_4 % of populati… Northeast  5.04   5    0.829  0.95  8.12   217
#>  3 ACS_PCT_AGE_0_4 % of populati… South      5.76   5.78 1.26   0.98 18.4   1422
#>  4 ACS_PCT_AGE_0_4 % of populati… West       5.80   5.71 1.67   0.23 13.8    449
#>  5 ACS_PCT_AGE_5_9 % of populati… Midwest    6.17   6.11 1.18   0.95 12.9   1055
#>  6 ACS_PCT_AGE_5_9 % of populati… Northeast  5.28   5.35 0.762  0.53  7.53   217
#>  7 ACS_PCT_AGE_5_9 % of populati… South      5.99   6.03 1.24   0    14.9   1422
#>  8 ACS_PCT_AGE_5_9 % of populati… West       6.23   6.1  1.78   0    12.2    449
#>  9 ACS_PCT_AGE_10… % of populati… Midwest    6.48   6.49 1.15   1.71 11.6   1055
#> 10 ACS_PCT_AGE_10… % of populati… Northeast  5.69   5.77 0.779  1.08  7.94   217
#> 11 ACS_PCT_AGE_10… % of populati… South      6.48   6.48 1.23   0    13.6   1422
#> 12 ACS_PCT_AGE_10… % of populati… West       6.46   6.29 1.62   0    11.6    449
#> 13 ACS_PCT_AGE_15… % of populati… Midwest    3.94   3.94 0.635  0.64  7.83  1055
#> 14 ACS_PCT_AGE_15… % of populati… Northeast  3.59   3.61 0.383  2.02  4.67   217
#> 15 ACS_PCT_AGE_15… % of populati… South      3.86   3.88 0.747  0    11.9   1422
#> 16 ACS_PCT_AGE_15… % of populati… West       3.80   3.78 0.985  0    11.6    449

set.seed(0222)
grouped_data <-
  data.frame(
    symptoms.t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
    symptoms.t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
  )

mean_group_tbl(data = grouped_data,
               var_stem = "symptoms",
               group = ".t\\d",
               group_type = "pattern",
               na_removal = "listwise",
               ignore = c(symptoms = -999))
#> # A tibble: 2 × 8
#>   variable    group  mean median    sd   min   max  nobs
#>   <chr>       <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
#> 1 symptoms.t1 t1     5.51      6  3.19     0    10    37
#> 2 symptoms.t2 t2     4.95      5  2.97     0    10    37