Introduction to summarytabl • summarytabl

Overview

summarytabl is an R package designed to simplify the creation of summary tables for different types of data. It provides a set of functions that help you quickly describe:

Categorical variables
Multiple response variables
Continuous variables

Each function is clearly prefixed based on the type of data it summarizes, making it easy to identify and apply the right tool for your analysis.

Use these functions to summarize binary and nominal variables:

cat_tbl() creates a summary table for a categorical variable.
cat_group_tbl() summarizes two categorical variables.

These functions are ideal for summarizing binary, ordinal, and Likert-scale variables in which respondents select one response per statement, question, or item:

select_tbl() summarizes multiple response and ordinal variables.
select_group_tbl() summarizes multiple response and ordinal variables by a group or pattern.

For interval and ratio-level variables, use:

mean_tbl() generates summary statistics for continuous variables.
mean_group_tbl() generates summary statistics for continuous variables by group or pattern.

All functions work with data frames and tibbles, and each returns a tibble as output.

This document is organized into three sections, each focusing on a different set of functions for summarizing a specific type of variable.

To begin working with summarytabl, load the package:

library(summarytabl)

Keep reading to learn more about how each function works, or jump to the section that matches the type of variable or data you’re working with.

Working with categorical variables

Let’s explore how to use cat_tbl() and cat_group_tbl() to summarize categorical variables. We’ll begin by summarizing a single categorical variable, race, from the nlsy dataset.

cat_tbl(data = nlsy, var = "race")

## # A tibble: 3 × 3
##   race                   count percent
##   <chr>                  <int>   <dbl>
## 1 Black                    868   0.292
## 2 Hispanic                 631   0.212
## 3 Non-Black,Non-Hispanic  1477   0.496

The function returns a tibble with three columns by default:

race: the name of the variable being summarized
count: the number of observations in each category of race
percent: the percentage of observations in each category of race, calculated relative to the total

You can exclude certain values and eliminate missing values from the data using the ignore and na.rm arguments, respectively.

cat_tbl(data = nlsy, 
        var = "race",
        ignore = "Hispanic",
        na.rm = TRUE)

## # A tibble: 2 × 3
##   race                   count percent
##   <chr>                  <int>   <dbl>
## 1 Black                    868   0.370
## 2 Non-Black,Non-Hispanic  1477   0.630

Suppose we want to create a contingency table to summarize two categorical variables. We can do this using the cat_group_tbl() function. In this example, we summarize race by bthwht. Before applying cat_group_tbl(), we’ll recode the values of bthwht, changing 0 to regular_birthweight and 1 to low_birthweight.

nlsy_cross_tab <- 
  nlsy |>
  dplyr::select(c(race, bthwht)) |>
  dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight")) 

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht")

## # A tibble: 6 × 4
##   race                   bthwht             count percent
##   <chr>                  <chr>              <int>   <dbl>
## 1 Black                  low_birthweight      102  0.0343
## 2 Black                  regular_bithweight   766  0.257 
## 3 Hispanic               low_birthweight       42  0.0141
## 4 Hispanic               regular_bithweight   589  0.198 
## 5 Non-Black,Non-Hispanic low_birthweight       83  0.0279
## 6 Non-Black,Non-Hispanic regular_bithweight  1394  0.468

The function returns a tibble with four columns by default:

race: the name of the row_var variable
bthwht: the name of the col_var variable
count: the number of observations for each combination of race and bthwht categories.
percent: the percentage of observations for each combination of race and bthwht categories, calculated relative to the total

To pivot the output to the wide format, set pivot = "wider".

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider")

## # A tibble: 3 × 5
##   race      count_bthwht_low_bir…¹ count_bthwht_regular…² percent_bthwht_low_b…³
##   <chr>                      <int>                  <int>                  <dbl>
## 1 Black                        102                    766                 0.0343
## 2 Hispanic                      42                    589                 0.0141
## 3 Non-Blac…                     83                   1394                 0.0279
## # ℹ abbreviated names: ¹count_bthwht_low_birthweight,
## #   ²count_bthwht_regular_bithweight, ³percent_bthwht_low_birthweight
## # ℹ 1 more variable: percent_bthwht_regular_bithweight <dbl>

To display only percentages, set only = "percent". You can also control how those percentages are calculated and displayed using the margins argument.

# Default: percentages across the full table sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                          0.0343                    0.257
## 2 Hispanic                                       0.0141                    0.198
## 3 Non-Black,Non-Hispanic                         0.0279                    0.468
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

# Rowwise: percentages sum to one across columns within each row
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "rows",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                          0.118                     0.882
## 2 Hispanic                                       0.0666                    0.933
## 3 Non-Black,Non-Hispanic                         0.0562                    0.944
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

# Columnwise: percentages within each column sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "columns",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                           0.449                    0.279
## 2 Hispanic                                        0.185                    0.214
## 3 Non-Black,Non-Hispanic                          0.366                    0.507
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

Sometimes, you may want to exclude specific values from your analysis. To do this, use a named vector or list to specify which values to exclude from the row_var and col_var variables. For example, in the case below, the Non-Black/Non-Hispanic category is excluded from the race variable (i.e., row_var) and to ensure that NAs are not returned in the final table, na.rm.row_var is set to TRUE.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = c(race = "Non-Black,Non-Hispanic"))

## # A tibble: 4 × 4
##   race     bthwht             count percent
##   <chr>    <chr>              <int>   <dbl>
## 1 Black    low_birthweight      102  0.0680
## 2 Black    regular_bithweight   766  0.511 
## 3 Hispanic low_birthweight       42  0.0280
## 4 Hispanic regular_bithweight   589  0.393

When you need to exclude more than one value from row_var or col_var, use a named list. In the example below, both the Non-Black/Non-Hispanic and Hispanic categories are excluded from the race variable.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))

## # A tibble: 2 × 4
##   race  bthwht             count percent
##   <chr> <chr>              <int>   <dbl>
## 1 Black low_birthweight      102   0.118
## 2 Black regular_bithweight   766   0.882

Working with multiple response and ordinal variables

Next, let’s explore how to use select_tbl() and select_group_tbl() functions to summarize multiple response and ordinal variables. Multiple response and ordinal variables are commonly used in survey research, psychology, and health sciences. Examples include symptom checklists, scales like a depression index with multiple items, or questions allowing respondents to select all choices that apply to them.

The depressive dataset contains eight variables that share the same variable stem: dep, with each one representing a different item used to measure depression.

names(depressive)

##  [1] "cid"   "race"  "sex"   "yob"   "dep_1" "dep_2" "dep_3" "dep_4" "dep_5"
## [10] "dep_6" "dep_7" "dep_8"

Using the select_tbl() function, we can summarize participants’ responses to these items by showing how many respondents chose each answer option (i.e., value) for every variable.

select_tbl(data = depressive, var_stem = "dep")

## # A tibble: 24 × 4
##    variable values count percent
##    <chr>     <int> <int>   <dbl>
##  1 dep_1         1   109  0.0678
##  2 dep_1         2   689  0.429 
##  3 dep_1         3   809  0.503 
##  4 dep_2         1   144  0.0896
##  5 dep_2         2   746  0.464 
##  6 dep_2         3   717  0.446 
##  7 dep_3         1  1162  0.723 
##  8 dep_3         2   392  0.244 
##  9 dep_3         3    53  0.0330
## 10 dep_4         1   601  0.374 
## # ℹ 14 more rows

Alternatively, you can summarize specific variables by passing their names to the var_stem argument and setting the var_input argument to "name".

select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name")

## # A tibble: 9 × 4
##   variable values count percent
##   <chr>     <int> <int>   <dbl>
## 1 dep_1         1   117  0.0714
## 2 dep_1         2   703  0.429 
## 3 dep_1         3   818  0.499 
## 4 dep_4         1   608  0.371 
## 5 dep_4         2   854  0.521 
## 6 dep_4         3   176  0.107 
## 7 dep_6         1   398  0.243 
## 8 dep_6         2   872  0.532 
## 9 dep_6         3   368  0.225

By default, missing values are removed using listwise deletion, which excludes any row containing at least one missing value in any of the variables returned or analyzed. To use pairwise deletion instead, set na_removal = "pairwise". Pairwise deletion handles missing values per variable or per pair of variables, using all available data, even if other variables in the row have missing values.

As a result, applying select_tbl() as shown below will yield different results from the previous example, as all three variables (dep_1, dep_4, dep_6) are analyzed independently when na_removal = "pairwise".

select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name",
           na_removal = "pairwise")

## # A tibble: 9 × 4
##   variable values count percent
##   <chr>     <int> <int>   <dbl>
## 1 dep_1         1   120  0.0726
## 2 dep_1         2   709  0.429 
## 3 dep_1         3   825  0.499 
## 4 dep_4         1   611  0.371 
## 5 dep_4         2   856  0.519 
## 6 dep_4         3   181  0.110 
## 7 dep_6         1   399  0.242 
## 8 dep_6         2   879  0.533 
## 9 dep_6         3   371  0.225

You can display the output from select_tbl() in the wide format by setting pivot = "wider", and choose which summary statistics to include with the only argument.

# Table in wide format; all summary statistics returned
# (counts and percentages)
select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name",
           na_removal = "pairwise",
           pivot = "wider")

## # A tibble: 3 × 7
##   variable count_value_1 count_value_2 count_value_3 percent_value_1
##   <chr>            <int>         <int>         <int>           <dbl>
## 1 dep_1              120           709           825          0.0726
## 2 dep_4              611           856           181          0.371 
## 3 dep_6              399           879           371          0.242 
## # ℹ 2 more variables: percent_value_2 <dbl>, percent_value_3 <dbl>

# Table in wide format; counts only returned
select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name",
           na_removal = "pairwise",
           pivot = "wider",
           only = "count")

## # A tibble: 3 × 4
##   variable count_value_1 count_value_2 count_value_3
##   <chr>            <int>         <int>         <int>
## 1 dep_1              120           709           825
## 2 dep_4              611           856           181
## 3 dep_6              399           879           371

# Table in wide format; percentages only returned
select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name",
           na_removal = "pairwise",
           pivot = "wider",
           only = "percent")

## # A tibble: 3 × 4
##   variable percent_value_1 percent_value_2 percent_value_3
##   <chr>              <dbl>           <dbl>           <dbl>
## 1 dep_1             0.0726           0.429           0.499
## 2 dep_4             0.371            0.519           0.110
## 3 dep_6             0.242            0.533           0.225

It’s common practice to group multiple response or ordinal variables by another variable. This type of descriptive analysis allows for meaningful comparisons across different segments of your dataset. With select_group_tbl(), you can create a summary table for multiple response and ordinal variables, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by race.

First, recode the race variable and the values for each of the eight depressive index variables in the depressive dataset, replacing numeric categories with descriptive string labels for easier interpretation.

dep_recoded <- 
  depressive |>
  dplyr::mutate(
    race = dplyr::case_match(.x = race,
                             1 ~ "Hispanic", 
                             2 ~ "Black", 
                             3 ~ "Non-Black/Non-Hispanic",
                             .default = NA)
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("dep"),
      .fns = ~ dplyr::case_when(.x == 1 ~ "often", 
                                .x == 2 ~ "sometimes", 
                                .x == 3 ~ "hardly ever")
    ))

Next, use the select_group_tbl() function to summarize responses for all eight variables by race:

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race")

## # A tibble: 72 × 5
##    variable race                   values      count percent
##    <chr>    <chr>                  <chr>       <int>   <dbl>
##  1 dep_1    Black                  hardly ever   248  0.154 
##  2 dep_1    Black                  often          45  0.0280
##  3 dep_1    Black                  sometimes     194  0.121 
##  4 dep_1    Hispanic               hardly ever   187  0.116 
##  5 dep_1    Hispanic               often          28  0.0174
##  6 dep_1    Hispanic               sometimes     155  0.0965
##  7 dep_1    Non-Black/Non-Hispanic hardly ever   374  0.233 
##  8 dep_1    Non-Black/Non-Hispanic often          36  0.0224
##  9 dep_1    Non-Black/Non-Hispanic sometimes     340  0.212 
## 10 dep_2    Black                  hardly ever   234  0.146 
## # ℹ 62 more rows

As with select_tbl(), setting the pivot argument to "wider" reshapes the table into the wide format, while using "pairwise" for the na_removal argument ensures missing values are addressed through pairwise deletion.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

## # A tibble: 24 × 8
##    variable values   count_race_Black count_race_Hispanic count_race_Non-Black…¹
##    <chr>    <chr>               <int>               <int>                  <int>
##  1 dep_1    hardly …              256                 190                    379
##  2 dep_1    often                  54                  28                     38
##  3 dep_1    sometim…              203                 159                    347
##  4 dep_2    hardly …              241                 172                    315
##  5 dep_2    often                  52                  38                     61
##  6 dep_2    sometim…              213                 165                    384
##  7 dep_3    hardly …               20                  20                     15
##  8 dep_3    often                 342                 252                    598
##  9 dep_3    sometim…              149                 105                    152
## 10 dep_4    hardly …               48                  40                     93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## #   `percent_race_Non-Black/Non-Hispanic` <dbl>

The ignore argument can be used to exclude specific values from analysis. In the example below, the value often is removed from all eight depression index variables, and the Non-Black/Non-Hispanic category is excluded from the race variable.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider",
                 ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))

## # A tibble: 16 × 6
##    variable values      count_race_Black count_race_Hispanic percent_race_Black
##    <chr>    <chr>                  <int>               <int>              <dbl>
##  1 dep_1    hardly ever              256                 190             0.317 
##  2 dep_1    sometimes                203                 159             0.251 
##  3 dep_2    hardly ever              241                 172             0.305 
##  4 dep_2    sometimes                213                 165             0.269 
##  5 dep_3    hardly ever               20                  20             0.0680
##  6 dep_3    sometimes                149                 105             0.507 
##  7 dep_4    hardly ever               48                  40             0.0854
##  8 dep_4    sometimes                269                 205             0.479 
##  9 dep_5    hardly ever              253                 201             0.333 
## 10 dep_5    sometimes                182                 124             0.239 
## 11 dep_6    hardly ever              128                  95             0.190 
## 12 dep_6    sometimes                249                 200             0.371 
## 13 dep_7    hardly ever               38                  28             0.110 
## 14 dep_7    sometimes                152                 128             0.439 
## 15 dep_8    hardly ever              171                 127             0.238 
## 16 dep_8    sometimes                237                 182             0.331 
## # ℹ 1 more variable: percent_race_Hispanic <dbl>

When group_type is set to variable (the default), the margins argument determines how percentages are calculated and displayed. To highlight the differences in percentages across tables, the only argument is set to percent, so that only percentages are returned.

# Default: percentages across each variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider",
                 only = "percent")

## # A tibble: 24 × 5
##    variable values      percent_race_Black percent_race_Hispanic
##    <chr>    <chr>                    <dbl>                 <dbl>
##  1 dep_1    hardly ever             0.155                 0.115 
##  2 dep_1    often                   0.0326                0.0169
##  3 dep_1    sometimes               0.123                 0.0961
##  4 dep_2    hardly ever             0.147                 0.105 
##  5 dep_2    often                   0.0317                0.0232
##  6 dep_2    sometimes               0.130                 0.101 
##  7 dep_3    hardly ever             0.0121                0.0121
##  8 dep_3    often                   0.207                 0.152 
##  9 dep_3    sometimes               0.0901                0.0635
## 10 dep_4    hardly ever             0.0291                0.0243
## # ℹ 14 more rows
## # ℹ 1 more variable: `percent_race_Non-Black/Non-Hispanic` <dbl>

# Rowwise: for each value of the variable, the percentages 
# across all levels of the grouping variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "rows",
                 na_removal = "pairwise",
                 pivot = "wider",
                 only = "percent")

## # A tibble: 24 × 5
##    variable values      percent_race_Black percent_race_Hispanic
##    <chr>    <chr>                    <dbl>                 <dbl>
##  1 dep_1    hardly ever              0.310                 0.230
##  2 dep_1    often                    0.45                  0.233
##  3 dep_1    sometimes                0.286                 0.224
##  4 dep_2    hardly ever              0.331                 0.236
##  5 dep_2    often                    0.344                 0.252
##  6 dep_2    sometimes                0.280                 0.217
##  7 dep_3    hardly ever              0.364                 0.364
##  8 dep_3    often                    0.287                 0.211
##  9 dep_3    sometimes                0.367                 0.259
## 10 dep_4    hardly ever              0.265                 0.221
## # ℹ 14 more rows
## # ℹ 1 more variable: `percent_race_Non-Black/Non-Hispanic` <dbl>

# Columnwise: for each level of the grouping variable, 
# the percentages across all values of the variable sum 
# to one.
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "columns",
                 na_removal = "pairwise",
                 pivot = "wider",
                 only = "percent")

## # A tibble: 24 × 5
##    variable values      percent_race_Black percent_race_Hispanic
##    <chr>    <chr>                    <dbl>                 <dbl>
##  1 dep_1    hardly ever             0.499                 0.504 
##  2 dep_1    often                   0.105                 0.0743
##  3 dep_1    sometimes               0.396                 0.422 
##  4 dep_2    hardly ever             0.476                 0.459 
##  5 dep_2    often                   0.103                 0.101 
##  6 dep_2    sometimes               0.421                 0.44  
##  7 dep_3    hardly ever             0.0391                0.0531
##  8 dep_3    often                   0.669                 0.668 
##  9 dep_3    sometimes               0.292                 0.279 
## 10 dep_4    hardly ever             0.0949                0.106 
## # ℹ 14 more rows
## # ℹ 1 more variable: `percent_race_Non-Black/Non-Hispanic` <dbl>

Another way to use select_group_tbl() is to summarize responses that match a specific pattern, such as survey waves or time points. To enable this feature, set group_type = "pattern" and provide the desired pattern in the group argument. For example, the stem_social_psych dataset contains variables that capture student responses about their sense of belonging in the STEM community at two distinct time points: “w1” and “w2”. You can summarize these responses using a pattern-based approach, where the time points (e.g., “w1” and “w2”) serve as grouping variables.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern")

## # A tibble: 10 × 5
##    variable             group values count percent
##    <chr>                <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 w1         1     5  0.0185
##  2 belong_belongStem_w1 w1         2    20  0.0741
##  3 belong_belongStem_w1 w1         3    59  0.219 
##  4 belong_belongStem_w1 w1         4   107  0.396 
##  5 belong_belongStem_w1 w1         5    79  0.293 
##  6 belong_belongStem_w2 w2         1    11  0.0407
##  7 belong_belongStem_w2 w2         2    11  0.0407
##  8 belong_belongStem_w2 w2         3    44  0.163 
##  9 belong_belongStem_w2 w2         4   113  0.419 
## 10 belong_belongStem_w2 w2         5    91  0.337

Use the group_name argument to assign a descriptive name to the column containing the matched pattern values.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

## # A tibble: 10 × 5
##    variable             wave  values count percent
##    <chr>                <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 w1         1     5  0.0185
##  2 belong_belongStem_w1 w1         2    20  0.0741
##  3 belong_belongStem_w1 w1         3    59  0.219 
##  4 belong_belongStem_w1 w1         4   107  0.396 
##  5 belong_belongStem_w1 w1         5    79  0.293 
##  6 belong_belongStem_w2 w2         1    11  0.0407
##  7 belong_belongStem_w2 w2         2    11  0.0407
##  8 belong_belongStem_w2 w2         3    44  0.163 
##  9 belong_belongStem_w2 w2         4   113  0.419 
## 10 belong_belongStem_w2 w2         5    91  0.337

You can also include variable labels in your summary table by using the var_labels argument.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 var_labels = c(
                   belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)",
                   belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)"
                 ))

## # A tibble: 10 × 6
##    variable             variable_label                wave  values count percent
##    <chr>                <chr>                         <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 I feel like I belong in STEM… w1         1     5  0.0185
##  2 belong_belongStem_w1 I feel like I belong in STEM… w1         2    20  0.0741
##  3 belong_belongStem_w1 I feel like I belong in STEM… w1         3    59  0.219 
##  4 belong_belongStem_w1 I feel like I belong in STEM… w1         4   107  0.396 
##  5 belong_belongStem_w1 I feel like I belong in STEM… w1         5    79  0.293 
##  6 belong_belongStem_w2 I feel like I belong in STEM… w2         1    11  0.0407
##  7 belong_belongStem_w2 I feel like I belong in STEM… w2         2    11  0.0407
##  8 belong_belongStem_w2 I feel like I belong in STEM… w2         3    44  0.163 
##  9 belong_belongStem_w2 I feel like I belong in STEM… w2         4   113  0.419 
## 10 belong_belongStem_w2 I feel like I belong in STEM… w2         5    91  0.337

Working with continuous variables

Finally, let’s look at how to use the mean_tbl() and mean_group_tbl() functions to summarize continuous variables. The mean_tbl() function allows you to generate descriptive statistics for either a set of continuous variables that share a common stem or for individual continuous variables. The resulting summary table includes key metrics such as the variable’s mean, median, standard deviation, minimum value, maximum value, and the count of non-missing observations for each variable.

The sdoh dataset contains six variables describing characteristics of health care facilities, all of which begin with the prefix HHC_PCT. Using the mean_tbl() function, you can generate summary statistics for these variables:

mean_tbl(data = sdoh, var_stem = "HHC_PCT")

## # A tibble: 6 × 7
##   variable                  mean median    sd   min   max  nobs
##   <chr>                    <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  100    49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  100    48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4   76.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1   50    47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2    0    46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1   95.2  48.6     0   100  3227

Alternatively, if you want to generate summary statistics for only a subset of those variables, you can specify their names directly in the var_stem argument and set var_input = "name" to indicate you’re referencing variable names rather than a shared stem.

mean_tbl(
  data = sdoh,
  var_stem = c("HHC_PCT_HHA_PHYS_THERAPY",
               "HHC_PCT_HHA_OCC_THERAPY",
               "HHC_PCT_HHA_SPEECH"),
  var_input = "name"
)

## # A tibble: 3 × 7
##   variable                  mean median    sd   min   max  nobs
##   <chr>                    <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_PHYS_THERAPY  56.7  100    48.8     0   100  3227
## 2 HHC_PCT_HHA_OCC_THERAPY   52.4   76.4  48.3     0   100  3227
## 3 HHC_PCT_HHA_SPEECH        49.1   50    47.6     0   100  3227

Like select_tbl() and select_group_tbl(), functions beginning with the prefix mean_ (i.e., mean_tbl() and mean_group_tbl()) also remove missing values by default using listwise deletion, which excludes any row containing at least one missing value in any of the variables analyzed or returned. To use pairwise deletion instead, set na_removal = "pairwise". With pairwise deletion, missing values are handled on a per-variable or per-variable-pair basis, so all available data is used even if some variables in the row have missing values.

# Default listwise removal
mean_tbl(data = sdoh, var_stem = "HHC_PCT")

## # A tibble: 6 × 7
##   variable                  mean median    sd   min   max  nobs
##   <chr>                    <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  100    49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  100    48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4   76.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1   50    47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2    0    46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1   95.2  48.6     0   100  3227

# Pairwise removal
mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise")

## # A tibble: 6 × 7
##   variable                  mean median    sd   min   max  nobs
##   <chr>                    <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  100    49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  100    48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4   76.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1   50    47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2    0    46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1   95.2  48.6     0   100  3227

Consider adding variable labels using the var_labels argument to help make the variable names easier to interpret.

mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise",
         var_labels = c(
           HHC_PCT_HHA_NURSING="% agencies offering nursing care services",
           HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services",
           HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services",
           HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services",
           HHC_PCT_HHA_MEDICAL="% agencies offering medical social services",
           HHC_PCT_HHA_AIDE="% agencies offering home health aide services"
         ))

## # A tibble: 6 × 8
##   variable                 variable_label    mean median    sd   min   max  nobs
##   <chr>                    <chr>            <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING      % agencies offe…  58.2  100    49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY % agencies offe…  56.7  100    48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY  % agencies offe…  52.4   76.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH       % agencies offe…  49.1   50    47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL      % agencies offe…  42.2    0    46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE         % agencies offe…  55.1   95.2  48.6     0   100  3227

Similar to working with multiple response variables, it’s common practice to group continuous variables by another variable to enable meaningful comparisons across different segments of a dataset. The mean_group_tbl() function facilitates this type of descriptive analysis by generating summary statistics for continuous variables, grouped either by a specific variable in the dataset or by matching patterns in variable names. For example, it’s often useful to present summary statistics by demographic categories such as region, gender, age, or race.

mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               group_type = "variable")

## # A tibble: 24 × 8
##    variable                 REGION     mean median    sd   min   max  nobs
##    <chr>                    <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest    57.4  100    49.5     0   100  1055
##  2 HHC_PCT_HHA_NURSING      Northeast  74.2  100    43.9     0   100   217
##  3 HHC_PCT_HHA_NURSING      South      58.8  100    49.2     0   100  1422
##  4 HHC_PCT_HHA_NURSING      West       56    100    49.7     0   100   450
##  5 HHC_PCT_HHA_PHYS_THERAPY Midwest    55.2  100    48.9     0   100  1055
##  6 HHC_PCT_HHA_PHYS_THERAPY Northeast  68.0  100    43.1     0   100   217
##  7 HHC_PCT_HHA_PHYS_THERAPY South      58.4  100    49.0     0   100  1422
##  8 HHC_PCT_HHA_PHYS_THERAPY West       54.5   95.4  49.0     0   100   450
##  9 HHC_PCT_HHA_OCC_THERAPY  Midwest    52.9   82.2  48.7     0   100  1055
## 10 HHC_PCT_HHA_OCC_THERAPY  Northeast  64.8   89.7  42.8     0   100   217
## # ℹ 14 more rows

You can control which values to exclude and how missing data is handled using the ignore and na_removal arguments. To specify values to ignore, use a named vector or list, where each name corresponds to a variable stem or specific variable name.

# Default listwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

## # A tibble: 18 × 8
##    variable                 REGION   mean median    sd    min   max  nobs
##    <chr>                    <chr>   <dbl>  <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100      100  0    100      100   403
##  2 HHC_PCT_HHA_NURSING      South   100      100  0    100      100   681
##  3 HHC_PCT_HHA_NURSING      West    100      100  0    100      100   200
##  4 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.7    100  7.15  50      100   403
##  5 HHC_PCT_HHA_PHYS_THERAPY South    99.2    100  4.78  50      100   681
##  6 HHC_PCT_HHA_PHYS_THERAPY West     98.3    100  5.31  60      100   200
##  7 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3    100 10.4   33.3    100   403
##  8 HHC_PCT_HHA_OCC_THERAPY  South    95.5    100 12.4   28.6    100   681
##  9 HHC_PCT_HHA_OCC_THERAPY  West     94.8    100 12.2   25      100   200
## 10 HHC_PCT_HHA_SPEECH       Midwest  91.9    100 16.2   33.3    100   403
## 11 HHC_PCT_HHA_SPEECH       South    93.4    100 15.3   25      100   681
## 12 HHC_PCT_HHA_SPEECH       West     91.0    100 17.2   20      100   200
## 13 HHC_PCT_HHA_MEDICAL      Midwest  82.4    100 23.8    9.09   100   403
## 14 HHC_PCT_HHA_MEDICAL      South    89.4    100 18.6   16.7    100   681
## 15 HHC_PCT_HHA_MEDICAL      West     92.6    100 15.3   33.3    100   200
## 16 HHC_PCT_HHA_AIDE         Midwest  97.3    100  8.97  50      100   403
## 17 HHC_PCT_HHA_AIDE         South    96.1    100 10.3   42.9    100   681
## 18 HHC_PCT_HHA_AIDE         West     96.4    100  9.96  50      100   200

# Pairwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

## # A tibble: 18 × 8
##    variable                 REGION   mean median    sd    min   max  nobs
##    <chr>                    <chr>   <dbl>  <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100      100  0    100      100   606
##  2 HHC_PCT_HHA_NURSING      South   100      100  0    100      100   836
##  3 HHC_PCT_HHA_NURSING      West    100      100  0    100      100   252
##  4 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.8    100  8.36  25      100   595
##  5 HHC_PCT_HHA_PHYS_THERAPY South    99.4    100  4.32  50      100   836
##  6 HHC_PCT_HHA_PHYS_THERAPY West     97.7    100  8.14  33.3    100   251
##  7 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3    100 11.5   25      100   579
##  8 HHC_PCT_HHA_OCC_THERAPY  South    95.8    100 12.2   28.6    100   787
##  9 HHC_PCT_HHA_OCC_THERAPY  West     94.5    100 13.0   25      100   232
## 10 HHC_PCT_HHA_SPEECH       Midwest  92.6    100 16.1   25      100   552
## 11 HHC_PCT_HHA_SPEECH       South    93.7    100 15.2   25      100   769
## 12 HHC_PCT_HHA_SPEECH       West     91.3    100 17.0   20      100   221
## 13 HHC_PCT_HHA_MEDICAL      Midwest  83.0    100 23.6    9.09   100   419
## 14 HHC_PCT_HHA_MEDICAL      South    89.7    100 18.6   16.7    100   724
## 15 HHC_PCT_HHA_MEDICAL      West     92.5    100 15.8   33.3    100   224
## 16 HHC_PCT_HHA_AIDE         Midwest  98.0    100  7.85  50      100   588
## 17 HHC_PCT_HHA_AIDE         South    96.6    100  9.82  42.9    100   816
## 18 HHC_PCT_HHA_AIDE         West     96.4    100 10.8   33.3    100   247

# Pairwise removal excluding several values from the same stem 
# or group variable.
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))

## # A tibble: 12 × 8
##    variable                 REGION   mean median    sd    min   max  nobs
##    <chr>                    <chr>   <dbl>  <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100      100  0    100      100   606
##  2 HHC_PCT_HHA_NURSING      West    100      100  0    100      100   252
##  3 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.8    100  8.36  25      100   595
##  4 HHC_PCT_HHA_PHYS_THERAPY West     97.7    100  8.14  33.3    100   251
##  5 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3    100 11.5   25      100   579
##  6 HHC_PCT_HHA_OCC_THERAPY  West     94.5    100 13.0   25      100   232
##  7 HHC_PCT_HHA_SPEECH       Midwest  92.6    100 16.1   25      100   552
##  8 HHC_PCT_HHA_SPEECH       West     91.3    100 17.0   20      100   221
##  9 HHC_PCT_HHA_MEDICAL      Midwest  83.0    100 23.6    9.09   100   419
## 10 HHC_PCT_HHA_MEDICAL      West     92.5    100 15.8   33.3    100   224
## 11 HHC_PCT_HHA_AIDE         Midwest  98.0    100  7.85  50      100   588
## 12 HHC_PCT_HHA_AIDE         West     96.4    100 10.8   33.3    100   247

Another way to use mean_group_tbl() is to summarize responses based on a shared pattern, such as survey time points. To enable this feature, set group_type = "pattern" and specify the desired pattern in the group argument.

Consider a dataset compiled by researchers examining how many symptoms participants reported they’d had after a long illness. In this (fictitious) dataset, responses are collected at three time points: “t1” (baseline), “t2” (6-month follow-up), and “t3” (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.

In the example below, we first create the symptoms_data dataset and then use the mean_group_tbl() function to generate summary statistics for variables that begin with the prefix symptoms and contain a substring matching the pattern "_t\\d", an underscore followed by the letter “t” and a single digit, indicating different time points. The ignore argument is also used to exclude the value -999 from the analysis.

set.seed(0803)
symptoms_data <-
  data.frame(
    symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
    symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50),
    symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
  )

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               ignore = c(symptoms = -999))

## # A tibble: 3 × 8
##   variable    group  mean median    sd   min   max  nobs
##   <chr>       <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1     4.03      3  3.14     0    10    33
## 2 symptoms_t2 t2     5.12      5  3.33     0    10    33
## 3 symptoms_t3 t3     4.64      4  3.29     0    10    33

To make your output easier to understand, use the group_name argument to add a label to the column that shows grouping values or matched patterns. You can also use the var_labels argument to display descriptive labels for each variable.

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999), 
               var_labels = c(symptoms_t1 = "# of symptoms at baseline",
                              symptoms_t2 = "# of symptoms at 6 months follow up",
                              symptoms_t3 = "# of symptoms at one-year follow up"))

## # A tibble: 3 × 9
##   variable    variable_label     time_point  mean median    sd   min   max  nobs
##   <chr>       <chr>              <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 # of symptoms at … t1          4.03      3  3.14     0    10    33
## 2 symptoms_t2 # of symptoms at … t2          5.12      5  3.33     0    10    33
## 3 symptoms_t3 # of symptoms at … t3          4.64      4  3.29     0    10    33

Finally, you can choose what information to return using the only argument.

# Default: all summary statistics returned
# (mean, median, sd, min, max, nobs)
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999))

## # A tibble: 3 × 8
##   variable    time_point  mean median    sd   min   max  nobs
##   <chr>       <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1          4.03      3  3.14     0    10    33
## 2 symptoms_t2 t2          5.12      5  3.33     0    10    33
## 3 symptoms_t3 t3          4.64      4  3.29     0    10    33

# Means and non-missing observations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "nobs"))

## # A tibble: 3 × 4
##   variable    time_point  mean  nobs
##   <chr>       <chr>      <dbl> <int>
## 1 symptoms_t1 t1          4.03    33
## 2 symptoms_t2 t2          5.12    33
## 3 symptoms_t3 t3          4.64    33

# Means and standard deviations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "sd"))

## # A tibble: 3 × 4
##   variable    time_point  mean    sd
##   <chr>       <chr>      <dbl> <dbl>
## 1 symptoms_t1 t1          4.03  3.14
## 2 symptoms_t2 t2          5.12  3.33
## 3 symptoms_t3 t3          4.64  3.29