chensus • chensus

library(chensus)
library(dplyr)

Introduction

The chensus package estimates population frequencies, means, proportions and confidence intervals from surveys conducted by the Federal Statistical Office (FSO):

structural survey: Strukturerhebung (SE) / relevé structurel (RS),
mobility and transport survey: Mikrozensus Mobilität und Verkehr (MZMV) / Microrecensement mobilité et transports (MRMT).

In this vignette, we demonstrate the main features of the package using the built-in nhanes dataset, which contains a subset of data from the National Health and Nutrition Examination Survey for the period 2015-2016 (more with ?nhanes and vignette("nhanes")). Its structure is similar to FSO survey data in that it contains strata and weights columns and demographic features such as gender and household_size.

Structural Survey

Total Estimates

Suppose we want to estimate the population in the nhanes data set by gender and birth country. We can use the main analysis function se_total():

se_total(
  data = nhanes,
  weight = weights,
  strata = strata,
  gender, birth_country
)
#> # A tibble: 5 × 10
#>   gender birth_country   occ      total    vhat stand_dev      ci ci_per    ci_l
#>   <chr>  <chr>         <int>      <dbl>   <dbl>     <dbl>   <dbl>  <dbl>   <dbl>
#> 1 Female Missing           2     29693. 4.61e 8    21460.  4.21e4 142.   -1.24e4
#> 2 Female Other          1168  23914531. 6.54e11   808405.  1.58e6   6.63  2.23e7
#> 3 Female US             3909 137978222. 7.76e12  2785964.  5.46e6   3.96  1.33e8
#> 4 Male   Other          1068  23897302. 7.91e11   889444.  1.74e6   7.29  2.22e7
#> 5 Male   US             3824 130661296. 7.66e12  2768516.  5.43e6   4.15  1.25e8
#> # ℹ 1 more variable: ci_u <dbl>

Column names can be passed programmatically with the help of rlang’s !!sym() and !!!syms() in the function call:

w <- "weights"
s <- "strata"
v <- c("gender", "birth_country")

se_total(
  data = nhanes,
  strata = !!sym(s),
  weight = !!sym(w),
  !!!syms(v)
)
#> # A tibble: 5 × 10
#>   gender birth_country   occ      total    vhat stand_dev      ci ci_per    ci_l
#>   <chr>  <chr>         <int>      <dbl>   <dbl>     <dbl>   <dbl>  <dbl>   <dbl>
#> 1 Female Missing           2     29693. 4.61e 8    21460.  4.21e4 142.   -1.24e4
#> 2 Female Other          1168  23914531. 6.54e11   808405.  1.58e6   6.63  2.23e7
#> 3 Female US             3909 137978222. 7.76e12  2785964.  5.46e6   3.96  1.33e8
#> 4 Male   Other          1068  23897302. 7.91e11   889444.  1.74e6   7.29  2.22e7
#> 5 Male   US             3824 130661296. 7.66e12  2768516.  5.43e6   4.15  1.25e8
#> # ℹ 1 more variable: ci_u <dbl>

We can also estimate population in parallel for multiple groups:

se_total_map(
  nhanes,
  weight = weights,
  strata = strata,
  gender, birth_country
)
#> # A tibble: 5 × 10
#>   variable     value   occ  total    vhat stand_dev     ci ci_per    ci_l   ci_u
#>   <chr>        <chr> <int>  <dbl>   <dbl>     <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
#> 1 gender       Fema…  5079 1.62e8 7.82e12  2795884. 5.48e6   3.38  1.56e8 1.67e8
#> 2 gender       Male   4892 1.55e8 7.90e12  2810039. 5.51e6   3.56  1.49e8 1.60e8
#> 3 birth_count… Miss…     2 2.97e4 4.61e 8    21460. 4.21e4 142.   -1.24e4 7.18e4
#> 4 birth_count… Other  2236 4.78e7 1.30e12  1140910. 2.24e6   4.68  4.56e7 5.00e7
#> 5 birth_count… US     7733 2.69e8 1.16e13  3402537. 6.67e6   2.48  2.62e8 2.75e8

If we wish to estimate population for all combinations of grouping variables including no or partial grouping, we can use se_total_ogd(), a wrapper function for the main se_total() function:

se_total_ogd(nhanes, strata = strata, weight = weights, gender, birth_country)
#> # A tibble: 11 × 7
#>    gender birth_country   occ      total       ci       ci_l       ci_u
#>    <fct>  <fct>         <int>      <dbl>    <dbl>      <dbl>      <dbl>
#>  1 Total  Total          9971 316481044. 6370681. 310110363. 322851725.
#>  2 Female Total          5079 161922446. 5479833. 156442613. 167402279.
#>  3 Male   Total          4892 154558598. 5507576. 149051022. 160066174.
#>  4 Total  Missing           2     29693.   42060.    -12367.     71753.
#>  5 Total  Other          2236  47811833. 2236143.  45575690.  50047977.
#>  6 Total  US             7733 268639517. 6668850. 261970667. 275308367.
#>  7 Female Missing           2     29693.   42060.    -12367.     71753.
#>  8 Female Other          1168  23914531. 1584444.  22330087.  25498975.
#>  9 Female US             3909 137978222. 5460390. 132517832. 143438611.
#> 10 Male   Other          1068  23897302. 1743278.  22154024.  25640580.
#> 11 Male   US             3824 130661296. 5426191. 125235105. 136087486.

Proportion Estimates

We can also estimate the proportion of males and females by birth country in the nhanes survey:

se_prop(
  data = nhanes,
  gender,
  birth_country,
  weight = weights,
  strata = strata
)
#> # A tibble: 5 × 9
#>   gender birth_country   occ     prop    vhat stand_dev      ci     ci_l    ci_u
#>   <chr>  <chr>         <int>    <dbl>   <dbl>     <dbl>   <dbl>    <dbl>   <dbl>
#> 1 Female Missing           2  9.38e-5 4.59e-9 0.0000678 1.33e-4 -3.90e-5 2.27e-4
#> 2 Female Other          1168  7.56e-2 7.30e-6 0.00270   5.30e-3  7.03e-2 8.09e-2
#> 3 Female US             3909  4.36e-1 5.20e-5 0.00721   1.41e-2  4.22e-1 4.50e-1
#> 4 Male   Other          1068  7.55e-2 8.51e-6 0.00292   5.72e-3  6.98e-2 8.12e-2
#> 5 Male   US             3824  4.13e-1 5.19e-5 0.00721   1.41e-2  3.99e-1 4.27e-1

and we can display total and proportion estimates in a single table using the FSO format. The FSO publication format qualifies the reliability of estimates and hides confidential estimates (fewer than five observations):

se_total_prop(
  data = nhanes,
  gender,
  birth_country,
  weight = weights,
  strata = strata
) |>
  fso_flag_mask()
#> # A tibble: 5 × 12
#>   gender birth_country   occ      total ci_total ci_l_total ci_u_total    prop
#>   <chr>  <chr>         <int>      <dbl>    <dbl>      <dbl>      <dbl>   <dbl>
#> 1 Female Missing          NA        NA       NA         NA         NA  NA     
#> 2 Female Other          1168  23914531. 1584444.  22330087.  25498975.  0.0756
#> 3 Female US             3909 137978222. 5460390. 132517832. 143438611.  0.436 
#> 4 Male   Other          1068  23897302. 1743278.  22154024.  25640580.  0.0755
#> 5 Male   US             3824 130661296. 5426191. 125235105. 136087486.  0.413 
#> # ℹ 4 more variables: ci_prop <dbl>, ci_l_prop <dbl>, ci_u_prop <dbl>,
#> #   obs_status <chr>

Mean Estimates

If on the other hand we wish to estimate the mean household size then we can use the function se_mean():

se_mean(
  data = nhanes,
  variable = household_size,
  strata = strata,
  weight = weights
)
#> # A tibble: 1 × 7
#>     occ household_size     vhat stand_dev     ci  ci_l  ci_u
#>   <int>          <dbl>    <dbl>     <dbl>  <dbl> <dbl> <dbl>
#> 1  9971           3.46 0.000495    0.0222 0.0436  3.42  3.51

or the wrapper function se_mean_ogd() for all possible combinations of grouping variables gender and interview_lang:

se_mean_ogd(
  nhanes,
  variable = household_size,
  strata = strata,
  weight = weights,
  gender, interview_lang
)
#> # A tibble: 9 × 7
#>   gender interview_lang   occ household_size     ci  ci_l  ci_u
#>   <fct>  <fct>          <int>          <dbl>  <dbl> <dbl> <dbl>
#> 1 Total  Total           9971           3.46 0.0436  3.42  3.51
#> 2 Female Total           5079           3.44 0.0611  3.38  3.50
#> 3 Male   Total           4892           3.49 0.0621  3.43  3.55
#> 4 Total  English         8584           3.38 0.0453  3.33  3.42
#> 5 Total  Spanish         1387           4.60 0.0959  4.50  4.69
#> 6 Female English         4345           3.36 0.0636  3.29  3.42
#> 7 Female Spanish          734           4.56 0.129   4.43  4.69
#> 8 Male   English         4239           3.40 0.0645  3.33  3.46
#> 9 Male   Spanish          653           4.63 0.139   4.49  4.77

and with FSO format:

nhanes |>
  se_mean_ogd(
    variable = household_size,
    gender, birth_country,
    strata = strata,
    weight = weights,
  ) |>
  fso_flag_mask(lang = "en") # Default is "de", further possibilities: "fr", "it"
#> # A tibble: 11 × 8
#>    gender birth_country   occ household_size      ci  ci_l  ci_u obs_status     
#>    <fct>  <fct>         <int>          <dbl>   <dbl> <dbl> <dbl> <chr>          
#>  1 Total  Total          9971           3.46  0.0436  3.42  3.51 Reliable estim…
#>  2 Female Total          5079           3.44  0.0611  3.38  3.50 Reliable estim…
#>  3 Male   Total          4892           3.49  0.0621  3.43  3.55 Reliable estim…
#>  4 Total  US             7733           3.40  0.0485  3.35  3.44 Reliable estim…
#>  5 Total  Other          2236           3.85  0.0840  3.77  3.94 Reliable estim…
#>  6 Total  Missing          NA          NA    NA      NA    NA    No estimate (c…
#>  7 Female US             3909           3.36  0.0678  3.30  3.43 Reliable estim…
#>  8 Female Other          1168           3.88  0.114   3.77  4.00 Reliable estim…
#>  9 Female Missing          NA          NA    NA      NA    NA    No estimate (c…
#> 10 Male   US             3824           3.43  0.0691  3.36  3.50 Reliable estim…
#> 11 Male   Other          1068           3.83  0.123   3.70  3.95 Reliable estim…

Mobility Survey

If we want to estimate the mean household income then we can use mzmv_mean():

mzmv_mean(
  data = nhanes,
  variable = annual_household_income,
  weight = weights
)
#> # A tibble: 1 × 4
#>   variable                  occ wmean    ci
#>   <chr>                   <int> <dbl> <dbl>
#> 1 annual_household_income  9626  11.9 0.240

and grouped by gender (note the variable argument must be quoted here):

mzmv_mean_map(
  data = nhanes,
  variable = "annual_household_income",
  gender,
  weight = weights
)
#> # A tibble: 2 × 6
#>   variable                group_vars group_vars_value   occ wmean    ci
#>   <chr>                   <chr>      <fct>            <int> <dbl> <dbl>
#> 1 annual_household_income gender     Female            4906  11.8 0.350
#> 2 annual_household_income gender     Male              4720  12.0 0.328

Flagging Estimate Reliability

fso_flag_mask applies FSO’s reliability rules for survey estimates, based on the number of observations (occ). It flags low reliability estimates and masks them when sample size is too small (occ <= 4) as follows:

`occ <= 4`	No estimate (confidential)
`occ <= 49`	Estimate of low reliability
`occ > 49`	Reliable estimate

results <- nhanes |>
  se_total(
    strata = strata,
    weight = weights,
    gender,
    birth_country,
    interview_lang,
    edu_level
  )
results |>
  filter(occ < 60) |>
  fso_flag_mask() |>
  select(gender, birth_country, interview_lang, occ, total, ci, obs_status)
#> # A tibble: 26 × 7
#>    gender birth_country interview_lang   occ    total      ci obs_status        
#>    <chr>  <chr>         <chr>          <int>    <dbl>   <dbl> <chr>             
#>  1 Female Missing       English           NA      NA      NA  Kein Schätzwert (…
#>  2 Female Missing       Spanish           NA      NA      NA  Kein Schätzwert (…
#>  3 Female Other         English           57 1098021. 301680. Schätzwert verläs…
#>  4 Female Other         English           41  883410. 287543. Schätzwert beding…
#>  5 Female Other         Spanish           45  673772. 217372. Schätzwert beding…
#>  6 Female Other         Spanish           NA      NA      NA  Kein Schätzwert (…
#>  7 Female Other         Spanish           18  310672. 154623. Schätzwert beding…
#>  8 Female US            English           14  403508. 277121. Schätzwert beding…
#>  9 Female US            Spanish           21  258146. 114605. Schätzwert beding…
#> 10 Female US            Spanish           13  208119. 116775. Schätzwert beding…
#> # ℹ 16 more rows