Introduction
The chensus package estimates population frequencies,
means, proportions and confidence intervals from surveys conducted by
the Federal Statistical Office (FSO):
- structural survey: Strukturerhebung (SE) / relevé structurel (RS),
- mobility and transport survey: Mikrozensus Mobilität und Verkehr (MZMV) / Microrecensement mobilité et transports (MRMT).
In this vignette, we demonstrate the main features of the package
using the built-in nhanes dataset, which contains a subset
of data from the National
Health and Nutrition Examination Survey for the period 2015-2016
(more with ?nhanes and vignette("nhanes")).
Its structure is similar to FSO survey data in that it contains
strata and weights columns and demographic
features such as gender and
household_size.
Structural Survey
Total Estimates
Suppose we want to estimate the population in the nhanes
data set by gender and birth country. We can use the main analysis
function se_total():
se_total(
data = nhanes,
weight = weights,
strata = strata,
gender, birth_country
)
#> # A tibble: 5 × 10
#> gender birth_country occ total vhat stand_dev ci ci_per ci_l
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 29693. 4.61e 8 21460. 4.21e4 142. -1.24e4
#> 2 Female Other 1168 23914531. 6.54e11 808405. 1.58e6 6.63 2.23e7
#> 3 Female US 3909 137978222. 7.76e12 2785964. 5.46e6 3.96 1.33e8
#> 4 Male Other 1068 23897302. 7.91e11 889444. 1.74e6 7.29 2.22e7
#> 5 Male US 3824 130661296. 7.66e12 2768516. 5.43e6 4.15 1.25e8
#> # ℹ 1 more variable: ci_u <dbl>Column names can be passed programmatically with the help of
rlang’s !!sym() and !!!syms() in
the function call:
w <- "weights"
s <- "strata"
v <- c("gender", "birth_country")
se_total(
data = nhanes,
strata = !!sym(s),
weight = !!sym(w),
!!!syms(v)
)
#> # A tibble: 5 × 10
#> gender birth_country occ total vhat stand_dev ci ci_per ci_l
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 29693. 4.61e 8 21460. 4.21e4 142. -1.24e4
#> 2 Female Other 1168 23914531. 6.54e11 808405. 1.58e6 6.63 2.23e7
#> 3 Female US 3909 137978222. 7.76e12 2785964. 5.46e6 3.96 1.33e8
#> 4 Male Other 1068 23897302. 7.91e11 889444. 1.74e6 7.29 2.22e7
#> 5 Male US 3824 130661296. 7.66e12 2768516. 5.43e6 4.15 1.25e8
#> # ℹ 1 more variable: ci_u <dbl>We can also estimate population in parallel for multiple groups:
se_total_map(
nhanes,
weight = weights,
strata = strata,
gender, birth_country
)
#> # A tibble: 5 × 10
#> variable value occ total vhat stand_dev ci ci_per ci_l ci_u
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 gender Fema… 5079 1.62e8 7.82e12 2795884. 5.48e6 3.38 1.56e8 1.67e8
#> 2 gender Male 4892 1.55e8 7.90e12 2810039. 5.51e6 3.56 1.49e8 1.60e8
#> 3 birth_count… Miss… 2 2.97e4 4.61e 8 21460. 4.21e4 142. -1.24e4 7.18e4
#> 4 birth_count… Other 2236 4.78e7 1.30e12 1140910. 2.24e6 4.68 4.56e7 5.00e7
#> 5 birth_count… US 7733 2.69e8 1.16e13 3402537. 6.67e6 2.48 2.62e8 2.75e8If we wish to estimate population for all combinations of grouping
variables including no or partial grouping, we can use
se_total_ogd(), a wrapper function for the main
se_total() function:
se_total_ogd(nhanes, strata = strata, weight = weights, gender, birth_country)
#> # A tibble: 11 × 7
#> gender birth_country occ total ci ci_l ci_u
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Total Total 9971 316481044. 6370681. 310110363. 322851725.
#> 2 Female Total 5079 161922446. 5479833. 156442613. 167402279.
#> 3 Male Total 4892 154558598. 5507576. 149051022. 160066174.
#> 4 Total Missing 2 29693. 42060. -12367. 71753.
#> 5 Total Other 2236 47811833. 2236143. 45575690. 50047977.
#> 6 Total US 7733 268639517. 6668850. 261970667. 275308367.
#> 7 Female Missing 2 29693. 42060. -12367. 71753.
#> 8 Female Other 1168 23914531. 1584444. 22330087. 25498975.
#> 9 Female US 3909 137978222. 5460390. 132517832. 143438611.
#> 10 Male Other 1068 23897302. 1743278. 22154024. 25640580.
#> 11 Male US 3824 130661296. 5426191. 125235105. 136087486.Proportion Estimates
We can also estimate the proportion of males and females by birth
country in the nhanes survey:
se_prop(
data = nhanes,
gender,
birth_country,
weight = weights,
strata = strata
)
#> # A tibble: 5 × 9
#> gender birth_country occ prop vhat stand_dev ci ci_l ci_u
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 9.38e-5 4.59e-9 0.0000678 1.33e-4 -3.90e-5 2.27e-4
#> 2 Female Other 1168 7.56e-2 7.30e-6 0.00270 5.30e-3 7.03e-2 8.09e-2
#> 3 Female US 3909 4.36e-1 5.20e-5 0.00721 1.41e-2 4.22e-1 4.50e-1
#> 4 Male Other 1068 7.55e-2 8.51e-6 0.00292 5.72e-3 6.98e-2 8.12e-2
#> 5 Male US 3824 4.13e-1 5.19e-5 0.00721 1.41e-2 3.99e-1 4.27e-1and we can display total and proportion estimates in a single table using the FSO format. The FSO publication format qualifies the reliability of estimates and hides confidential estimates (fewer than five observations):
se_total_prop(
data = nhanes,
gender,
birth_country,
weight = weights,
strata = strata
) |>
fso_flag_mask()
#> # A tibble: 5 × 12
#> gender birth_country occ total ci_total ci_l_total ci_u_total prop
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing NA NA NA NA NA NA
#> 2 Female Other 1168 23914531. 1584444. 22330087. 25498975. 0.0756
#> 3 Female US 3909 137978222. 5460390. 132517832. 143438611. 0.436
#> 4 Male Other 1068 23897302. 1743278. 22154024. 25640580. 0.0755
#> 5 Male US 3824 130661296. 5426191. 125235105. 136087486. 0.413
#> # ℹ 4 more variables: ci_prop <dbl>, ci_l_prop <dbl>, ci_u_prop <dbl>,
#> # obs_status <chr>Mean Estimates
If on the other hand we wish to estimate the mean household size then
we can use the function se_mean():
se_mean(
data = nhanes,
variable = household_size,
strata = strata,
weight = weights
)
#> # A tibble: 1 × 7
#> occ household_size vhat stand_dev ci ci_l ci_u
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 9971 3.46 0.000495 0.0222 0.0436 3.42 3.51or the wrapper function se_mean_ogd() for all possible
combinations of grouping variables gender and
interview_lang:
se_mean_ogd(
nhanes,
variable = household_size,
strata = strata,
weight = weights,
gender, interview_lang
)
#> # A tibble: 9 × 7
#> gender interview_lang occ household_size ci ci_l ci_u
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Total Total 9971 3.46 0.0436 3.42 3.51
#> 2 Female Total 5079 3.44 0.0611 3.38 3.50
#> 3 Male Total 4892 3.49 0.0621 3.43 3.55
#> 4 Total English 8584 3.38 0.0453 3.33 3.42
#> 5 Total Spanish 1387 4.60 0.0959 4.50 4.69
#> 6 Female English 4345 3.36 0.0636 3.29 3.42
#> 7 Female Spanish 734 4.56 0.129 4.43 4.69
#> 8 Male English 4239 3.40 0.0645 3.33 3.46
#> 9 Male Spanish 653 4.63 0.139 4.49 4.77and with FSO format:
nhanes |>
se_mean_ogd(
variable = household_size,
gender, birth_country,
strata = strata,
weight = weights,
) |>
fso_flag_mask(lang = "en") # Default is "de", further possibilities: "fr", "it"
#> # A tibble: 11 × 8
#> gender birth_country occ household_size ci ci_l ci_u obs_status
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Total Total 9971 3.46 0.0436 3.42 3.51 Reliable estim…
#> 2 Female Total 5079 3.44 0.0611 3.38 3.50 Reliable estim…
#> 3 Male Total 4892 3.49 0.0621 3.43 3.55 Reliable estim…
#> 4 Total US 7733 3.40 0.0485 3.35 3.44 Reliable estim…
#> 5 Total Other 2236 3.85 0.0840 3.77 3.94 Reliable estim…
#> 6 Total Missing NA NA NA NA NA No estimate (c…
#> 7 Female US 3909 3.36 0.0678 3.30 3.43 Reliable estim…
#> 8 Female Other 1168 3.88 0.114 3.77 4.00 Reliable estim…
#> 9 Female Missing NA NA NA NA NA No estimate (c…
#> 10 Male US 3824 3.43 0.0691 3.36 3.50 Reliable estim…
#> 11 Male Other 1068 3.83 0.123 3.70 3.95 Reliable estim…Mobility Survey
If we want to estimate the mean household income then we can use
mzmv_mean():
mzmv_mean(
data = nhanes,
variable = annual_household_income,
weight = weights
)
#> # A tibble: 1 × 4
#> variable occ wmean ci
#> <chr> <int> <dbl> <dbl>
#> 1 annual_household_income 9626 11.9 0.240and grouped by gender (note the variable argument must be quoted here):
mzmv_mean_map(
data = nhanes,
variable = "annual_household_income",
gender,
weight = weights
)
#> # A tibble: 2 × 6
#> variable group_vars group_vars_value occ wmean ci
#> <chr> <chr> <fct> <int> <dbl> <dbl>
#> 1 annual_household_income gender Female 4906 11.8 0.350
#> 2 annual_household_income gender Male 4720 12.0 0.328Flagging Estimate Reliability
fso_flag_mask applies FSO’s reliability rules for survey
estimates, based on the number of observations (occ). It
flags low reliability estimates and masks them when sample size is too
small (occ <= 4) as follows:
occ <= 4 |
No estimate (confidential) |
occ <= 49 |
Estimate of low reliability |
occ > 49 |
Reliable estimate |
results <- nhanes |>
se_total(
strata = strata,
weight = weights,
gender,
birth_country,
interview_lang,
edu_level
)
results |>
filter(occ < 60) |>
fso_flag_mask() |>
select(gender, birth_country, interview_lang, occ, total, ci, obs_status)
#> # A tibble: 26 × 7
#> gender birth_country interview_lang occ total ci obs_status
#> <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
#> 1 Female Missing English NA NA NA Kein Schätzwert (…
#> 2 Female Missing Spanish NA NA NA Kein Schätzwert (…
#> 3 Female Other English 57 1098021. 301680. Schätzwert verläs…
#> 4 Female Other English 41 883410. 287543. Schätzwert beding…
#> 5 Female Other Spanish 45 673772. 217372. Schätzwert beding…
#> 6 Female Other Spanish NA NA NA Kein Schätzwert (…
#> 7 Female Other Spanish 18 310672. 154623. Schätzwert beding…
#> 8 Female US English 14 403508. 277121. Schätzwert beding…
#> 9 Female US Spanish 21 258146. 114605. Schätzwert beding…
#> 10 Female US Spanish 13 208119. 116775. Schätzwert beding…
#> # ℹ 16 more rows