Introduction
The chensus
package estimates population frequencies,
means, proportions and confidence intervals from surveys conducted by
the Federal Statistical Office (FSO):
- structural survey: Strukturerhebung (SE) / relevé structurel (RS),
- mobility and transport survey: Mikrozensus Mobilität und Verkehr (MZMV) / Microrecensement mobilité et transports (MRMT).
In this vignette, we demonstrate the main features of the package
using the built-in nhanes
dataset, which contains a subset
of data from the National
Health and Nutrition Examination Survey for the period 2015-2016
(more with ?nhanes
and vignette("nhanes")
).
Its structure is similar to FSO survey data in that it contains
strata
and weights
columns and demographic
features such as gender
and
household_size
.
Structural Survey
Total Estimates
Suppose we want to estimate the population in the nhanes
data set by gender and birth country. We can use the main analysis
function se_total()
:
se_total(
data = nhanes,
weight = weights,
strata = strata,
gender, birth_country
)
#> # A tibble: 5 × 10
#> gender birth_country occ total vhat stand_dev ci ci_per ci_l
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 29693. 4.61e 8 21460. 4.21e4 142. -1.24e4
#> 2 Female Other 1168 23914531. 6.54e11 808405. 1.58e6 6.63 2.23e7
#> 3 Female US 3909 137978222. 7.76e12 2785964. 5.46e6 3.96 1.33e8
#> 4 Male Other 1068 23897302. 7.91e11 889444. 1.74e6 7.29 2.22e7
#> 5 Male US 3824 130661296. 7.66e12 2768516. 5.43e6 4.15 1.25e8
#> # ℹ 1 more variable: ci_u <dbl>
Column names can be passed programmatically with the help of
rlang
’s !!sym()
and !!!syms()
in
the function call:
w <- "weights"
s <- "strata"
v <- c("gender", "birth_country")
se_total(
data = nhanes,
strata = !!sym(s),
weight = !!sym(w),
!!!syms(v)
)
#> # A tibble: 5 × 10
#> gender birth_country occ total vhat stand_dev ci ci_per ci_l
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 29693. 4.61e 8 21460. 4.21e4 142. -1.24e4
#> 2 Female Other 1168 23914531. 6.54e11 808405. 1.58e6 6.63 2.23e7
#> 3 Female US 3909 137978222. 7.76e12 2785964. 5.46e6 3.96 1.33e8
#> 4 Male Other 1068 23897302. 7.91e11 889444. 1.74e6 7.29 2.22e7
#> 5 Male US 3824 130661296. 7.66e12 2768516. 5.43e6 4.15 1.25e8
#> # ℹ 1 more variable: ci_u <dbl>
We can also estimate population in parallel for multiple groups:
se_total_map(
nhanes,
weight = weights,
strata = strata,
gender, birth_country
)
#> # A tibble: 5 × 10
#> variable value occ total vhat stand_dev ci ci_per ci_l ci_u
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 gender Fema… 5079 1.62e8 7.82e12 2795884. 5.48e6 3.38 1.56e8 1.67e8
#> 2 gender Male 4892 1.55e8 7.90e12 2810039. 5.51e6 3.56 1.49e8 1.60e8
#> 3 birth_count… Miss… 2 2.97e4 4.61e 8 21460. 4.21e4 142. -1.24e4 7.18e4
#> 4 birth_count… Other 2236 4.78e7 1.30e12 1140910. 2.24e6 4.68 4.56e7 5.00e7
#> 5 birth_count… US 7733 2.69e8 1.16e13 3402537. 6.67e6 2.48 2.62e8 2.75e8
If we wish to estimate population for all combinations of grouping
variables including no or partial grouping, we can use
se_total_ogd()
, a wrapper function for the main
se_total()
function:
se_total_ogd(nhanes, strata = strata, weight = weights, gender, birth_country)
#> # A tibble: 11 × 7
#> gender birth_country occ total ci ci_l ci_u
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Total Total 9971 316481044. 6370681. 310110363. 322851725.
#> 2 Female Total 5079 161922446. 5479833. 156442613. 167402279.
#> 3 Male Total 4892 154558598. 5507576. 149051022. 160066174.
#> 4 Total Missing 2 29693. 42060. -12367. 71753.
#> 5 Total Other 2236 47811833. 2236143. 45575690. 50047977.
#> 6 Total US 7733 268639517. 6668850. 261970667. 275308367.
#> 7 Female Missing 2 29693. 42060. -12367. 71753.
#> 8 Female Other 1168 23914531. 1584444. 22330087. 25498975.
#> 9 Female US 3909 137978222. 5460390. 132517832. 143438611.
#> 10 Male Other 1068 23897302. 1743278. 22154024. 25640580.
#> 11 Male US 3824 130661296. 5426191. 125235105. 136087486.
Proportion Estimates
We can also estimate the proportion of males and females by birth
country in the nhanes
survey:
se_prop(
data = nhanes,
gender,
birth_country,
weight = weights,
strata = strata
)
#> # A tibble: 5 × 9
#> gender birth_country occ prop vhat stand_dev ci ci_l ci_u
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing 2 9.38e-5 4.59e-9 0.0000678 1.33e-4 -3.90e-5 2.27e-4
#> 2 Female Other 1168 7.56e-2 7.30e-6 0.00270 5.30e-3 7.03e-2 8.09e-2
#> 3 Female US 3909 4.36e-1 5.20e-5 0.00721 1.41e-2 4.22e-1 4.50e-1
#> 4 Male Other 1068 7.55e-2 8.51e-6 0.00292 5.72e-3 6.98e-2 8.12e-2
#> 5 Male US 3824 4.13e-1 5.19e-5 0.00721 1.41e-2 3.99e-1 4.27e-1
and we can display total and proportion estimates in a single table using the FSO format. The FSO publication format qualifies the reliability of estimates and hides confidential estimates (fewer than five observations):
se_total_prop(
data = nhanes,
gender,
birth_country,
weight = weights,
strata = strata
) |>
fso_flag_mask()
#> # A tibble: 5 × 12
#> gender birth_country occ total ci_total ci_l_total ci_u_total prop
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Female Missing NA NA NA NA NA NA
#> 2 Female Other 1168 23914531. 1584444. 22330087. 25498975. 0.0756
#> 3 Female US 3909 137978222. 5460390. 132517832. 143438611. 0.436
#> 4 Male Other 1068 23897302. 1743278. 22154024. 25640580. 0.0755
#> 5 Male US 3824 130661296. 5426191. 125235105. 136087486. 0.413
#> # ℹ 4 more variables: ci_prop <dbl>, ci_l_prop <dbl>, ci_u_prop <dbl>,
#> # obs_status <chr>
Mean Estimates
If on the other hand we wish to estimate the mean household size then
we can use the function se_mean()
:
se_mean(
data = nhanes,
variable = household_size,
strata = strata,
weight = weights
)
#> # A tibble: 1 × 7
#> occ household_size vhat stand_dev ci ci_l ci_u
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 9971 3.46 0.000495 0.0222 0.0436 3.42 3.51
or the wrapper function se_mean_ogd()
for all possible
combinations of grouping variables gender
and
interview_lang
:
se_mean_ogd(
nhanes,
variable = household_size,
strata = strata,
weight = weights,
gender, interview_lang
)
#> # A tibble: 9 × 7
#> gender interview_lang occ household_size ci ci_l ci_u
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Total Total 9971 3.46 0.0436 3.42 3.51
#> 2 Female Total 5079 3.44 0.0611 3.38 3.50
#> 3 Male Total 4892 3.49 0.0621 3.43 3.55
#> 4 Total English 8584 3.38 0.0453 3.33 3.42
#> 5 Total Spanish 1387 4.60 0.0959 4.50 4.69
#> 6 Female English 4345 3.36 0.0636 3.29 3.42
#> 7 Female Spanish 734 4.56 0.129 4.43 4.69
#> 8 Male English 4239 3.40 0.0645 3.33 3.46
#> 9 Male Spanish 653 4.63 0.139 4.49 4.77
and with FSO format:
nhanes |>
se_mean_ogd(
variable = household_size,
gender, birth_country,
strata = strata,
weight = weights,
) |>
fso_flag_mask(lang = "en") # Default is "de", further possibilities: "fr", "it"
#> # A tibble: 11 × 8
#> gender birth_country occ household_size ci ci_l ci_u obs_status
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Total Total 9971 3.46 0.0436 3.42 3.51 Reliable estim…
#> 2 Female Total 5079 3.44 0.0611 3.38 3.50 Reliable estim…
#> 3 Male Total 4892 3.49 0.0621 3.43 3.55 Reliable estim…
#> 4 Total US 7733 3.40 0.0485 3.35 3.44 Reliable estim…
#> 5 Total Other 2236 3.85 0.0840 3.77 3.94 Reliable estim…
#> 6 Total Missing NA NA NA NA NA No estimate (c…
#> 7 Female US 3909 3.36 0.0678 3.30 3.43 Reliable estim…
#> 8 Female Other 1168 3.88 0.114 3.77 4.00 Reliable estim…
#> 9 Female Missing NA NA NA NA NA No estimate (c…
#> 10 Male US 3824 3.43 0.0691 3.36 3.50 Reliable estim…
#> 11 Male Other 1068 3.83 0.123 3.70 3.95 Reliable estim…
Mobility Survey
If we want to estimate the mean household income then we can use
mzmv_mean()
:
mzmv_mean(
data = nhanes,
variable = annual_household_income,
weight = weights
)
#> # A tibble: 1 × 4
#> variable occ wmean ci
#> <chr> <int> <dbl> <dbl>
#> 1 annual_household_income 9626 11.9 0.240
and grouped by gender (note the variable argument must be quoted here):
mzmv_mean_map(
data = nhanes,
variable = "annual_household_income",
gender,
weight = weights
)
#> # A tibble: 2 × 6
#> variable group_vars group_vars_value occ wmean ci
#> <chr> <chr> <fct> <int> <dbl> <dbl>
#> 1 annual_household_income gender Female 4906 11.8 0.350
#> 2 annual_household_income gender Male 4720 12.0 0.328
Flagging Estimate Reliability
fso_flag_mask
applies FSO’s reliability rules for survey
estimates, based on the number of observations (occ
). It
flags low reliability estimates and masks them when sample size is too
small (occ <= 4) as follows:
occ <= 4 |
No estimate (confidential) |
occ <= 49 |
Estimate of low reliability |
occ > 49 |
Reliable estimate |
results <- nhanes |>
se_total(
strata = strata,
weight = weights,
gender,
birth_country,
interview_lang,
edu_level
)
results |>
filter(occ < 60) |>
fso_flag_mask() |>
select(gender, birth_country, interview_lang, occ, total, ci, obs_status)
#> # A tibble: 26 × 7
#> gender birth_country interview_lang occ total ci obs_status
#> <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
#> 1 Female Missing English NA NA NA Kein Schätzwert (…
#> 2 Female Missing Spanish NA NA NA Kein Schätzwert (…
#> 3 Female Other English 57 1098021. 301680. Schätzwert verläs…
#> 4 Female Other English 41 883410. 287543. Schätzwert beding…
#> 5 Female Other Spanish 45 673772. 217372. Schätzwert beding…
#> 6 Female Other Spanish NA NA NA Kein Schätzwert (…
#> 7 Female Other Spanish 18 310672. 154623. Schätzwert beding…
#> 8 Female US English 14 403508. 277121. Schätzwert beding…
#> 9 Female US Spanish 21 258146. 114605. Schätzwert beding…
#> 10 Female US Spanish 13 208119. 116775. Schätzwert beding…
#> # ℹ 16 more rows