This vignette describes the mathematical method for estimating confidence intervals of the structural survey and mobility and transport survey conducted by the Swiss Federal Statistical Office (FSO).
Structural survey
The FSO provides formulas to estimate populations and variances of the structural survey in German (Section 6).
The estimator depends on:
- The type of variable:
- Categorical: a factor-like variable, e.g., gender, country of birth.
- Continuous: a numeric variable, e.g., income, household size.
- The type of estimate:
- Total (sum across the population).
- Proportion (relative frequency) or mean (average).
Population Estimator
The estimator of variable y depends on the type of the variable and the desired statistic:
Variable type | Estimate type | Estimate |
---|---|---|
Categorical | Total | \hat{y} = \sum_k w_k I_c(y_k) |
Continuous | Total | \hat{y} = \sum_k w_k y_k |
Categorical | Proportion | \bar y = \frac{\sum_k w_k I_c(y_k)} {\sum _k w_k} |
Continuous | Mean | \bar y = \frac {\sum_k w_k y_k} {\sum _k w_k} |
where:
- w_k is the sampling weight for respondent k,
- I_c = 1 if condition(s) c is true, 0 otherwise,
- y_k is the observed value for respondent k.
The variance of the estimator of the variable y is approximated by the variance of the estimate of variable z defined as:
\hat z = \sum_{k \in r} w_k z_k
where the transformation z_k depends on both the type of variable y and the desired statistic:
Variable type | Estimate type | Transformation z_k |
---|---|---|
Categorical | Total | z_k = I_c(y_k) |
Continuous | Total | z_k = y_k |
Categorical | Proportion | z_ k = \frac{ y _k - \bar y} {\sum _i w_i} |
Continuous | Mean | z_k=\frac{y_k - \bar y} {\sum _i w_i} |
Variance Estimator
The variance estimator for the estimator \hat{z} is given by:
\hat V(\hat z) = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 where:
-
h is index stratum
(
zone
), - r_h is the set of respondents in stratum h,
- m_h is the number of respondents in r_h,
- N_h = \sum_{i \in r_h} w_i is the estimated population size in stratum h,
- w_i is the sampling weight for respondent i,
- z_i is a transformation of y_i.
- \hat{z}_h is the estimate of variable z in stratum h.
The confidence interval is given by:
\text{CI} = \sqrt{\hat{V}(\hat{z})} \times \text{qnorm}\left(1 - \frac{\alpha}{2}\right) where \alpha is the significance level, for example \alpha = 0.05 for confidence interval 95%.
Simplification of Variance Estimates
Total of Categorical Variable
The estimated total for a condition c is given by:
\hat{N}_c = \sum_{i \in r} w_i I_c with corresponding variance estimate:
\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 where:
- \hat{N}_c is the total estimate of condition c,
- \hat{N}_{hc} is the total estimate of conditions c in stratum h,
For condition c, this term becomes:
\begin{aligned} \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 &= \sum_{i \notin r_{hc}} \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \\ &= \left(m_h - m_{hc}\right) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \end{aligned}
where r_{hc} is the set of respondents in stratum h who fulfill condition c, and m_{hc} is the number of respondents in r_{hc}.
Thus, the original variance estimate equation becomes:
\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \left[(m_h - m_{hc}) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2\right]
Mean of a Continuous Variable
The estimate of the mean of a continuous variable y, for example the average rent
rentnet
, is given by the weighted mean:
\bar y = \frac{\sum_k w_k y_k}{\sum_k w_k}
Variance of \bar y is approximated by that of the total of variable \hat{z} = \sum_k w_k z_k where: z_k = \frac{y_k - \bar y}{\sum_i w_i}
In other words:
\begin{align*} \hat V(\bar y) & = \hat V(\hat z) \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\sum_{j \in r_h} w_j z_j}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i \frac{y_i - \bar y}{\sum_{j \in r_h} w_j} - \frac{\sum_{j \in r_h} w_j \left(\frac{y_j - \bar y}{\sum_{k \in r_h} w_k}\right)}{m_h}\right)^2 \end{align*}
Mobility and Transport Survey
From the survey (MZMV/MRMT) data, mzmv_mean()
estimates:
- mean or proportion of a variable in the real population: weighted mean of sub-population of interest,
- confidence interval of the estimate with significance level \alpha,
while mzmv_mean_map()
additionally uses grouping
variables.
Note that one can simply use mzmv_mean()
to estimate
both proportions and means, as shown below.
The FSO provides formulas to estimate variances of the MZMV/MRMT.
Means
The estimated mean is:
\hat{Y} = \frac{1}{\sum\limits_{i\in r} w_i}\sum_{i \in r} w_i y_i where:
- w_i is the weight for participant i,
- y_i is the response of participant i,
- r is the set of respondents.
The confidence interval of the estimated mean is:
\begin{aligned}\text{CI} &= 1.14\times Z_{\alpha}\frac{\hat{\sigma}_{y}}{\sqrt{n}}\\ &= 1.14 \times \frac{\hat{\sigma}_{y}}{\sqrt{n}} \times \text{qnorm}(1 - \frac{\alpha}2) \end{aligned}
where:
- 1.14 is a correction factor,
- \alpha is the significance level, for example 0.05 for confidence interval 95%,
- Z_{\alpha} is the Z-value for the desired confidence level (Z_{0.05} = 1.96 for double-sided 95% confidence interval),
- n is the size of set r, i.e. number of respondents,
- \hat{\sigma}_{y}^2 is the variance of variable Y estimated with sample r.
The (sample) variance of variable Y is estimated by:
\hat{\sigma}_{y}^2 = \frac{\sum\limits_{i\in r} w_i \left(y_i - \bar{y}\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1} where \bar{y} is the estimated mean \hat{Y}.
Proportions
If y_i \in \{0, 1\}, for example possession of a car, then the mean estimate becomes the proportion estimate:
p = \frac{1}{\sum\limits_{i \in r} w_i} \sum_{i \in r} w_i I_c where:
- w_i is the weight for participant i,
- I_c = 1 if condition c is true (y_i = 1), 0 otherwise (y_i = 0),
- r is the set of participants.
The sample variance in the previous section then becomes:
\hat{\sigma}_{p}^2 = \frac{\sum\limits_{i\in r} w_i \left(I_c - p\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}
Noting that I_c^2 = I_c and \sum\limits_i w_i I_c = p \sum\limits_i w_i, the nominator then becomes:
\begin{aligned} \sum\limits_{i\in r} w_i \left(I_c - p\right)^2 &= \sum_i w_i \left(I_c^2 +p^2 -2pI_c\right) \\ &= \sum_i w_i I_c + p^2 \sum_i w_i -2p\sum_i w_i I_c\\ &= p \sum_i w_i + p^2 \sum_i w_i - 2p^2 \sum_i w_i\\ &= p \sum_i w_i - p^2 \sum_i w_i\\ &= p(1-p) \sum_i w_i \end{aligned}
Therefore, the estimated sample variance becomes:
\hat{\sigma}_{p}^2 = \frac{p(1-p) \sum\limits_{i} w_i}{\left(\sum\limits_{i} w_i \right)- 1}
which when \sum\limits_i w_i >> 1 can be approximated with:
\hat{\sigma}_{p}^2 \approx p(1-p)
The confidence interval for proportions could therefore be approximated with:
\text{CI} \approx 1.14 \times \sqrt{\frac{p(1-p)}{n}} \times \text{qnorm}(1 - \frac{\alpha} 2) where:
- \alpha is the significance level,
- \text{qnorm} outputs the Z-score for the required significance level \alpha,
- n is the size of set r, i.e. number of respondents.
Confidence Interval - Definition
A confidence interval is a range of plausible values for a population parameter, calculated from sample data. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the resulting intervals would contain the true population value. This does not imply that there is a 95% probability that the true value lies within any single interval, rather, it reflects the reliability of the estimation method across repeated samples.