vignettes/01-Introduction.Rmd
01-Introduction.RmdIn this book, we demonstrate how to measure poverty and income concentration in a population based on microdata collected from a complex survey sample. Most surveys administered by government agencies or larger research organizations utilize a sampling design that violates the assumption of simple random sampling (SRS), including:
Therefore, basic unweighted R commands such as mean() or
glm() will not properly account for the weighting nor the
measures of uncertainty (such as the confidence intervals) present in
the dataset. For some examples of publicly-available complex survey data
sets, see http://asdfree.com.
Unlike other software, the R convey package does not
require that the user specify these parameters throughout the analysis.
So long as the svydesign
object or svrepdesign
object has been constructed properly at the outset of the analysis,
the convey package will incorporate the survey design
automatically and produce statistics and variances that take the complex
sample into account.
In the following example, we’ve loaded the data set
eusilc from the R library laeken (Alfons and Templ 2013).
Next, we create an object of class survey.design using
the function svydesign of the library survey:
library(survey)
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)Right after the creation of the design object
des_eusilc, we should use the function
convey_prep that adds an attribute to the survey design
which saves information on the design object based upon the whole
sample, needed to work with subset designs.
library(convey)
des_eusilc <- convey_prep( des_eusilc )To estimate the at-risk-of-poverty rate, we use the function
svyarpt:
svyarpr(~eqIncome, design=des_eusilc)
arpr SE
eqIncome 0.14444 0.0028To estimate the at-risk-of-poverty rate across domains defined by the
variable db040 we use:
svyby(~eqIncome, by = ~db040, design = des_eusilc, FUN = svyarpr, deff = FALSE)
db040 eqIncome se
Burgenland Burgenland 0.1953984 0.017202852
Carinthia Carinthia 0.1308627 0.010606502
Lower Austria Lower Austria 0.1384362 0.006513217
Salzburg Salzburg 0.1378734 0.011581408
Styria Styria 0.1437464 0.007453192
Tyrol Tyrol 0.1530819 0.009884094
Upper Austria Upper Austria 0.1088977 0.005933094
Vienna Vienna 0.1723468 0.007684540
Vorarlberg Vorarlberg 0.1653731 0.013756389Using the same data set, we estimate the quintile share ratio:
# for the whole population
svyqsr(~eqIncome, design=des_eusilc, alpha1= .20)
qsr SE
eqIncome 3.97 0.0426
# for domains
svyby(~eqIncome, by = ~db040, design = des_eusilc,
FUN = svyqsr, alpha1= .20, deff = FALSE)
db040 eqIncome se
Burgenland Burgenland 5.008486 0.32755685
Carinthia Carinthia 3.562404 0.10909726
Lower Austria Lower Austria 3.824539 0.08783599
Salzburg Salzburg 3.768393 0.17015086
Styria Styria 3.464305 0.09364800
Tyrol Tyrol 3.586046 0.13629739
Upper Austria Upper Austria 3.668289 0.09310624
Vienna Vienna 4.654743 0.13135731
Vorarlberg Vorarlberg 4.366511 0.20532075These functions can be used as S3 methods for the classes
survey.design and svyrep.design.
Let’s create a design object of class svyrep.design and
run the function convey_prep on it:
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
des_eusilc_rep <- convey_prep(des_eusilc_rep) and then use the function svyarpr:
svyarpr(~eqIncome, design=des_eusilc_rep)
arpr SE
eqIncome 0.14444 0.0025
svyby(~eqIncome, by = ~db040, design = des_eusilc_rep, FUN = svyarpr, deff = FALSE)
db040 eqIncome se.eqIncome
Burgenland Burgenland 0.1953984 0.016713791
Carinthia Carinthia 0.1308627 0.012061625
Lower Austria Lower Austria 0.1384362 0.007294696
Salzburg Salzburg 0.1378734 0.010050357
Styria Styria 0.1437464 0.008558783
Tyrol Tyrol 0.1530819 0.010328225
Upper Austria Upper Austria 0.1088977 0.006212301
Vienna Vienna 0.1723468 0.007259732
Vorarlberg Vorarlberg 0.1653731 0.012792618The functions of the library convey are called in a similar way to the functions in library survey.
It is also possible to deal with missing values by using the argument
na.rm.
# survey.design using a variable with missings
svygini( ~ py010n , design = des_eusilc )
gini SE
py010n NA NA
svygini( ~ py010n , design = des_eusilc , na.rm = TRUE )
gini SE
py010n 0.64606 0.0036
# svyrep.design using a variable with missings
svygini( ~ py010n , design = des_eusilc_rep )
gini SE
py010n NA NA
svygini( ~ py010n , design = des_eusilc_rep , na.rm = TRUE )
gini SE
py010n 0.64606 0.0043In what follows, we often use the linearization method as a tool to produce an approximation for the variance of an estimator. From the linearized variable \(z\) of an estimator \(T\), we get from the expression @ref(eq:var) an estimate of the variance of \(T\)
If \(T\) can be expressed as a
function of the population totals \(T = g(Y_1,
Y_2, \ldots, Y_n)\), and if \(g\) is linear, the estimation of the
variance of \(T = g(Y_1, Y_2, \ldots,
Y_n)\) is straightforward. If \(g\) is not linear but is a ‘smooth’
function, then it is possible to approximate the variance of \(g(Y_1, Y_2, \ldots, Y_n)\) by the variance
of its first order Taylor expansion. For example, we can use Taylor
expansion to linearize the ratio of two totals. However, there are
situations where Taylor linearization cannot be immediately possible,
either because \(T\) cannot be
expressed as functions of the population totals, or because \(g\) is not a smooth function.
An example is the case where \(T\) is a
quantile.
In these cases, it might work an alternative form of linearization of
\(T\), by
Influence Function, as defined in @ref(eq:lin), proposed in
Deville (1999). Also, it coud be used
replication methods such as bootstrap and
jackknife.
In the convey library, there are some basic functions
that produce the linearized variables needed to measure income
concentration and poverty. For example, looking at the income variable
in some complex survey dataset, the quantile of that income
variable can be linearized by the function
convey::svyiqalpha and the sum total below any quantile of
the variable is linearized by the function
convey::svyisq.
From the linearized variables of these basic estimates, it is possible by using rules of composition, valid for influence functions, to derive the influence function of more complex estimates. By definition the influence function is a Gateaux derivative and the rules rules of composition valid for Gateaux derivatives also hold for Influence Functions.
The following property of Gateaux derivatives was often used in the library convey. Let \(g\) be a differentiable function of \(m\) variables. Suppose we want to compute the influence function of the estimator \(g(T_1, T_2,\ldots, T_m)\), knowing the Influence function of the estimators \(T_i, i=1,\ldots, m\). Then the following holds:
\[ I(g(T_1, T_2,\ldots, T_m)) = \sum_{i=1}^m \frac{\partial g}{\partial T_i}I(T_i) \]
In the library convey this rule is implemented by the function
contrastinf which uses the R function deriv to
compute the formal partial derivatives \(\frac{\partial g}{\partial T_i}\).
For example, suppose we want to linearize the
Relative median poverty gap(rmpg), defined as the
difference between the at-risk-of-poverty threshold (arpt)
and the median of incomes less than the arpt relative to
the arprt:
\[ rmpg= \frac{arpt-medpoor} {arpt} \]
where medpoor is the median of incomes less than
arpt.
Suppose we know how to linearize arpt and
medpoor, then by applying the function
contrastinf with \[
g(T_1,T_2)= \frac{(T_1 - T_2)}{T_1}
\] we linearize the rmpg.
Using the notation in Osier (2009), the variance of the estimator \(T(\hat{M})\) can approximated by:
\[\begin{equation} Var\left[T(\hat{M})\right]\cong var\left[\sum_s w_i z_i\right] (\#eq:var) \end{equation}\]
The linearized variable \(z\) is given by the derivative of the
functional:
\[\begin{equation} z_k=lim_{t\rightarrow0}\frac{T(M+t\delta_k)-T(M)}{t}=IT_k(M) (\#eq:lin) \end{equation}\]
where, \(\delta_k\) is the Dirac measure in \(k\): \(\delta_k(i)=1\) if and only if \(i=k\).
This derivative is called Influence Function and was introduced in the area of Robust Statistics.
Some measures of poverty and income concentration are defined by
non-differentiable functions so that it is not possible to use Taylor
linearization to estimate their variances. An alternative is to use
Influence functions as described in Deville (1999) and Osier
(2009). The convey library implements this methodology to work
with survey.design objects and also with
svyrep.design objects.
Some examples of these measures are:
At-risk-of-poverty threshold: \(arpt=.60q_{.50}\) where \(q_{.50}\) is the income median;
At-risk-of-poverty rate \(arpr=\frac{\sum_U 1(y_i \leq arpt)}{N}.100\)
Quintile share ratio
\(qsr=\frac{\sum_U 1(y_i>q_{.80})}{\sum_U 1(y_i\leq q_{.20})}\)
Note that it is not possible to use Taylor linearization for these measures because they depend on quantiles and the Gini is defined as a function of ranks. This could be done using the approach proposed by Deville (1999) based upon influence functions.
Let \(U\) be a population of size \(N\) and \(M\) be a measure that allocates mass one to the set composed by one unit, that is \(M(i)=M_i= 1\) if \(i\in U\) and \(M(i)=0\) if \(i\notin U\)
Now, a population parameter \(\theta\) can be expressed as a functional of \(M\) \(\theta=T(M)\)
Examples of such parameters are:
Total: \(Y=\sum_Uy_i=\sum_U y_iM_i=\int ydM=T(M)\)
Ratio of two totals: \(R=\frac{Y}{X}=\frac{\int y dM}{\int x dM}=T(M)\)
Cumulative distribution function: \(F(x)=\frac{\sum_U 1(y_i\leq x)}{N}=\frac{\int 1(y\leq x)dM}{\int{dM}}=T(M)\)
To estimate these parameters from the sample, we replace the measure \(M\) by the estimated measure \(\hat{M}\) defined by: \(\hat{M}(i)=\hat{M}_i= w_i\) if \(i\in s\) and \(\hat{M}(i)=0\) if \(i\notin s\).
The estimators of the population parameters can then be expressed as functional of the measure \(\hat{M}\).
Total: \(\hat{Y}=T(\hat{M})=\int yd\hat{M}=\sum_s w_iy_i\)
Ratio of totals: \(\hat{R}=T(\hat{M})=\frac{\int y d\hat{M}}{\int x d\hat{M}}=\frac{\sum_s w_iy_i}{\sum_s w_ix_i}\)
Cumulative distribution function: \(\hat{F}(x)=T(\hat{M})=\frac{\int 1(y\leq x)d\hat{M}}{\int{d\hat{M}}}=\frac{\sum_s w_i 1(y_i\leq x)}{\sum_s w_i}\)
Total: \[ \begin{aligned} IT_k(M)&=lim_{t\rightarrow 0}\frac{T(M+t\delta_k)-T(M)}{t}\\ &=lim_{t\rightarrow 0}\frac{\int y.d(M+t\delta_k)-\int y.dM}{t}\\ &=lim_{t\rightarrow 0}\frac{\int yd(t\delta_k)}{t}=y_k \end{aligned} \]
Ratio of two totals: \[ \begin{aligned} IR_k(M)&=I\left(\frac{U}{V}\right)_k(M)=\frac{V(M)\times IU_k(M)-U(M)\times IV_k(M)}{V(M)^2}\\ &=\frac{X y_k-Y x_k}{X^2}=\frac{1}{X}(y_k-Rx_k) \end{aligned} \]
\[ z_k= -\frac{0.6}{f(m)}\times\frac{1}{N}\times\left[I(y_k\leq m-0.5) \right] \]
\[ arpr=\frac{\sum_U I(y_i \leq t)}{\sum_U w_i}.100 \] \[ z_k=\frac{1}{N}\left[I(y_k\leq t)-t\right]-\frac{0.6}{N}\times\frac{f(t)}{f(m)}\left[I(y_k\leq m)-0.5\right] \]
where:
\(N\) - population size;
\(t\) - at-risk-of-poverty threshold;
\(y_k\) - income of person \(k\);
\(m\) - median income;
\(f\) - income density function;
All major functions in the library convey have S3 methods for the
classes: survey.design, svyrep.design and
DBIdesign. When the argument design is
a survey design with replicate weights created by the library survey,
convey uses the method svrepdesign.
Considering the remarks in (Wolter
1985), p. 163, concerning the deficiency of the
Jackknife method in estimating the variance of
quantiles, we adopted the type bootstrap instead.
The function bootVar from the library
laeken , (Alfons and Templ
2013), also uses the bootstrap method to estimate variances.
Some inequality and multidimensional poverty measures can be
decomposed. As of December 2016, the decomposition methods in
convey are limited to group decomposition.
For instance, the generalized entropy index can be decomposed into between and within group components. This sheds light on a very simple question: of the overall inequality, how much can be explained by inequalities between groups and within groups? Since this measure is additive decomposable, one can get estimates of the coefficients, SEs and covariance between components. For a more practical approach, see (Lima 2013).
The Alkire-Foster class of multidimensional poverty indices can be decomposed by dimension and groups. This shows how much each group (or dimension) contribute to the overall poverty.
This technique can help understand where and who is more affected by inequality and poverty, contributing to more specific policy and economic analysis.