vignettes/01-Introduction.Rmd
01-Introduction.Rmd
In this book, we demonstrate how to measure poverty and income concentration in a population based on microdata collected from a complex survey sample. Most surveys administered by government agencies or larger research organizations utilize a sampling design that violates the assumption of simple random sampling (SRS), including:
Therefore, basic unweighted R commands such as mean()
or
glm()
will not properly account for the weighting nor the
measures of uncertainty (such as the confidence intervals) present in
the dataset. For some examples of publicly-available complex survey data
sets, see http://asdfree.com.
Unlike other software, the R convey
package does not
require that the user specify these parameters throughout the analysis.
So long as the svydesign
object or svrepdesign
object has been constructed properly at the outset of the analysis,
the convey
package will incorporate the survey design
automatically and produce statistics and variances that take the complex
sample into account.
In the following example, we’ve loaded the data set
eusilc
from the R library laeken (Alfons and Templ 2013).
Next, we create an object of class survey.design
using
the function svydesign
of the library survey:
library(survey)
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
Right after the creation of the design object
des_eusilc
, we should use the function
convey_prep
that adds an attribute to the survey design
which saves information on the design object based upon the whole
sample, needed to work with subset designs.
library(convey)
des_eusilc <- convey_prep( des_eusilc )
To estimate the at-risk-of-poverty rate, we use the function
svyarpt
:
svyarpr(~eqIncome, design=des_eusilc)
arpr SE0.14444 0.0028 eqIncome
To estimate the at-risk-of-poverty rate across domains defined by the
variable db040
we use:
svyby(~eqIncome, by = ~db040, design = des_eusilc, FUN = svyarpr, deff = FALSE)
db040 eqIncome se0.1953984 0.017202852
Burgenland Burgenland 0.1308627 0.010606502
Carinthia Carinthia 0.1384362 0.006513217
Lower Austria Lower Austria 0.1378734 0.011581408
Salzburg Salzburg 0.1437464 0.007453192
Styria Styria 0.1530819 0.009884094
Tyrol Tyrol 0.1088977 0.005933094
Upper Austria Upper Austria 0.1723468 0.007684540
Vienna Vienna 0.1653731 0.013756389 Vorarlberg Vorarlberg
Using the same data set, we estimate the quintile share ratio:
# for the whole population
svyqsr(~eqIncome, design=des_eusilc, alpha1= .20)
qsr SE3.97 0.0426
eqIncome
# for domains
svyby(~eqIncome, by = ~db040, design = des_eusilc,
FUN = svyqsr, alpha1= .20, deff = FALSE)
db040 eqIncome se5.008486 0.32755685
Burgenland Burgenland 3.562404 0.10909726
Carinthia Carinthia 3.824539 0.08783599
Lower Austria Lower Austria 3.768393 0.17015086
Salzburg Salzburg 3.464305 0.09364800
Styria Styria 3.586046 0.13629739
Tyrol Tyrol 3.668289 0.09310624
Upper Austria Upper Austria 4.654743 0.13135731
Vienna Vienna 4.366511 0.20532075 Vorarlberg Vorarlberg
These functions can be used as S3 methods for the classes
survey.design
and svyrep.design
.
Let’s create a design object of class svyrep.design
and
run the function convey_prep
on it:
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
des_eusilc_rep <- convey_prep(des_eusilc_rep)
and then use the function svyarpr
:
svyarpr(~eqIncome, design=des_eusilc_rep)
arpr SE0.14444 0.0025
eqIncome
svyby(~eqIncome, by = ~db040, design = des_eusilc_rep, FUN = svyarpr, deff = FALSE)
db040 eqIncome se.eqIncome0.1953984 0.016713791
Burgenland Burgenland 0.1308627 0.012061625
Carinthia Carinthia 0.1384362 0.007294696
Lower Austria Lower Austria 0.1378734 0.010050357
Salzburg Salzburg 0.1437464 0.008558783
Styria Styria 0.1530819 0.010328225
Tyrol Tyrol 0.1088977 0.006212301
Upper Austria Upper Austria 0.1723468 0.007259732
Vienna Vienna 0.1653731 0.012792618 Vorarlberg Vorarlberg
The functions of the library convey are called in a similar way to the functions in library survey.
It is also possible to deal with missing values by using the argument
na.rm
.
# survey.design using a variable with missings
svygini( ~ py010n , design = des_eusilc )
gini SENA NA
py010n svygini( ~ py010n , design = des_eusilc , na.rm = TRUE )
gini SE0.64606 0.0036
py010n
# svyrep.design using a variable with missings
svygini( ~ py010n , design = des_eusilc_rep )
gini SENA NA
py010n svygini( ~ py010n , design = des_eusilc_rep , na.rm = TRUE )
gini SE0.64606 0.0043 py010n
In what follows, we often use the linearization method as a tool to produce an approximation for the variance of an estimator. From the linearized variable \(z\) of an estimator \(T\), we get from the expression @ref(eq:var) an estimate of the variance of \(T\)
If \(T\) can be expressed as a
function of the population totals \(T = g(Y_1,
Y_2, \ldots, Y_n)\), and if \(g\) is linear, the estimation of the
variance of \(T = g(Y_1, Y_2, \ldots,
Y_n)\) is straightforward. If \(g\) is not linear but is a ‘smooth’
function, then it is possible to approximate the variance of \(g(Y_1, Y_2, \ldots, Y_n)\) by the variance
of its first order Taylor expansion. For example, we can use Taylor
expansion to linearize the ratio of two totals. However, there are
situations where Taylor linearization cannot be immediately possible,
either because \(T\) cannot be
expressed as functions of the population totals, or because \(g\) is not a smooth
function.
An example is the case where \(T\) is a
quantile.
In these cases, it might work an alternative form of linearization of
\(T\), by
Influence Function
, as defined in @ref(eq:lin), proposed in
Deville (1999). Also, it coud be used
replication methods such as bootstrap
and
jackknife
.
In the convey
library, there are some basic functions
that produce the linearized variables needed to measure income
concentration and poverty. For example, looking at the income variable
in some complex survey dataset, the quantile
of that income
variable can be linearized by the function
convey::svyiqalpha
and the sum total below any quantile of
the variable is linearized by the function
convey::svyisq
.
From the linearized variables of these basic estimates, it is possible by using rules of composition, valid for influence functions, to derive the influence function of more complex estimates. By definition the influence function is a Gateaux derivative and the rules rules of composition valid for Gateaux derivatives also hold for Influence Functions.
The following property of Gateaux derivatives was often used in the library convey. Let \(g\) be a differentiable function of \(m\) variables. Suppose we want to compute the influence function of the estimator \(g(T_1, T_2,\ldots, T_m)\), knowing the Influence function of the estimators \(T_i, i=1,\ldots, m\). Then the following holds:
\[ I(g(T_1, T_2,\ldots, T_m)) = \sum_{i=1}^m \frac{\partial g}{\partial T_i}I(T_i) \]
In the library convey this rule is implemented by the function
contrastinf
which uses the R function deriv
to
compute the formal partial derivatives \(\frac{\partial g}{\partial T_i}\).
For example, suppose we want to linearize the
Relative median poverty gap
(rmpg), defined as the
difference between the at-risk-of-poverty threshold (arpt
)
and the median of incomes less than the arpt
relative to
the arprt
:
\[ rmpg= \frac{arpt-medpoor} {arpt} \]
where medpoor
is the median of incomes less than
arpt
.
Suppose we know how to linearize arpt
and
medpoor
, then by applying the function
contrastinf
with \[
g(T_1,T_2)= \frac{(T_1 - T_2)}{T_1}
\] we linearize the rmpg
.
Using the notation in Osier (2009), the variance of the estimator \(T(\hat{M})\) can approximated by:
\[\begin{equation} Var\left[T(\hat{M})\right]\cong var\left[\sum_s w_i z_i\right] (\#eq:var) \end{equation}\]
The linearized
variable \(z\) is given by the derivative of the
functional:
\[\begin{equation} z_k=lim_{t\rightarrow0}\frac{T(M+t\delta_k)-T(M)}{t}=IT_k(M) (\#eq:lin) \end{equation}\]
where, \(\delta_k\) is the Dirac measure in \(k\): \(\delta_k(i)=1\) if and only if \(i=k\).
This derivative is called Influence Function and was introduced in the area of Robust Statistics.
Some measures of poverty and income concentration are defined by
non-differentiable functions so that it is not possible to use Taylor
linearization to estimate their variances. An alternative is to use
Influence functions as described in Deville (1999) and Osier
(2009). The convey library implements this methodology to work
with survey.design
objects and also with
svyrep.design
objects.
Some examples of these measures are:
At-risk-of-poverty threshold: \(arpt=.60q_{.50}\) where \(q_{.50}\) is the income median;
At-risk-of-poverty rate \(arpr=\frac{\sum_U 1(y_i \leq arpt)}{N}.100\)
Quintile share ratio
\(qsr=\frac{\sum_U 1(y_i>q_{.80})}{\sum_U 1(y_i\leq q_{.20})}\)
Note that it is not possible to use Taylor linearization for these measures because they depend on quantiles and the Gini is defined as a function of ranks. This could be done using the approach proposed by Deville (1999) based upon influence functions.
Let \(U\) be a population of size \(N\) and \(M\) be a measure that allocates mass one to the set composed by one unit, that is \(M(i)=M_i= 1\) if \(i\in U\) and \(M(i)=0\) if \(i\notin U\)
Now, a population parameter \(\theta\) can be expressed as a functional of \(M\) \(\theta=T(M)\)
Examples of such parameters are:
Total: \(Y=\sum_Uy_i=\sum_U y_iM_i=\int ydM=T(M)\)
Ratio of two totals: \(R=\frac{Y}{X}=\frac{\int y dM}{\int x dM}=T(M)\)
Cumulative distribution function: \(F(x)=\frac{\sum_U 1(y_i\leq x)}{N}=\frac{\int 1(y\leq x)dM}{\int{dM}}=T(M)\)
To estimate these parameters from the sample, we replace the measure \(M\) by the estimated measure \(\hat{M}\) defined by: \(\hat{M}(i)=\hat{M}_i= w_i\) if \(i\in s\) and \(\hat{M}(i)=0\) if \(i\notin s\).
The estimators of the population parameters can then be expressed as functional of the measure \(\hat{M}\).
Total: \(\hat{Y}=T(\hat{M})=\int yd\hat{M}=\sum_s w_iy_i\)
Ratio of totals: \(\hat{R}=T(\hat{M})=\frac{\int y d\hat{M}}{\int x d\hat{M}}=\frac{\sum_s w_iy_i}{\sum_s w_ix_i}\)
Cumulative distribution function: \(\hat{F}(x)=T(\hat{M})=\frac{\int 1(y\leq x)d\hat{M}}{\int{d\hat{M}}}=\frac{\sum_s w_i 1(y_i\leq x)}{\sum_s w_i}\)
Total: \[ \begin{aligned} IT_k(M)&=lim_{t\rightarrow 0}\frac{T(M+t\delta_k)-T(M)}{t}\\ &=lim_{t\rightarrow 0}\frac{\int y.d(M+t\delta_k)-\int y.dM}{t}\\ &=lim_{t\rightarrow 0}\frac{\int yd(t\delta_k)}{t}=y_k \end{aligned} \]
Ratio of two totals: \[ \begin{aligned} IR_k(M)&=I\left(\frac{U}{V}\right)_k(M)=\frac{V(M)\times IU_k(M)-U(M)\times IV_k(M)}{V(M)^2}\\ &=\frac{X y_k-Y x_k}{X^2}=\frac{1}{X}(y_k-Rx_k) \end{aligned} \]
\[ z_k= -\frac{0.6}{f(m)}\times\frac{1}{N}\times\left[I(y_k\leq m-0.5) \right] \]
\[ arpr=\frac{\sum_U I(y_i \leq t)}{\sum_U w_i}.100 \] \[ z_k=\frac{1}{N}\left[I(y_k\leq t)-t\right]-\frac{0.6}{N}\times\frac{f(t)}{f(m)}\left[I(y_k\leq m)-0.5\right] \]
where:
\(N\) - population size;
\(t\) - at-risk-of-poverty threshold;
\(y_k\) - income of person \(k\);
\(m\) - median income;
\(f\) - income density function;
All major functions in the library convey have S3 methods for the
classes: survey.design
, svyrep.design
and
DBIdesign
. When the argument design
is
a survey design with replicate weights created by the library survey,
convey uses the method svrepdesign
.
Considering the remarks in (Wolter
1985), p. 163, concerning the deficiency of the
Jackknife
method in estimating the variance of
quantiles
, we adopted the type bootstrap instead.
The function bootVar
from the library
laeken
, (Alfons and Templ
2013), also uses the bootstrap method to estimate variances.
Some inequality and multidimensional poverty measures can be
decomposed. As of December 2016, the decomposition methods in
convey
are limited to group decomposition.
For instance, the generalized entropy index can be decomposed into between and within group components. This sheds light on a very simple question: of the overall inequality, how much can be explained by inequalities between groups and within groups? Since this measure is additive decomposable, one can get estimates of the coefficients, SEs and covariance between components. For a more practical approach, see (Lima 2013).
The Alkire-Foster class of multidimensional poverty indices can be decomposed by dimension and groups. This shows how much each group (or dimension) contribute to the overall poverty.
This technique can help understand where and who is more affected by inequality and poverty, contributing to more specific policy and economic analysis.