Survey Data Analysis with {Kobocruncher}

class: center, middle, inverse, title-slide

.title[
# Survey Data Analysis with {Kobocruncher}
]
.subtitle[
## Session 6 - Indicator Calculation
]
.author[
### <a href="https://edouard-legoupil.github.io/kobocruncher/">Link to Documentation</a> – <a href="05-Searching_Asssociation.html">Link to Previous Session</a> – <a href="07-Weighting.html">Link to Next Session</a>
]
.date[
### Training Content as of 02 November 2023
]

---

## Use case for new calculated variables 
 
 
 * filters on specific criteria
 
 * Create a filters on specific criteria 
 
 * Ratio between 2 numeric variable 
 
 * Calculation on date 
 
 * Discretization of numeric variable according to quintile 
 
 * Discretization of numeric variable according to fixed break 
 
 * Aggregate variable from nested frame (aka within repeat) to parent table 
 
 * filters on specific criteria

---

## kobo_indicator
 
In kobocruncher, this is done with [kobo_indicator](https://edouard-legoupil.github.io/kobocruncher/reference/kobo_indicator.html) function. The function goes through steps:

1 - load the already defined indicators in the xlsform 
  
  2 - append new indicator supplied to the function if any, 
  
  3 - apply the indicator, i.e. do the calculation,
  
  4 - re-save all the working indicator definition within the extended xlsform, dedicated indicator worksheet 
  
  5 - bind the new indicators in the dictionary for further plotting 
  
  6 - rebuild the plan if indicators are allocated to chapter, subchapter 
 
---

## Create a new variable based on a combination of specific criteria

.pull-lef[

When adding a new indicator a few elements should be provided:

* Name and Label for the new variable: ideally name should be consice and meaningful (less than 12 characters) and label for any regula label should be less than 80 characters
 
 * Type: a calculated variable will be either of type `select_one` or `numeric`. Documenting the indicator type will allow the indicator to be crunched automatically
 
 * list_name and list_label are the labeling for the response options in case the indicator is of type `select_one`. This will allow for automatic relablling for charting 
 
 * repeatvar: you need to document to what frame the new indicator will be calculated. Basically for most household survey, this will be either the household (aka the first frame, named main per default and referenced as `datalist[[\"main\"]]` ) or the frame for individuals (for intance if,within the repeat of xlsform,  the frame was named  members, `datalist[[\"members\"]]`)
 
 * Calculation: this is the complex part. Calculation should be defined as an R statement using data manipulation functions. In order to build the statement you will need to identify the correct variable name and response name.

]
.pull-right[

```r
indicatoradd <- c(  name =  "inColombia",
                    label = "Is from Colombia",
                    type = "select_one",
                    repeatvar = "datalist[[\"main\"]]",
calculation = "dplyr::if_else(datalist[[\"main\"]]$variable ==\"criteria\",
                                   \"yes\",\"no\")"
                    )

## then we add our indicators and expand
expanded  <- kobo_indicator(datalist = datalist,
                    dico = dico,
                 indicatoradd = indicatoradd ,
                 xlsformpath = xlsformpath,
                 xlsformpathout = xlsformpathout)
dico <- expanded[["dico"]]
datalist <- expanded[["datalist"]]
```

]

---

## The key data manipulation verbs for indicator calculations

* `dplyr::mutate()` used to create a new variable
 
  * `dplyr::filter()` Extract rows that meet logical criteria- for instance when you want to extract the information from the head of household from the nested information and append it to the household information

* `dplyr::if_else()` & `dplyr::case_when()` used to apply a specific condition (if) or multiplace condition (case) in order to create a calculated variable
  
  * `dplyr::group_by()`,  `dplyr::summarise()` & `dplyr::ungroup()` are used to perform aggregation and calculation (count, sum) based on specific group
  
  * `dplyr::left_join()` used to merge information from 2 frames, for instance on one side the information available at household level and the one on individual level

???

https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf 
https://posit.co/wp-content/uploads/2022/10/tidyr.pdf

---

## Ratio between 2 numeric variable

.pull-lef[

]
.pull-right[

```r
indicatoradd <- c(  name =  "ratio",
                    label = "Ratio varnum1 on varnum2",
                    type = "numeric",
                    repeatvar = "datalist[[\"main\"]]",
        calculation = "datalist[[\"main\"]]varnum1 /
                        datalist[[\"main\"]]varnum2"
                            )
```

]
 
---

## Calculation on date

.pull-lef[

]
.pull-right[

```r
indicatoradd <- c(  name =  "duration",
                    label = "Difference between today and datetocheck",
                    type = "numeric",
                    repeatvar = "datalist[[\"main\"]]",
        calculation = "lubridate::interval( datalist[[\"main\"]]$datetocheck,
                        lubridate::today()) %/% months(1)"
                            )
```

]
 
---

## Discretization of numeric variable according to quintile

.pull-lef[

]
.pull-right[

```r
indicatoradd <- c(  name =  "varnum_cat",
                    label = "Discretise the varnum into quintile",
                    type = "select_one",
                    repeatvar = "datalist[[\"main\"]]",
        calculation = "Hmisc::cut2(datalist[1]$varnum, g =5)"
                            )
```

]
 
---

## Discretization of numeric variable according to fixed break

.pull-lef[

for instance case size from integer to categoric
]
.pull-right[

```r
indicatoradd <- c(  name =  "varnum_cat",
                    label = "Discretise the varnum into fixed break",
                    type = "select_one",
                    repeatvar = "datalist[[\"main\"]]",
        calculation = "cut(datalist[1]$casesize, 
                              breaks = c(0, 1, 2, 3,5,30),
                             labels = c(\"Case.size.1\", 
                                        \"Case.size.2\", 
                                        \"Case.size.3\", 
                                        \"Case.size.4.5\", 
                                        \"Case.size.6.or.more\" ),
                         include.lowest=TRUE)"
                            )
```

]
 
---

## Aggregate variable from nested frame (aka within repeat) to parent table

.pull-lef[

]
.pull-right[

```r
indicatoradd <- c(  name =  "vnumber_femala_HH",
                    label = "Number of female in the household",
                    hint = "this indicator counts the number of females 
                            as registered in the household roster"
                    type = "integer",
                    repeatvar = "datalist[[\"main\"]]",
        calculation = "datalist[2] |> 
                       dplyr::select( members.sex, parent_index) |> 
                       tidyr::gather( parent_index, members.sex) |> 
                       dplyr::count(parent_index, members.sex) |> 
                       tidyr::spread(members.sex, n, fill = 0) |> 
                       dplyr::select( female)"
                            )

]

---
class: inverse, center, middle

# TIME TO PRACTISE ON YOUR OWN!

### .large[.white[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M368.4 18.3L312.7 74.1 437.9 199.3l55.7-55.7c21.9-21.9 21.9-57.3 0-79.2L447.6 18.3c-21.9-21.9-57.3-21.9-79.2 0zM288 94.6l-9.2 2.8L134.7 140.6c-19.9 6-35.7 21.2-42.3 41L3.8 445.8c-3.8 11.3-1 23.9 7.3 32.4L164.7 324.7c-3-6.3-4.7-13.3-4.7-20.7c0-26.5 21.5-48 48-48s48 21.5 48 48s-21.5 48-48 48c-7.4 0-14.4-1.7-20.7-4.7L33.7 500.9c8.6 8.3 21.1 11.2 32.4 7.3l264.3-88.6c19.7-6.6 35-22.4 41-42.3l43.2-144.1 2.8-9.2L288 94.6z"/></svg>] **10 minutes! **]

Open again your expanded xlsfrom,  set up the outliers treatment, clean the _"or_other"_ in the .large[clean] column. Then add calculated variables. Save and knit again your report!

Do not hesitate to raise your questions in the [ticket system](https://github.com/Edouard-Legoupil/kobocruncher/issues/new) or in the chat so the training content can be improved accordingly! 
 
---
class: inverse, center, middle

### .large[.white[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 0a256 256 0 1 1 0 512A256 256 0 1 1 256 0zM232 120V256c0 8 4 15.5 10.7 20l96 64c11 7.4 25.9 4.4 33.3-6.7s4.4-25.9-6.7-33.3L280 243.2V120c0-13.3-10.7-24-24-24s-24 10.7-24 24z"/></svg>] **Let's take a break! **] 
 
<div class="countdown" id="timer_e1388590" data-update-every="1" data-play-sound="true" tabindex="0" style="right:0;bottom:0;margin:5%;font-size:8em;position: relative; width: min-content;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

__Next session__:

[07-Weighting If the data was created through a probabilistic selection sampling approach, then we can apply weighting to the data before and regenerate the report so that those weights are reflected](07-Weighting.html)