class: center, middle, inverse, title-slide .title[ # Data Collection Monitoring ] .subtitle[ ## {HighFrequencyChecks} ] .date[ ### Training Content as of 02 November 2023 ] --- --- class: inverse, center, middle # Checks --- ## High Frequency Checks * Corrective actions: * Correct set-up of data collection devices and encoding of the forms * Data collected according the sampling plan * Pro-active actions: * Ensuring enumerators rigorous work standards * Promoting enumerators productivity --- ## Corrective: Correct set-up of data collection devices and encoding of the forms --- ## Corrective: Data collected according the sampling plan --- ## Pro-active: Ensuring enumerators rigorous work standards --- ## Pro-active: Promoting enumerators productivity --- class: inverse, center, middle # Cleaning --- ## When do yo need to clean the data? Survey data cleaning involves identifying and removing responses from individuals who either don’t match your target audience criteria or didn’t answer your questions thoughtfully. This filtering is done to avoid drawing misleading conclusions. Data cleaning remains a last resort option that can be at first minimized by: * .large[__Quality of questionnaire design__] not only to minimize social desirability and biased questions but also to ensure that the interview duration is limited (_ideally less than 45 minutes for a face to face interview and less than 25 minutes for a telephone interview_) * .large[__Good form encoding__] - with well defined [constraints](https://xlsform.org/en/#constraints) and [skip logic](https://xlsform.org/en/#relevant) and [requirement](https://xlsform.org/en/#required) to avoid Inconsistent Responses, sufficient testing to ensure that the questions are well understood and the responses options are covering well the options * .large[__Good training for the data enumerator__] and detailed [question hints](https://xlsform.org/en/#hints) so the enumerators fills correctly the questionnaire * .large[__Sufficient data collection quality monitoring__] to identify, prevent and cure issues early on. This can be done through [High Frequency Check](https://unhcr.github.io/HighFrequencyChecks/docs/). This should help to early on flag Straightlining / Patterned Responses if one enumerator uses for instance the same answer option ("B") over and over (as an example for at least five rows in a grid)... .bg-blue[ For data quality, prevention is lot more effective, quicker and cheaper than curing. Take the time to thoroughly test the questionnaire before starting full on data collection. ] ??? https://dimewiki.worldbank.org/Checklist:_Data_Cleaning https://dimewiki.worldbank.org/Data_Cleaning --- ## Cleaning is the most time consuming task: go through your initial exploration report to identify issues! In order to guide this selection phase, data experts, in collaboration with the data analysis group, can use the following elements: * For numeric value, check the .large[__frequency distributions__] of each variable to average, deviation, including outlier and oddities * For categorical variables, check for .large[__unexpected values__]: any weird results based on common sense expectations * Use cross-tabulation to verify potential .large[__illogical combination__] of answers (for instance "pregnant men"... ) * Use correlation analysis to check for potential .large[__contradictions__] in respondents answers to different questions for identified associations (chi-square) * Always, Check for .large[__missing data__] (NA) or "%of respondent who answered" (in the chart caption) that you cannot confidently explain * Check unanswered questions, that corresponds to .large[__unused skip logic__] in the questionnaire: For instance, did a person who was never displaced answer displacement-related questions? Were employment-related answers provided for a toddler? --- ## Variables cleaning Cleaning involves cleaning records (rows) and variable (columns). Cleaning variables is relevant for a few situations: * System variables ( precise date) and Section timestamps * Removing .large[__direct identifiers__] such as: * Name and surname * Document number (passport, national ID, driving license, etc.) * Address or other precise geographic information * GPS coordinates, if collected during face to face interviews. In some situation, it's possible to at least decrease the accuracy of the coordinates (removing digit).... * Telephone number .large[Fix them] by setting the concerned variable as `identifier` in the `anonymise` column the `xlsform`: see next chapter of this presentation --- ## Situation when you will still need minimum record cleaning a priori Whatever is quality of form design, enumerator training and data collection quality monitoring, there will be still cases where cleaning will involve removing entire records: * Remove from the dataset records where no consent were obtained and/or more broadly one a specific filter/condition (where the respondent do not meet certain criteria or data from an unreliable enumerator identified during data collection quality monitoring... * Remove duplicate respondent ID based on the original sample list * Often survey includes nested tables (aka `repeat`), so if you remove records from the main table, you need to remove linked records in the nested tables `kobo_remove` will remove records based on a specific filter. For instance if you want to remove all the records from the enumerator edouard (assuming I have a variable called `enumerator` where the enumerator name is recorded) > kobo_remove( datalist = datalist, filter( datalist[["main]]$numerator == "edouard")) --- ## Clean based on time * Remove from the dataset records before or at specific dates .large[Fix them] by setting within the `xlsform`: * Starting date of the data collection in the `clean` column / `start` row * Ending date of the data collection in the `clean` column / `End` row * Remove from the dataset records when interview duration appears as outliers, either too long or too short, aka "speed responses" --- ## Situation when you will still need minimum cleaning a posteriori In other cases, cleaning will involve .large[`recoding`] some variables: 1. Recode un-explainable .large[__outliers for numerical questions__]. An example of this would be if you asked how much water one person use in a day and someone answered that they use 1000 liters, while the second largest usage reported is 150 liters. 2. Recode questions consecutive from .large[__"or other" choices__]. 3. Recode some questions answer as .large[__new calculated variables__] to have more balanced response categories based on frequency or potential closes meaning --- ## Outliers for numerical questions * Outliers: values significantly different from others * Outliers should be removed or modified only if they are (clearly) wrong values * Common outlier definition: observations three standard deviations from the mean .large[identify outliers] by looking at histograms, boxplots and scatter plots from the exploration report .large[Fix them] by setting the maximum accepted standard deviation for the variable in the `clean` column of the `xlsform` --- ## Recode categories and/or treat "or other" choices Often, categorical questions include an `or other` variables and this option might be mis-used. In the exploratory report, those are plotted with a word cloud. You may also have categories you would like to merge. .large[Fix them] by setting the `clean` column of the `xlsform`. For instance, first question `shelter` includes an option `other` which trigger a subsequent text question `shelter_other` .pull-left[ * If you want to clean the badly categorized `other` from `shelter`, on the row `shelter_other` in te column `clean`, insert > shelter == "other" .bg-blue[ This will create automatically an additional cleaning log in xlsform called `clean_shelter_other` where the first column is the original value and the second column is the cleaned value. You can then manually edit the second column and the next time you will run `kobo_clean`, __the 'other' value__ will be automatically replaced by the value from your cleaning log if different from the original value ] ] .pull-right[ * If you want to adjust the current categorized from `shelter`, on the row `shelter` in te column `clean`, insert > shelter .bg-blue[ This will create automatically an additional cleaning log in xlsform called `clean_shelter` where the first column is the original value and the second column is the cleaned value. You can then manually edit the second column and the next time you will run `kobo_clean`, __all values__ will be automatically replaced by the values from your cleaning log if different from the original value ] ] --- class: inverse, center, middle # Anonymisation --- # Dissemination of microdata (i.e. raw survey data) is crucial * Reducing duplication in data collection; * Improving the reliability and relevance of data; * Supporting research and promoting development of new tools for using data; * Enhancing the credibility of UNHCR as an authoritative source of Information on refugees. --- ## But requires first proper anonymisation Suppose a hypothetical intruder has access to some released microdata and attempts to identify or find out more information about a particular respondent. Disclosure, also known as re-identification, occurs when the intruder reveals previously unknown information about a respondent by using the released data. Three types of disclosure can be distinguished: * __Identity disclosure__ occurs if the intruder associates a known individual with a released data record. For example, the intruder links a released data record with external information, or identifies a respondent with extreme data values. In this case, an intruder can exploit a small subset of variables to make the linkage, and once the linkage is successful, the intruder has access to all other information in the released data related to the specific respondent. * __Attribute disclosure__ occurs if the intruder is able to determine some new characteristics of an individual based on the information available in the released data. For example, if a hospital publishes data showing that all female patients aged 56 to 60 have cancer, an intruder then knows the medical condition of any female patient aged 56 to 60 without having to identify the specific individual. * __Inferential disclosure__ occurs if the intruder is able to determine the value of some characteristic of an individual more accurately with the released data than otherwise would have been possible. For example, with a highly predictive regression model, an intruder may be able to infer a respondent's sensitive income information using attributes recorded in the data, leading to inferential disclosure. ??? > Even when personal data is not being collected it still may be appropriate to apply the methodology since quasi-identifiable data or other sensitive data could lead to personal identification or should not be shared. https://jangorecki.github.io/blog/2014-11-07/Data-Anonymization-in-R.html This is based on the World Bank sponsored [disclosure Control Toolbox](http://www.ihsn.org/software/disclosure-control-toolbox) for the R language and built on the recommendations from the [International Household Survey Network](http://ihsn.org/sites/default/files/resources/ihsn-working-paper-007-Oct27.pdf). Per default, it is assumed that __direct identifiers__ (such as name, progres ID, telephone, GPS locations) values are removed. --- ## The paradox of Disclosure risks and information loss .pull-left[ Data anonymisation is always a trade-off between disclosure risks and information loss. The objective is to modify data in such a way that both the disclosure risk and the information loss caused are acceptably low. There's no standards based on risk measurement as the risk measurement is linked to both the level of conservatism in the number of selected variables that are considered to be "at risk" and on the trust in the data analyst.. ] .pull-right[ ] --- ## Identify potential statistical disclosure risks * Risk linked to each records in the dataset: __Global disclosure risk__ & __Record-level disclosure risk__; * Risk linked to combination of categoric variables in the dataset: __k-anonymity__ & __l-diversity__: ; * Risk linked to specific values for numeric variables in the dataset: various index based on __Robust Mahalanobis distances__ are calculated; --- ## Anonymisation treatment Perturbative and non-pertubative approaches to decrease risks can be used: * for __categoric__ variable: Recoding, suppressing, post randomization , * for __continuous__ variables: Discretisation (convert from continuous to categoric), Adding noise, micro-aggregation, swapping Once those additional treatments are applied, this report can be then regenerated till the ratio risk/loss is acceptable. .bg-blue[ __Global recoding__ is a non-perturbative method that can be equally applied to both categorical and continuous key variables. The basic idea of recoding a categorical variable is to combine several categories into a new, less informative category. A frequent use case is the recoding of age given in years into age-groups. If the method is applied to a continuous variable, it means to discretize the variable. You can apply recoding using the basic cleaning functions. ] --- ## Variables for risk scenarios To assess disclosure risk, one must make realistic assumptions about the information data users might have at hand to match against the micro dataset; these assumptions are called disclosure risk scenarios. This goes hand in hand with the selection of categorical key variables because the choice these identifying variables defines a specific disclosure risk scenario. The specific set of chosen key variables has direct influence on the risk assessment. --- ## The intrusion scenario: What variables to consider when configuring the anonymisation? * __Direct identifiers__: Can be directly used to identify an individual. E.g. Name, Address, Date of birth, Telephone number, GPS location. Metadata, Data about who, where and how the data is collected is often stored separately to the main data and can be used identify individuals and should usually be removed * __Quasi- identifiers__ also called _implicit identifiers_ or _key variables_: Can be used in combination to re-identify respondents in the released dataset when it is joined with other information. (E.g. gender, age, occupation, specific needs, region..(). . * __Sensitive information__: Variables whose values must not be discovered for any respondent. Determination is often subject to legal and ethical concerns (Protection risk, Vulnerabilities, Ethnicity, Religious belief ..) Such information might not identify an individual but could put an individual or group at risk. --- ## Corporate Data Curation .pull-left[ Given the risk linked to Statistical Disclosure, full and final anonymisation in UNHCR is performed by a dedicated team 1. Start from the data documented and archived in the Internal Data Library: [RIDL](https://im.unhcr.org/ridl/) 2. Review and proceed with anonymisation in order to get the formal approval of the data controller in the operation in line with [UNHCR Data Protection Policy](https://www.refworld.org/docid/55643c1d4.html) 3. Publish the resulting microdata within [UNHCR Microdata Library](https://www.unhcr.org/blogs/advancing-unhcrs-open-data-vision-the-new-microdata-library/) ] .pull-right[ ![https://microdata.unhcr.org/index.php/home](data:image/png;base64,#../../inst/mdl.png) ] --- ## How to set it up in Kobocruncher? By default the initial exploration report does not include anonymisation as it depends on your data. Though you can implement a very first level so that potential high risk of re-identification are prevented during the analysis stage (for the record, disclosure risk also occurs on tabulation, not only when releasing microdata). To do so: 1. Open the expanded xlsform 2. Categorize the variable in the `anonymise` column * `remove` * `key` * `sensitve` --- ## What will you get then? Once you regenerate the report, the function `Kobo_anonymise` will automatically output some information about statistical disclosure risks for your dataset * Key metrics * Visual representation .bg-blue[ Those measurement will automatically appear in the report. Obviously a revision of the intrusion scenario or removing certain variable for a recoded version will change the metrics Some automatic recoding and variable removing will be applied to the data and used in the rest of the data exploration A new version of the dataset will be save in the data folder and can be uploaded to RIDL for __internal data sharing__ (recall that external data sharing is managed through corporate data curation). ] --- class: inverse, center, middle # TIME TO PRACTISE ON YOUR OWN! ### .large[.white[
] **5 minutes! **]
−
+
10
:
00
- Open again locally and fill in the anonymisation column - Regenerate the report and check the results Do not hesitate to raise your questions in the [ticket system](https://github.com/Edouard-Legoupil/kobocruncher/issues/new) or in the chat so the training content can be improved accordingly!