BUILD_PNADC_PANEL


Description

Our load_pnadc function uses the internal function build_pnadc_panel to identify households and individuals across quarters. The base method used for the identification draws from the paper of Rafael Perez Ribas and Sergei Suarez Dillon Soares (2008): “Sobre o painel da Pesquisa Mensal de Emprego (PME) do IBGE”, with modernizations implemented by the Data Zoom team to handle missing data and typographical errors.


Usage:

Basic Panel:

panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "basic")

Advanced Panel (Stages 1, 2, or 3):

# Stage 1: Exact matching with donated birth dates
panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "advanced_1")

# Stage 2: Relaxed matching constraints
panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "advanced_2")

# Stage 3: Fuzzy matching using Graph Theory (Recommended)
panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "advanced_3")

Basic Identification

The household identifier – stored as id_dom – combines the variables:

  • UPA – Primary Sampling Unit - PSU;
  • V1008 – Household;
  • V1014 – Panel Number;

In order to create a unique number for every combination of those variables.


The basic individual identifier – stored as id_ind – combines the household id with:

  • V2007 – Sex;
  • Date of Birth – [V20082 (year), V20081 (month), V2008 (day)];

In order to create a unique number for every combination of those variables.


Advanced Identification

On individuals who were not matched across all interviews using the basic method, we apply a progressive multi-stage algorithm to increase matching power without compromising uniqueness.

First, we reproduce the birth date donation method based on the methodology described in the IPEA technical note (Osório, 2019). It estimates and imputes missing birth dates (day, month, and year) by matching individuals with donors from different interviews within the same household based on sex, acceptable household condition changes, and estimated age. This process is executed by our internal function donate_birth_dates.

  • Stage 1 (advanced_1): Repeats the basic identification logic, but utilizing the donated dates. The identifier – stored as id_rs1 – combines:
    • id_dom – Household ID
    • V2007 – Sex
    • birth_day – Donated day of birth
    • birth_month – Donated month of birth
    • birth_year – Donated year of birth
  • Stage 2 (advanced_2): For individuals not completely matched in Stage 1, we relax the year of birth constraint (assuming it is often misreported). The identifier – stored as id_rs2 – combines:
    • id_dom – Household ID
    • birth_month – Donated month of birth
    • birth_day – Donated day of birth
    • V2003 – Order number in the household
  • Stage 3 (advanced_3): Targets candidates with fragmented interviews (less than 5 matches in the previous stages). It considers a match successful if there is a unique individual in the same household, in a different quarter, that satisfies the acceptable difference criteria established by Ribas and Soares (up to 4 days difference in the day of birth, 2 months in the month of birth, and a dynamically adjusted year-of-birth difference based on the individual’s reported age). The final identifier is stored as id_rs3.

Attrition and Identification Rates

1. Household Attrition Rate

The table below shows the unconditional attrition rate for households. This represents the percentage of household units observed in Wave 1 that were successfully re-interviewed and tracked in subsequent waves.

Interview (Wave) Household Attrition Rate (%)
1 100.00000
2 93.75667
3 91.84360
4 90.59256
5 89.60861

2. Initial Identification Rate (Line-Based)

This table reports the percentage of raw PNADC individual observations (lines) in Wave 1 for which we successfully built a valid identifier. Data is lost in this stage exclusively due to the inability to construct the identifier (e.g., missing essential data) or household grouping constraints.

Interview (Wave) Basic Rate (%) Adv 1 Rate (%) Adv 2 Rate (%) Adv 3 Rate (%)
1 93.82378 95.82954 96.40170 96.39606

3. Individual Unconditional Tracking Rate (ID-Based)

This table demonstrates the cumulative retention of tracked individuals over time. It uses the total number of uniquely identified individuals from Wave 1 as the universal denominator (starting at 100%), showing how much tracking power is gained by using the advanced algorithms.

Interview (Wave) Basic Rate (%) Adv 1 Rate (%) Adv 2 Rate (%) Adv 3 Rate (%) Difference (Adv 3 - Basic)
1 100.00000 100.00000 100.00000 100.00000 0.00000 p.p.
2 87.01360 88.20828 88.50570 88.83374 + 1.82014 p.p.
3 80.55773 82.33729 82.85546 83.38268 + 2.82495 p.p.
4 75.81465 77.91677 78.57830 79.25447 + 3.43982 p.p.
5 72.01655 74.27815 75.01486 75.79868 + 3.78213 p.p.