Authors & Year: Christopher Kenny, Shiro Kuriwaki, Cory McCartan, Evan Rosenman, Tyler Simko, and Kosuke Imai (2021)
Journal: Science Advances [DOI:10.1126/sciadv.abk3283]
U.S. Census & Differential Privacy
Census statistics play a pivotal role in making public policy decisions such as redrawing legislative districts and allocating federal funds as well as supporting social science research. However, given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, by adding noise to tabulated data and subsequently conducting postprocessing. The U.S. Census Bureau in particular has implemented a Disclosure Avoidance System (DAS) based on differential privacy technology to protect individual Census responses. This system adds random noise, guided by a privacy loss budget (denoted by ϵ), to Census tabulations, aiming to prevent the disclosure of personal information as mandated by law. The privacy loss budget value ϵ determines the level of privacy protection, with higher ϵ values allowing more noise. While the adoption of differential privacy has been controversial, this approach is crucial for maintaining data confidentiality. Other countries and organizations are also considering this technology as well.
Disclosure Avoidance System
Differential privacy involves more than just adding noise. It necessitates postprocessing to ensure data accuracy. The U.S. Census Bureau has adjusted counts to avoid negative or inadmissible values such as a negative population size of a small rural community or a negative number of solar-powered homes. Postprocessing is also required to maintain consistency across tabulations. That is, the adjusted census data remains logically coherent, matches historical data, and preserves known relationships between variables to maintain data accuracy while protecting individual privacy. The challenge lies in determining whether these adjustments unintentionally introduce systematic biases into reported Census statistics. The study evaluates the impact of the DAS, including noise injection and postprocessing, on redistricting and voting rights analysis at various levels. To assess the impact of the DAS, the study generated 10 sets of redistricting datasets for various offices and states using precinct-level data. It employed a generalized additive model (GAM) to distinguish systematic and residual components of population changes. A GAM is a statistical modeling technique that extends linear regression by allowing for the inclusion of non-linear relationships between variables through the use of smooth functions, making it flexible for modeling complex data patterns. A GAM was fitted to the precinct-level population difference between the DAS-12.2 and the Census data, with various predictors including the Democratic vote share of elections in the precinct, turnout as a fraction of the voting age population, a logarithmic transform of population density, the fraction of the population that is White, and the Herfindahl-Hirschman Index (HHI) of race as a measure of racial heterogeneity. HHI is a measure of diversity ranging from 0 (most diverse) to 1 (least diverse). Figure 1 below displays the errors or deviations of the model fits using the DAS-12.2 data against the minority fraction of the population in each precinct for 8 U.S. states. These 8 states include those frequently studied in redistricting (Pennsylvania and North Carolina), the Deep South (South Carolina, Louisiana, and Alabama), small states (Delaware), and heavily Republican (Utah) or Democratic (Washington) western states. A consistent U-shaped pattern indicates that mixed White/non-White precincts lose the most population relative to more homogeneous precincts.
Figure 1. Model-smoothed error in precinct populations by the minority fraction of voters. The overlaid GAM-smoothed curve displays the mean error by minority share. A U-shaped pattern suggests that precincts with a mix of White and non-White populations exhibit the greatest population loss compared to precincts with more uniform racial compositions.
Figure 2 below plots this error against HHI, and it shows the aforementioned pattern more vividly. The fitted error in estimated population declines more rapidly (i.e., a steeper slope) as the precinct becomes more racially diverse. The study deduces that this could be due to the adopted DAS targets, which prioritize accuracy for the largest racial group in a given geographic area. Thus, the DAS seems to systematically undercount the population in mixed-race and mixed-partisan heterogeneous precincts. It was also reported that this measured difference under the DAS is orders of magnitude larger than the difference under block population numbers released in 2010. Hence, the DAS introduces systematic biases along racial and partisan lines, affecting population redistribution. The noise introduced can make it difficult to create districts of equal population, violating the “One Person, One Vote” principle. Smaller districts, such as state legislative districts, are particularly affected.
Implications for Redistricting
With this finding, the DAS has partisan and racial implications, leading to unpredictable changes in district-level partisan outcomes and potentially altering redistricting analyses. The study shows that DAS-protected data underpredicts minority voters, impacting the number of majority-minority districts. That is, DAS-protected data tends to show a smaller or less accurate count of minority voters than what is actually present in the original, unaltered data, which creates a systematic bias in counting or identifying minority voters. The findings have critical implications for future redistricting and voting rights analysis using privacy-protected Census data, as they reveal the challenges and potential biases introduced by the DAS. This research provides a comprehensive assessment of the DAS impact on redistricting, covering various levels and contexts. It builds on prior work by using the latest DAS demonstration data to examine the consequences of DAS-induced errors across multiple use cases and states.
Figure 2. Model-smoothed error in precinct populations by the Herfindahl-Hirschman Index (HHI). The HHI of 100% indicates that the precinct consists of only one racial group. The plots depict that the prediction error in estimating population decline increases at a quicker rate as the precinct’s racial diversity level rises.
Challenges of Privacy vs. Accuracy: A Summary
In summary, the study focuses on the Census Bureau’s DAS and its impact on the redistricting process. The DAS employs a two-step algorithm, combining differential privacy and postprocessing, to enhance data privacy during the decennial census. The study examines the demonstration data released in April 2021, based on the 2010 decennial census, to assess the effectiveness of the DAS. The DAS differs from previous approaches, which relied on swapping to protect the privacy of individuals with unique responses. The latest DAS algorithm adds statistical noise to counts in various Census tables. However, the study highlights that the DAS, particularly its postprocessing, introduces systematic biases. The DAS postprocessing redistributes populations from racially mixed areas to more racially homogeneous regions. The impact on redistricting was analyzed through redistricting simulation, revealing population discrepancies at the voting tabulation district (VTD) level. The study finds systematic population shifts, with mixed White/non-White precincts losing more population compared to homogeneous precincts. Undercounting biases along partisan lines are also observed. The study shows that the DAS introduces systematic biases along racial and partisan lines, thereby compromising equitable representation, affecting redistricting outcomes and electoral representation. These findings emphasize the trade-offs between privacy protection and data accuracy, and suggest a need for careful consideration of these effects in redistricting and policymaking. The authors also acknowledge the challenges faced by national statistics agencies worldwide in balancing privacy protection and data accuracy, especially in full enumeration censuses. The study even recommends considering the previous privacy method to better balance privacy and data accuracy. Until a satisfactory solution is found, unresolved flaws in the 2020 Census data may have critical implications for future redistricting and related matters.