Jenna Shelton, MPP Staff Writer, Brief Policy Perspectives
With the 2020 Census just around the corner, the Census Bureau announced it will apply differential privacy, a rigorous data protection practice, to its data collection methods. The announcement comes as several high-profile hacking and data abuse incidents—like those involving Facebook and Equifax— have left many people concerned about whether and how their personal information can be prevented from falling into the wrong hands.
These concerns are not without merit. Eighty-seven percent of U.S. residents can be identified by combining three basic pieces of information: ZIP code, birthday, and gender. Data linkage arose as a major concern in 1997, when the Massachusetts Group Insurance Commission published anonymized datasets of hospital visits by state employees, removing what it believed was any personally identifiable information. However, then-graduate student of computer science Latayna Sweeney combined public health and voter registration records to expose Governor William Weld’s medical records, demonstrating how individuals can be identified when anonymized data is cross-referenced with other sources.
This event raised serious concerns about data privacy, and served as the impetus for the development of stringent de-identification provisions within landmark legislation such as the 2003 Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Today, to avoid such re-identification more broadly, the federal government is implementing strict individual-level privacy protections that balance protecting data from linkage attacks with allowing permitted researchers to use data to inform policy-making.
Why is protecting Census data important?
Article 1, Section 2 of the U.S. Constitution dictates that Congress conduct a census of the population every 10 years to determine the apportionment of seats in the House of Representatives. In the country’s early years, policymakers primarily used Census data to inform political representation, but within the last century the use of Census data has widely grown to inform other arenas, including social science research, city planning, and private-sector management. Today, the internet allows anyone to freely access and use certain levels of Census data for their own purposes.
Despite the long history and expanding use of the decennial Census, as well as the more recent advent of the annual American Community Survey (ACS), the Census Bureau is still determining the best mechanisms for safeguarding individual-level data. Privacy-enhancing strategies such as differential privacy allow for greater protection of an individual’s data even when more researchers are accessing this information than ever before.
What is differential privacy?
To combat concerns about data linkage, the Census Bureau will use differential privacy in the 2020 Census. Simply put, differential privacy is a statistical application that allows researchers to collect data without compromising the confidentiality of an individual’s information (known as microdata). When researchers collect data and apply differential privacy, they use a statistical algorithm to randomly decide whether to record an individual’s real survey response or to substitute another answer instead, making individual-level information less reliable within a given dataset, but also less traceable. Given that the algorithm randomizes the results and the datasets are often large, the outcome of the survey retains its validity while still protecting individual-level data.
Differential privacy can also minimize unethical data use. For the first time since 1950, the 2020 Census will ask the citizenship question: “Is this person a citizen of the United States?”
The addition of this question is controversial and has stoked fears that individual responses may be used to identify undocumented immigrants, allowing authorities to find and penalize or deport them. The use of differential privacy would reduce the likelihood that these data could be traced back to specific individuals and used for unintended purposes.
What are the pitfalls of differential privacy?
The purpose of differential privacy is to avoid linking original responses that can be linked back to an individual. Thus, differential privacy modifies individual responses to ensure that individuals cannot be identified when information is combined with other datasets. While differential privacy does not pose serious risks to validity or data accuracy within large datasets—such as the decennial Census—there may be validity challenges when researchers pull and use smaller datasets (i.e. county-level data or the ACS) from differentially private Census information. Thus, analysis based on small differentially private datasets may not be accurate. In practice, microdata used for research and analysis remains available with restricted-use (i.e. nonpublic) access in one of the Census Bureau’s Federal Statistical Research Data Centers, separate from differentially private publicly available data.
The uncertain future of privacy
Given the statistical complexities of differential privacy, it is crucial to have trained Census employees verify that it has been implemented correctly. In recent years the federal government has struggled to hire young talent with IT and computer science skills due to competition with private-sector firms, which may pose a significant issue for future Census data development and improvement.
While adopting differential privacy will not solve all concerns related to government collection of data, it is a promising signal that the Census Bureau is committed to protecting confidential information. Nevertheless, it is imperative that the agency fully implement and operationalize strategies for differential privacy in the 2020 Census.