BOOK EXCERPT

Democracies need to collect data to work properly. The US isn't doing a good job

The government can’t level the playing field if it doesn’t know who’s playing

Published September 27, 2020 10:00AM (EDT)

US Census workers stand outside Lincoln Center for the Performing Arts as the city continues Phase 4 of re-opening following restrictions imposed to slow the spread of coronavirus on September 24, 2020 in New York City. The fourth phase allows outdoor arts and entertainment, sporting events without fans and media production. (Noam Galai/Getty Images)
US Census workers stand outside Lincoln Center for the Performing Arts as the city continues Phase 4 of re-opening following restrictions imposed to slow the spread of coronavirus on September 24, 2020 in New York City. The fourth phase allows outdoor arts and entertainment, sporting events without fans and media production. (Noam Galai/Getty Images)

Adapted from “Democratizing Our Data: A Manifesto” by Julia Lane (MIT Press, 2020). Reprinted with permission from MIT Press.

A government of by the people and for the people can't function very well if you don't know who those people are. Indeed, that's why public data are foundational to our democratic system: in order to distribute resources and understand how to serve the public, elected officials have to understand the composition of the body politic. All nations collect data on their citizens in some way in order to do this. 

In the United States, we know about income inequality and job trends thanks to data from the Bureau of Labor Statistics. We know what's happening to economic growth thanks to data from the US Census Bureau. We know about the impact of business tax changes thanks to data from the Statistics of Income Division. Data like these are profoundly important for most of us, and especially for individuals and small businesses who can't pay for expensive experts to produce customized reports. 

One of the government's jobs is to level the data playing field. Statistical agencies have historically been the source of accurate and objective information for democracies, due to the limitations of private sector–produced data. For example, emergency supplies probably shouldn't be allocated to an area based on the frequency of tweets from that location. Why? Because that would mean more supplies going to the people who tweet, underserving babies and elderly residents who are less likely to have Twitter accounts. Emergency supplies should be allocated based on information about the people likely to need such supplies, and government data are the way in which we ensure that the right people are counted. If people aren't counted, they don't count, and that threatens our democracy.

But recently, the playing field has been tilting against public data. Our current statistical system is under stress, too often based on old technology and with too little room for innovation. Our public statistical institutions often are not structurally capable of taking advantage of massive changes in the availability of data and the public need for new and better data to make decisions. And without breakthroughs in public measurement, we're not going to get intelligent public decision-making.

Over twenty years ago, one of the great statistical administrators of the twentieth century, Janet Norwood, pointed to the failing organizational structure of federal statistics, warning that, "In a democratic society, public policy choices can be made intelligently only when the people making the decisions can rely on accurate and objective statistical information to inform them of the choices they face and the results of choices they make." 

We must rethink ways to democratize data. There are successful models to follow and new legislation that can help effect change. The so-called Data Revolution happening in the private sector — where new types of data are collected and new measurements created by the private sector to build machine learning and artificial intelligence algorithms — can be mirrored by a public sector Data Revolution, one that is characterized by attention to counting all who should be counted, measuring what should be measured, and protecting privacy and confidentiality. Just as US private sector companies—Google, Amazon, Microsoft, Apple, and Facebook—have led the world in the use of data for profit, the US can show the world how to produce data for the public good.

There are massive challenges to be addressed. The national statistical system — our national system of measurement — has ossified. Public agencies struggle to change the approach to collecting the statistics that they have produced for decades — in some cases, as we shall see, since the Great Depression. Hamstrung by excessive legislative control, inertia, lack of incentives, ill-advised budget cuts, and the "tyranny of the established," they have largely lost the ability to respond to quickly changing user needs.  Despite massive increases in the availability of new types of data, such as administrative records (data produced through the administration of government programs, such as tax records) or by digital activities (such as social media or cell phone calls), the US statistical agencies struggle to operationalize their use. Worse still, the government agencies that produce public data are at the bottom of the funding chain—staffing is being cut, funding is stagnant if not being outright slashed, and entire agencies are being decimated.

If we don't move quickly, the cuts that have already affected physical, research, and education infrastructures  will also eventually destroy our public data infrastructure and threaten our democracy. Trust in government institutions will be eroded if government actions are based on political preference rather than grounded in statistics. The fairness of legislation will be questioned if there is not impartial data whereby the public can examine the impact of legislative changes in, for example, the provision of health care and the imposition of taxes. National problems, like the opioid crisis, will not be addressed, because governments won't know where or how to allocate resources. Lack of access to public data will increase the power of big businesses, which can pay for data to make better decisions, and reduce the power of small businesses, which can't. The list is endless because the needs are endless.

I want to provide a solution to the impending critical failure in public data. Our current approach and the current budget realities mean that we cannot produce all the statistics needed to meet today's expectations for informing increasingly complex public decisions. We must design a new statistical system that will produce public data that are useful at all levels of government—and make scientific, careful, and responsible use of many newly available data, such as administrative records from agencies that administer government programs, data generated from the digital lives of citizens, and even data generated within the private sector.

What's at stake

Measurement is at the core of democracy, as Simon Winchester points out: "All life depends to some extent on measurement, and in the very earliest days of social organization a clear indication of advancement and sophistication was the degree to which systems of measurement had been established, codified, agreed to and employed." 

Yet public data and measurement have to be paid for out of the public purse, so there is great scrutiny of costs and quality. Yet in a world where private data are getting cheaper, the current system of producing public data costs a lot of money—and costs are going up, not down. One standard is how much it costs the Census Bureau to count the US population. In 2018 dollars, the 1960 Census cost about $1 billion, or about $5.50 per person. The 1990 Census cost about $20 per head.  The 2020 Census is projected to cost about $16 billion, or about $48 per head. And the process is far from instant: Census Day is April 1, 2020, but the results won't be delivered until December.

Another standard is the quality of data that are collected. Take a look, for example, at the National Center for Health Statistics report to the Council of Professional Associations on Federal Statistics. Response rates on the National Health Interview Survey have dropped by over 20 percentage points, increasing the risk of nonresponse bias, and the rate at which respondents "break off" or fail to complete the survey has almost tripled over a twenty-year period.

As a result, communities are not getting all the information they need from government for decision-making. If we made a checklist of features of data systems that have made private sector businesses like Amazon and Google successful, it might include producing data that are: (1) real-time so customers can make quick decisions; (2) accurate so customers aren't misled; (3) complete so there is enough information for the customer to make a decision; (4) relevant to the customer; (5) accessible so the customer can easily get to information and use it; (6) interpretable so everyone can understand what the data mean; (7) innovative so customers have access to new products; and (8) granular enough so each customer has customized information.

If we were to look at the flagship programs of the federal system, they don't have those traits. Take, for example, the national government's largest survey—the Census Bureau's American Community Survey (ACS). It was originally designed to consistently measure the entire country so that national programs that allocated dollars to communities based on various characteristics were comparing the whole country on the same basis. It is an enormous and expensive household survey. It asks questions of 295,000 households every month—3.5 million individuals a year. The cost to the Census Bureau is about $220 million and another $64 million can be attributed to the respondents in the value of the time taken to answer the questions.  Because there is no high-quality alternative, it is used in hundreds if not thousands of local decisions—as the ACS website says, it "helps local officials, community leaders, and businesses understand the changes taking place in their communities."  In New York alone, the police department must report on priority areas that are determined, in part, using ACS poverty measures, pharmacies must provide translations for top languages as defined by the ACS,  and the New York Department of Education took 2008 ACS population estimates into account when it decided to make Diwali a school holiday.

Yet while reliable local data are desperately needed, the very expensive ACS data are too error prone for reliable local decision-making. The reasons for this include the survey design, sample sizes that are too small, public interpretation of margins of error when sample sizes are small, and lack of timely dissemination of data.

The core problem is the reliance on old technology. The data are collected by means of mailing a survey to a random set of households (one out of 480 households in any given month). One person is asked to fill out the survey on behalf of everyone else in the household, as well as to answer questions about the housing unit itself. To give you a sense of the issues with this approach: there is no complete national list of households (the Census Bureau's list misses about 6 percent of households), about a third of recipients refuse to respond, and of those who respond, many do not fill out all parts of the survey.  There is follow-up of a subset of nonresponders by phone, internet, and in-person interviews, but each one of these introduces different sources of bias in terms of who responds and how they respond. Because response rates vary by geography and demography, those biases can be very difficult to adjust for. 

Such problems are not unique to the ACS; surveys in general are less and less likely to be truly representative of the people in the United States and the mismatch between intentions and reality can result in the systematic erasure of millions of Americans from governmental decision-making.


By Julia Lane

Julia Lane is a founder of the Coleridge Initiative, Professor at the NYU Wagner Graduate School of Public Service and the NYU Center for Urban Science and Progress, and an NYU Provostial Fellow for Innovation Analytics.

MORE FROM Julia Lane


Related Topics ------------------------------------------

Book Excerpts Census Commentary Data Collection Data Science United States