Knowing what our data can’t say

Joshua Wu, PhD
5 min readJul 1, 2020

--

When we do research, we are almost always working in data deficit environment. To do reasonable research, we need to balance we can say with the data we have with what we cannot say because we lack better data.

photo credit

In an ideal world, there would be perfect match between research questions and available data to conduct empirical analyses. In the real world, we simply do what we can with best available data. Optimizing the confidence with which we answer the questions the data lets us answer with proper recognition of limitations in our analysis is critical. Too much confidence without qualification can lead to over-extrapolation of insights into unintended domains while an overemphasis on shortcomings and what questions we cannot answer could undermine the usefulness of the analyses that is possible.

So how do we strike that balance? One useful strategy is to imagine more perfect data and the types of analyses that would be possible if such better data was available. Using my previous analyses as examples, I discuss two core questions we should consider. Doing this will not only help us understand the limitations of what we do not know, but can also give us confidence in the things we can say with the data we have now.

What are baseline expectations?

In data analysis, we cannot look at observed outcomes without reference to expected outcomes. As part of the scientific process of discovery, we need not only hypotheses of what we expect to find but also null hypotheses, baseline expectations of what we would find if hypothesized effects or expected shifts did not happen.

In a previous piece, I describe how Black Americans are over twice as likely to be arrested and nearly seven times more likely to be murdered than White Americans. While this is empirical evidence of racial disparities, it does not answer whether the rates of arrests and murders of Black Americans are higher than expected given non-racial determinants of criminality. For example, we would expect more crime in more economically depressed communities; in communities with lower economic opportunity, crime may be more attractive because there is a lower opportunity cost. Secondly, we might expect more arrests and especially murders in communities where there are more firearms since availability of lethal weapons increase likelihood that escalations and disputes result in deadly use of force, and thus, higher murder rates.

However, the lack of systematic data means it is difficult to calculate the baseline expectation of arrests and murders given socio-economic conditions and availability of lethal force weapons. There are inconsistent reporting requirements for police, and widely divergent compliance with the quantity and quality of data that is shared with national agencies. And there are no robust estimates of the availability of guns or a national database of gun purchases or transfers.

But if this data was collected and available, we could build prediction models that used social-economic indicators and firearm availability to identify an expected baseline likelihood of arrest and murder. Or we could leverage quasi-random randomization to assess the marginal effect of differentials in racial composition of otherwise similar communities (with comparable socio-economic condition and availability of firearms) and compare relative arrest and murder rates. We could then answer if and extent of racial disparities after accounting for differences in non-racial determinants of arrests and murders.

What about temporal effects?

A second key question to ask is how our data and corresponding analyses factor (or fail to account) temporal effects. There are at least three types of temporal effects to consider: measurement lags where there are slow moving indicators and observable implications of a change are only observable after time has passed; temporal dependencies where each observation in a period is dependent not just on contemporaneous changes but also subject to lag effects from previous time periods; and temporal dimensionality effects where the unit of measurement, usually a count, does not fully capture the temporal dynamics of the outcomes of interest, usually a duration or repeated tenure.

Previously, my analyses reveal Black Americans are more nearly three times more likely to be incarcerated after an arrest than White Americans. To calculate this incarceration rate, I divide the current number of prisoners with the number of arrests in the same time period. While this analysis is a good proxy for relative risk, it does not capture the temporal dynamics of varying incarceration sentences. It reveals Black Americans are more likely to be incarcerated, but it does not reveal the cumulative magnitude of difference in terms of total difference of imprisonment.

Since there is only data on the number of prisoners, and not data on the total length of time they have spent or have yet to spend from their original sentence, we cannot differentiate between a prisoner finishing up the last few months of a 10 month sentence versus a prisoner in the middle of a 10 year sentence. In the data, they are empirically equivalent, both counting as one prisoner, though substantively very different when it comes to calculating differences in incarceration outcomes by race.

In an ideal world, there would be a comprehensive dataset that records the reason for initial arrest, the charges filed against an individual, and the length of sentence for that crime. That would allow for comparison of rate of incarceration and length of sentence. More than just comparing the number of prisoners as a proportion of arrests, we could examine both differences in conviction rates and temporal differences in sentenced jail time. We would also be able to calculate cumulative sentenced time (either as total sentence duration or time already spent incarcerated) to have a more comprehensive measure of racial differences in incarceration outcomes. For example, are Black Americans not just more likely to be incarcerated, but also more likely to be sentenced to longer terms for similar offenses?

Imaging the data we do not yet have

These two questions are starting points to begin imagining the type of better data that could provide more insightful answers to complement the best analyses we can do with the data we have. But it is critical that we imagine what is ideal so that we can be more confident of what we do know and more sure of what to do next so we can best know more of that which we do not yet know.

--

--

Joshua Wu, PhD
Joshua Wu, PhD

No responses yet