Data ethics is a hot topic these days, rightly so. There is a whole world of data ethics in play for any organization that is just about mitigating risk. So how do you arrive at a better stance when it comes to the ethical use of data? 

For the clients I work with, the answer is both simple and highly complicated. You mitigate risk by focusing on how your organization uses data. This often means considering the unintended consequences of exposing data to anyone in your organization. That’s where the complexity comes in, because it’s nearly impossible to control how anyone uses data once it’s out of your hands. It’s also simply reality that the degree of any data set’s hygiene or bias constantly changes based on how the data was collected and how it’s being analyzed.

Valid data, unintended consequences

In quantitative measurements, a subfield of educational psychology, we talk about the importance of validating what you measure. If you’re conducting a survey, for example, you have an intended purpose for what you measure and should provide validity evidence around that purpose. But that doesn’t mean this is how people will read your results or ultimately use that information, even if you provide a disclaimer. 

A recent example is the Houston Astros. It’s perfectly ethical for catchers on opposing teams to signal to their pitchers whether to throw a curve or a slider to a particular Houston batter. That’s valid data. Where the use of this data became unintended was when the manager and general manager of the team helped to develop an illegal approach to steal these pitch signs from opposing catchers to help their hitters anticipate what was coming. Once this system came to light, both men were fired in 2020. Other teams have faced similar charges.

Where this dynamic hits our daily lives is when someone collects valid data and someone else starts doing something different with it. Ethical data practices should be thought of on a continuum, in other words. If a subset of a population could be negatively affected by the data we collect, it’s up to us as data leaders to evaluate our process and determine how we can clean it up.

Linking data literacy to ethics 

This is where data literacy comes in. Our ability to understand and work with data, understand its purpose or how it’s been collected and measured, will directly determine the nuance of that understanding and our ultimate use of the data. 

Literacy is a two-way street, however. No matter how well we may curate data, we must rely on the consumers of that information, within and outside our organizations, to make the best judgment based on how we've presented the data should be used. The data literacy of the curator and the consumer, in short, will end up shaping whether the assumptions of how the data was to be used are violated or respected. 

We know, in real-world practice, that the systems we have in place for creating data create the conditions for bias, whether this is intended or not. A person’s zip code is a well-researched example. This passive data (data that can be collected without the subject’s involvement) can be treated as such or used to determine whether or not to loan someone money, and at what interest rate. The passive data of one’s race can have real implications in their treatment from healthcare systems. 

Data literacy plays a crucial role here, because if we don’t account for differences in sub-populations that are the result of passive, objective data, then the inferences we make with data will be incorrect at best or unethical at worst. This is especially problematic when you make this data available to someone who doesn't understand the potential implications of sharing the information framed in a way that benefits them. 

Questions to ask

As data leaders, it’s important to ask several questions to prevent unintended data use, even if it’s based on passive collection:

  • Is the data we create as an organization equitable, and can it track the continuous change among all the customer groups we serve?
  • How might the use of this data influence its outcomes and the choices that are made as a result? 
  • How should the potential long-term implications of the display or use of this data shape how we collect it and distribute it? 
  • How explicitly should we validate the intended use of our data, especially if downstream users are likely to have lower data fluency? 
  • Do we have the processes in place to ensure long-term fairness?    

In business, data ethics conversations often center on whether you see the potential to drive additional margin based on the data at hand, and whether this advantage is based on active data that users volunteer or passive data over which they have no control. 

My point is, if we care about equity as data leaders, it’s incumbent on us to drive data literacy forward in our organizations but also in our customers’ or end-users’ organizations. The literacy of our data ecosystems may play an outsized role in how willing any decision-maker or organization is to sacrifice the valid interpretation of a metric for an ulterior motive that unfairly disadvantages others at their own profit. 

Conversations about data equity are beginning to happen, but they need to happen much more frequently, at all levels of organizations. There currently exists an enormous gulf between organizations that say they care about data ethics and those actually building machine learning and AI models that drive equity. The bulk of organizations are indifferent to these questions, or self-defensive in responding to them, because they don’t believe their practices could drive ethically questionable outcomes. Yet the ultimate success of these models, and all organizations is a direct function of creating higher data literacy so people understand the implications of their actions.