Based on my experience working in global corporate for-profits and tiny start-up not-for-profits, it’s a mistake to assume that the data gets cleaner and higher-quality as you move up in size. Whether data has been collected from a website form or brought in through an API, you’re following the same process of gathering information and putting it into a system. You may have more data staff in large enterprises or fewer people in small ones, but without the right checks and balances in place, you’ll end up with data that’s just as dirty as the next business down the road while dealing with the cost of bad data.

Some seemingly trivial examples of dirty data that can have very real implications include

  • Using an open text field for “country” when a dropdown pick list would eliminate the output of people entering whatever they wanted; e.g., UK, England, Britain, Great Britain, The United Kingdom, Northern Ireland, etc.
  • Phone number formats, which are all well and good in their infinite variety until you decide to integrate a telephony system
  • Zip codes and postal codes, which can work very well complementing apps that calculate distance from a local office to all codes in the vicinity, unless of course they aren’t entered consistently and can’t be verified

If your team knows how much time and resources bad data costs them and is looking to get their arms around the problem, here are some of the steps I recommend in my practice.

  1. Evaluate what you’ve got. If your inside sales reps or marketing staff are filling out profiles with no customer emails available, better that they can check a box to this effect – No Email Available – than guessing on the company’s email format or filling in a field with something like noemail@email.com. The record can be corrected later, versus mounting one digital campaign after another to an address that goes nowhere. Not much ROI in that. 
     
  2. Get started versus boiling the ocean. Once you’ve accepted that you need to do something about your data quality, it’s important to get going rather than waiting until you have a perfect plan. That could take months or years. Try to understand which data you’re collecting and why. Assess what your largest source of data is – in the world of non-profits, it’s often the fundraising database or the volunteer management integration – and move to stop the flow of bad data there. For example, make some progress on updating data fields and validation to ensure you have the right formats. I’m also an advocate of hiding data from view using system admin field so users that don’t need to see all the data and likely will be slowed down by it don’t see it at all.
     
  3. Investigate “perfect” data claims. Data is almost never pristine. When I was in hospitality, there used to be hotels that would turn in perfect audits month after month. I finally asked a contact at one of them how they did it. Their response was, “Have you actually looked at what we put in?” It turned out that some of these hotels did have a perfect record – in finding ways around the data validation. One month at the same company, someone accidentally added two zeroes to the end of an inquiry about a group stay to make it £1 million instead of £10,000. Since the stay never happened, it skewed the conversion rate for the whole month and panicked our business leaders. So much for perfect data. 
     
  4. Practice good data governance. Good governance is what separates organizations that collect gender and ethnicity data because they feel it will be “interesting” from those that know General Data Protection Regulation (GDPR) compliance and understand the requirements for gathering data, aggregating it, sharing it appropriately, storing it, backing it up, and deleting it when required. Because a lot of non-profits work on the basis of goodwill, they assume no one will mind if they collect a little extra data and that no one will mind if they keep the data past the allotted date when they’re no longer supposed to be holding it. However, bad data practices have costs. I like to refer to a tabloid newspaper here in the U.K. and say, “Imagine how it will look on the cover of The Daily Mail after your data’s been breached.” No one ever suspects they’ll be hacked until they are. 
     
  5. Be careful what you ask for. Data requests from users such as potential customers or donors have become a bit more reasonable in recent years, and pre-populating form functionality can be a godsend, but violations are still rampant. Before presenting a colleague or sales prospect with 100 fields to fill out without noting which ten are mandatory, consider their likely behavior. “Oh, I don’t know. Click, click, click.” Suddenly you’ve got a boatload of dirty data.
     
  6. Test and learn. If one of your clients or users insists on collecting a dubious piece of data, ask them what they want it for and what the overall plan is for using and measuring it. Can it be selectively added for just a subset of users? Is it going to add any value? If they insist, offer to implement the change but to report out on it periodically. If after a year only 1 percent of the target audience is providing this data, you’ve got the evidence you need to delete the request. 
     
  7. Be honest about your limitations. I’ve seen far too many examples where a data team fudges the figures because the leadership believes it’s doable. Then they get into trouble six months down the line when leadership asks, “What’s our sustainer rate for donors?” and the new analyst admits that they have no way of telling them that. To prevent you or another analyst from being thrown under the bus at some point in the future, be honest and admit what you can and can’t offer. Better to have a difficult conversation now than spending three days every month tweaking all your numbers to try and bring them in where someone wants them.

AI: Good data in, good data out
If you’ve ever seen an AI application driven by a Large Language Model hallucinate, you know that AI is only as good as the data it gets. The last thing you want your AI application to do is churn through thousands of records with zip codes like XXXXX or turn it loose on a database where Sales put data in one field and Service puts the same data in another field. The fewer gaps there are, the more meaningful AI’s outputs will be. You need to make sure your data is as clean as possible, which is why you need good data governance, including a full data dictionary that explains the purpose of all your fields as well as a common, consistent language to describe your terms. 

I like to think of using AI as sending your data off to a large data company for reporting and analysis. With bad data and no data dictionary the AI will return bad results. But the data company has no magic wand, either. They’ll come back to you straightaway with a list of questions. Do you have your data in this format? Can you please update all these records? Are these elements part of this formula? And so on. So iron out these issues in advance if you can. 

Taking data seriously
Data is no longer a trivial thing that you collect and use haphazardly. It’s what businesses run on. There are financial and reputational implications of not getting it right, but regulatory and legal ones as well. Make sure your organization knows what data you’re collecting and why. A heightened awareness of data always pays off today and in the long term.