What is data cleansing, do you need it and how does it work?
Spring has sprung, friends! The calendar officially propelled us from winter into spring this past weekend and, with the sun shining, the Easter holiday just around the corner and covid vaccines becoming widely available, everything sure does seem brighter, doesn’t it? Hopefully you’re feeling energized for one of the season’s biggest tasks – spring cleaning. If so, we encourage you to think beyond reorganizing the closets and deep cleaning the oven (but, please… do deep clean the oven.) We’re curious: what are your current data cleansing practices? If you’re not sure what we’re talking about, no worries. Now is an excellent time to tackle it.
First things first: what is data cleansing?
Data cleansing is the process of removing any “dirty” data that doesn’t belong in your dataset. This could be incorrect info (often due to human error and “fat thumbs”), outdated items, outliers, duplicates, or incomplete or improperly formatted data. A lot of this process can be automated via a friendly software solution. A very simplistic example would be running a spellcheck, though much more advanced data cleansing tools exist. However, be aware that some data cleansing tasks need the time and attention of a human (friendliness optional but very much preferred.)
Why is data cleansing important?
Because technology can do so much, it’s tempting to just assume that computers are smart and they know what’s up. The reality is that, while computers have enormous potential, they are incredibly literal, meaning that they only do exactly what we tell them to do. In terms of data, that means that if you want your software to perform any kind of analysis or machine learning, it’s critical that you feed it accurate data as a starting point. Incorrect data messes up any efforts you have to use your data to make decisions – business intelligence, personalized marketing, predictive maintenance, etc. And these errors will cost you. According to Forbes, dirty data can account for up to 12 percent in revenue losses.
Unfortunately, data cleansing isn’t something you can do once and check off your list – especially if customer data is important to your operation. In the B2B world, customer data decays at a rate of 30 percent a year. People change, man. Your customer database needs to change with them.
Here’s how you do it.
There are a lot of methods for finding dirty data. Aside from running data scrubbing software, you might use algorithms, graphing, histograms or even just good old-fashioned skimming and note-taking. Once you find your problematic data, here are some steps you won’t want to skip:
Set structural rules and stick to them. This refers to naming conventions. For example, does your company abbreviate the United States as US, U.S., USA or U.S.A.? Because, chances are that your computer will see that as four separate places.
Flag and filter outliers. The term “outlier” refers to a value that is just way off base. For instance, many businesses are currently recording employees’ temperatures each day as a pandemic precaution. An outlier would be 988 degrees instead of 98.8 degrees. That number just doesn’t make sense. Double-check that your outliers are not mistakes and, if you’re unsure, remove those items from any analysis.
Get rid of obvious junk. Some data just has to go. This could be your duplicates or info that just isn’t at all relevant to the type of insight you’re after. Say you want to analyze failure rates on equipment that is 10 years or older. Any newer equipment that gets placed in this dataset is really going to mess up your effort. Filter it out.
Address the gaps. Inevitably, your dataset will have missing info. Sometimes you can figure out the missing values based in other data (but know that in some cases this degrades the data integrity.) Your other options are dropping the data altogether or flagging it for your machine learning algorithms. It all depends on what exactly you are analyzing and why.
Happy data cleansing! Remember there are some very thorough data cleansing tools you can adopt to speed up this process. Remember, too, to clean your data periodically since it’s always changing. Oh, and good luck cleaning that oven – remember to wear gloves. If you have more questions about data, take a look at this blog post, or reach out to us directly any time.