You can use your favorite social network to register or link an existing account:
Or use your email address to register without a social network:
Sign in with these social networks:
Or enter your username and password
Forgot your password?
Yes, please link my existing account with for quick, secure access.
No, I would like to create a new account with my profile information.
Tips
How-to
News
Videos
Stories
Today’s author: Chad Rothschiller, a program manager on the Excel team. Chad is going to discuss using formulas to 'clean up' data in Excel.
Overview
Excel is a great tool to use when you need to take data in one format, manipulate it into another format, and push the results along to another process, e.g. a database. In this context, Excel is a great landing pad or middle man, serving as a data transformation tool to move data from one system to another.
This example considers a sample data set and walks through the steps to clean up the data and perform various transformations on the data set to massage it into a more desirable format.
Example Data
Here’s the data discussed in this example exercise, feel free to copy/paste it into a spreadsheet and work along through the steps:
SSN,Last Name,Middle Name,First Name,DOB,Date of Letter,E-mail Address,Home Phone,Address,City,State,Postal Code123456789,freehafer ,drew ,nancy,19700101,11/1/2007,NancyF@northwindtraders.com,1235550102,123 1st Avenue,Seattle ,wa,09999123456789,cencini ,a ,andrew,19700101,11/1/2007,AndrewC@northwindtraders.com,1235550102,123 2nd Avenue,Bellevue,wa,09999123456789,kotas ,bollen,jan,19700101,11/1/2007,JanK@northwindtraders.com,1235550102,123 3rd Avenue,Redmond ,wa,09999123456789,sergienko ,,mariya,19700101,11/1/2007,MariyaS@northwindtraders.com,1235550102,123 4th Avenue,Kirkland,wa,09999123456789,thorpe ,j ,steven,19700101,11/1/2007,steven@northwindtraders.com,1235550102,123 5th Avenue,Seattle,wa,09999123456789,neipper ,john,michael,19700101,11/1/2007,MichaelN@northwindtraders.com,1235550102,123 6th Avenue,Redmond,wa,09999123456789,zare ,b ,robert,19700101,11/1/2007,RobertZ@northwindtraders.com,1235550102,123 7th Avenue,Seattle,wa,09999123456789,giussani ,,laura,19700101,11/1/2007,LauraG@northwindtraders.com,1235550102,123 8th Avenue,Redmond,wa,09999123456789,hellung-larsen ,marie,anne,19700101,11/1/2007,AnneHL@northwindtraders.com,1235550102,123 9th Avenue,Seattle ,wa,09999
Notice a few things about the data:
Here’s what needs to be done to the data to get it ready for processing:
Step 1: Import the data and don’t accidentally drop the leading zeros!
To Excel, the Postal Code values look like numbers, and so that’s the default way it will treat them when loading this file. (Unfortunately Excel can’t read the header “Postal Code” and switch it to a text field based on the semantic meaning of the data.)
Fortunately there is a way to override this default, using the Text Import Wizard:
Best fit the columns (select those columns, double click on the line between column headers), and here’s what the result should look like:
Now that the data has landed in Excel without dropping critical information, next up is to work it into the format needed for reporting and uploading.
Step 2: Formatting SSN
The SSN values need to be converted to the form NNN-NN-NNNN. There are two ways to accomplish this.
Change the display via cell formatting
Here’s what the end result looks like:
Note that the underlying value is “123456789” (look in the formula bar to verify this), while the cell is formatted to display the hyphens.
Inserting hyphens
A formula is needed to accomplish this. This example will use the convention of creating a new table of data from the original, where the formulas are in cells off to the right of the original data.
The formula for this will use the MID() function to pull out the first 3 numbers of the SSN. Then it will use the “&” to concatenate that with a hyphen, use the MID() function to grab the middle two numbers, use “&” again to insert another hyphen, and finally use MID() again to tack on the last 4 numbers of the SSN.
Here’s what the result looks like:
To prove it, select the column of SSNs, Copy / Paste Special… Values and examine the contents of the cells. The hyphens are in the right places.
Step 3: Fixing up names
There are several problems to be addressed with the names:
The “&” operator will be used to concatenate the value in Last Name with a comma and space; the “&” operator will be used to add the value from First Name and to add a space to that result; the LEFT() function along with the “&” operator will be used to create and add a middle initial from Middle Name. The last two steps will be to wrap those functions with a TRIM() function to remove extra spaces, and wrap the final result with a PROPER() function to set the letter casing correctly.
Note: while extra spaces were in the original values, even if they weren’t the TRIM() function would need to be used in this case because for people without Middle Name values listed, the concatenation part of the formula introduces an extra space. That’s also the reason why the TRIM() function should be used on the final result as opposed to deeper in the formula, i.e. on First Name, Last Name, and Middle Name individually.
Note: Extra spaces may not be the only undesirable characters in the data. The CLEAN() function can be used to remove non-printable characters. See Appendix 1 for a more thorough discussion of removing undesirable characters from the data.
Step 4: Convert DOB to an actual date
The DOB values are not recognized by Excel as dates. These need to be converted to MM/DD/YYYY format. Using the DATE() function and MID() function will produce the desired results. The YEAR, MONTH, and DAY portions of the date will be extracted from the DOB field using MID(), and the result is converted to a date using the DATE() function:
Step 5: Fill in Date of Letter
No changes are needed to the Date of Letter field, so either copy / paste those values in, or use a formula to refer to them so that the table created is complete.
Step 6: lower case E-Mail Address
For consistency sake, the email addresses should all be lower case, since they aren’t case-sensitive anyhow. This will be accomplished using the LOWER() function.
Step 7: Format Home Phone
The Home Phone values are hard to read, and not in the correct format. They need to be fixed up to match the (NNN) NNN-NNNN format.
Just like SSN, this can be accomplished through cell formatting or inserting the proper characters in the right places.
Inserting parenthesis & hyphens
A formula needs to be used to accomplish this. The formula will use MID() and concatenate operations to build up the correct value.
Step 8: Trim extra spaces from Address
Use the TRIM() function as done on the name fields to remove extra spaces. This one is easy, so go ahead and do it and move to the next step!
Step 9: Add City values to the new table
Since a whole new table is being created to replace the original data, the values from City need to be brought over, even though there’s nothing wrong with them. If you have a huge data set you might not be able to inspect them all by hand, so in that case you’ll probably still want to use TRIM() and CLEAN() on those values, just in case.
Step 10: Make all State value upper case
The State field contains lower case values, and all the characters for this field need to be upper case. Both PROPER() and LOWER() have already been used to clean up the data, so using UPPER() to convert the lower case characters to uppercase ones is easy.
Step 11: Fill in Postal Code
To complete the table, copy/paste the Postal Code values from L2:L10 to W2:W10.
Step 12: Finalize the values
Now creating the new table is completed, with the data in the format required. The sheet should be finalized by removing formulas, keeping only the values, and then deleting the original data.
Note: It might be useful to keep the formulas around on a template saved separately, so they don’t have to be built up over and over as different sets of data are processed.
Now the newly cleaned data is the only data set in the sheet. Nice work!
Bonus Discussion: Sorting Text using more than 3 keys
It may be that someone needs to sort data sets using more than 3 keys for the sort. As you might now, Excel 2003 only supports 3 levels of sort keys. Excel 2007 supports a whole lot more, so this exercise is trivial using Excel 2007. Here’s how it could be done using Excel 2003. Assume the original data set needs to be sorted by State / City / Last Name / First Name / Middle Name, consider this similar data set:
A new column can be created (say, in column F), using text concatenation to string together the sort keys. In F2, enter this formula: =E2&D2&A2&C2&B2, and give it a header, like “SortKey”.
Now, sorting on that single column is the same as sorting on those keys individually. Of course, ascending/descending order can’t be specified on each key using this method, so Excel 2007 is still better!
Appendix 1: Cleaning Text
Trim means
Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing. The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character.
Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing.
The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character.
Clean means
Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed. The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.
The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
Advanced Clean for Unicode Characters using SUBSTITUTE & CHAR
Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or non-printing characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions. The TRIM function removes spaces from text except for single spaces between words. The CLEAN function removes all nonprintable characters from text. Both functions were designed to work with 7-bit ASCII, which is a subset of the ANSI character set (ANSI character set: An 8-bit character set used by Microsoft Windows that allows you to represent up to 256 characters (0 through 255) by using your keyboard. The ASCII character set is a subset of the ANSI set.). It's important to understand that the first 128 values (0 to 127) in 7-bit ASCII represent the same characters as the first 128 values in the Unicode character set. The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character. The CLEAN function was designed to remove the first 32 non-printing characters in the 7 bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters. To do this task, use the SUBSTITUTE function to replace the higher value Unicode characters with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.
Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or non-printing characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions.
The TRIM function removes spaces from text except for single spaces between words. The CLEAN function removes all nonprintable characters from text. Both functions were designed to work with 7-bit ASCII, which is a subset of the ANSI character set (ANSI character set: An 8-bit character set used by Microsoft Windows that allows you to represent up to 256 characters (0 through 255) by using your keyboard. The ASCII character set is a subset of the ANSI set.). It's important to understand that the first 128 values (0 to 127) in 7-bit ASCII represent the same characters as the first 128 values in the Unicode character set.
The CLEAN function was designed to remove the first 32 non-printing characters in the 7 bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
To do this task, use the SUBSTITUTE function to replace the higher value Unicode characters with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.
Comments: (25) Collapse
Thanks for bringing the issue with Data Tables to our attention. This is a known issue and we're working on getting a fix out as a part of a future release. We'll update you on when the fix will be available when we have more information to share.
Any way to use CLEAN to clean-up non-printable characters in the whole cloumn or from multiple columns?
I have an excel spreadsheet with 2000 lines of data. Some of the data is the same address. One may be writtes as 301 N Howeth St, one may be written 301 Howeth St, one may be written 301 Howeth. I want to have a way to run a script and find these similar rows and ask me if I want to delete any so I can get rid of the dups. Any help on this? rlasalle@mfgtx.com
<p>Excel Novice - I'd just copy the CLEAN function down the whole column. To cover multiple columns, just keep adding new columns of formulas off to the right, for as many as you need to clean.</p>
<p>Robert LaSalle - You're essentially talking about Data Quality, Cleansing, Standardization, and Deduplication. I have seen companies spend tens of thousands of dollars to clean up their marketing databases / address lists. It is just plain irritating to send someone a piece of mail 3 different times. It gives the impression you don't know who they are. You have a similar problem with names: is S. Thorpe the same as Steven Thorpe? I would say it depends. If the address is different, then certainly they’re not the same. If the address is the same, then, well, what if Steven has a wife Shelly? But at least you know it's the same household.</p>
<p>The real way to solve this problem is to first normalize the address (N becomes North, St becomes Street (in most cases), and so on) and parse it out into sub fields (like house number, direction, street name, street type). Then you can use matching rules to decide if they're the same. You can even weight certain fields as stronger indications of a match than others. e.g. it isn't that big of a deal if the direction is missing on 1 of the records you're comparing, but it certainly isn't a match if the street name doesn't match. To do all this, you'll need specialized software that might cost you upwards of $100K (see <a href="www.trilliumsoftware.com/" target="_new" rel="nofollow">http://www.trilliumsoftware.com</a> for an example of such software).</p>
<p>To do a one-off job in Excel, you're going to have to play with the data and eyeball it. You might consider making a copy of your data, then removing all the directional pieces from the address (N/S/E/W/NE/NW/SE/SW/North/South/East/West/etc), and maybe even the suffix data as well (Ave/Avenue/St/Street/Drive/Dr/etc). Then sort by those fields and look for duplicates, looking at first name/last name to help, and so on. Then go back to the original data to find and remove possible duplicates. You could also try sorting by last name then first name, and looking for duplicates there.</p>
<p>To quickly flag duplicates in a list, here's what I sometimes do. Say there are sorted addresses in A1:A10. In B2 I'd put this formula: =IF(A2=A1,"X",""), and then copy it down. The X will flag any duplicates and you can quickly work through them.</p>
<p>OK, enough on that topic!</p>
Sorry, newbees and don't know where to blog.
when I plot my data as an exponential curve in Excel, the Excel 2007 gave a wrong equation (y=18159e0.0236x)
while Excel 2003 (y=181596e0.0236x) give correct equation
Raw data:
x y
0 200000
25 302500
49 535833
74 1035000
96 1863333
msn999 - this is a known issue that we are hoping to fix in the upcoming SP1.
I have a spreadsheet of email addresses that all appear to have the @ sign in them - but when some of them reach their destination they @ sign has been replaced with the HTML code %40. There is no visual difference between those that appear this way and those that don't.
Is there any way to clean this?
Thanks!
Charles
Charles - could you send an example of each type in a workbook? Use blogs.msdn.com/.../contact.aspx to start an email conversation and I can take a look.
--Chad
David/Chad
Great post - these are techniques I use all the time, and I frequently get asked to help colleagues 'clean' their data. Nice to see that someone has had time to put all this information together so now I have a place I can send people when I don't have time to help!
-TB 8>
Comments: (loading) Collapse