You can use your favorite social network to register or link an existing account:
Or use your email address to register without a social network:
Sign in with these social networks:
Or enter your username and password
Forgot your password?
Yes, please link my existing account with for quick, secure access.
No, I would like to create a new account with my profile information.
Tips
How-to
News
Videos
Stories
Today’s author: Chad Rothschiller, a program manager on the Excel team. Chad is going to discuss using formulas to 'clean up' data in Excel.
Overview
Excel is a great tool to use when you need to take data in one format, manipulate it into another format, and push the results along to another process, e.g. a database. In this context, Excel is a great landing pad or middle man, serving as a data transformation tool to move data from one system to another.
This example considers a sample data set and walks through the steps to clean up the data and perform various transformations on the data set to massage it into a more desirable format.
Example Data
Here’s the data discussed in this example exercise, feel free to copy/paste it into a spreadsheet and work along through the steps:
SSN,Last Name,Middle Name,First Name,DOB,Date of Letter,E-mail Address,Home Phone,Address,City,State,Postal Code123456789,freehafer ,drew ,nancy,19700101,11/1/2007,NancyF@northwindtraders.com,1235550102,123 1st Avenue,Seattle ,wa,09999123456789,cencini ,a ,andrew,19700101,11/1/2007,AndrewC@northwindtraders.com,1235550102,123 2nd Avenue,Bellevue,wa,09999123456789,kotas ,bollen,jan,19700101,11/1/2007,JanK@northwindtraders.com,1235550102,123 3rd Avenue,Redmond ,wa,09999123456789,sergienko ,,mariya,19700101,11/1/2007,MariyaS@northwindtraders.com,1235550102,123 4th Avenue,Kirkland,wa,09999123456789,thorpe ,j ,steven,19700101,11/1/2007,steven@northwindtraders.com,1235550102,123 5th Avenue,Seattle,wa,09999123456789,neipper ,john,michael,19700101,11/1/2007,MichaelN@northwindtraders.com,1235550102,123 6th Avenue,Redmond,wa,09999123456789,zare ,b ,robert,19700101,11/1/2007,RobertZ@northwindtraders.com,1235550102,123 7th Avenue,Seattle,wa,09999123456789,giussani ,,laura,19700101,11/1/2007,LauraG@northwindtraders.com,1235550102,123 8th Avenue,Redmond,wa,09999123456789,hellung-larsen ,marie,anne,19700101,11/1/2007,AnneHL@northwindtraders.com,1235550102,123 9th Avenue,Seattle ,wa,09999
Notice a few things about the data:
Here’s what needs to be done to the data to get it ready for processing:
Step 1: Import the data and don’t accidentally drop the leading zeros!
To Excel, the Postal Code values look like numbers, and so that’s the default way it will treat them when loading this file. (Unfortunately Excel can’t read the header “Postal Code” and switch it to a text field based on the semantic meaning of the data.)
Fortunately there is a way to override this default, using the Text Import Wizard:
Best fit the columns (select those columns, double click on the line between column headers), and here’s what the result should look like:
Now that the data has landed in Excel without dropping critical information, next up is to work it into the format needed for reporting and uploading.
Step 2: Formatting SSN
The SSN values need to be converted to the form NNN-NN-NNNN. There are two ways to accomplish this.
Change the display via cell formatting
Here’s what the end result looks like:
Note that the underlying value is “123456789” (look in the formula bar to verify this), while the cell is formatted to display the hyphens.
Inserting hyphens
A formula is needed to accomplish this. This example will use the convention of creating a new table of data from the original, where the formulas are in cells off to the right of the original data.
The formula for this will use the MID() function to pull out the first 3 numbers of the SSN. Then it will use the “&” to concatenate that with a hyphen, use the MID() function to grab the middle two numbers, use “&” again to insert another hyphen, and finally use MID() again to tack on the last 4 numbers of the SSN.
Here’s what the result looks like:
To prove it, select the column of SSNs, Copy / Paste Special… Values and examine the contents of the cells. The hyphens are in the right places.
Step 3: Fixing up names
There are several problems to be addressed with the names:
The “&” operator will be used to concatenate the value in Last Name with a comma and space; the “&” operator will be used to add the value from First Name and to add a space to that result; the LEFT() function along with the “&” operator will be used to create and add a middle initial from Middle Name. The last two steps will be to wrap those functions with a TRIM() function to remove extra spaces, and wrap the final result with a PROPER() function to set the letter casing correctly.
Note: while extra spaces were in the original values, even if they weren’t the TRIM() function would need to be used in this case because for people without Middle Name values listed, the concatenation part of the formula introduces an extra space. That’s also the reason why the TRIM() function should be used on the final result as opposed to deeper in the formula, i.e. on First Name, Last Name, and Middle Name individually.
Note: Extra spaces may not be the only undesirable characters in the data. The CLEAN() function can be used to remove non-printable characters. See Appendix 1 for a more thorough discussion of removing undesirable characters from the data.
Step 4: Convert DOB to an actual date
The DOB values are not recognized by Excel as dates. These need to be converted to MM/DD/YYYY format. Using the DATE() function and MID() function will produce the desired results. The YEAR, MONTH, and DAY portions of the date will be extracted from the DOB field using MID(), and the result is converted to a date using the DATE() function:
Step 5: Fill in Date of Letter
No changes are needed to the Date of Letter field, so either copy / paste those values in, or use a formula to refer to them so that the table created is complete.
Step 6: lower case E-Mail Address
For consistency sake, the email addresses should all be lower case, since they aren’t case-sensitive anyhow. This will be accomplished using the LOWER() function.
Step 7: Format Home Phone
The Home Phone values are hard to read, and not in the correct format. They need to be fixed up to match the (NNN) NNN-NNNN format.
Just like SSN, this can be accomplished through cell formatting or inserting the proper characters in the right places.
Inserting parenthesis & hyphens
A formula needs to be used to accomplish this. The formula will use MID() and concatenate operations to build up the correct value.
Step 8: Trim extra spaces from Address
Use the TRIM() function as done on the name fields to remove extra spaces. This one is easy, so go ahead and do it and move to the next step!
Step 9: Add City values to the new table
Since a whole new table is being created to replace the original data, the values from City need to be brought over, even though there’s nothing wrong with them. If you have a huge data set you might not be able to inspect them all by hand, so in that case you’ll probably still want to use TRIM() and CLEAN() on those values, just in case.
Step 10: Make all State value upper case
The State field contains lower case values, and all the characters for this field need to be upper case. Both PROPER() and LOWER() have already been used to clean up the data, so using UPPER() to convert the lower case characters to uppercase ones is easy.
Step 11: Fill in Postal Code
To complete the table, copy/paste the Postal Code values from L2:L10 to W2:W10.
Step 12: Finalize the values
Now creating the new table is completed, with the data in the format required. The sheet should be finalized by removing formulas, keeping only the values, and then deleting the original data.
Note: It might be useful to keep the formulas around on a template saved separately, so they don’t have to be built up over and over as different sets of data are processed.
Now the newly cleaned data is the only data set in the sheet. Nice work!
Bonus Discussion: Sorting Text using more than 3 keys
It may be that someone needs to sort data sets using more than 3 keys for the sort. As you might now, Excel 2003 only supports 3 levels of sort keys. Excel 2007 supports a whole lot more, so this exercise is trivial using Excel 2007. Here’s how it could be done using Excel 2003. Assume the original data set needs to be sorted by State / City / Last Name / First Name / Middle Name, consider this similar data set:
A new column can be created (say, in column F), using text concatenation to string together the sort keys. In F2, enter this formula: =E2&D2&A2&C2&B2, and give it a header, like “SortKey”.
Now, sorting on that single column is the same as sorting on those keys individually. Of course, ascending/descending order can’t be specified on each key using this method, so Excel 2007 is still better!
Appendix 1: Cleaning Text
Trim means
Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing. The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character.
Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing.
The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character.
Clean means
Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed. The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.
The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
Advanced Clean for Unicode Characters using SUBSTITUTE & CHAR
Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or non-printing characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions. The TRIM function removes spaces from text except for single spaces between words. The CLEAN function removes all nonprintable characters from text. Both functions were designed to work with 7-bit ASCII, which is a subset of the ANSI character set (ANSI character set: An 8-bit character set used by Microsoft Windows that allows you to represent up to 256 characters (0 through 255) by using your keyboard. The ASCII character set is a subset of the ANSI set.). It's important to understand that the first 128 values (0 to 127) in 7-bit ASCII represent the same characters as the first 128 values in the Unicode character set. The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity, . By itself, the TRIM function does not remove this non-breaking space character. The CLEAN function was designed to remove the first 32 non-printing characters in the 7 bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters. To do this task, use the SUBSTITUTE function to replace the higher value Unicode characters with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.
Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or non-printing characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions.
The TRIM function removes spaces from text except for single spaces between words. The CLEAN function removes all nonprintable characters from text. Both functions were designed to work with 7-bit ASCII, which is a subset of the ANSI character set (ANSI character set: An 8-bit character set used by Microsoft Windows that allows you to represent up to 256 characters (0 through 255) by using your keyboard. The ASCII character set is a subset of the ANSI set.). It's important to understand that the first 128 values (0 to 127) in 7-bit ASCII represent the same characters as the first 128 values in the Unicode character set.
The CLEAN function was designed to remove the first 32 non-printing characters in the 7 bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
To do this task, use the SUBSTITUTE function to replace the higher value Unicode characters with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.
Comments: (25) Collapse
ok, can you answer how to extract numbers from column like that:
545.52 kB
78.4 MB
852.27 B
.......
45.5 MB
970 kB
thank you, best wishes!
Nice article. I think, however, that those MID() functions can be easier to understand by replacing them with a TEXT() formula, for example
=TEXT(A2, "000-00-0000")
to format the SSN fields.
Still, good job, this articles are definitely needed.
I love this piece!
I often start with a column of names as you have in your final "Name" column and need to separate them into "Last Name" "Initial" and "First Name" - basically the reverse of what you have done. Could you please please tell me how to do that?
Alternately, is there a way to alphabetize using the last word of the Name field?
Thanks!!
d.volans - - you could do this by either using the Text to Columns feature (picking "space" as the delimiter), or assuming your data is in A1, use this formula: =LEFT(A1,FIND(" ",A1,1)-1)
Juan Pablo González - - beautiful!!! Thanks for the elegant solution.
donnap99 - - I KNEW someone was going to ask this question, as I've encountered names in this format myself. It seems like databases like names split out into first, middle, and last, and they always come strung together like this. While the solution I came up with works, it's a bit complicated. Perhaps Mr. González has a more elegant solution :). First, the solution assumes that the name is in the format "LName, FName, MName", where LName and FName (at least first initial) are required, and MName is optional, and can be either full middle name or just initial. I know these assumptions don't work for people from Mexico, whom I have seen have 2-4 'words' in one or all of FName / MName / LName.
Here goes:
Assume we have the name in the format "LName, FName MName" in A1.
To get the FName, use this formula: =MID(A1,FIND(" ",A1,1)+1,IF(NOT(ISERROR(FIND(" ",A1,FIND(" ",A1,1)+1))), FIND(" ",A1,FIND(" ",A1,1)+1)-1-FIND(" ",A1,1), 100))
To get the MName, use this formula: =MID(A1,IF(NOT(ISERROR(FIND(" ",A1,FIND(" ",A1,1)+1)+1)),FIND(" ",A1,FIND(" ",A1,1)+1)+1,LEN(A1)+1),100)
To get the LName, use this formula: =LEFT(A1,FIND(",",A1,1)-1)
Just one note: The "100" at the end of the first two formulas means that if we can't find the 2nd space in the string (no MName value), then the FName starts after the first space and goes for 100 characters. Set that to 1000 or whatever if you don't think 100 is big enough.
Hi Chad,
Excellent Article. Do tell what number format to use if we need to change the number format to Indian Style. Like
1000.00 to 1,000.00
10000.00 to 10,000.00
100000.00 to 1,00,000.00
1000000.00 to 10,00,000.00
10000000.00 to 1,00,00,000.00
100000000.00 to 10,00,00,000.00
1000000000.00 to 1,00,00,00,000.00
Thanks and Regards
CA kanwaljit Singh Dhunna
I don't think your Bonus Sort Method always works (but there is a method that does work!). Your proposed method does not work if, for example, you had a city called Mill and a city called Millworks. The people in Mill should always be listed before the people in Millworks. But if John Works lived in Mill (concatenated as MillWorksJohn), he would be listed after Abby Bright from Millworks (concatenated as MillworksBrightAbby), wouldn't he? The solution that does work (I think) is to sort on the "lowest" three fields first (columns B, C, and A) and the re-sort on the higher two (in this case) fields (Columns E and D). I'm too lazy to test it out right now, but that's the way I've always done it and I seem to recall that it always works.
Great Post. I'm curious to know if anyone else has had problems running macros with an if function. I've been trying to create macros like I have in excel 03 that run if functions and they won't run in excel 07. Is this a common problem or user error?
Kanwaljit - -
You can format non-negative numbers this way using a custom number format: [>=10000000]##\,##\,##\,##0.00;[>=100000]##\,##\,##0.00;##,##0.00
The above format uses conditions to specify formatting, and the custom number format codes don't support enough conditions to cover negative numbers as well.
To get Excel to display both negative and positive numbers, you would have to change your operating system's Regional and Language Options / Format to a format that supports that number format. However, this is a system setting, so the display of the number will vary depending on the system settings for each user.
Excel does not support a way of specifying that this type of number format should be used, regardless of system settings.
Thanks for these kinds of questions, it helps us understand what challenges people face when using Excel.
Gary - -
Thanks for pointing out the limitations of using this method to sort. You're correct, but what a pain to have to specify levels of sort in so many steps!
Drew - -
I am not sure what you mean. Are you talking about XLM macros? If you like, please use the contact page (blogs.msdn.com/.../contact.aspx) to send us an example of specifically what you are trying to do, and we can take a look.
...forgot to mention that my custom number format won't work for very large numbers either, since the formatting is specified for a specific range of numbers.
Gee, does Excel 2007 launch the text import wizard when opening CSV files? Or does it behave the same as previous versions and automatically MISparse CSV files without launching the text import wizard?
If the latter, you missed step 0: rename the @#$% file, changing the extension from CSV to TXT.
Useful article. Synopsizes thousands of newsgroup articles on this topic (parsing and text transformation) that have been posted over the past two decades.
Chad, check out Morefunc Excel add-in: xcell05.free.fr/.../index.htm
You'll never do "massaging" of the textual data the same (cumbersome) way again :)
A collection of useful liitle tips and tricks for an exercise that is frequently needed - though I think that using functions to add hyphens is horrible especially as the custom format works fine
One thing I've noticed with Excel 2000 is that when you have data to import that is not comma separated but uses another character (such as the *); import it once, then use Text to Columns and change the character from comma to *; in future, when you import text with copy and Paste Special it will work automatically since it now recognises the * as the separator
Off-topic, but there seems to be another bug in Excel 2007's calculation functionality. At least it's a difference between Excel 2007 and all previous versions.
groups.google.com/.../af39bb94f6ca26e7
Nice article. Nothing new in the techniques here, but it is a good tutorial for those not already in the know.
Not mentioned here is the matter of saving the results in an appropriate format. While CSV is convenient enough for some applications, it does not round-trip reliably. If you save as CSV from Excel and then open in Excel you do not always get what you saved. This is primarily due to text fields that Excel tries to convert into numbers, as with the dropped leading zeros mentioned, and also long strings of digits that are truncated, and certain strings resembling dates.
Something new that I would like to see is an old fashioned output format that quotes text fields if they resemble numbers, or even all text fields.
Comments: (loading) Collapse