Back
Excel

Manipulating and Massaging Data in Excel

Today’s author: Chad Rothschiller, a program manager on the Excel team.  Chad is going to discuss using formulas to ‘clean up’ data in Excel. 

Overview

Excel is a great tool to use when you need to take data in one format, manipulate it into another format, and push the results along to another process, e.g. a database. In this context, Excel is a great landing pad or middle man, serving as a data transformation tool to move data from one system to another.

This example considers a sample data set and walks through the steps to clean up the data and perform various transformations on the data set to massage it into a more desirable format.

Example Data

Here’s the data discussed in this example exercise, feel free to copy/paste it into a spreadsheet and work along through the steps:

SSN,Last Name,Middle Name,First Name,DOB,Date of Letter,E-mail Address,Home Phone,Address,City,State,Postal Code
123456789,freehafer    ,drew   ,nancy,19700101,11/1/2007,NancyF@northwindtraders.com,1235550102,123 1st   Avenue,Seattle  ,wa,09999
123456789,cencini        ,a   ,andrew,19700101,11/1/2007,AndrewC@northwindtraders.com,1235550102,123 2nd Avenue,Bellevue,wa,09999
123456789,kotas      ,bollen,jan,19700101,11/1/2007,JanK@northwindtraders.com,1235550102,123 3rd Avenue,Redmond  ,wa,09999
123456789,sergienko      ,,mariya,19700101,11/1/2007,MariyaS@northwindtraders.com,1235550102,123     4th Avenue,Kirkland,wa,09999
123456789,thorpe     ,j ,steven,19700101,11/1/2007,steven@northwindtraders.com,1235550102,123  5th Avenue,Seattle,wa,09999
123456789,neipper    ,john,michael,19700101,11/1/2007,MichaelN@northwindtraders.com,1235550102,123 6th Avenue,Redmond,wa,09999
123456789,zare       ,b ,robert,19700101,11/1/2007,RobertZ@northwindtraders.com,1235550102,123 7th Avenue,Seattle,wa,09999
123456789,giussani       ,,laura,19700101,11/1/2007,LauraG@northwindtraders.com,1235550102,123 8th Avenue,Redmond,wa,09999
123456789,hellung-larsen      ,marie,anne,19700101,11/1/2007,AnneHL@northwindtraders.com,1235550102,123 9th Avenue,Seattle   ,wa,09999

Notice a few things about the data:

  • SSN isn’t formatted, it’s just a raw number
  • Last Name is all lower case and contains extra spaces
  • Some people don’t have middle names at all, some are abbreviated, some are spelled out in full
  • Date of Birth (DOB) is in YYYYMMDD format, while Date of Letter is in MM/D/YYYY format
  • E-Mail Address contains both upper and lower case characters
  • Home Phone isn’t formatted
  • Address has extra spaces
  • State is all lower case
  • Postal Code has a leading zero that shouldn’t be dropped

Here’s what needs to be done to the data to get it ready for processing:

  • Not lose the leading zero when bringing this data into Excel
  • Format the SSN with dashes in the proper place
  • Remove extra spaces from the name fields and Address field
  • Get the names all in one field, in the form of “Last Name, First Name Middle Initial.â€, with proper casing
  • Convert DOB to MM/DD/YYYY format
  • Make E-Mail Address values be all lower case
  • Format the Home Phone field according to the standard US phone number format (NNN) NNN-NNNN
  • Upper case the State values
  • Sort the data according to these sort keys: State / City / Last Name / First Name / Middle Name (not middle initial)

Step 1: Import the data and don’t accidentally drop the leading zeros!

To Excel, the Postal Code values look like numbers, and so that’s the default way it will treat them when loading this file. (Unfortunately Excel can’t read the header “Postal Code†and switch it to a text field based on the semantic meaning of the data.)

Fortunately there is a way to override this default, using the Text Import Wizard:

  1. File | Open the text file
  2. Text Import Wizard – Step 1 of 3: Choose “Delimitedâ€, then click Next
  3. Text Import Wizard – Step 2 of 3: Choose “Comma†as the delimiter, then click Next (note that other delimiters can be selected, and even a custom one can be specified if the data uses a character not listed in the UI)
  4. Text Import Wizard – Step 3 of 3: Scroll over to the Postal Code field, select it, and set “Column data format†to Text, then click Finish

Best fit the columns (select those columns, double click on the line between column headers), and here’s what the result should look like:

Now that the data has landed in Excel without dropping critical information, next up is to work it into the format needed for reporting and uploading.

Step 2: Formatting SSN

The SSN values need to be converted to the form NNN-NN-NNNN. There are two ways to accomplish this.

  1. Change the way the data is displayed, but keep the underlying values as numbers, or
  2. Insert hyphens into the data

Change the display via cell formatting

  1. Select the range A2:A10 where the data is stored
  2. Bring up the Format Cells dialog (CTRL + 1)
  3. On the Number tab, click on the Custom category
  4. In the Type field, enter this string: 000-00-0000
  5. Click OK

Here’s what the end result looks like:

Note that the underlying value is “123456789†(look in the formula bar to verify this), while the cell is formatted to display the hyphens.

Inserting hyphens

A formula is needed to accomplish this. This example will use the convention of creating a new table of data from the original, where the formulas are in cells off to the right of the original data.

The formula for this will use the MID() function to pull out the first 3 numbers of the SSN. Then it will use the “&†to concatenate that with a hyphen, use the MID() function to grab the middle two numbers, use “&†again to insert another hyphen, and finally use MID() again to tack on the last 4 numbers of the SSN.

  1. Copy/Paste the headers from A1:L1 into N1:Y1 (to not get lost)
  2. In N2, enter this formula: =MID(A2,1,3)&”-”&MID(A2,4,2)&”-”&MID(A2,6,4)
  3. Fill that formula down across all the rows of data (N2:N10 for this example)

Here’s what the result looks like:

To prove it, select the column of SSNs, Copy / Paste Special… Values and examine the contents of the cells. The hyphens are in the right places.

Step 3: Fixing up names

There are several problems to be addressed with the names:

  • They’re not in proper casing
  • They contain extra spaces
  • Only the middle initial is desired

The “&†operator will be used to concatenate the value in Last Name with a comma and space; the “&†operator will be used to add the value from First Name and to add a space to that result; the LEFT() function along with the “&†operator will be used to create and add a middle initial from Middle Name. The last two steps will be to wrap those functions with a TRIM() function to remove extra spaces, and wrap the final result with a PROPER() function to set the letter casing correctly.

  1. For this part, the headers “Last Nameâ€, “Middle Nameâ€, and “First Name†that were created in O1:Q1 aren’t needed. Delete columns P & Q, and rename “Last Name†in O1 to “Nameâ€. The formula will go in column O, using the original Name columns as input.
  2. In O2 insert this function: =PROPER(TRIM(TRIM(B2)&”, “&D2&” “&LEFT(C2,1)))

Here’s what the result looks like:

Note: while extra spaces were in the original values, even if they weren’t the TRIM() function would need to be used in this case because for people without Middle Name values listed, the concatenation part of the formula introduces an extra space. That’s also the reason why the TRIM() function should be used on the final result as opposed to deeper in the formula, i.e. on First Name, Last Name, and Middle Name individually.

Note: Extra spaces may not be the only undesirable characters in the data. The CLEAN() function can be used to remove non-printable characters. See Appendix 1 for a more thorough discussion of removing undesirable characters from the data.

Step 4: Convert DOB to an actual date

The DOB values are not recognized by Excel as dates. These need to be converted to MM/DD/YYYY format. Using the DATE() function and MID() function will produce the desired results. The YEAR, MONTH, and DAY portions of the date will be extracted from the DOB field using MID(), and the result is converted to a date using the DATE() function:

  1. In P2 enter this formula: =DATE(MID(E2,1,4),MID(E2,5,2),MID(E2,7,2))
  2. Fill that formula down across the rows of data (P2:P10 in this case)

Here’s what the result looks like:

Step 5: Fill in Date of Letter

No changes are needed to the Date of Letter field, so either copy / paste those values in, or use a formula to refer to them so that the table created is complete.

Step 6: lower case E-Mail Address

For consistency sake, the email addresses should all be lower case, since they aren’t case-sensitive anyhow. This will be accomplished using the LOWER() function.

  1. In cell R2, enter this formula: =LOWER(G2)
  2. Fill that formula down across the rows of data

Here’s what the result looks like:

Step 7: Format Home Phone

The Home Phone values are hard to read, and not in the correct format. They need to be fixed up to match the (NNN) NNN-NNNN format.

Just like SSN, this can be accomplished through cell formatting or inserting the proper characters in the right places.

Change the display via cell formatting

  1. Select the range H2:H10 where the data is stored
  2. Bring up the Format Cells dialog (CTRL + 1)
  3. On the Number tab, click on the Custom category
  4. In the Type field, enter this string: (000) 000-0000
  5. Click OK

Here’s what the end result looks like:

Inserting parenthesis & hyphens

A formula needs to be used to accomplish this. The formula will use MID() and concatenate operations to build up the correct value.

  1. In S2, enter this formula: =”(“&MID(H2,1,3)&”) “&MID(H2,4,3)&”-”&MID(H2,7,4)
  2. Fill that formula down across all the rows of data (S2:S10 for this example)

Here’s what the result looks like:

Step 8: Trim extra spaces from Address

Use the TRIM() function as done on the name fields to remove extra spaces. This one is easy, so go ahead and do it and move to the next step!

Step 9: Add City values to the new table

Since a whole new table is being created to replace the original data, the values from City need to be brought over, even though there’s nothing wrong with them. If you have a huge data set you might not be able to inspect them all by hand, so in that case you’ll probably still want to use TRIM() and CLEAN() on those values, just in case.

Step 10: Make all State value upper case

The State field contains lower case values, and all the characters for this field need to be upper case. Both PROPER() and LOWER() have already been used to clean up the data, so using UPPER() to convert the lower case characters to uppercase ones is easy.

  1. In V2 enter this formula: =UPPER(K2)
  2. Fill down that formula across all the rows of data (V2:V10 for this example)

Here’s what the result looks like:

Step 11: Fill in Postal Code

To complete the table, copy/paste the Postal Code values from L2:L10 to W2:W10.

Step 12: Finalize the values

Now creating the new table is completed, with the data in the format required. The sheet should be finalized by removing formulas, keeping only the values, and then deleting the original data.

Note: It might be useful to keep the formulas around on a template saved separately, so they don’t have to be built up over and over as different sets of data are processed.

  1. Select the entire range of newly massaged data (N1:W10)
  2. Copy / Paste Special… Values / OK (This step is to ensure the cells contain data values and not formulas. Since the formulas are referencing the original data, The original data can’t be deleted until the formulas are converted into values)
  3. Now select the columns A:M, which include the range of original data (A1:L10) plus the extra space column, and delete those entire columns.

Now the newly cleaned data is the only data set in the sheet. Nice work!

Bonus Discussion: Sorting Text using more than 3 keys

It may be that someone needs to sort data sets using more than 3 keys for the sort. As you might now, Excel 2003 only supports 3 levels of sort keys. Excel 2007 supports a whole lot more, so this exercise is trivial using Excel 2007. Here’s how it could be done using Excel 2003. Assume the original data set needs to be sorted by State / City / Last Name / First Name / Middle Name, consider this similar data set:

A new column can be created (say, in column F), using text concatenation to string together the sort keys. In F2, enter this formula: =E2&D2&A2&C2&B2, and give it a header, like “SortKeyâ€.

Now, sorting on that single column is the same as sorting on those keys individually. Of course, ascending/descending order can’t be specified on each key using this method, so Excel 2007 is still better!

Appendix 1: Cleaning Text

Trim means

Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing.

The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity,  . By itself, the TRIM function does not remove this non-breaking space character.

Clean means

Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.

The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.

Advanced Clean for Unicode Characters using SUBSTITUTE & CHAR

Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or non-printing characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions.

The TRIM function removes spaces from text except for single spaces between words. The CLEAN function removes all nonprintable characters from text. Both functions were designed to work with 7-bit ASCII, which is a subset of the ANSI character set (ANSI character set: An 8-bit character set used by Microsoft Windows that allows you to represent up to 256 characters (0 through 255) by using your keyboard. The ASCII character set is a subset of the ANSI set.). It’s important to understand that the first 128 values (0 to 127) in 7-bit ASCII represent the same characters as the first 128 values in the Unicode character set.

The TRIM function was designed to trim the 7-bit ASCII space character (value 32) from text. In the Unicode character set, there is an additional space character called the non-breaking space character that has a decimal value of 160. This character is commonly used in Web pages as the HTML entity,  . By itself, the TRIM function does not remove this non-breaking space character.

The CLEAN function was designed to remove the first 32 non-printing characters in the 7 bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.

To do this task, use the SUBSTITUTE function to replace the higher value Unicode characters with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.