Flash Fill

This post is brought to you by Chad Rothschiller a Program Manager on the Excel team. To try out this new feature download the Office Preview here.

Before coming to Microsoft I spent a good deal of time messing around with data. I would extract first names, last names, and email addresses from chunks of text. I would join first names and last names together into a single column of text. Sometimes I only needed first name initials, put together with the last name. I had to change the format & display of dates, social security numbers, and dollar amounts. Because the data came from a database, I often had to make sure the names had Proper Casing (e.g. not ALL CAPS, not all lower).

Today I’m proud to be writing about a new feature in Excel 2013 called Flash Fill that makes all of those tasks very simple for even the most novice user. Several years ago, I wrote a blog post demonstrating how to do string manipulations like the above examples. The problem is that the solutions relied heavily on using formulas, which can feel quite complex and intimidating for people not as familiar with Excel. Now, with Flash Fill, you can do it all without formulas. In fact, it’s easier than that. Now you can do it all by providing a few examples.

Here’s what Flash Fill looks like in action:

clip_image001

Let’s work through the examples in my original post, but using Flash Fill as the solution. I updated a few of the values from the original data set to make things more interesting, you can get it here. After pasting the data into Excel, you’ll need to split it out into separate columns. I used Text to Columns feature, using “comma” as the separator.

clip_image002
Note: I used bold formatting on the top row so they stand out. But it also helps Excel to understand that the top row is a header row so that Flash Fill won’t use it as part of your examples when generating pattern rules.

Step 1: Format social security numbers like this: 123-45-6789

Insert a new column between SSN and Last Name and title it “SSN”. Notice below that as you start typing the formatted SSN, Excel will help you complete the column based on the value you entered in the prior row.

  1. Make sure the inserted column is wide enough to display the formatted value (80 pixels should be enough; this isn’t required to use Flash Fill, but it looks better that way)
  2. Type the first social security number in B2, with hyphens: 413-66-4341
  3. Start typing the second SSN value: 2

At this point Flash Fill recognizes what you’re doing and makes additional suggestions:

clip_image003

If you like the suggestions (in this case, I do!), just hit ENTER to accept them.

Now, if you’re paranoid like me, then at step 2. You won’t trust yourself to re-type the SSN correctly. Instead you can double click in A2 (goes into “cell edit mode”), copy the value, then double click cell B2 (back into “edit mode”), and paste the value. Then, before hitting ENTER to commit the value, put the cursor between the 3rd and 4th digit, and type the hyphen. Ditto for between the 5th and 6th character. Now hit enter, and continue to step 3.

Step 2: Format names like this: Rothschiller, Chad B. (in other words: <Last Name>, <First Name> <Middle Initial>.)

Insert another column between First Name and DOB, call it Name.

  1. Make sure the new Name column is wide enough for everyone’s names (162 pixels should be enough; again, not required, just looks nice)
  2. Enter into cell F2: Freehafer, Nancy D.
  3. Type the “C” for Cencini, and see Excel Flash Fill make suggestions for the rest of the dataclip_image004 
  4. If you like the results (not perfect, but I’ll take them), hit ENTER to accept. clip_image005
     
    In this case, Flash Fill wasn’t sure what to do for rows without a middle name or initial, because the only example I provided used a middle initial.
    By providing additional examples, called revisions, I can teach Flash Fill what I want for those cases that it got wrong.
  5. Go down to cell F5 (first blank cell)
  6. Type: Sergienko, Mariya, then hit ENTER
    Notice that while editing that cell, the Flash Fill visuals come back for the Flash Filled range: clip_image006

   

After hitting enter, Flash Fill adds my new example, creates a new set of rules, and re-applies the new set of rules to the entire range (excluding the values you typed), which includes the other row without a middle name / initial (Guissani, Laura, 2nd from the bottom):clip_image007 

Notice that all the extra spaces in the original name values were removed. Flash Fill ignored the extra whitespace and cleaned up my results.

Step 3: Convert DOB from 19610222 to 2/22/1961 (more generally, from YYYYMMDD to M/D/YYYY)

Let’s convert the DOB column to date format so it’s more readable and so Excel can understand the values as dates. Insert a new column called DOB.

  1. Make sure the column is wide enough so you can see what you’re doing (75 pixels would work).
  2. Select the entire range where the new dates will go, and format them as mm/dd/yyyy (Format Cells / Number / Custom, then enter “mm/dd/yyyy” into the Type: text box).clip_image008
  3. Type the first 3 dates.clip_image009
  4. Notice that Flash Fill hasn’t triggered. We decided to disable automatic Flash Fill for numeric data. After all, there are only 10 unique characters in our numbering system, and it’s just too easy to come up with random patterns that don’t make good sense. For this case, providing additional examples and using the ribbon button is the way to go.
  5. On the Data tab, click the Flash Fill button:clip_image010

…and the results:clip_image011

Now that we have converted to dates, we can change the date format to whatever we like, say m/d/yyyy, by using the Number Format dialog box again.clip_image012

You can use the same steps to convert Date of Letter, so I’ll skip that step.

Notice that there were a few extra steps when dealing with dates (and numbers in general). The algorithm engine we’re using looks at everything as text. So it doesn’t know any special rules that go along with numbers (for example, it’s OK to drop leading zeros when dealing with numbers, but not so much when you’re dealing with Postal Codes or other IDs). If I hadn’t initially formatted the new DOB column with two digits for month and two digits for day, then the algorithm would see me converting “02” to “2” (say, “February”), and figure that I only want the 2nd digit of the month number. But then it would also convert “12” to “2”, because it doesn’t know that there’s a special number rule that says it’s OK to drop leading zeros, but not other leading digits. Looking to the future, this is definitely an area where we can push the feature forward and get better results faster.

Step 4: Format phone numbers like (206) 555-1212

Insert a new column after Home Phone, and call it Phone (I made it 96 pixels so I could see the suggestions).

  1. Start typing! (206) 695-9457
  2. When you start typing the “(“ for the second row, the suggestions kick in (Note: while we’re working with numbers here, since it’s mixed with text, Excel treats them as text, not numbers, so you will automatically get suggestions). clip_image013
  3. Hit enter to accept, and you’re done!

Step 5: Put the leading zero back on Postal Codes (like 01001)

This is going to be very similar to the date step. Once you’ve got your target column set up (I called mine Zip), select the whole range, and format it as TEXT. We do this so that Excel doesn’t treat the number as a number and re-drop the leading zero.

1. Format the Zip range as TEXT.clip_image014

2. Start typing Postal Codes, and see the suggestion UI.clip_image015

I do see some of the suggestions are a bit off, but I’m going to take it anyway, and then go revise the ones it didn’t get right. Flash Fill is probably thinking that instead of a “hard coded” leading zero, I want to extract that digit from somewhere else in my data.

3. After accepting, I go to the one that needs to be revised (in my case, it’s in cell S5), and type the desired value. clip_image016

4. …and hit enter. clip_image017

Notice that with that single revision, the other values were updated as well.

What about the other columns in your source data?

I didn’t give detailed steps for these because getting the desired results is very straight forward, once you know how to trigger Flash Fill.

  • E-Mail Address values can be converted to lower case. Just insert a new column and start typing lower-case versions of the values and Flash Fill will suggest to do the rest of them for you.
  • Extra spaces can be removed from Address. Just start typing the values without the spaces in a new column and let Flash Fill do the rest.
  • State values can be made UPPER CASE by typing a couple UPPER CASE values in a new column.

Some final tips on using Flash Fill

You might have noticed some patterns to how I was working with my data. Here are a couple things you should know to get the most out of Flash Fill.

  • The Flash Fill algorithm for recognizing patterns doesn’t treat numbers or digits any differently than text.
    • It doesn’t know that zeros to the left of the left-most non-zero digit are insignificant and can be dropped (or shouldn’t be dropped, if it’s a Postal Code).
    • It doesn’t know that 1 might correspond to “January”, 2 to “February”, and so on.
    • Having said the above, Flash Fill does a good job with numbers and dates, though in some cases you might have to think more explicitly about how to get the desired results.
  • Flash Fill does better when the relationship between the source data and the example output you provide is clear, not ambiguous. For example, if your source data looks like this:

clip_image018

Then it would be better to provide an example suffix value for the 2nd or 3rd row of data (“EF” or “YZ”, respectively), rather than the first (“AB”). The reason is because it’s unclear whether the “AB” value would come from before or after the hyphen in the source data.

  • To get automatic suggestions, you have to be editing right next to related data.
    • You can’t have one or more blank columns between the cell you’re editing and the data its related to.
    • If you’re in doubt, just select the cell, and hit CTRL + A (select the contiguous range). If the resulting selection is the whole sheet, then Excel doesn’t think the cell is next to any data. If the select is some set of data that’s related, then your cell is in the right spot.
  • To get automatic suggestions, you have to make two sequential edits, one right on top of the other. “Two sequential edits” means that you don’t do something else between edit #1 and edit #2 (e.g. switch to another sheet) .
  • If your table of data has headers at the top, make sure Excel recognizes those as headers, so that it excludes them when generating Flash Fill rules.
    • The way I like to see whether Excel thinks I have headers is to put the active cell in the data range and hit CTRL + T (create Table). The dialog that comes up has a checkbox saying “My table has headers”. If that box is checked, then Excel’s automatic header detection logic thinks there are headers. Otherwise, Excel doesn’t think you have headers.
    • The easiest way to “tell” Excel you have headers is to change your range into a Table (just hit OK on the dialog after hitting CTRL + T).

Thanks everyone for reading and providing comments. If you come across a pattern that doesn’t work well with Flash Fill, please let us know. We’re always looking for ways to improve, and your feedback is valued and appreciated!