Tips for Excel, Word, PowerPoint and Other Applications

Cleaning Data With Notepad

Why It Matters To You

We all spend a significant amount of time trying to get data from non-Excel sources (e.g., PDF, Word, websites, etc) into Excel. One of the problems we encounter when importing data is the formatting of the original source, which can be quite a pain to deal with. The faster and more efficiently we can get data into Excel in a useable format, the faster we can move to more interesting things ... like lunch.

A lot of times, a simple Paste Special: Text is all you need, but in the example case, as in many others, there is more formatting than Paste Special can sort out...

How To ...

So today, we're going to talk about something very mundane ... Notepad, a humble text editor, which when used creatively, can be used to strip all of the formatting out of a block of text.

A slightly more advanced application is the cleansing of data through Notepad, Wordpad, and/or Word. The search and replace features of the latter two and the ability to replace special characters, such as tab marks, can be pretty powerful.

Included are two sample files. One, a furniture spec called, data_cleaning.doc is extract of a furniture spec document that lays out the parts and pieces of a office configuration component (e.g., cubicle). The second file is an Excel file named, data_cleaning.xls, which are the results of the 3 attempts to import the data described below. Our goal was to get several dozen configurations into an Excel workbook for evaluation as quickly as possible.

Attempt #1 - Copy from Word and paste into Excel - FAILURE

Here's my first attempt at getting the data into Excel. I highlighted the text, CTRL-C to copy, the CTRL-V to paste into Excel. It looks nearly the same as in the Word document, but now there are lots of spaces, bullets, and fonts to clean up.

Attempt #2 - Copy from Word and paste into Excel using Paste Special - EPIC FAILURE

In an attempt to remove the formatting and extra spaces, I tried copying from Word, then selecting Edit:Paste Special:Text in Excel.

This is supposed to strip out most of the formatting and just leave me with text. This is almost as bad. Scroll through the lines and you'll see what I mean, some lines have additional rows added in, in some cases, it's hard to tell that data exists in adjacent cells, the Example line has 7 lines embedded in it. This isn't any good to me either.

Attempt #3 - Copy from Word and paste into Notepad. Copy from Notepad and paste into Excel - SUCCESS

Enter Notepad. Notepad is a simple text editor that lets you format font and font size -- that's about it. This also means that if you paste something into Notepad, it will get stripped of just about all of it's formating except for tabs - which is exactly what we need.

Start by creating a New Text Document. You can do this by going to your desktop, right-clicking, then selecting New: Text Document.

You should get a new file called "New Text Document.txt". Double-click on it to open. NOTE: I've also include my data_cleaning.txt with the pasted text so that you can compare what this looks like vs the original Word document.

Now, Copy your text from Word, paste it into Notepad to clear the formatting, then copy from Notepad and paste into Excel. As you can see, all I have to do is delete the first column to eliminate the bullet symbols, but other than that, this is about as perfect a result as I could have hoped for.

Notes

Last updated7/24/08
Application VersionExcel 2003
AuthorMichael Kan
Pre-requisitesNone
Related Tips None