Sunday, June 26, 2011

Microsoft Excel: Advanced Techniques for Data analysis

Many of the techniques to be discussed in the following lesson need to be practiced in order to be fully understood. Still, realizing that these tools exist, and learning the basics of their use is an important first step.

LOGICAL FUNCTIONS:

=IF( Logical test, Value if TRUE, value if False)
Checks whether a condition is met, and returns one value if the condition is true and other value if the condition is false.

=AND (Logical 1, Logical 2,…Logical n)
Checks whether all of the input arguments are TRUE, and returns “TRUE” if so, or “FALSE” if not.

= NOT()
Changes FALSE to TRUE and TRUE to FALSE (inverts the logic).

=OR(Logical 1, Logical 2,…Logical n)
Returns true if ANY of the input arguments are TRUE.

Understanding the use of logical functions takes practice. The best method for learning these logical functions is to use simple math to examine how the function is working. Still, you should remember that you can also use strings, or combinations of numbers and strings, (See NOT example below). Here are a few examples of the main logical functions:


NOTE: Most logical functions automatically produce a “TRUE” or “FALSE” output. However, with the “=IF( )” function, the values that are returned are custom. Here I have chosen to use true and false, but I could have just as easily used any other phrases I desired. Fortunately, if you wish to return Excel's true or false logic, you can simply nest the functions =true and =false inside the if function. This would simply be accomplished by removing the "'s.

CONDITIONAL FORMATTING:

One fun feature of Excel that can be used for a wide variety of tasks is conditional formatting. Conditional formatting can be accessed under the “Home” tab of MSExcel 2007-2010 under the “Styles” section.



Essentially, conditional formatting uses logic to highlight or color code certain bits of data. This can be useful for visually separating certain cells from others. A few ways one might use this is:

1) To highlight errors or odd values in the data (i.e. checking data)
2) To separate categories of a certain variable for easy viewing.


The menu for conditional formatting gives you a number of methods for deciding which data to color code or highlight. For instance, you could color code all numbers greater than 50, or all cells where the strings match your input string. In the example below, the conditional formatting has been used to color code the different experimental subjects by using the “Conditional Formatting”/”Highlight Cell Rules”/”Text that contains” option. In some cases, you will need to create multiple “rules”.



SORT OPTIONS:
The next function that is useful for data analysis is the “sort” function. In Excel 2007-2010 this can be found under the “Data” tab (1) under the “Sort & Filter section” (2). Under this section, there are two quick sort buttons (A->Z; Z->A) as well as a Sort button/icon that will display a pop-up menu to create more complex sorting (3). Importantly, the quick sort buttons will either sort a single column of data or sort a larger data set by the first column of your data (if the entire data set is selected). Still, for files with multiple columns, the pop-up menu is a far more versatile and prudent option.


Once the pop-up menu has been opened (3), you can “Add Level(s)” to the filtering. This effectively allows you to sort the data by as many variables as you choose. In the above example, the data is first sorted by the column named “Type Code” and then by the column named “Call Type”.

FILTERING:

Another option for organizing data and extracting meaningful information is Excel’s filtering tool. This tool can be found under the “Data” tab in the “Sort and Filter” section.


After selecting the filter option, a pull-down menu will appear at the top of each column. When the pull-down menu for a given column is selected, the menu will display a set of Sort and Filtering options.


While it is worth exploring the sub-menus, most filtering can be done by using the check boxes that appear under the pull-down menu. These checkboxes generally display all of the “unique” values that occur in the column and allow you to select one or more of these values to display. For example, when the check box for “TRILL” is selected above, the filter hides the rows containing the other two values in the column(FF and FM) and displays only the data that are categorized as a TRILL(see below).

Note: The filtering is NOT case sensitive.

Importantly, the other data are not removed from the worksheet, but instead removed from the display. Thus, the row numbers for the filtered cells (and un-filtered cells) are maintained. Unfortunately, this also means that when using functions and formulas, one must either select the individual cells or copy the post-filter results to another spreadsheet in order to safely work with the data. While Excel is designed to allow you to work with the data in the filtered worksheet, small errors and glitches can prove costly if you neglect to copy the filtered data into its own worksheet.



CALCULATING STANDARD ERROR OF THE MEAN:

When examining group data in our laboratory, we often need to calculate two parameters of the data: 1) a measure of central tendency (generally the mean) and 2) a measure of the variance (commonly the standard error).

Measures of central tendency can be easily derived in excel using functions for the mean, median, mode, etc. Furthermore, one measure of variance—the standard deviation—can also be calculated using the MSExcel function “=Stdev()”.

Still, given that the standard error of the mean is normalized by the number of subjects/observations, many consider it to be a more appropriate measure of variance. Fortunately, the standard error (which cannot be calculated using an Excel function) and the standard deviation (which can be calculated using the Stdev function) are intimately related. The formula for the standard error is as follows:

SEM= Standard deviation/sqrt(n)

Where:
SEM- is the standard error of the mean
Sqrt- is the square root of the value in the parentheses
n- is the number of subjects or observations

Knowing this, we can easily calculate the SEM in excel with a nested function:

=stdev(A1:A10)/sqrt(n)

OR

=stdev(A1:A10)/sqrt(count(A1:A10))


The first example involves manually inputting the number of subjects or observations, while the second example counts the number of cells for which you are taking the standard deviation (which is generally the number of observations you will use for n). Thus, by nesting the =stdev, =sqrt, and =count functions in excel, you can easily calculate the standard error of the mean.


NOTE: There are many instances where Excel is able to “calculate” the standard error automatically. The most common instance of this is when adding error bars to plots. While each instance is different, Microsoft Excel is often using different numbers to calculate the standard error than those that should be used. Therefore, the prudent researcher will calculate these values by hand to ensure their accuracy.

PLOTTING:



Plotting in Excel is useful for a quick look at data. We will spend some time in the future discussing plotting in general and plotting in other programs (E.g. Matlab and Sigma Plot). However, here are a few key points:

1) Graphs are not “one size fits all”. Certain plots are more appropriate for certain types of data. For instance, the categorical counts on the top left are best fit by a bar graph, but would be poorly represented by the scatter plot on the right, whereas the pairs of data on the bottom left would not fit well into a bar plot.

2) If data is set up correctly, Microsoft Excel is able to accurately label the graph automatically (see the upper left data and the bar plot).

3) Nearly every component of a plot can be selected and modified. In the Excel 2007-2010, most of the options for charts are contained under a series of “Chart Tools” tabs that appear whenever a chart is selected.


THE CANONICAL FORM: PREPARING DATA FOR MATLAB

Canonical- “Conforming to orthodox or well-established rules or patterns, as of procedure.”


For our purpose, this means that we will prepare data to be easily transferred to Matlab, SPSS, SAS, or any other program. Most programs can handle numerical data, but not all programs can adequately handle text/strings. Moreover, while some programs can handle strings, the use of strings often slows computations. Thus, the canonical form for data analysis is defined as an array of data wherein all of the data is numerical with the possible exception of one row of headers. Most often, this is accomplished by incorporating one observation per row.

This is best accomplished by creating your own “Rosetta Stone” for the data. Specifically, you can create one Excel spreadsheet containing the data with both the strings and their respective numerical coding and a second sheet that is free of all strings. For some programs (e.g. Matlab and SPSS) it is appropriate to leave the column headers in the second sheet.

The creation of the first sheet is most easily accomplished by inserting rows into Excel and subsequently using Excel functions to convert strings to data. The second sheet can be easily created by using “Paste Special/Values” in a new sheet and then deleting columns that contain strings.

Some commands that might prove helpful for converting your data to the canonical form are:
=IF,=Left,=Right,=Middle, or =Replace.

Note: Some text commands need to be nested within the =Value command in order to transform numbers in text form to numerical form.


CHALLENGE:

Find a set of data that has at least 3 subjects and at least one other variable with 2 or more levels. Then, complete the following steps, making sure to take a screen shot of your success at each stage:

1) use the Sort function to sort the data by subject, 2) use filtering to filter out one level of the independent variable of interest, 3) plot the data you have filtered using a graph of your choice and finally 4) repeat this process for the second level of the independent variable.
2) Once your plots have been created convert your dataset to the canonical form using one or more new excel sheets.

DON'T FORGET TO SUBMIT YOUR COMPLETED CHALLENGES SO THAT THEY CAN BE SHARED ON THE BLOG!

No comments:

Post a Comment

Thoughts? Leave a comment!