Analyzing Survey Data in Microsoft Excel: Coding, Inputting Data, and Creating Frequency Distributions
In the following hub I provide a brief tutorial on how to use Microsoft Excel to analyze survey data. I use two small data sets, the College Rating Survey and the General Social Survey 2008 Extract, to illustrate some basic techniques. The examples below should help you to familiarize yourself with some basic tools of analysis that can be applied when analyzing your own survey results.
Microsoft Excel: The Basics
Microsoft Excel is a spreadsheet program with many basic features that are useful for analyzing data. In the spreadsheet, each column corresponds to a letter and each row corresponds to a number. The intersection of a column and row is referred to as a cell. Above the column letters is the formula bar, which is used to create and execute computations in Excel. Many common computations can be performed using the AutoSum button, located on the right side of the screen.
Before bringing your data into Microsoft Excel, I would strongly suggest assigning codes to all non-numerical (non-interval/ratio level) data. To perform certain functions in Excel, data must be numeric. This is also the case for nearly all more advanced statistical packages—so it’s useful to get into the practice of coding.
Most numerical (interval/ratio-level) data does not require coding. For example, data on age (in years) can be kept as is and need not be recoded. However, if data on age has been collected using categories/ranges (ex. less than 20, 20-29, 30-39), then these categories/ranges must also be given numeric codes.
Step 1: For each survey question (variable), where coding is deemed appropriate, determine the number of unique response options.
- Ex. In the GSS 2008 subsample, respondents are asked to choose their “highest degree completed” from a set of 5 response options (less than high school, high school, junior college, bachelor’s degree, graduate degree). Therefore, we know we will need to create 5 unique codes.
Step 2: Determine whether response options have a logical ordering (this is almost always this case for ordinal-level variables but not necessarily for nominal-level variables). If so, assign values accordingly.
- Ex. The GSS 2008 response options for “highest degree completed” can be logically ranked from lowest to highest level of education. Therefore, we decide to assign the following codes: 1=less than high school, 2=high school, 3=junior college 4=bachelor’s degree, 5=graduate degree.
*Missing Responses: If any of the questions on your survey were left unanswered, it is best to go ahead and code them anyway (although sometimes missing responses are simply left blank—Excel does not deal well with empty cells). Following from the GSS 2008 example above, if some respondents did not provide a selection for “highest degree” completed, we could create an additional code to represent (6=missing).
Once a coding scheme has been created for all response options corresponding to all survey questions (variables) requiring them, it is time to enter the data into Microsoft Excel. Data from the College Rating Survey is used to illustrate how to input data. I have already identified all unique response options and have created a coding scheme for the five items to be analyzed in Excel. The items and coding scheme are presented below:
*Note in the table above that some codes refer to combinations of response options. Unfortunately, not all responses in the survey were mutually exclusive or mutually exhaustive (or were not answered in this way). Although not ideal, it was necessary to create and assign additional codes for all unique combinations.
Step 1: For each survey item, provide a name/description at the top of each column. If you choose to provide column headers that are not intrinsically meaningful (ex. QI, Q2), make sure to include a description of each survey item along with your findings. If respondents have been assigned study ID#s, list them in the first column of the spreadsheet. If respondents have not been assigned study ID#s I would strongly urge you to assign them (as they are needed to use Excel’s built-in contingency table function). For the College Ranking Survey data, I have chosen the following column headings and have listed Study ID#s in the rows:
Step #2: Now that the columns of the spreadsheet have been properly labeled, it is time to input the data. The data for each participant should be entered into the row corresponding to their Study ID#. Remember to implement your coding scheme when inputting data. In the first image below I have provided the survey data collected from one of the survey participants (ID #1). In the second image displaying the entire data set, data for this respondent is located in the row directly beneath the column headings (this row is highlighted in blue). I have also included the coding scheme for each survey item below its corresponding column on the spreadsheet.
Survey Data for Respondent #1
Image 2: Spreadsheet including all Data (Respondent #1’s Coded Responses Highlighted in Blue)
Analyzing the Data
Once all of the survey data has been entered into the spreadsheet we can begin our analysis. Techniques for analyzing and presenting survey data are presented throughout the remainder of this tutorial.
Frequency distributions are a useful way to present and interpret data. A frequency distribution is a numerical display showing the number of cases, and usually the percentage of cases (the relative frequencies), corresponding to each value or group of values of a variable (definition from Schutt p. 454).
To illustrate, I will begin by creating a frequency distribution of the #1 ranked colleges/universities in the College Rating Survey Ranking Task #1. In this item, respondents are asked to rank colleges/universities by their profile characteristics. Symbols were given in lieu of names. I have chosen only to include respondents’ #1 selection.
Data on this variable is located in the Excel spreadsheet in Column D entitled “Highest Rank (Symbols).” We will work through frequency distribution creation step-by-step. Manual instructions (using hand calculations) and Excel-specific instructions (using formulas) are both provided (where applicable). The method you choose to analyze your data will likely be influenced by the size of your data set and your proficiency with Excel.
Step 1: Choose a location in your Microsoft Excel spreadsheet to create your frequency distribution. Give your frequency distribution a title. Include “Value,” “Frequency,” and “Percentage” in your column headings. Beneath “Value,” list all unique response options. Even if values have been re-assigned codes, present them in textual form in your frequency distribution. Include “Total” in the cell beneath your last value.
- Ex. I have chosen to construct the frequency distribution on the same spreadsheet as the survey findings, beneath the input data and coding scheme. It will begin in cell B18. I provide the title “Frequency Distribution of Highest Rank (Symbols).” I include “Value,” “Frequency,” and “Percentage” in the column headings. Beneath “Value,” I list all four unique survey response options (Harvard, Yale, MIT, Brandeis) in textual form. I include a cell for “Total.”
Step 2 (manual calculation): Count the number of times each value occurs in your data spreadsheet and enter it into the “Frequency” column next to its corresponding value. Include the sum of the frequencies next to “Total.”
- Ex. (manual calculation) Returning to Column D of the data spreadsheet (rows 2-9), I manually count that “Harvard” (coded as 1) was chosen as the highest rank 3 times. “Yale” (coded as 2) was chosen 3 times. “MIT” (coded as 3) was chosen once. “Brandeis” was chosen once. I enter these counts into the spreadsheet in the “frequency” column. I report the summed frequencies (they add up to 8) next to “Total.”
Alternative Step #2 (using formulas): For large data sets, manually counting the number of times a value has occurred may not be feasible. Luckily, Microsoft Excel has a function that will do it for you. The formula takes the form =COUNTIF(range, criteria). To use this function, your frequency distribution must be located on the same spreadsheet as your data.
- Ex. (using formulas): Returning to the data spreadsheet, we see that data on highest rank is located in column D, rows 2-9. Our range can be specified as D2:D9 (this literally translates as cells D2 through D9, with the colon meaning “to”). Our criteria is the numeric code corresponding to a particular value. Criteria must always be specified in quotations. For Harvard, our criteria is “1” (its numerical code). Click in the frequency cell corresponding to Harvard (C20), enter the formula =COUNTIF(D2:D9, “1”) and press enter. Our frequency for Harvard (3) is calculated for us. Formulas for each additional value must be constructed and entered in its corresponding cell. Never forget the = sign (or the formula will not execute).
Once all the frequencies have been computed you can have Excel sum the frequences for you using “AutoSum” (∑). Highlight frequencies for all values (in this case there are four of them located in cells C20 through C23). To select all values simply click on cell C20, hold down the mouse, and drag the box that appears down to C23. Click on the drop-down arrow next to the “AutoSum” (∑) button. Once you click “sum,” your calculation will appear in the cell directly beneath your selection by default, in this case, next to “Total.”
Step #3 (manual calculation): Calculate a percentage corresponding to the frequency for each value. To do this, divide the frequency for each value by the frequency total and enter the percentage into the spreadsheet. Also sum the percentages (they should equal 100%).
- Ex. (manual calculation): Harvard has a frequency of 3. If we divide the frequency for Harvard (3) by the frequency total (8), we obtain a percentage of 37.5% for Harvard. Yale: 3/8= 37.5%. MIT: 1/8 = 12.5%. Brandeis: 1/8 = 12.5%. The percentages sum to 100%. Calculations are noted in the table.
Alternative Step #3 (using formulas): Instead of computing percentages using a calculator, we are also able to make basic computations using formulas in Excel (similar to the illustration of the use of formulas in Alternative Step #2).
- Ex. (using formulas): If we were to select cell D6 we could calculate the percentage for Harvard by entering =3/8 (/ is the symbol for division in Excel) in the formula bar and pressing enter. Percentages for Yale, MIT, and Brandeis would also be calculated by using formulas accordingly. Once all four percentages are obtained, they can be highlighted and summed using the AutoSum function, just like in Alternative Step #2.
Step #4 (manual calculation): In many frequency distributions it is using to include a column for “Cumulative percentage.” Cumulative percentage is simply a “running tally” of percentages. Although it often makes more sense for ordinal and interval-ratio level variables, we can calculate it for any frequency distributions if desired.
- Ex. Add a column for “cumulative percentage” next to “percentage.” For the first value (in this case Harvard), simply re-enter it into the corresponding cell in the “cumulative percentage” column. To calculate the second cumulative percentage, we must add the percentage corresponding to the first value (Harvard – 37.5%) to the percentage corresponding to the second value (Yale – 37.5%) for a total of 75%. To calculate the third percentage, add the percentages corresponding to the first two values to the third value. Continue this process. Your final table should look like this:
*How to read cumulative percentage: 37.5% of respondents chose Harvard as their top ranking. 75% of respondents chose either Yale or Harvard as their top ranking. 87% of respondents chose MIT, Yale, or Harvard as their top ranking. 100% of respondents chose Brandeis, MIT, Yale, or Harvard as their top ranking. Although the cumulative percentage may not be particularly meaningful in this case—it is often useful to include for variables that include ranked response options.