Organizing data for Future Analysis

2.2 Organizing data for Future Analysis

We are now at the place where we have learned something about extracting data from a problem situation and recording it on data collection forms. Recording ”live” data that we have extracted from a problems situation, however, may not be the only way to gather the data we need to solve problems. Some or all of the data could have been collected by someone else and stored in computer data banks or archived in some other medium. By whatever means we have gathered our data, we will eventually need to input that data into a computer program so that we can use that program to analyze the data. The most common kind of program that is used in business to analyze data is the spreadsheet, and the most commonly used spreadsheet is Microsoft Excel. This section will teach you how to code and organize your data so you can process it with whatever data analysis tool you are most familiar.

Data should be organized in rows and columns. The intersection of a row and column is called a cell. Each column contains the data associated with a variable, e.g. salary, or age or gender or opinion. An observation is a complete row of data and contains all the information about a particular individual or a particular case of what we are studying. You may also see observations referred to as records.


EmpID	AnnualSalary	Gender	Height	Dept	YrsExp
	(thousands of dollars)		(inches)		(years)

90020	31.5	Male	68	Sales	5.4

90034	40.3	Female	64	Research	0.5

92300	65.1	Male	72	Admin	15.1

92305	40.1	Male	69	Sales	6.1

92307	32.6	Female	68	Admin	7.8

92455	51.9	Male	70	Sales	3.1

94500	28.9	Male	65	Research	3.2

94700	44	Female	62	Sales	9.1

94545	49.9	Male	71	Admin	8.3

There are a few rules that must be followed when entering data in a spreadsheet. Following these rules will help make the data useable, which is the primary requirement. The organization of the spreadsheet should also be done for readability, but not at the expense of the useability. Once the analysis is complete, one can worry about making the data or the output of the analysis look nice for presentation, but that should be the last concern. The main considerations about spreadsheet organization are these:

Every column of data must have a variable name at the top of its column. This is the purpose of the column headings ”EmpID”, ”AnnualSalary”, ”Gender”, ”Height”, ”Dept”, and ”YrsExp” in the table above. Note that when entering variable names into spreadsheets, you may need to be careful to avoid blanks spaces or non-alphanumeric characters. If so, you can use an underscore character (e.g. Annual_Salary) or run the words together as we have above.
Every observation should have a unique identifier, usually at the beginning of its row. The column ”EmpID” serves this purpose in the data above, clarifying to which employee a particular row of data refers. Such identifiers could be as simple as a sequential number that is different for each record in the data, or the name of the person represented in that record, or an address if the records are data on buildings.
A data cell can contain only one kind of information; that is, two variables cannot share the same cell. We will see examples of this later.
If the data is numerical, the units should appear in the column heading or a comment, not as part of the data entered into the cell. The information in parentheses for each variable name defines the units in the table above, e.g., years, inches, thousands of dollars, etc. It is vital that you use the same units throughout a variable. In other words, if you are recording wait times for orders, you cannot record the time in minutes for some records and seconds for others; if you do, then your analysis will be critically flawed.

  2.2.1 Definitions and Formulas
  2.2.2 Worked Examples
  2.2.3 Exploration 2B: Entering Beef n’ Buns Data into a Spreadsheet

[next] [prev] [prev-tail] [front] [up]