2.1 Extracting Data from the Problem Situation

In the previous chapter we learned how to define a problem. We recognized that a real-world problem is often embedded in an interconnected web of events taking place in time and space usually involving people, objects, or machines. To gather meaningful data about a problem we must think of how the data is related to its surroundings. For example, in order to gather the kind of data that we can use to identify and then correct excessive wait times at Beef n’ Buns, we need to consider when a ”wait time” begins and when it ends and then connect these wait times to the types of orders being filled during these wait times because not all orders are created equal with regard to wait times.

In order to gather the kinds of data that we can use to identify and then correct excessive wait times, we need to understand why not all orders are created equal with regard to wait times. And one of the first things that we recognize as we try to understand this connection is that there seems to be an inherent difference between wait-time data and type-of-order data. In this section we move ahead by learning how to recognize different types of data in a problem situation and how to record them on data collection forms. This is the process of extracting data from the problem situation.

Before we can complete the data extraction process by recording the data on data collection forms, we need to know exactly what type of data we are recording in order to know either ”how many of what” to mark down or what category to check, depending on whether the data is numerical or categorical.

   Types of Data
   The Units for Recording Numerical Data
   Categories for Recording Non-Numerical Data
   Raw Data, Summary Data, and Computed Fields
  2.1.1 Definitions and Formulas
  2.1.2 Worked Examples
  2.1.3 Exploration 2A: Extracting Data at Beef n’ Buns
Types of Data

As we mentioned above, not all data has to do with numbers. Data that does have to do with numbers, that is, counting or measuring something, is called numerical data and that which has to do with classification or categorizing something is called categorical data. Examples of numerical data are salaries, sales, heights, weights, number of customers, number of children. Examples of categorical data are gender (male, female), job classifications (e.g. office staff, management, vice president), day of week, marriage status. Sometimes it is obvious what type of data we are dealing with in a particular problem situation; other times we have to make a conscious decision as to whether we want to record our data numerically or categorically. In the latter case, we have to ask ourselves if it would be more beneficial for our analysis to retain the numerical differences between the individual things we are observing or whether it would be better to group them into categories. Each has its advantages.

Almost any type of numerical data can be converted into categorical data by some sort of classification scheme. For example, individual numerical heights could be lumped into short, medium, tall, and very tall categories by some sort of scheme, such as, all heights below 60 inches will be placed in the ”short” category, all heights between 60 inches and 68 inches will be placed in the ”medium” category, etc. Categorical data, however, cannot be converted to numerical data, however. Take, for example, the gender categorical data. It would not make sense to find the add-up-and-divide average of the categories ”female” and ”male” even if we decided to think of a female as ”0” and a male as ”1.” It would make no sense to talk about (0+1)/2 or .5 as gender. In general, we can distinguish numerical and categorical data by this rule of thumb: if you can do meaningful arithmetic with the data, it is numerical; if not, it is categorical.

When coding data, note that numbers can be used as codes for categorical data: E.g. 0 for male, 1 for female or 1-5 in opinion poll rankings. Without prior knowledge or provided information, it is often difficult to distinguish between numerical and categorical data: E.g. Age: 59, 52, 58, 12, 43, 23. This data could either be numerical or categorical, depending on the purpose and design of the study. That is, if it were to be considered numerical, 59 would have a different impact on the sum of all the ages, for instance, than would 52, whereas if age were considered to be categorical data, then both 59 and 52 might be lumped into the ”middle-aged” category, whereas 70 and 80 might be counted in the ”senior” category.

Each type of data, numerical and categorical, has two subtypes. Numerical data can be either discrete or continuous and categorical data can be either ordinal or nominal. In short, continuous numerical data can take on values that fall anywhere within a continuous range of numbers, whereas discrete numerical data can only take on particular number values and nothing in between them (non-continuous); with ordinal categorical data, the categories are related by some sort of ”more than” or ”later than” or ”better than” structure, whereas nominal categorical data (name-only categorical data) does not have any kind of inherent ordering structure (see Definitions and Formulas for examples). There are cases, however, in which some of these distinctions break down, but the point of trying to make them in the first place is that they give us more than just a way of focusing on and thinking about data as we attempt to extract it from a problem situation. They also give us the vocabulary to talk about it, especially when we are deciding how to record it.

The Units for Recording Numerical Data

Numerical data is recorded in units. In some cases, there is more than one choice for the units. For example, bottled soft drink could be measured in metric units or conventional English units. A bottle with volume 500 ML is 16.9 Fl oz., which could be measured as .5 L or as .53 qt. The business manager must be constantly aware of units. For example, if you hurriedly ran your eyes over an invoice and saw an order of 10000 bottles of soft drink, each recorded on the invoice as having a volume of .5, you might assume that the order was for 10000 half-quart bottles. But if the unit is a liter, then you would be making a 200 quart error.

The issue of units, however, is more fundamental than committing oversight errors. The choice of units can change the nature of the data we are extracting from a problem context. The different units in the bottled soft drink example all measure the amount of liquid as volume. We could have measured the amount of soft drink in units measuring the mass of liquid (grams or kilograms) or its weight (in pounds). Each unit, mL or grams, measures a quantity of water, but the units of data, whether measured in volume or in weight, determine the ease with which we can use incorporate the data into other problem contexts. For example, if the soft drink is being transported, there may be a weight limit, but the units are in mL (volume). In this particular case, we could, with time and effort, make the necessary conversion from volume to weight to see if our shipment is under the weight limit. The point is that we have to give some thought as to how our data might be used in the future when we go about extracting it from its context.

Categories for Recording Non-Numerical Data

Units are usually associated only with numerical data. Non-numerical data is recorded in categories that have to be explicitly defined unless they are obvious. Gender is an example of non-numerical data whose categories are obvious when recorded as Male or Female or even when recorded as M and F. Gender data is not obvious, however, when recorded in the categories 0 or 1. In this case, we shold make a note (for example, by adding a ”comment” to the cell in EXCEL) that explicitly states that, for example, 0 is being used to represent Male and 1 is being used to represent Female (the numbers could, of course, be reversed for male and female).

Raw Data, Summary Data, and Computed Fields

A very important idea in data collection is the difference between the raw data, a data summary, and a computed field. Raw data is the data as directly collected: one set of values for each variable per observation. In newspaper articles and other readings, it is not common to display the raw data, however, as it may contain thousands (or even millions) of observations. Instead, the data is often presented in summary form. The difference between the two is best illustrated with a database of employee information, like annual salary, gender, and height. The raw data would contain one observation of each of these variables for each employee, so a row of the raw data table would correspond to a single employee in the database. This raw data file would typically be large and have many entries, but it is necessary in order to do any data analysis that you have this file of raw data. Another clue that you are looking at raw data is that there should be an identifier for each set of observations (in the table below, this is the employee ID.)








Employee

Annual Salary

Gender Height

Gender

Height Range

Monthly Salary

ID

$1,000

Inches

(0=Male, 1=Female)

$








90020

31.5

Male 68

0

Medium

2,625








90034

40.3

Female 64

1

Medium

3,358








92300

65.1

Male 72

0

Very Tall

5,425








On the other hand, data could be represented in a summary form by reporting the number of male or female employees or the average salaries of male and female employees or the number of employees over a certain height. In a summary, notice that we cannot tell anything about individual employees; we have information about the aggregate set of employees, instead.




Gender Count Average Height (inches)



Male 452 69.4



Female 309 65.6



The examples above also illustrate the idea of a computed variable (Gender as a 0 or 1; height range as a descriptor). In these cases, someone probably collected the raw data on the employees in terms of their heights and genders, then added a new variable that compares the raw data (Gender as male or female; actual height in inches) to a set of values and assigns a new number or name based on the employee’s information. Another example of this would be the monthly salary variable above. Once we have the annual salary, we can compute the monthly salary easily, we just divide by 12. And while the variable contains no new information compared to the original raw data, it does show the information in a different way. This might be useful if, for example, we are trying the put together a project proposal that would involve some of these employees being assigned to the project for different amounts of time than a full year; having the monthly salary would allow us to cost out the project more accurately.