Definitions and Formulas

2.2.1 Definitions and Formulas

Identifier

Usually the leftmost column in your data, it should contain a name or other piece of information for the purpose of identifying each set of observations separately. Identifiers should be unique; that is, no two observations should have the same identifier. Examples include: names of employees, social security numbers, and home addresses. An identifier gives us a way of quickly and accurately locating all the information about a particular observation from among all the observations in the data set, something that we quite frequently have to be able to do in our analysis. Sometimes an identifier is nothing more than what its name implies, a way of identifying a particular observation, which is certainly important. In other situations, however, identifiers might be coded in a way so that they do indeed contain information that can be used for data analysis beyond their identification purpose. The point is that the analyst must be on guard when it comes to identifiers. A column of identifiers may look like data, and may even have a heading that looks like a variable name, but because they are no more than identifiers they should not be included along with the actual data when performing analysis. To do so might give rise to some very peculiar - and erroneous - results. Identifiers can be extremely helpful in the analysis phase for identifying data that may have been entered incorrectly or data that may represent outliers.

Row (Observation or Record)

Each row of your data should contain the observations of the different variables that are all associated with one identifier. If data is collected on people including name, age, education level, and salary, then a complete set of information is called a record or observation of the variables. Usually the term record is used in databases, and the term observation is used in statistical settings. When the data is organized into a spreadsheet, the records usually appear as rows.

Column (Variable or Field)

Each column of in your data should contain a set of observations of a single variable. In database terms, variables are called fields.

Coding

This is the process by which the information is converted the raw form in which it was collected into into data entries for analysis. For example, when collecting information on the gender of employees, the data could be coded in several ways:

You could enter the words ”Male” or ”Female”
You could enter ”M” or ”F”
You could enter ”0” for male, ”1” for female
You could enter ”0” for female, ”1” for male

The choice you make determines the way the data is coded. It is a good idea to include a comment for each variable that explains how it has been coded and what each code means.

Computed Field

A data item that is not collected directly from the problem situation, but computed based on the collected data. For example, we might collect an employee’s BirthDate, then compute his/her age as of a certain date.

Cross-sectional data

Cross sectional data is data in which the variables are all observed at some ”frozen instant in time”. Each of the observations is independent of the other observations (has no effect on it). Such data is usually used to capture information about a population by cutting through the entire population and recording information on all the variables for each individual in the population.

Time Series Data

If the same variables are observed at different times, then the data is time series data. Analysis of time series data is more difficult than the analysis of cross sectional data since usually the values of the variables at one time have an effect on the values of the variables at the next time they are observed. For example, if a stock closed up one day, this has an effect on the likelihood of the stock closing up the next day. This means that the observations are not independent of each other.

Population

Populations are collections of individual items (people, houses, companies, countries, cars) that are being investigated. For cross-sectional data on populations, each observation in the data is for a different member of the population. For example, in collecting data on incomes for families, you could define a population to be ”all families in cities with less than 100,000 people” or ”all families with two children in the United States”.

Sample

When collecting data, it is rare indeed to collect information from every member of a population. Usually this is impractical because of time or expense, so some portion, usually randomly chosen according to some carefully defined criteria, is sampled. Each member of the sample produces an observation of the variables in the data. However, it is possible that the sample you have collected is not representative of the entire population. It is critical that you make certain that the sample and population are as similar as possible. When you calculate any statistical information based on a sample, you are using this information to infer the characteristics of the population. This will usually modify the statistical calculations. (For an example, see chapter 3 on the standard deviation.)

[next] [prev] [prev-tail] [front] [up]