Usually the leftmost column in your data, it should contain a name or other
piece of information for the purpose of identifying each set of observations separately.
Identifiers should be unique; that is, no two observations should have the same
identifier. Examples include: names of employees, social security numbers, and home
addresses. An identifier gives us a way of quickly and accurately locating all the
information about a particular observation from among all the observations in the
data set, something that we quite frequently have to be able to do in our analysis.
Sometimes an identifier is nothing more than what its name implies, a way of identifying
a particular observation, which is certainly important. In other situations, however,
identifiers might be coded in a way so that they do indeed contain information that
can be used for data analysis beyond their identification purpose. The point is that
the analyst must be on guard when it comes to identifiers. A column of identifiers
may look like data, and may even have a heading that looks like a variable name, but
because they are no more than identifiers they should not be included along with the
actual data when performing analysis. To do so might give rise to some very peculiar
- and erroneous - results. Identifiers can be extremely helpful in the analysis phase for
identifying data that may have been entered incorrectly or data that may represent
outliers.
Row (Observation or Record)
Each row of your data should contain the observations
of the different variables that are all associated with one identifier. If data is collected
on people including name, age, education level, and salary, then a complete set of
information is called a record or observation of the variables. Usually the term record
is used in databases, and the term observation is used in statistical settings. When
the data is organized into a spreadsheet, the records usually appear as rows.
Column (Variable or Field)
Each column of in your data should contain a set of
observations of a single variable. In database terms, variables are called fields.
Coding
This is the process by which the information is converted the raw form in which it was
collected into into data entries for analysis. For example, when collecting information on the
gender of employees, the data could be coded in several ways:
You could enter the words ”Male” or ”Female”
You could enter ”M” or ”F”
You could enter ”0” for male, ”1” for female
You could enter ”0” for female, ”1” for male
The choice you make determines the way the data is coded. It is a good idea to include a
comment for each variable that explains how it has been coded and what each code
means.
Computed Field
A data item that is not collected directly from the problem situation, but
computed based on the collected data. For example, we might collect an employee’s
BirthDate, then compute his/her age as of a certain date.
Cross-sectional data
Cross sectional data is data in which the variables are all observed at some
”frozen instant in time”. Each of the observations is independent of the other observations
(has no effect on it). Such data is usually used to capture information about a population by
cutting through the entire population and recording information on all the variables for each
individual in the population.
Time Series Data
If the same variables are observed at different times, then the data is time
series data. Analysis of time series data is more difficult than the analysis of cross
sectional data since usually the values of the variables at one time have an effect on
the values of the variables at the next time they are observed. For example, if a
stock closed up one day, this has an effect on the likelihood of the stock closing
up the next day. This means that the observations are not independent of each
other.
Population
Populations are collections of individual items (people, houses, companies, countries,
cars) that are being investigated. For cross-sectional data on populations, each observation in
the data is for a different member of the population. For example, in collecting data on
incomes for families, you could define a population to be ”all families in cities
with less than 100,000 people” or ”all families with two children in the United
States”.
Sample
When collecting data, it is rare indeed to collect information from every member of a
population. Usually this is impractical because of time or expense, so some portion, usually
randomly chosen according to some carefully defined criteria, is sampled. Each member of the
sample produces an observation of the variables in the data. However, it is possible that the
sample you have collected is not representative of the entire population. It is critical
that you make certain that the sample and population are as similar as possible.
When you calculate any statistical information based on a sample, you are using
this information to infer the characteristics of the population. This will usually
modify the statistical calculations. (For an example, see chapter 3 on the standard
deviation.)