Before you submit data to analysis, it will often be useful to perform some preliminary operations. These may include:
When choosing the method of analysis you have to ask yourself whether you want to analyse disconnected variables, or the relations between several variables? Or, do you just want to use the measured variables for classifying or assorting individuals or cases?
Another important decision concerns the final purpose of your project. Do you want to describe how is the present (or past) state of your object, or do you wish to find out how the object should be: which degree of the measured attributes would be optimal? This latter type of analysis is discussed under the title Normative Study of Variables.
In the following, there is a list of some usual methods of statistical analysis of
one single variable. The methods have been arranged according
to the scale of measurement of the variable.
| - | Nominal scale | Ordinal scale | Interval scale | Ratio scale |
|---|---|---|---|---|
| Methods of presenting data | - Tabulation ; Graphical presentation - | |||
| Averages: | - The mode - | |||
| - | - The median - | |||
| - | - | - The arithmetic mean - | ||
| Measures of dispersion: | - | - The quartile deviation - | ||
| - | - The range - | |||
| - | - | - Standard deviation - | ||
A simple way of presenting a distribution of values is to
present each value as a dot on a scale. If there is a great number of
values, it may be better first to classify them and then present the
frequency of each class as a histogram (Fig. on the
right).
If your studies involve people, your measurements quite often turn out to be distributed according to a certain curve, the so called Gauss curve (on the left) which is therefore called the normal distribution. One of its properties is that 68% of all measurements
will differ from the mean (in the figure: M) by no more than the standard deviation, and 95% by two standard deviations.
Sometimes you will wish to emphasize not the absolute distribution but the proportional or percentage
distribution. A suitable diagram for this is the pie chart (on the
right):
Median is the value in the middle of the selection, if all the values are first arranged from the smallest to the largest.
(Arithmetic) mean is the sum of all the values divided by their number, or
From among the averages that were presented above, the researcher can usually choose the one that best shows the typical value of the variable. Arithmetic mean is the most popular one, but it can give the wrong picture e.g. in data which include one value which greatly differs from the others (see the picture below).
The same happens if the distribution is skewed,
like in the picture on the right. In the example, the minutes that the
different subjects spend carrying out a certain task, have been listed. The
fastest ones needed 5 minutes but the most common performance (= the
mode) was seven minutes. The value in the middle, i.e. the
median, has been shown as a red letter M in the picture. The median has the
value 11 here.
What about the mean? As the performance of the slowest
subject took as long as 34 minutes, the mean went up to 11.98 minutes,
which does not give a very accurate picture of the average performance
in this case. This shows that if the data is skewed, the type of average
must be chosen with care. A graphic presentation would often be more
illustrative than calculating a single statistic.
The distribution shown in the picture is positively skewed,
because the measurements that are larger than the median (11) are
spreading out on a large range (from 11 to 34) while the measurements
below median concentrate into just a few values (5...11).
A statistic can also be found to describe the amount of
skew, if necessary.
When selecting the most suitable average, you should consider the scale which was used in the collection of the data. If the scale was nominal, the only possible mean is the mode. If the scale was ordinal, you may use either the median or the mode. Note, however, that the categories of scales are not always quite distinct; for example the common worded scale
beautiful / - / - / - / - / - / ugly
should actually be regarded ordinal because the intervals between the markings are not truly equal (people prefer to put their ticks near the middle because the intervals near the ends are sensed as greater). However, many researchers prefer using the mean, not the median, as a summary for this type of questionnaire, which means that they rate this scale as arithmetical.
Finally, if the average was calculated from a sample, you should test its statistical significance, or how probable it is that the same average is true in the population from which the sample was drawn. A suitable test for this is the t-test.
Once you have calculated the average value, it would sometimes be interesting to describe how far the singular values are scattered around the average. To this end, you may choose between a variety of statistics. The selection depends on the type of average that you have used:
However, if the standard deviation concerns just a random sample, the formula is,
In both formulas, n is the number of the values, and the values of each variable will be substituted for x one at the time.
Hardly any researcher bothers to perform the calculation himself because
the necessary algorithm for it now exists even in pocket calculators.
The square of the standard deviation is called variance, and
it, too, is often used to describe and to analyse the dispersion.
If the statistic of dispersion has been calculated from a sample, its statistical significance should also be calculated in the end. The t-test is suited to this.
If two variables vary in such a way that they follow each other to some extent, we say that there is an association or covariation between the variables. For example, the height and weight of people are statistically associated: although nobody's weight is caused by his height nor the height by the weight, it is, however, usual for a tall person to weigh more than a short person. On the other hand, the data usually include exceptions as well, which means that a statistic association is inherently stochastic.
The science of statistics offers numerous methods for revealing and presenting the associations between two and even more variables. The simplest means are the methods of graphic presentation and tabulation. The association between the variables can also be measured with the help of special statistics, such as correlation.
If, when analysing the data, some association between the variables is discovered, this does not mean that either variable necessarily causally depends on the other. A strong covariation between, say, A and B, can be due to four alternative reasons:
In the following, we mention some usual methods of statistical analysis which can be used when studying the interdependence between
two or more variables. The methods have been arranged according to which measurement scale the variables mostly correspond to.
| - | Nominal scale | Ordinal scale | Interval scale | Ratio scale |
|---|---|---|---|---|
| Methods of presenting data | - Tabulation ; Graphics - | |||
| Measures of association | - Coefficient of contingence ; Chi squared - | |||
| - | - Ordinal correlation - | |||
| - | - | Pearson's r correlation ; ANOVA | ||
| - | - | Regression analysis ; Factor analysis | ||
Some abbreviations conventionally used in tables are presented on the page Classification.
Products, as objects of study, are often presented as pictures, which
are one kind of graphical presentation. (Examples of pictorial presentations.)
If the researcher wishes to highlight some common traits or
general patterns he has found in a group of objects, he can combine
several objects in one graphic, like in the figure on the left. In the
diagram, Sture Balgård shows how the old buildings in
Härnösand follow uniform proportions of width and height (the
red line) with just a few exceptions. In inventing illustrative methods of
presenting the findings of the study of products, the most serious
restriction is the imagination of the researcher.
Often, however, the appearance of the object itself is not
important and only the numerical values of his measurements are
of interest. If you feel like that, the first question that you should consider
when selecting the type of graphics is what the structure in the data that
you wish to show is. Of course, you must not "lie with the help of
statistics", but it is always admissible to select a style of presentation
which highlights the important patterns by eliminating or diminishing the
uninteresting relations and structures.
If your data consists of only a few measurements, it is possible to show all of them as a scattergram. You may exhibit the values of two variables on the two axes x and y, and additionally a couple of variables by utilizing the colours or shapes of the dots. In the diagram to the right, the variable z has two values, which are indicated by a square and a plus sign.
If the variation is too small to appear clearly, you may emphasize it by cutting off parts of one or both of the scales, see examples. You simply cut off the uninteresting part, either from the top or the bottom. The discarded part must be empty of empirically measured values. To make sure that the reader notices the operation, it is better to show it not only in the scales but also in the background grid of the diagram.
On the other hand, if the variation range of your data is very
large, you may consider using a logarithmic scale on one or both
of the axes, see diagram on the left. Logarithmic scaling is appropriate on
a ratio scale only.
If you have hundreds of measurements, you will probably not want to
show each of them as a scattergram. One possibility in this case is to
classify the cases and present them as a histogram.
The histogram may be adapted to present up to four or five
variables. You can do this by varying the widths of the columns, their
colours, background patterns and by a three dimensional presentation
(fig. on the left). All these variations are easily created by a spreadsheet
program like Excel, but they should not be used for decoration only.
The patterns filling or making up the histogram columns may
be chosen so that they symbolize one of the variables. For example, the
columns describing the number of cars may be formed by piling up
pictures of cars one above or after the other. This is all right, provided
that you do not vary the size of the symbols used in a histogram.
Otherwise, the interpretation would be difficult for the reader (does the
number of cars relate to the length, to the area or to the volume of the
car symbols?)
The researcher is often interested in the relations of two or more variables rather than in the detached pairs of measurements as such. The normal way of presenting two or more interdependent variables is the curve. It implicates a continuous variable (i.e. where the number of possible values are infinite). (Examples.)
You should not fabricate a curve from measurements which are not values of the same variable. For instance, the attributes of an object are different variables. Examples are the personal evaluations that researchers often gather with the help of semantic differential scales of the type below:
| Estimate the characteristics of your bedroom.
Tick one box on every line. | ||||||||
|---|---|---|---|---|---|---|---|---|
| Light | _ | _ | _ | _ | _ | _ | _ | Dark |
| Noisy | _ | _ | _ | _ | _ | _ | _ | Quiet |
| Clean | _ | _ | _ | _ | _ | _ | _ | Dirty |
| Big | _ | _ | _ | _ | _ | _ | _ | Small |
It would now be pointless to present the various evaluations
of the bedroom as a single "profile" as in the diagram on the left
(although you often find this type of illogical presentations in research
reports.)
If you absolutely want to stress that the variables belong together
(e.g. because all of them are evaluations of the same object), an
appropriate method would be e.g. a horizontal histogram (on the right).
All of the above diagrams can be combined with maps and other topological presentations. For example, the variation in the different areas of the country is often shown as a cartogram by distinguishing the different districts with different colours or shades. Another technique is the cartopictogram in which small pie or column diagrams have been placed on the map. If you need to show associations between areas this can be done with arrows whose thickness indicates the intensity of the connection. (Example.)
How close a relation two variables have with each other, can be studied with the means of tabulating or graphical presentation, and it can be measured with special statistics, too. The statistics available for analysing the links between two variables depend on what type of scale the variables have been measured by (see table that was presented earlier).
Formulas for calculating the statistics of contingence are not shown here because performing the calculations manually would be awkward and researchers usually do them with computers.
The product moment correlation or Pearson's correlation which is usually abbreviated with the letter r measures how closely the association between two variables resembles the linear equation y = ax + b. If the correlation coefficient is high, in other words if its value approaches either +1 or -1, it means that the relation between the two variables approaches this equation. If the correlation is low, e.g. something between -0.3 and +0.3, the two variables have not much to do with each other (more exactly, they have almost no linear covariation). The sign of the correlation coefficient is not important; the sign is always identical with the sign of the coefficient a in the above equation.
Below you can see three scattergrams which show three different sets of data from two variables, each set consisting of eight pairs of values. The correlations between the two variables have been calculated and are shown under each scattergram. It can be seen that there is no correlation between the variables in the set on the left, and the other two sets show correlations of 0.5 and 1.0.
Notwithstanding the fact that correlation analysis is able to handle only two variables, it is an excellent tool for the initial analysis of a large number of variables, when you have no clear idea of their mutual relations. A computer can quickly calculate the correlations between all possible pairs of variables, finally constructing a correlation matrix from the results. You can then select those pairs that have the strongest correlation and continue by examining these pairs with other, more refined tools for analysis.
A weakness of the correlation analysis is that it cannot detect other than linear relations between the variables. E.g., a relation that obeys the equation y = ax2 + bx + c would pass unnoticed. However, some of the newer analysis programs are able to detect even this and some other usual relationships of variables. Besides, you can try to:
Once you have found a pair of variables with a strong correlation (or contingence) you can continue, for example, with the following operations:
Finally remember that when a correlation has been calculated from a sample, you should examine its statistical significance with the t test.
The ANOVA method is based on the mathematically proven fact that
there is a difference between the groups only if the
between-groups variance is greater than
the within-group variance.
The analysis is initiated by calculating the within-group variance
for each group, and the mean of all these group variances.
The next step is to calculate the mean for each group, and then the
variance of these means. It is the between-groups variance.
Then you calculate the ratio of the above two figures, which is called
F. In other words,
F = (variance of the group means) / (mean of the group variances).
Finally you refer to the table (in statistics handbooks) which shows
how high values the coefficient F may reach when only chance operates.
If the F received from the ANOVA is higher than the table value, there is
a difference between the groups which is as significant as the table reports.
The algorithm of regression analysis constructs an equation, which has the following pattern. Moreover, it gives the parameters a1, a2 etc. and b such values that the equation corresponds to the empirical values as closely as possible.
y = a1x1 + a2x2 + a3x3 + ... + b
In the equation,
y = the dependent variable
x1 , x2 etc. = independent variables
a1 , a2 etc. = parameters
b = coefficient.
If you have extensive data with many variables, at the beginning of the analysis you are perhaps not sure which variables are mutually connected and which should thus be included in the equation. You might first study this with correlation analysis, or you can let the regression analysis program select the "right" variables (x1, x2 etc) to the equation. "Right" are those variables which improve the closeness of fit between the equation and the empirical values.
When one of the independent variables is time, and especially when we have a series of measurements at equal intervals, this series goes by the name time series. Regression analysis is a suitable tool for revealing a trend or a long-term development in a time series. For other structures which can be found in time series, see Historical Study.
The researcher has sometimes a great amount of data on numerous different variables which correlate with each other. With the help of factor analysis, such data can often be compressed and the variations presented through just a few variables.
As an example, let us consider the data from a questionnaire (shown
elsewhere) where a number of test subjects were
asked to indicate how closely their personal bedrooms would correspond
to adjectives provided by the researcher (given with "semantic
differential" scales). The researcher now wants to find out if behind the
estimates of the subjects, there are some "background variables" whose
direct measuring through linguistic means would not be possible because
of the lack of appropriate adjectives in language. The researcher's
hypothesis is that these background variables "appear" through the
adjectives used in the semantic scales, usually not through any single
adjective but instead through a group of correlated adjectives.
With the help of factor analysis, the combined variables or
factors hidden behind the measured attributes can be detected
and specified, and the analysis also tells how closely these factors are
linked with the original variables.-- Sometimes also an extra condition is
placed on the factors, namely that they must not correlate with each
other at all, and they can hence be said to be "perpendicular" to each
other (= "orthogonal rotation" of the factors during analysis).
A drawback of the method of factor analysis is that it is too easy to use it for practically absurd but formally correct studies because the analysis always presents its results in an elegant, mathematically exact form, even when the factors obtained have no sensible empirical contents.
There is no great difference between informative and normative
methods of analysis. In normative study at least one of the variables is
evaluative like "usefulness" or "satisfaction" etc., and the final aim
of the study is to improve the attributes of the object of study.
Because all evaluation is subjective it is important to consider and define
exactly whose point of view is used in the evaluation; this aspect
is discussed under the titles Human Subjectivity
and Objectivity and Normative Research. The choice depends also on the degree of participation which prevails in the
organisation that is going to use the normative suggestions that you are
preparing.
Once the assessing persons have been selected, the evaluations
can be gathered with a survey.
| Property: Ease of use | Utility value |
|---|---|
| All operations are automatic. | 5 |
| Many operations are automatic.
There is a detailed instruction booklet. |
4 |
| No automatic operations. The instructions book is mediocre. |
3 |
| The machine reacts contrary to the booklet. |
2 |
| Operation is confusing. No instructions book. |
1 |
A common difficulty when trying to improve the object of study is that its attributes are interdependent. While one of the attributes of the object of study is being improved, other attributes like usability, beauty, meaning, ecology or economy often deteriorate. In such a case you will want to uncover the exact relationship between these attributes, for example with regression analysis.
The figure on the left gives an example of finding an
optimum for the sum of two variables, both of which are dependent of a
third variable. The aim is to find out how heavy an insulation you should
select for a new building. The costs which vary along with the thickness
of insulation are its price (which is proportional to its thickness) and the
subsequent heating costs of the building (which will be diminished if the
insulation is augmented). Because investment and annual payments
cannot be added together as such, we first have to transform the
investment into annual cost. Such an annual cost which corresponds to
the investment is presented as curve B in the diagram, while
curve A shows the yearly heating costs. The optimum for an
insulation will be found at the point where the sum of these two costs is
at its minimum.
The science of operations analysis includes other comparable analysis methods which can be used to find the common optimum of several quantifiable attributes of a product. Such is for example the algorithm of linear programming.
Sometimes it turns out that two or more variables are causally dependent from each other; such a relation should be made explicit because you will normally prefer manipulating the cause, not the effect.
| Product attribute | Weight |
|---|---|
| Speed at least 100 mph | 40 |
| Easy to use, automatic | 40 |
| Design: sportive,
unlike the competitors |
10 |
| Materials are
potentially recyclable |
10 |
| Total weights | 100 |
Often the table of weights grows too large and difficult to grasp and to work with. To restrain its growth, you may want to present a family of related properties in the pattern of a logical tree, which means combining groups of associated properties into one bundle. When searching such bundles, or families of attributes, you can consider using factor analysis.
| Attribute,
or property of the product |
Weight
W |
Alternative 1 | Alternative 2 | ||
|---|---|---|---|---|---|
| Utility
value U |
WxU | Utility
value U |
WxU | ||
| Capacity | 40 | 2 | 80 | 5 | 200 |
| Ease of use | 40 | 3 | 120 | 4 | 160 |
| Design, appearance | 10 | 5 | 50 | 2 | 20 |
| Materials, recycling | 10 | 3 | 30 | 2 | 20 |
| Total | 100 | -- | 280 | -- | 400 |
An alternative method is to include the costs in the table as an additional row. Its weight in the final evaluation is then set at 40% or 50%. An example of such a table is given elsewhere.
January 2, 2005. Original location:
http://www2.uiah.fi/projects/metodi
Comments to the author: