Assorting has been moved to another page: Classification
Time series has been moved to another page: Historical Study

Quantitative Analysis

  1. Analysing Individual Variables
  2. Analysing Relationships between Variables
  3. Normative Study of Variables
En Español  In Finnish   Contents

Before you submit data to analysis, it will often be useful to perform some preliminary operations. These may include:

When choosing the method of analysis you have to ask yourself whether you want to analyse disconnected variables, or the relations between several variables? Or, do you just want to use the measured variables for classifying or assorting individuals or cases?

Another important decision concerns the final purpose of your project. Do you want to describe how is the present (or past) state of your object, or do you wish to find out how the object should be: which degree of the measured attributes would be optimal? This latter type of analysis is discussed under the title Normative Study of Variables.

Analysing Individual Variables

In the following, there is a list of some usual methods of statistical analysis of one single variable. The methods have been arranged according to the scale of measurement of the variable.
 
- Nominal scale Ordinal scale Interval scale Ratio scale
Methods of presenting data - Tabulation ; Graphical presentation -
Averages: - The mode -
- - The median -
- - - The arithmetic mean -
Measures of dispersion: - - The quartile deviation -
- - The range -
- - - Standard deviation -

Graphical Presentation of One Variable

HistogramA simple way of presenting a distribution of values is to present each value as a dot on a scale. If there is a great number of values, it may be better first to classify them and then present the frequency of each class as a histogram (Fig. on the right).

Gauss curveIf your studies involve people, your measurements quite often turn out to be distributed according to a certain curve, the so called Gauss curve (on the left) which is therefore called the normal distribution. One of its properties is that 68% of all measurements will differ from the mean (in the figure: M) by no more than the standard deviation, and 95% by two standard deviations.

Pie chartSometimes you will wish to emphasize not the absolute distribution but the proportional or percentage distribution. A suitable diagram for this is the pie chart (on the right):

Averages

An average is a statistic which characterizes the typical value of your data and eliminates the random scattering of values. For each of the various measurement scales there is an appropriate type of average, Mode is the most common value in your data set.

Median is the value in the middle of the selection, if all the values are first arranged from the smallest to the largest.

(Arithmetic) mean is the sum of all the values divided by their number, or

Mean

From among the averages that were presented above, the researcher can usually choose the one that best shows the typical value of the variable. Arithmetic mean is the most popular one, but it can give the wrong picture e.g. in data which include one value which greatly differs from the others (see the picture below).

SkewThe same happens if the distribution is skewed, like in the picture on the right. In the example, the minutes that the different subjects spend carrying out a certain task, have been listed. The fastest ones needed 5 minutes but the most common performance (= the mode) was seven minutes. The value in the middle, i.e. the median, has been shown as a red letter M in the picture. The median has the value 11 here.
What about the mean? As the performance of the slowest subject took as long as 34 minutes, the mean went up to 11.98 minutes, which does not give a very accurate picture of the average performance in this case. This shows that if the data is skewed, the type of average must be chosen with care. A graphic presentation would often be more illustrative than calculating a single statistic.
The distribution shown in the picture is positively skewed, because the measurements that are larger than the median (11) are spreading out on a large range (from 11 to 34) while the measurements below median concentrate into just a few values (5...11).
A statistic can also be found to describe the amount of skew, if necessary.

When selecting the most suitable average, you should consider the scale which was used in the collection of the data. If the scale was nominal, the only possible mean is the mode. If the scale was ordinal, you may use either the median or the mode. Note, however, that the categories of scales are not always quite distinct; for example the common worded scale

beautiful / - / - / - / - / - / ugly

should actually be regarded ordinal because the intervals between the markings are not truly equal (people prefer to put their ticks near the middle because the intervals near the ends are sensed as greater). However, many researchers prefer using the mean, not the median, as a summary for this type of questionnaire, which means that they rate this scale as arithmetical.

Finally, if the average was calculated from a sample, you should test its statistical significance, or how probable it is that the same average is true in the population from which the sample was drawn. A suitable test for this is the t-test.

Dispersion of Data

Once you have calculated the average value, it would sometimes be interesting to describe how far the singular values are scattered around the average. To this end, you may choose between a variety of statistics. The selection depends on the type of average that you have used:

Standard deviation

However, if the standard deviation concerns just a random sample, the formula is,

Standard deviation from sample

In both formulas, n is the number of the values, and the values of each variable will be substituted for x one at the time. Hardly any researcher bothers to perform the calculation himself because the necessary algorithm for it now exists even in pocket calculators.
The square of the standard deviation is called variance, and it, too, is often used to describe and to analyse the dispersion.

If the statistic of dispersion has been calculated from a sample, its statistical significance should also be calculated in the end. The t-test is suited to this.

Analysing Relationships between Variables

If two variables vary in such a way that they follow each other to some extent, we say that there is an association or covariation between the variables. For example, the height and weight of people are statistically associated: although nobody's weight is caused by his height nor the height by the weight, it is, however, usual for a tall person to weigh more than a short person. On the other hand, the data usually include exceptions as well, which means that a statistic association is inherently stochastic.

The science of statistics offers numerous methods for revealing and presenting the associations between two and even more variables. The simplest means are the methods of graphic presentation and tabulation. The association between the variables can also be measured with the help of special statistics, such as correlation.

If, when analysing the data, some association between the variables is discovered, this does not mean that either variable necessarily causally depends on the other. A strong covariation between, say, A and B, can be due to four alternative reasons:

The researcher must deliberately choose one of these alternatives. There are no means of statistical analysis for the job of finding out the causal explanation for a statistical association. In many cases, the original theory of the researcher can provide an explanation; if not, the researcher must use his common sense to clarify the cause.

In the following, we mention some usual methods of statistical analysis which can be used when studying the interdependence between two or more variables. The methods have been arranged according to which measurement scale the variables mostly correspond to.
 
- Nominal scale Ordinal scale Interval scale Ratio scale
Methods of presenting data - Tabulation ; Graphics -
Measures of association - Coefficient of contingence ; Chi squared -
- - Ordinal correlation -
- - Pearson's r correlation ; ANOVA
- - Regression analysis ; Factor analysis

Tabulating

Tabulating is a usual way of presenting the associations between two or more variables. A table has the advantage that extensive data can be fitted into it and the precise figures are conserved. A disadvantage is that a large table is not illustrative: it seldom reveals more than the most obvious regularities or interdependencies of the data.

Some abbreviations conventionally used in tables are presented on the page Classification.

Graphical Presentation

Products, as objects of study, are often presented as pictures, which are one kind of graphical presentation. (Examples of pictorial presentations.)
Old buildingsIf the researcher wishes to highlight some common traits or general patterns he has found in a group of objects, he can combine several objects in one graphic, like in the figure on the left. In the diagram, Sture Balgård shows how the old buildings in Härnösand follow uniform proportions of width and height (the red line) with just a few exceptions. In inventing illustrative methods of presenting the findings of the study of products, the most serious restriction is the imagination of the researcher.

Scattergram Often, however, the appearance of the object itself is not important and only the numerical values of his measurements are of interest. If you feel like that, the first question that you should consider when selecting the type of graphics is what the structure in the data that you wish to show is. Of course, you must not "lie with the help of statistics", but it is always admissible to select a style of presentation which highlights the important patterns by eliminating or diminishing the uninteresting relations and structures.

If your data consists of only a few measurements, it is possible to show all of them as a scattergram. You may exhibit the values of two variables on the two axes x and y, and additionally a couple of variables by utilizing the colours or shapes of the dots. In the diagram to the right, the variable z has two values, which are indicated by a square and a plus sign.

If the variation is too small to appear clearly, you may emphasize it by cutting off parts of one or both of the scales, see examples. You simply cut off the uninteresting part, either from the top or the bottom. The discarded part must be empty of empirically measured values. To make sure that the reader notices the operation, it is better to show it not only in the scales but also in the background grid of the diagram.

HistogramOn the other hand, if the variation range of your data is very large, you may consider using a logarithmic scale on one or both of the axes, see diagram on the left. Logarithmic scaling is appropriate on a ratio scale only.

If you have hundreds of measurements, you will probably not want to show each of them as a scattergram. One possibility in this case is to classify the cases and present them as a histogram.
The histogram may be adapted to present up to four or five variables. You can do this by varying the widths of the columns, their colours, background patterns and by a three dimensional presentation (fig. on the left). All these variations are easily created by a spreadsheet program like Excel, but they should not be used for decoration only.
The patterns filling or making up the histogram columns may be chosen so that they symbolize one of the variables. For example, the columns describing the number of cars may be formed by piling up pictures of cars one above or after the other. This is all right, provided that you do not vary the size of the symbols used in a histogram. Otherwise, the interpretation would be difficult for the reader (does the number of cars relate to the length, to the area or to the volume of the car symbols?)

The researcher is often interested in the relations of two or more variables rather than in the detached pairs of measurements as such. The normal way of presenting two or more interdependent variables is the curve. It implicates a continuous variable (i.e. where the number of possible values are infinite). (Examples.)

You should not fabricate a curve from measurements which are not values of the same variable. For instance, the attributes of an object are different variables. Examples are the personal evaluations that researchers often gather with the help of semantic differential scales of the type below:
Estimate the characteristics of your bedroom.
Tick one box on every line.
Light _ _ _ _ _ _ _ Dark
Noisy _ _ _ _ _ _ _ Quiet
Clean _ _ _ _ _ _ _ Dirty
Big _ _ _ _ _ _ _ Small

ProfileIt would now be pointless to present the various evaluations of the bedroom as a single "profile" as in the diagram on the left (although you often find this type of illogical presentations in research reports.)
If you absolutely want to stress that the variables belong together (e.g. because all of them are evaluations of the same object), an appropriate method would be e.g. a horizontal histogram (on the right).
Histogram

All of the above diagrams can be combined with maps and other topological presentations. For example, the variation in the different areas of the country is often shown as a cartogram by distinguishing the different districts with different colours or shades. Another technique is the cartopictogram in which small pie or column diagrams have been placed on the map. If you need to show associations between areas this can be done with arrows whose thickness indicates the intensity of the connection. (Example.)

Contingence and Correlation

How close a relation two variables have with each other, can be studied with the means of tabulating or graphical presentation, and it can be measured with special statistics, too. The statistics available for analysing the links between two variables depend on what type of scale the variables have been measured by (see table that was presented earlier).

Formulas for calculating the statistics of contingence are not shown here because performing the calculations manually would be awkward and researchers usually do them with computers.

The product moment correlation or Pearson's correlation which is usually abbreviated with the letter r measures how closely the association between two variables resembles the linear equation y = ax + b. If the correlation coefficient is high, in other words if its value approaches either +1 or -1, it means that the relation between the two variables approaches this equation. If the correlation is low, e.g. something between -0.3 and +0.3, the two variables have not much to do with each other (more exactly, they have almost no linear covariation). The sign of the correlation coefficient is not important; the sign is always identical with the sign of the coefficient a in the above equation.

Below you can see three scattergrams which show three different sets of data from two variables, each set consisting of eight pairs of values. The correlations between the two variables have been calculated and are shown under each scattergram. It can be seen that there is no correlation between the variables in the set on the left, and the other two sets show correlations of 0.5 and 1.0.

Notwithstanding the fact that correlation analysis is able to handle only two variables, it is an excellent tool for the initial analysis of a large number of variables, when you have no clear idea of their mutual relations. A computer can quickly calculate the correlations between all possible pairs of variables, finally constructing a correlation matrix from the results. You can then select those pairs that have the strongest correlation and continue by examining these pairs with other, more refined tools for analysis.

A weakness of the correlation analysis is that it cannot detect other than linear relations between the variables. E.g., a relation that obeys the equation y = ax2 + bx + c would pass unnoticed. However, some of the newer analysis programs are able to detect even this and some other usual relationships of variables. Besides, you can try to:

Once you have found a pair of variables with a strong correlation (or contingence) you can continue, for example, with the following operations:

Finally remember that when a correlation has been calculated from a sample, you should examine its statistical significance with the t test.

Analysis of Variance

ANOVA, ANalysis Of VAriance examines two or more sets of measurements, especially their variances, and tries to detect statistically significant differences between the sets. These sets might be, for example, measured reactions from two experimental groups and the researcher wants to inspect if there is a difference in the reactions, perhaps caused by the different stimuli to the groups.

The ANOVA method is based on the mathematically proven fact that there is a difference between the groups only if the between-groups variance is greater than the within-group variance.
The analysis is initiated by calculating the within-group variance for each group, and the mean of all these group variances.
The next step is to calculate the mean for each group, and then the variance of these means. It is the between-groups variance.
Then you calculate the ratio of the above two figures, which is called F. In other words,
F = (variance of the group means) / (mean of the group variances).
Finally you refer to the table (in statistics handbooks) which shows how high values the coefficient F may reach when only chance operates. If the F received from the ANOVA is higher than the table value, there is a difference between the groups which is as significant as the table reports.

Regression Analysis

The researcher has often theoretical or practical reasons to believe that a certain variable is causally dependent on one or more other variables. If there are enough empirical data on these variables, the regression analysis is a suitable method for revealing the exact pattern of this association.

The algorithm of regression analysis constructs an equation, which has the following pattern. Moreover, it gives the parameters a1, a2 etc. and b such values that the equation corresponds to the empirical values as closely as possible.

y = a1x1 + a2x2 + a3x3 + ... + b

In the equation,
y = the dependent variable
x1 , x2 etc. = independent variables
a1 , a2 etc. = parameters
b = coefficient.

If you have extensive data with many variables, at the beginning of the analysis you are perhaps not sure which variables are mutually connected and which should thus be included in the equation. You might first study this with correlation analysis, or you can let the regression analysis program select the "right" variables (x1, x2 etc) to the equation. "Right" are those variables which improve the closeness of fit between the equation and the empirical values.

When one of the independent variables is time, and especially when we have a series of measurements at equal intervals, this series goes by the name time series. Regression analysis is a suitable tool for revealing a trend or a long-term development in a time series. For other structures which can be found in time series, see Historical Study.

Factor Analysis

The researcher has sometimes a great amount of data on numerous different variables which correlate with each other. With the help of factor analysis, such data can often be compressed and the variations presented through just a few variables.

As an example, let us consider the data from a questionnaire (shown elsewhere) where a number of test subjects were asked to indicate how closely their personal bedrooms would correspond to adjectives provided by the researcher (given with "semantic differential" scales). The researcher now wants to find out if behind the estimates of the subjects, there are some "background variables" whose direct measuring through linguistic means would not be possible because of the lack of appropriate adjectives in language. The researcher's hypothesis is that these background variables "appear" through the adjectives used in the semantic scales, usually not through any single adjective but instead through a group of correlated adjectives.
With the help of factor analysis, the combined variables or factors hidden behind the measured attributes can be detected and specified, and the analysis also tells how closely these factors are linked with the original variables.-- Sometimes also an extra condition is placed on the factors, namely that they must not correlate with each other at all, and they can hence be said to be "perpendicular" to each other (= "orthogonal rotation" of the factors during analysis).

A drawback of the method of factor analysis is that it is too easy to use it for practically absurd but formally correct studies because the analysis always presents its results in an elegant, mathematically exact form, even when the factors obtained have no sensible empirical contents.

Normative Study of Variables

There is no great difference between informative and normative methods of analysis. In normative study at least one of the variables is evaluative like "usefulness" or "satisfaction" etc., and the final aim of the study is to improve the attributes of the object of study. Because all evaluation is subjective it is important to consider and define exactly whose point of view is used in the evaluation; this aspect is discussed under the titles Human Subjectivity and Objectivity and Normative Research. The choice depends also on the degree of participation which prevails in the organisation that is going to use the normative suggestions that you are preparing.
Once the assessing persons have been selected, the evaluations can be gathered with a survey.

Property: Ease of use Utility value
All operations are automatic. 5
Many operations are automatic. 
There is a detailed instruction booklet.
4
No automatic operations.
The instructions book is mediocre.
3
The machine reacts
contrary to the booklet.
2
Operation is confusing.
No instructions book.
1
In the case that you want to improve the state of only one, or a few, attributes, the analysis is simple: you have to specify the acceptability of various potential levels of this attribute. Once you have obtained the evaluations from appropriate persons (Who? See above), you can present them as a table like on the right or as a curve. Other examples can be found under the title Degrees of Satisfaction.

A common difficulty when trying to improve the object of study is that its attributes are interdependent. While one of the attributes of the object of study is being improved, other attributes like usability, beauty, meaning, ecology or economy often deteriorate. In such a case you will want to uncover the exact relationship between these attributes, for example with regression analysis.

The figure on the left gives an example of finding an optimum for the sum of two variables, both of which are dependent of a third variable. The aim is to find out how heavy an insulation you should select for a new building. The costs which vary along with the thickness of insulation are its price (which is proportional to its thickness) and the subsequent heating costs of the building (which will be diminished if the insulation is augmented). Because investment and annual payments cannot be added together as such, we first have to transform the investment into annual cost. Such an annual cost which corresponds to the investment is presented as curve B in the diagram, while curve A shows the yearly heating costs. The optimum for an insulation will be found at the point where the sum of these two costs is at its minimum.

The science of operations analysis includes other comparable analysis methods which can be used to find the common optimum of several quantifiable attributes of a product. Such is for example the algorithm of linear programming.

Sometimes it turns out that two or more variables are causally dependent from each other; such a relation should be made explicit because you will normally prefer manipulating the cause, not the effect.

Product attribute Weight
Speed at least 100 mph 40
Easy to use, automatic 40
Design: sportive, 
unlike the competitors
10
Materials are 
potentially recyclable
10
Total weights 100
However, it often happens that the relations between the variables remain obscure, perhaps because the practical problem at hand needs swift action and there is no time for meticulous investigation. In such a situation you can sometimes do without the exact relationships and just make a table, like the one on the right, which gives the relative weights of the desirable or avoidable properties of the object of study.

Often the table of weights grows too large and difficult to grasp and to work with. To restrain its growth, you may want to present a family of related properties in the pattern of a logical tree, which means combining groups of associated properties into one bundle. When searching such bundles, or families of attributes, you can consider using factor analysis.

Value Engineering

Value engineering is a method of summing up the significant utility values of the available alternatives and finding their optimum. In the case that also cost or price of each alternative is included in the comparison the method can be called cost benefit analysis. It is definitely a quantitative tool, and it necessitates the measurement of all the components to be analysed. The analysis is carried out in distinct and logical steps:
  1. Before you can start the analysis you have to find, for each attribute to be considered, the relation between its value and its acceptability.
  2. Moreover, you will need to define the weights of all the important attributes of the object of study.
  3. Cross tabulate all the available alternatives and all the attributes to be appraised.

  4. Under each alternative are placed two columns: one for the utility values for each attribute, and another column for its final evaluation which you get by multiplying the utility value with the weight of the attribute.
  5. The next step is to add together all the evaluations for one alternative. In the example below, the sums are written (in italics) into the red boxes on bottom line. The best alternative will be the one with the highest sum. (This is not necessarily the most profitable alternative yet, because the prices and inputs are not included in this stage.)

  6. Attribute,
    or property of
    the product
    Weight
    W
    Alternative 1 Alternative 2
    Utility
    value
    U
    WxU Utility
    value
    U
    WxU
    Capacity 40 2 80 5 200
    Ease of use 40 3 120 4 160
    Design, appearance 10 5 50 2 20
    Materials, recycling 10 3 30 2 20
    Total 100 -- 280 -- 400
  7. If the analysis is made in the above fashion and only the utility values are considered, a final step in the analysis will be, for each alternative, to compare its total utility value (from the bottom line of the above table) to the price (or other input) of the same alternative. This is done simply by comparing the ratios between total utility and input (which often includes both investment and annual cost, cf. Expenditures). The highest ratio indicates the optimal alternative.

    An alternative method is to include the costs in the table as an additional row. Its weight in the final evaluation is then set at 40% or 50%. An example of such a table is given elsewhere.

En Español  In Finnish   Contents

January 2, 2005. Original location: http://www2.uiah.fi/projects/metodi
Comments to the author: