# Patterns in physical data

**Introduction**

The goal here is to identify patterns in the physical and
environmental parameters that determine habitat type. Substrata
(rock, gravel, sand mud etc) is usually the principal factor as
each different type provides different living conditions and so is
usually inhabited by recognisably different communities. Those
conditions are further modified by environmental gradients such as
depth, exposure (to wave and tidal energy), salinity etc., forming
a multidimensional matrix of physical habitats. Analysing the
physical and environmental data indicates which parts of this
matrix your sample represents and which are the principal
components that characterise sample groups and differentiate
between them.

**Data preparation**

Only true variables such as the % gravel, sand or mud in
sediments, or actual temperature and salinity measurements can be
used on numerical analyses. Sometimes it may be desirable to
simplify the analysis by converting variable data into categorical
classes, which are then used as factors. This is a common approach
with sediment data, using the Folk triangle to classify sediment
samples into categories such as 'gravely-sand', 'sandy-gravel',
'sandy-mud' etc., as seen on many seabed sediment maps.

Many environmental factors or descriptors are recorded as
categorical data, such as wave exposure (exposed, sheltered,
extremely sheltered) or biological zones (eulittoral,
infralittoral, circalittoral etc). Often these are given shorter
codes, which are easier to handle in spreadsheets and read on
graphical outputs. If numerals are used as codes it is important to
remember that the data is still categorical and cannot be used in
numerical analyses.

Where a suite of physical or environmental variables has been
measured, the data will be expressed in a variety of different
units (e.g. temperature in

^{o}C, salinity in^{o}/_{oo}, current speed in knots or metres per second). It is futile to attempt numerical analysis using the absolute data values, as the data set with the largest range will always appear to be the most influential. To overcome this, the data need to be expressed as relative, ‘unitless’ values. This is achieved by a simple mathematical operation called ‘*normalisation*’, which uses the mean and standard deviation of the data set. The mean is subtracted from each data value and the result divided by the standard deviation.In the example, the temperature values range from 5 to
14

^{o}C, and salinity from 33.1 to 34.0^{o}/_{oo}. There appears to be very little similarity between these data sets when the absolute values are inspected (column A), they even have different ranges, means and standard deviation. Normalising the data removes their dependence on the units of measurement. Comparing the normalised data (column C) shows that temperature and salinity showed identical patterns.**Methodology**

There is a variety of ways that patterns in physical data can
be determined. Perhaps the most widely known is Principal
Components Analysis (PCA). This ordination method simplifies the
physical dataset by transforming the data to a new coordinate
system such that the greatest variance lies in the first coordinate
(First Principal Component), the second greatest variance then
forms the Second Principal Component and so on. In this method of
indirect gradient analysis, samples are spread out relative to the
PCA axes. The principal components represent linear combinations of
the variables.

Nevertheless, the same techniques used in analysing biological
data can also be applied to physical data. Thus cluster analysis
and MDS plots offer simple methods for pattern determination. In
contrast to the biological data, however, the Bray-Curtis
similarity coefficient is not appropriate, because in physical data
zero has no special meaning; it is simply one point on a scale (in
biological data zero indicates the absence of a species). As the
variables will likely be on different scales, normalising
procedures will produce negative and positive values. Distance
coefficients such as Euclidean Distance are therefore the preferred
measure of similarity for physical/environmental data sets.