 Patterns in physical data

Introduction

The goal here is to identify patterns in the physical and environmental parameters that determine habitat type. Substrata (rock, gravel, sand mud etc) is usually the principal factor as each different type provides different living conditions and so is usually inhabited by recognisably different communities. Those conditions are further modified by environmental gradients such as depth, exposure (to wave and tidal energy), salinity etc., forming a multidimensional matrix of physical habitats. Analysing the physical and environmental data indicates which parts of this matrix your sample represents and which are the principal components that characterise sample groups and differentiate between them.

Data preparation

Only true variables such as the % gravel, sand or mud in sediments, or actual temperature and salinity measurements can be used on numerical analyses. Sometimes it may be desirable to simplify the analysis by converting variable data into categorical classes, which are then used as factors. This is a common approach with sediment data, using the Folk triangle to classify sediment samples into categories such as 'gravely-sand', 'sandy-gravel', 'sandy-mud' etc., as seen on many seabed sediment maps.

Many environmental factors or descriptors are recorded as categorical data, such as wave exposure (exposed, sheltered, extremely sheltered) or biological zones (eulittoral, infralittoral, circalittoral etc). Often these are given shorter codes, which are easier to handle in spreadsheets and read on graphical outputs. If numerals are used as codes it is important to remember that the data is still categorical and cannot be used in numerical analyses.

Where a suite of physical or environmental variables has been measured, the data will be expressed in a variety of different units (e.g. temperature in oC, salinity in o/oo, current speed in knots or metres per second). It is futile to attempt numerical analysis using the absolute data values, as the data set with the largest range will always appear to be the most influential. To overcome this, the data need to be expressed as relative, ‘unitless’ values. This is achieved by a simple mathematical operation called ‘normalisation’, which uses the mean and standard deviation of the data set. The mean is subtracted from each data value and the result divided by the standard deviation.  In the example, the temperature values range from 5 to 14oC, and salinity from 33.1 to 34.0o/oo. There appears to be very little similarity between these data sets when the absolute values are inspected (column A), they even have different ranges, means and standard deviation. Normalising the data removes their dependence on the units of measurement. Comparing the normalised data (column C) shows that temperature and salinity showed identical patterns.

Methodology

There is a variety of ways that patterns in physical data can be determined. Perhaps the most widely known is Principal Components Analysis (PCA). This ordination method simplifies the physical dataset by transforming the data to a new coordinate system such that the greatest variance lies in the first coordinate (First Principal Component), the second greatest variance then forms the Second Principal Component and so on. In this method of indirect gradient analysis, samples are spread out relative to the PCA axes. The principal components represent linear combinations of the variables. Graphical output from a Principal Components Analysis (PCA) on sediment samples from four areas (A to D) in the Hastings Shingle Bank in the English Channel (Brown et al., 2001). The variables were mean particle size (mm), sorting coefficient, % gravel, % sand and % silt/clay content. The greatest variance (along PC Axis 1) is clearly driven by the sand:gravel content of the samples.

Nevertheless, the same techniques used in analysing biological data can also be applied to physical data. Thus cluster analysis and MDS plots offer simple methods for pattern determination. In contrast to the biological data, however, the Bray-Curtis similarity coefficient is not appropriate, because in physical data zero has no special meaning; it is simply one point on a scale (in biological data zero indicates the absence of a species). As the variables will likely be on different scales, normalising procedures will produce negative and positive values. Distance coefficients such as Euclidean Distance are therefore the preferred measure of similarity for physical/environmental data sets. MDS plot of the same set of samples displayed in the PCA image (Brown et al., 2001). The tighter clustering of samples from regions A and D indicates they were similar and more consistent in their composition than samples from regions B and C. All material variously copyrighted by MESH project partners 2004-2010 