Standard occupational classification (SOC 2000) backcasting


The Standard Occupational Classification (SOC) 2000 that replaced SOC 90 was introduced to the Labour Force Survey (LFS) in spring 2001.

When new questions and classifications are introduced to the LFS, it is normal practice not to release these for public use until they have been quality assured over a number of quarters.

However, for SOC 2000, the questions and methods to code the classification were well-established. Only the categories of the classification were new.

Though SOC 2000 still has 9 major groups, there have been considerable changes in the structure and composition of the classification.

Because of this, a meaningful comparison of results based on one classification with those based on the other is not possible. This is a problem if one wants to compare data over time.

To overcome this problem of comparability there were 2 possible solutions.

The first solution was to code the historical microdata to SOC 2000. However, this would have been a very time-consuming and costly operation.

The second solution was to code certain data sources to both the classifications. These dual-coded datasets could be used to estimate the correspondence between the two classifications. These correspondences could then be used to backcast historical data at an aggregated level.

This solution was quicker and easier but has its problems which are outlined in these notes.

The Office for National Statistics (ONS) made the decision to dual-code the LFS summer 2000 quarter to both SOC 90 and SOC 2000. Further details of this dual-coding exercise can be found in an article in the July 2001 edition of Labour Market Trends.

Apart from this dual-coded quarter, other dual-coded LFS data were available. Analysis of these various dual-coded data showed that the LFS winter 2000/01 quarter provided the best estimates on which to base the backcasting probabilities.


Matrices showing the correspondence between SOC 90 and SOC 2000, which derived from the LFS winter 2000/01 dual-coded quarter, have been used to backcast the historical time series.

Where individuals in the LFS winter 2000/01 dual-coded quarter had codes assigned on both SOC 90 and SOC 2000, the observed relationship was included in a matrix.

The cell counts in these matrices were then calculated as percentages, representing the proportional relationship to SOC 2000 of each SOC 90 minor group.

Each cell in the resulting matrix showed the probability of how many observations in a given category of SOC 90 would be classified in a specific category of SOC 2000.

Separate matrices had been calculated for each economic activity group at the lowest level, with a full-time/part-time gender split.

Using this method preserved the distinct occupational characteristics of each group.

For example, the distribution of part-time workers shows a smaller percentage in manager occupations than the equivalent proportion of managers among those who are full-time workers.

The SOC 2000 probability distributions for each SOC 90 category were then applied to other datasets as a proxy for what respondents would have been coded to under SOC 2000.

The estimates provided using the matrices from the LFS winter 2000/01 quarter are considered the best available. However, any methodology using only the one-time-period as a proxy for the relationship in other periods, will be subject to a number of quality issues that users should take into consideration before using the data.

Transformation matrices for SOC 2000 − quality issues

Caution should be exercised when analysing or interpreting the backcasted data series. This section presents a number of issues relating to data quality that need to be considered.

Modal differences

The dual-coded LFS winter 2000/01 quarter, which produced the matrix with correspondences, cannot replicate the exact method of classification that SOC 2000 used for the LFS spring 2001 quarter.

Two different coding systems were being used for the 2 quarters, which meant there were minor differences in the on-screen information available to coders.

These differences were mainly linked to information on supervisory and managerial duties. This difference may have caused discontinuities. This would be particularly true for areas where the classifications have seen the most change, for example, major groups 1, 4, and 7.

Sampling error

The LFS is a sample survey so the data are subject to sampling error.

Estimates based on smaller subgroups tend to have larger relative sampling errors, although sampling errors also depend on the way the sample and population are distributed.

Therefore, both the data from previous time periods being transformed into SOC 2000 and the probabilities based on the data from the dual-coded dataset were subject to sampling error.

Coder error

In addition to sampling error in the dual-coded dataset, the observed relationship in the LFS winter 2000/01 quarter will have been affected by coder variance. Occupational information on the LFS is coded to SOC by interviewers, so there will have been a certain amount of variation in the way interviewers assign SOC codes.

This will have affected the distributions in the probability matrix and the historic time series data.


It is also difficult to assess whether any seasonal differences affect the use of a probability matrix based on only one data quarter.

The SOC 2000 major group 5 (skilled trade occupations), which includes such occupations as skilled farm and construction workers, did show a seasonal pattern in data produced from a transitional matrix.

The strength of the seasonal pattern is as much dependent on the clarity of the relationship between the categories in the 2 classifications, as it is on the seasonal changes in numbers for that group.

Therefore, if a specific SOC 90 category only corresponds to one category in SOC 2000, the seasonal pattern would have been replicated in its entirety, even though the relationship was based on data from only one time point (LFS winter 2000/01 dual-coded quarter).

However, if the SOC 90 group was spread over several SOC 2000 groups, then the seasonal pattern would also be diffused. Therefore, basing the relationship on only one time point, that is, LFS winter 2000/01 quarter, would most likely affect the results.

Changing occupational structure

Over time, the structure of industry changes and therefore people's occupations also change. Therefore it is not meaningful to apply a classification with new occupations to data for a time period which did not have these new occupations.

This problem will increase the further back in time data are backcasted.

In balancing this risk and users' interests in the time series of data, ONS has estimated the occupations under the new classification from LFS spring 1995 quarter to LFS winter 2000/01 quarter.

As shown in the LFS 1996/97 winter quarter that was also recoded to SOC 2000 (further details of this dual-coding exercise can be found in the article in the July 2001 edition of Labour Market Trends), the distribution of occupational groups has not changed significantly over the intervening period.

Therefore, the matrix based on LFS winter 2000/01 quarter should reasonably reflect, in most cases, the likely relationship between SOC 90 and SOC 2000 for those earlier periods.

Other issues

The probabilities between SOC 90 and SOC 2000, for LFS winter 2000/01 quarter, were computed based on unweighted data because we wanted internal correspondences between 2 classifications.

However, backcasting data could have been affected if any given relationship between 2 classifications in the correspondence tables used were over- or under-represented.

It is possible this could have occurred because the data used in unweighted form would not have corrected the response differences in the UK. Such differences in response rates in different parts of the UK may have led to more subtle relationships being affected.

For example, an area such as the North East, which is rich in energy-intensive industries, may have a high response rate, while inner London, which has less of these industries, may have a lower response rate. This may mean that occupations typical to these energy-intensive industries in the North East would have suppressed more subtle relationships for similar occupations from inner London, originally in the same SOC 90 group.

This would have occurred simply because there were a disproportionately high number of people from the North East in the sample.


When comparing the spring data and the historic datasets, it can be observed from the estimates that there were some discontinuities in distribution.

This difference in distribution is in groups 4 (administrative and secretarial) and 7 (sales and customer services) where the historic data is of a lower level.

The majority of these unexplained changes in levels from the historic time series to the LFS spring 2001 quarter could have been attributable to one or more of the quality issues mentioned above. It could have been an unusual movement or sampling error in the spring data.

However, the differences are small and the time series had been broadly consistent over the time periods.

SOC 2000 backcasting tables

The full range of backcasting tables available includes people in employment, employees, self-employed, full- and part-time workers, temporary workers and second jobs, people that are long-term unemployed and unemployed by previous occupation.

These are all split by gender, at the 1 (major) and 2 (sub-major) digit level and cover the periods from LFS spring 1995 quarter to LFS autumn 2001 quarter. This is the only backcast data that is available.

Please see the Downloads section for the full range of backcasting tables available.

Further information

For general information on methodology and background to the LFS, please see User Guide Volume 1.

For information on the structure of the SOC 2000 classifications, see Related links.

If you have any specific comments or questions on the SOC 2000 backcasting please contact: labour

