Yesterday I attended a presentation at the Keil Centre delivered by Janette Edmonds on 'Predictive Assessment of Human Reliability'.
Focus of the presentation was on prediction and looking forward to what could happen, rather than the traditional view of retrospectively looking into times something has gone wrong.
While around 80% of accidents have some human element, this should just be the start of an investigation and not a conclusion. Skilled people in the right circumstances mostly get it right. Generally, humans are required to make systems work.
An early 'solution' to avoiding human error was simply to design out the human element and fully automate something. The problem with this approach is that great care needs to be taken with edge cases. Humans are good when it comes to things not being in the 'normal' configuration and dealing with unexpected situations.
For example, dealing with an advanced process control (APC) around a part of the plant. An APC solution is unlikely to cope if an instrument fails and could cause trouble. A human is likely to see the erroneous number and not believe it, investigate and come up with a work around solution until the instrument can be fixed.
If you design out humans (APC/Airplanes), when things go wrong, you can't then rely on the human being in position with the correct competency to jump back in and take control.
Reliability research can be generally divided into two different approaches, Safety I and Safety II.
Understand human failure, why people perform 'not as expected'.
Safety I is the traditional and original view of human reliability and focuses on when things go wrong. The approaches here tend to break a task down into discreet steps and focusing on them to avoid any individual step going wrong. This is good at considering skills and rules however it often overlooks the larger context, environment, organisational factors.
Looking at the rest of the time, rest of actions.
As stated earlier, most of the time humans get things right. The second generation of tools and methods started getting developed in the 1990s and was to look at why humans get things right and what can be done to promote getting things right rather than avoiding getting things wrong.
The defining Principles behind Safety II are:
- Human failure is not the cause; it is a symptom of a bigger failure
- Human failure is not random, it is linked to tools, tasks, environment
- Human failure is not the conclusion of an investigation, just the starting point.
Understanding Human Failure
We cannot eliminate human failures. We can minimise it, detect it and mitigate.
A number of concepts and definitions were discussed that are useful to keep in mind when dealing with human reliability.
Human error vs Violation
The first major concept was the distinction between human error and a violation.
- Error - non intentional
- Violation - noncompliance (like taking shortcuts)
The majority of the presentation covered human errors, when the person involved does not intend to take the wrong action.
There was a brief discussion around violations, when people intentionally take the wrong action and they know it. Usually it is because they don't agree with the 'correct' method or procedure.
It was mentioned that you can perform ABC analysis to help avoid violations. It is generally concerned the analysis of consequences, whither they are:
- Positive or Negative to the individual
- Taking a drug gives a buzz (positive) vs negative health effects
- Immediate or long term consequence
- Drug gives immediate short term buzz vs long term health risk
- Certainty of consequence
- Guaranteed to get the buzz, may or may not get illness
Slips, Lapses and Mistakes
Slips, lapses and mistakes are different categories of errors.
Slips are when the wrong action is carried out unintentionally. This could include typing the wrong number into the control panel, turning the handle the wrong way. If you ask the person what it is they need to do, they do know it but in this particular case they got it wrong.
Lapses are when something is forgotten, often caused by something interrupting or distracting the person carrying out the task.
Mistakes are when someone does the wrong thing thinking it is the correct thing to do. This may be because of lack of training or something else has changed that they are not aware of.
Performance Shaping Factors
Skilled people get it right most of the time.
Performance Shaping Factors (PSFs) are things that cause someone to get it wrong.
Examples of performance shaping factors: Stress (including workload, time, monotony fatigue) Instructions and procedures (including accuracy, clarity, availability) Environment (including temperature, noise, lighting) Individual (including training, experience, health) * task characteristics (including frequency, duration)
Active vs Latent Failures
Active failures occur in operation - they are immediate and often where there is no room for error such as when driving.
Latent failures are removed in time from the operation - they can be caused during design, historical actions or management decisions.
Active failures are relatively easy to spot and act upon to avoid. Latent failures often remain hidden until it is too late and they trigger a significant event.
Approach to Safety Criticality
Safety Critical tasks are tasks which if not performed correct, could contribute to a major accident.
Safety Critical Task Analysis (SCTA) is used to determine what tasks are safety critical and what to do about it.
7 step plan
- Identify main hazards
- Identify critical tasks
- Understand the tasks
- Represent critical tasks (analyse in more detail - hierarchical rich data)
- Identify human failures and PSFs
- Determine safety measures to control human failures
- Review effectiveness of process
More details are available in the publication Human Factors Assessment of Safety Critical Tasks (HSE OTO 99/092) by the Health and Safety Executive.
One tool used to help analyse a critical task is SHERPA, Systematic Human Error Reduction and Prediction Approach
- Description of a task
- Identify PSFs
- Error mode (type of error)
- Describe error and consequences
- Identify existing risk controls & potential error recovery
- Identify error prevention measures
To determine the error modes, the team make use of a list of trigger words, similar to HAZOP.
The identification of existing and suggested prevention measures is similar to the use in a LOPA or Risk Assessment.
There was not a lot of detail on different quantitative tools used. These were loved in the 1980s and gave numbers that could then be factored into other calculations.
The problem is that these numbers are often taken as absolutes when what is being dealt with is very difficult to get an actual and correct number. As a result, they are not often by the Keil Centre. The only time they are useful is to get a relative number and to compare two different methods or systems.
Generally, if you want a probability of error for a larger calculation, just use 100%.
The one number that was given was that if a 'truly independent' check is carried out on a task by another individual, the number of errors can be reduced by a factor of 3.
Safety II methods
These methods are still relatively new and not yet fully being implemented. They aim to consider the wider context of the task.
Functional Resonance Analysis Method (FRAM)
More details on FRAM can be found on the website functionalresonance.com
Fram is based on four Principles:
- Equivalence of success and failure
- Things go wrong and right in the same way.
- Failures = adaptations necessary to cope.
- Approximate adjustments
- Performance is adjusted to meet conditions
- Normal variability is not particularly large.
- Variability of multiple factors can combine in unexpected ways.
- Functional resonance.
- Cause-effect relations are replaced by resonance.
FRAM looks at six aspects of a task:
For more information, see the FRAM handbook
Systems-Theoretic Accident Model and Processes (STAMP)
The STAMP Model attempts to model the system and any accidents as if the whole thing is a control system, including control actions and feedback. This allows the system to make use of existing control engineering methods to analyse and improve the system.
- Most of the time humans get it right. This should not be overlooked.
- Key to avoiding human error is to get the environment and Performance Shaping Factors right
- Automating out the human is not the best solution, instead ensure computers
and humans play to their strengths:
- Computers are good a very repetitive analytical tasks.
- Humans are good at adapting to unexpected situations.
- It might be worth not automating something to keep enough interest in the job and ensure a human is available to take over.
- Many errors are slips or lapses, and simply scaring people by pointing out the consequences of failure will not necessarily have the desired improvement (though it may work to avoid violations).
Note: This post has been backdated to the date it was actually written, not the day it was noticed in the drafts folder and posted to the web. AM 2017-01-18