April 2013, Vol. 240 No. 4
Features
Pipeline Remote Operations: The Human Factor
This article presents a primary justification for an Emergency Operations Center (EOC), the Human Element. Human Factors is a critical component in all safety programs, particularly PHMSA CFR 49 Parts 190-194, Pipeline Safety. It is a relatively new, but mature science and an increasingly mature engineering discipline in the process industries. It is also an area that is often overlooked and not well understood.
We will explore how an EOC can reduce personnel and asset risk caused by human-centered risk factors that are common to remote, expansive and environmentally sensitive pipeline systems.
As this topic is discussed, it is essential to understand that an accident caused by human error typically is traced to a failure in the design of the “system” in which humans must operate. [1, 2] Creating a staffing plan, procedure or control system display that ignores human limitations will increase the likelihood of human error and, therefore, accidents. Each should be viewed as a known and preventable “design” error. A second layer of humans in the loop, if properly designed and implemented, provides additional means to prevent or mitigate a potential hazard caused by human error and unforeseen events.
Example factors that increase the likelihood, degree and variations of human error are:
• Activities, circumstances and environments that increase stress.
• Mental task loads that exceed individual limits for accurate and timely assessment, decision making and execution performance.
• Unexpected events that result from gaps in risk assessments, training, response and management plans or facility design.
• Unique circumstances and event patterns that tend to exploit specific patterns of human error.
• Attributes that may be inherent to many pipeline systems, e.g., complex interactions and complex incident scenarios.
Time Vs. Human Performance
Human response capability and performance in the presence of unforeseen events is a critical consideration when selecting personnel and defining the organizational and training requirements. Personnel must respond to unforeseen events that are not mitigated by automated safety systems and not covered by standard operating procedures. If left unchecked, an alarm or an unforeseen event can trigger a progression that ends with an accident. The duration of this progression, referred to as Process Response Time or PRT, begins with the initiating event (i.e., a gas release) and can end in an accident, i.e., a fire or explosion.
To prevent the accident, the time needed to accurately assess, plan and execute a response that halts the progression must be less than the PRT, preferably half the PRT. This is the designed approach applied to automated safety systems. Correctly designed alarm systems have documented operator actions for each alarm. In addition, alarms are only provided if there is sufficient time to execute a timely human response.
Beyond that, the actual PRT is not accurate. The same is true for unexpected events. Maintaining situational awareness is, therefore, critical. Factors that degrade human performance, such as stress, must be identified and mitigated to the extent possible to achieve this end.
Contributors To Stress
The control room environment has some unique stressors that work to increase human error and degrade human performance. [3] Specific issues include:
• A ”right-sized” staffing plan that provides limited work surge capacity.
• High consequence and rapidly deteriorating events that can occur any time, day or night.
• Ongoing changes to equipment and operating, notification and emergency response procedures.
• Diverse communication systems used to convey and coordinate actions both internally and with external organizations.
• Trainee numbers increase due to high employee turnover.
• Changes in work schedule (circadian cycle).
• Unexpected events occur that are not covered by training or the existing Incident Command System procedures.
• Large-scale external events that create complex upset and emergency scenarios, e.g., an earthquake, hurricane or cyberattack.
Combining these conditions with unexpected hazards and events WILL exploit known human weakness.
Common Errors
People are fallible. They will make mistakes regardless of circumstances. The following are common error types that can occur at any time and under any circumstance. [4]
• Slip – This refers to a mistake when executing a task that is well-known and practiced. Entering an obviously incorrect control setpoint or control command at a control system user display is an example.
• Lapse – This is a memory recall error that occurs when an action is taken that is incorrect, which may mean the implemented action is an incorrect sequence or executed at the wrong time.
• Mistake – An event occurs, but the required action is not known because the event was not foreseen or the necessary procedure is not available.
• Violation – The correct action is known but an individual intentionally chooses a different action.
High Mental Task Load Errors
Under normal task loads, every person has their own limits of accuracy and performance within a defined and acceptable error rate. These limits are affected by training, experience, health, time of day and stress, to name a few. Failing to assess operator task load under abnormal conditions will ultimately create a “system” whereby mental task loads will exceed the safe and predictable capacity of an individual. [3] We will call it the Red Line. Beyond this hypothetical limit, human error increases dramatically. This occurs if task loads become too high. It can also occur under normal task loads but stress contributors are high. In its simplest form, pushing humans beyond their Red Line increases the likelihood, degree and type of human error.
The analogy is a golf swing. Attempting to push a swing to gain greater distance tends to increase unpredictability. The ball trajectory can be any direction and the distance can be embarrassingly short or goes long and ends up deep in the woods. Under “normal” conditions a competent operator can correctly respond to a maximum of 10 actionable alarms within a 10-minute period. [5] Beyond that, various types of errors can result such as ignoring an alarm or delaying a required response.
Pattern-Based Errors
The following topics discuss well-documented human patterns that are known to be significant contributors to major accidents. An EOC with the appropriate organization, staffing and procedures provide additional “eyes on” from a remote, safe location. This provides an added opportunity to detect and break dangerous patterns through early recognition and intervention.
Normalization of Deviance: This term was coined in the Columbia shuttle disaster report. It is manifested when a series of potentially hazardous events continue to occur, yet nothing happens. The human manifestation is “the unexpected become the expected that became the accepted.”[6] In the case of Columbia, the frequent foam dislocations during takeoff did not initially cause serious damage. The full range of failure modes that could result from this event were not fully assessed or understood.
Over time, the high frequency of occurrence, random events and the wide range of statistically possible outcomes converged, and a disaster resulted. Examples that apply to a pipeline:
• The Control Room Operator response to repeated nuisance leak detection alarms eventually degrade responsiveness and the urgency to confirm or resolve the alarm cause. Eventually, an actual leak occurs in a waterway or public area.
• An earthquake causes a minor leak in an older pipeline section, an event that has happened before. Repairs are completed and the line is quickly placed back into service. A second quake causes a major breach or perhaps multiple breaches; line integrity or structural supports were already deficient and further weakened by the prior quake but goes unnoticed in the haste to complete the initial repairs.
• The cause of a pressure transient in an oil pipeline connected to several offshore platforms is detected but not quickly or correctly resolved. Perhaps one of the platforms has a control or mechanical design issue with their pump system that is not examined in detail. The pressure rise is fast but remains below the maximum allowable operating pressure. Eventually this occurs concurrently with another event that also increases line pressure. Together, they cause line pressures to rise above safe limits.
• Shift handover without any note-taking and daily “white knuckling” upon assuming the operator seat.
Ignoring Risk when Nearing Event Completion: There is a tendency to accept additional risk or demonstrate a pattern of ignoring risk when a series of relatively high-risk activities is nearing completion. [1, 3] The outcome of initiating a safety response under these conditions may be certain, e.g., it stops production or project progress. All are aware that this will draw management attention. However, a full understanding of the risk if nothing is done is unknown or not fully understood. The natural tendency is to take no action or delay the action if the alternative behavior is not constantly reinforced by the organization culture, training and reward systems. Examples of this pattern may include:
• A pipeline tie-in or facility startup operation is nearing completion to meet a gas introduction schedule. As this time nears, unexpected minor delays occur. Decisions are made to abbreviate RTU testing, site housekeeping or commissioning checklist signoffs. Perhaps a last-minute control system change is made without a Management of Change review or verification test.
• Pressure increases to meet the completion schedule for a facility change so contractors are pressed to work excessive hours under difficult conditions
• A series of orders are being held up by compressor downtime and pipeline rerouting will miss nominations
Analysis Bias: When attempting to understand or assess information or an event, humans tend to quickly reach and fixate on an initial theory or understanding. [3, 6] The bias occurs when the assessor receives additional information. Information that contradicts or does not support the initial theory is ignored. Consideration for other theories may be delayed or simply not considered. A subsequent action may be taken that is based on an incorrect assumption or conclusion. A significant delay in the assessment and response process can be disastrous if the initiating event is of the type that can quickly progress to a dangerous outcome. A person performing these activities from the EOC may prove to be more accurate, effective and quicker since they are not inside a highly stressful event or environment. Examples may include:
• A sophisticated cybersecurity attack replaces real-time information on control room displays with recorded or false information. The same attack disables automated safety systems and provides the external attacker with full access to process control systems and isolation valves.
• Process operations such as a pigging operation runs into problems, e.g., the pig sticks for unknown reasons
Distorted Time Sense: Under pressure, individuals often fail to realize that time is passing more quickly than believed. [1] Any delay in the assessment and response process can significantly increase the safety exposure to personnel if the required response is too late. As time runs out, the added tasks can increase stress and quickly push a decision maker beyond their red-line, reducing their capability to quickly and accurately assess and act. As an example, a complex event occurs and the control room operator prioritizes work to focus on an immediate and urgent issue. Lower priority issues and alarms are put on hold or ignored. As time passes, a deferred alarm or a required but delayed response progresses to become an urgent issue. The EOC can be tasked with monitoring time lines to ensure critical activities and actions are implemented in time and provide additional resources to ensure the less urgent issues are managed.
A note on San Bruno: The PG&E San Bruno event involved a number of flawed designs and process hazards including faulty design in choosing not to include automated valves, insufficient fire and gas monitoring, inadequate communications infrastructure and poor procedures of spiking old gas line pressures [8]. An approach of safety in layers would have alerted the pipeline owners and operators to these threats and prevented the deadly event.
Other Forms Of Human-Centered Systemic Errors
There are many other forms of systemic failures that are human in nature and contribute to a defective or faulty system. [7] The following are examples. Each is worthy of exploration to understand the mechanism and then take a critical view of how it can affect a given facility.
– Leadership
– Training
– Maintenance
– Organizational structure
– Communication channels
– Lines of Authority ambiguity
Other Facility Attributes That Drive EOC Design And Capabilities
New terms were coined in the late 1990s to describe inherent attributes of systems and highlight how they contribute to major accidents. This simple approach provides a new framework for understanding interactions and hazards that may be inherent to each facility. Facilities and systems fall into the categories of “loose or tightly coupled” and “linear or complex” systems. [1] Tightly coupled and complex systems present the greatest challenge. Many liquid pipeline systems and tightly integrated gas systems fall into one or both categories. Operators need to fully understand the unique challenges of these attributes in terms of PRT, variability in human response in the presence of complexity and the types of organization, systems and training needed to effectively manage incidents in these systems.
Complexity is a property where there are many complex interactions between systems that are difficult to identify, assess and respond to. Some are unforeseen and, therefore, not addressed in response plans. As an example, a large-scale earthquake causes line breaches in several geographically diverse locations. Each incident has a different hazard scenario that requires different responses and communication channels, all demanding simultaneous response in real-time. The same event damages or degrades communication systems, SCADA systems and isolation valves so elements in the Incident Command System (ICS) fail and planned mitigation actions are not possible.
In general, complex interactions can occur between pipeline systems and both upstream and downstream assets that are created by a myriad of issues at each step in the production and transport process. Simply, the operation of midstream pipelines is too complex for any one person to grasp in real time if a more advanced and layered ICS is not provided. The occurrence of complex events often requires multiple resources with different skill sets and training to assess, plan and execute actions that may be previously unforeseen. A failure to understand the nature of human error when designing these systems increases the likelihood that plans and responses will be inadequate.
Human systems designed to respond to tightly coupled system events tend toward tightly organized systems and repetitive training on specific events to achieve the required response performance. Complex and some tightly coupled systems tend to have a higher number of unforeseen event scenarios that cannot be quickly and accurately assessed and mitigated. Complex systems lean toward training that is broader and more intuitively oriented with elements of organization decentralization. [1] Combining the tightly coupled approach by control room personnel and the complex systems approach by the EOC provides means to address all contingencies.
Cimation is a member of the Control Systems Integrators Association (CSIA). For more information, see www.controlsys.org.
Author
Tom Shephard, a project, technology and standards consultant in Cimation, has 30 years of control and safety system experience in oil and gas, refining, marketing, pipeline and product distribution systems. He is a certified project management professional (PMI), a certificated automation professional (ISA) and holds a BS degree in chemical engineering from Notre Dame University.
References
1. Charles Perrow, “Normal Accidents: Living with High-Risk Technologies” (New York: Basic Books Inc. 1999).
2. Sydney Decker, “The Field Guide to Understanding Human Error” (Surrey UK, Ashgate Publishing Ltd., reprint 2010).
3. Thomas B. Sheridan, “Humans and Automation: System Design and Research Issues” (Wiley, 2002).
4. Center for Chemical Process Safety, “Human Factors Methods for Improving Performance in the Process Industries” (Wiley, 2007).
5. Peter Bullemer and Doug Metzger, “CCPS Process Safety Metric Review: Considerations from an ASM Perspective” (ASM Consortium Metrics Work Group, May 23, 2008).
6. Brigadier General Duane W. Deal, USAF, “Beyond the Widget: Columbia Accidents Lesson Learned Affirmed” (Air & Space Power Journal, Summer 2004).
7. Tom Shephard, “Process Safety: Blind Spots and Red Flags” (Hydrocarbon Processing, March 2011).
8. Richard Clark, Public Utilities Commission deposition (hearing on PG&E San Bruno event, March 2011).
Comments