Here is the full transcript: System Hazard Analysis.
In the 45-minute video, The Safety Artisan looks at System Hazard Analysis, or SHA, which is Task 205 in Mil-Std-882E. We explore Task 205’s aim, description, scope and contracting requirements. We also provide value-adding commentary, which explains SHA – how to use it to complement Sub-System Hazard Analysis (SSHA, Task 204) in order to get the maximum benefits for your System Safety Program.
Hello, everyone, and welcome to the Safety Artisan, where you will find professional, pragmatic, and impartial safety training resources and videos. I’m Simon, your host, and I’m recording this on the 13th of April 2020. And given the circumstances when I record this, I hope this finds you all well.
System Hazard Analysis Task 205
Let’s get on to our topic for today, which is System Hazard Analysis. Now, system hazard analysis is, as you may know, is Task 205 in the Mil. Standard 882E system safety standard.
Topics for this Session
What we’re going to cover in this session is purpose, task description, reporting, contracting and some commentary – although I’ll be making commentary all the way through. Going to the back to the top, the yellow highlighting with this and with task 204, I’m using the yellow highlighting to indicate differences between 205 and 204 because they are superficially quite similar. And then I’m using underlining to emphasize those things that I want to really bring to your attention and emphasize. Within task 205, purpose. We’ve got four purposes for this one. Verify subsistent compliance and recommend necessary actions – fourth one there. And then in the middle of the sandwich, we’ve got identification of hazards, both between the subsystem interfaces and faults from the subsystem propagating upwards to the overall system and identifying hazards in the integrated system design. So, quite different emphasis to 204 which was really thinking about subsystems in isolation. We’ve got five slides of task description, a couple on reporting, one on contracting – nothing new there – and several commentaries.
System Requirements Hazard Analysis (T205)
Let’s get straight on with it. The purpose, as we’ve already said, there is a three-fold purpose here; Verify system compliance, hazard identification and recommended actions, and then, as we can see in the yellow, the identifying previously unidentified hazards is split into two. Looking at subsystem interfaces and faults and the integration of the overall system design. And you can see the yellow bit, that’s different from 204 where we are taking this much higher-level view, taking an inter subsystem view and then an integrated view.
Task Description (T205) #1
On to the task description. The contract has got to do it and documented, as usual, looking at hazards and mitigations, or controls, in the integrated system design, including software and human interface. It’s very important that we’ll come onto that later. All the usual stuff about we’ve got to include COTS, GOTS, GFE and NDI. So, even if stuff is not being developed, if we’re putting together a jigsaw system from existing pieces, we’ve still got to look at the overall thing. And as with 204, we go down to the underlined text at the bottom of the slide, areas to consider. Think about performance, and degradation of performance, functional failures, timing and design errors, defects, inadvertent functioning – that classic functional failure analysis that we’ve seen before. And again, while conducting this analysis, we’ve got to include human beings as an integral component of the system, receiving inputs, and initiating outputs. Human factors were included in this standard from long ago.
Task Description (T205) #2
Slide two. We’ve got to include a review of subsystem interrelationships. The assumption is that we’ve previously done task 204 down at a low level and now we’re building up to task 205. Again, verification of system compliance with requirements (A.), identification of new hazards and emergent hazards, recommendations for actions (B.), but Part C is really the new bit. We are looking at possible independent, dependent, and simultaneous events (C.) including system failures, failures of safety devices, common cause failures, and system interactions that could create a hazard or increase risk. And this is really the new stuff in 205 and we are going to emphasize in the commentary, you’re going to look very carefully at those underlying things because they are key to understanding task 205.
Task Description (T205) #3
Moving on to Slide 3, all new stuff, all in yellow. Degradation of the system or the total system (D.), design changes that affect subsystems (E.). Now, I’ve underlined this because what’s the constant in projects? It’s change. You start off thinking you’re going to do something and maybe the concept changes subtly or not so subtly during the project. Maybe your assumptions change the schedule changes, the resources available change. You thought you were going to get access to something, but it turns out that you’re not. So, all these things can change and cause problems, quite frankly, as I am sure we know. So, we need to deal with not just the program as we started out, but the program as it turns out to be – as it’s actually implemented. And that’s something I’ve seen often go awry because people hold on to what they started out with, partly because they’re frightened of change and also because of the work of really taking note changes. And it takes a really disciplined program or project manager to push back on random change and to control it well, and then think through the implications. So, that’s where strength of leadership comes in, but it is difficult to do.
Moving on now. It says effects of human errors (F.) in the blue, I’ve changed that. Human error implies that the human is at fault, that the human made a mistake. But very often, we design suboptimal systems and we just expect the human operator to cope. Whether it’s fair or unfair or unreasonable, it results in accidents. So, what we need to think about more generally is erroneous human action. So, something has gone wrong but it’s not necessarily the humans’ fault. Maybe the system has induced the human to make an error. We need to think very carefully about.
Moving on, determination (G.), potential contribution of all those components in G. 1. As we said before, all the non-developmental stuff. G.2, have design requirements in the specifications being satisfied? This standard emphasizes specifications and meeting requirements, we’ve discussed that in other lessons. G.3 and whether methods of system implementation have introduced any new hazards. Because of course, in the attempted to control hazards, we may introduce technology or plant or substances that themselves can create problems. So, we need to be wary of that.
Task Description (T205) #4
Moving on to slide four. Now, in 205.2.2, the assumption here is that the PM has specified methods to be used by the contractor. That’s not necessarily true, the PM may not be an expert in this stuff. While they may for contractual or whatever reasons have decided we want the contractor to decide what techniques to use. But the assumption here is that the PM has control and if the contractor decides they want to do something different they’ve got to get the PM’s authority to do that. This is assuming, of course, that the this has been specified in the contract.
And 205.2.3, whichever contractor is performing the system hazard analysis, the SHA, they are expected to have oversight of software development that’s going to be part of their system. And again, that doesn’t happen unless it’s contracted. So, if you don’t ask for it, you’re not going to get it because it costs money. So, if the ultimate client doesn’t insist on this in the contract and police it to be fair because it’s all very well asking for stuff. If you never check what you’re getting or what’s going on, you can’t be sure that it’s really happening. As an American Admiral Rickover once said, “You get the safety you inspect”. So, if you don’t inspect it, don’t expect to get anything in particular, or it’s an unknown. And again, if anything requires mitigation, the expectation in the standard is that it will be reported to the PM, the client PM this is and that they will have authority. This is an assumption in the way that the standard works. If you’re not going to run your project like that, then you need to think through the implications of using this standard and manage accordingly.
Task Description (T205) #5
And the final slide on task description. We’ve got another reminder that the contractor performing the SHA shall evaluate design changes. Again, if the client doesn’t contract for this it won’t necessarily happen. Or indeed, if the client doesn’t communicate that things have changed to the contractor or the subcontractors don’t communicate with the prime contractor then this won’t happen. So, we need to put in place communication channels and insist that these things happen. Configuration control, and so forth, is a good tool for making sure that this happens.
Reporting (T205) #1
So, if we move on to reporting, we’ve got two slides on this. No surprises, the contractor shall prepare a report that contains the results from the analysis as described. First, part A, we’ve got to have a system description. Including the physical and functional characteristics and subsystem interfaces. Again, always important, if we don’t have that system description, we don’t have the context to understand the hazard analysis that had been done or not being done for whatever reason. And the expectation is that there will be reference to more detailed information as and when it becomes available. So maybe detailed design stuff isn’t going to emerge until later, but it has to be included. Again, this has got to be required.
Reporting (T205) #2
Moving onto parts B and C. Part B as before we need to provide a description of each analysis method used, the assumptions made, and the data used in that analysis. Again, if you don’t do this, if you don’t include this description, it’s very hard for anybody to independently verify that what has been done is correct, complete, and consistent. And without that assurance, then that’s going to undermine the whole purpose of doing the analysis in the first place.
And then part C, we’ve got to provide the analysis results and at the bottom of this subparagraph is the assumption. The analysis results could be captured in the hazard tracking system, say the hazard log, but I would only expect the sort of leading to be captured in that hazard log. And the detail is going to be in the task 205 hazard analysis report, or whatever you’re calling it. We’ve talked about that before, so I’m not going to get into that here.
And then the final bit of quotation from the standard is contracting. And again, it’s all the same things that you’ve seen before. We need to require the task to be completed. It’s no good just saying apply Mil. Standard 882E because the contractor, if they understand 882E, they will tailor it to suit selves, not the client. Or if they don’t understand 882E they may not do it at all, or just do it badly. Or indeed they may just produce a bunch of reports that have got all the right headings in as the data item description, which is usually supplied in the contract, but there may be no useful data under those headings. So, if you haven’t made it clear to the contractor, they need to conduct this analysis and then report on the results – I know it sounds obvious. I know this sounds silly having to say this, but I’ve seen it happen. You’ve got a contractor that does not understand what system safety is.
(Mind you, why have you contracted them in the first place to do this? You should know that you should have done your research, found out.)
But if it’s new to them, you’re going to have to explain it to them in words of one syllable or get somebody else to do it for them. And in my day job, this is very often what consultancies get called in to do. You’ve got a contractor who maybe is expert building tanks, or planes, or ships, or chemical plants, or whatever it might be, but they’re not expert in doing this kind of stuff. So, you bring in a specialist. And that’s part of my day job.
So, getting back to the subject. Yes, we’ve got to specify this stuff. We’ve got to specify it early, which implies that the client has done quite a lot of work to work this all out. And again, the client may above the line, as we say, say engage a consultant or whoever to help them with this, a specialist. We’ve got to include all of the details that are necessary. And of course, how do you know what’s necessary, unless you’ve worked it out. And you’ve got to supply the contractor, it says concept of operations, but really supplying the contractor with as much relevant data and information as you can, without bogging them down. But that context is important to getting good results and getting a successful program.
I’ve got a little illustration here. The supposition in the standard in Task 205 is we’ve got a number of subsystems and there may be some other building blocks in there as well. And some infrastructure we’ve going to have probably some users, we’re going to have an operating environment, and maybe some external systems that our system, or the system of interest, interfaces with or interacts with in some way. And that interaction might be deliberate, or it might be just in the same operating environment at night. And they will interact intentionally or otherwise.
Commentary – Go Early
With that picture in mind, let’s think about some important points. And the first one is to get 205, get some 205-work done early. Now, the implication in the standard by the numbering and when you read the text is that subsystem hazard analysis comes first. You do those hexagonal building blocks first and then you build it up and task 205 comes after the subsystem hazard analysis. You thought, “Well, you’ve already got the SHHAs for each subsystem and then you build the SHA on top”. However, if you don’t do 205 early, you’re going to lose an opportunity to influence the design and to improve your system requirements. So, it’s worth doing an initial pass of 205 first, top-down, before you do the 204 hexagons and then come back up and redo 205. So, the first pass is done early to gain insight, to influence the design, and to improve your requirements, and to improve, let’s say, the prime contractor’s appreciation and reporting of what they are doing. And that’s really, dare I say, a quick and dirty stab at 205 could be quite cheap and will probably the payback/the return on investment should be large if you do it early enough. And of course, act on the results.
And then the second part is more about verifying compliance, verifying those as required interfaces, and looking at emergent stuff, stuff that’s emerged – the devil’s in the detail as the saying goes. We can look at the emerging stuff that’s coming out of that detail and then pull all that together and tidy up it up and look for emergent behaviour.
Commentary – Tools & Techniques
Looking at tools and techniques, most safety analysis techniques look at single events or single failures only in isolation. And usually, we expect those events and failures to be independent. So, there’re lots of analyses out there. Basic fault tree analysis, event tree analysis, (well, event tree is slightly different in that we can think about subsequent [control] failures), but there’re lots of basic techniques out there that will really only deal with a single failure at a time. However, 205.2.1C requires us to go further. We’ve got to think about dependent simultaneous events and common cause failures. And for a large and complex system, each of those can be a significant undertaking. So, if we’re doing task 205 well, we are going to push into these areas and not simply do a copy of task 204, but at a higher level. We’re now really talking about the second pass of 205. The previous, quick and dirty, 205 is done. Task 204 on the subsystems is done. Now we’re pulling it all together.
Let’s think about independent simultaneous events. First, dependent failures. Can an initial failure propagate? For example, a fire could lead to an explosion or an explosion could lead to a fire. That’s a classic combination. If something breaks or wears could be as simple as components wearing and then we get debris in the lubrication system. Could that – could the debris from component wear clog up the lubrication system and cause it to fail and then cause a more serious seizure of the overall system? Stuff like that. Or there may be more subtle functional effects. For example, electric effects, if we get a failure in an electrical system or even non-failure events that happen together.
Could we get what’s called a sneak circuit? Could we get a reverse flow of current that we’re not expecting? And could that cause unexpected effects? There’s a special technique we’re looking at called sneak circuits analysis. That’s sneak, SNEAK, go look it up if you’re interested. Or could there be multiple effects from one failure? Now, I’ve already mentioned fire. It’s worth repeating again. Fire is the absolute classic. First, the effects of fire. You’ve got the fire triangle. So, to get fire, we need an inflammable substance, we need an ignition source, and we need heat. And without all three, we don’t get a fire. But once we do get a fire, all bets are off, and we can get multiple effects. So, we recall, you might remember from being tortured doing thermodynamics in class, you might remember the old equation that P1V1T1 equals P2V2T2. (And I’ve put R2 that for some reason, so sorry about that.)
What that’s saying is, your initial pressure, volume and temperature multiplied together, P1V1T1, is going to be the same as your subsequent pressure, volume and temperature multiply together, P2V2T2. So, what that means is if you dramatically increase the temperature say, because that’s what a fire does, then your volume and your pressure are going to change. So, in an enclosed space we get a great big increase in pressure, or if we’re in an unenclosed space, we’re going to get an increase in volume in a [gas or] fluid. So, if we start to heat the [gas or] fluid, it’s probably going to expand. And then that could cause a spill and further knock-on effects.
Fire, as well as effect making pressure and volume changes to the fluids, it can weaken structures, it makes smoke, and produces toxic gases. So, it can produce all kinds of secondary hazardous effects that are dangerous in themselves and can mess up your carefully orchestrated engineering and procedural controls. So, for example, if you’ve got a fire that causes a pressure burst, you can destroy structures and your fire containment can fail. You can’t send necessarily people in to fix the problem because the area is now full of smoke and toxic gas. So, fire is a great example of this kind of thing where you think, “Well, if this happens, then this really messes up a lot of controls and causes a lot of secondary effects”. So, there’s a good example, but not the only one.
And then simultaneous events, a hugely different issue. What we’re talking about here is we have got undetected, or latent, failures. Something has failed, but it’s not apparent that it’s failed, we’re not aware, and that could be for all sorts of reasons. It could be a fatigue failure. We’ve got something that’s cracked, or it could be thermal fatigue. So, lots of things that can degrade physical systems, make them brittle. For example, an odd one, radiation causes most metals to expand and neutron bombardment makes them brittle. So, it can weaken things, structure and so forth. Or we might have a safety system that has failed, but because we’ve not called upon it in anger, we don’t notice. And then we have a failure, maybe the primary system fails. We expect the secondary system to kick in, but it doesn’t because there’s been some problem, or some knock-on effect has prevented the secondary system from kicking in. And I suspect we’ve all seen that happen.
My own experience of that was on a site I was working on. We had a big electricity failure, a contractor had sawed through the mains electricity cable or dug through it. And then, for some unknown reason, the emergency generators failed to kick in. So, that meant that a major site where thousands of people worked had to be evacuated because there was no electricity to run the computers. Even the old analogue phones failed after a while. Today, those phones would be digital, probably voice over IP, and without electricity, they’d fail instantly. And eventually, without power for the plumbing, the toilets back up. So, you’re going to end up having to evacuate the entire site because it’s unhygienic. So, some effects can be very widespread. Just because you had a late failure, and your backup system didn’t kick in when you expected it to.
So how can we look at that? Well, this is classic reliability modelling territory. We can look at meantime between failures, MTBF, and meantime to repair (MTTR) and therefore we could work out what the exposure time might be. We can work out, “What’s the likelihood of a latent failure occurring?” If we’ve got an interval, presumably we’ve going to test the system periodically. We’ve got to do a proof test. How often do we have to do the proof test to get a certain level of reliability or availability when we need the system to work? And we can look at synchronous and asynchronous events.
And to do that, we can use several techniques. The classic ones, Reliability Block Diagrams (RBD) and Fault Tree Analysis (FTA). Or if we’ve got repairable systems, we can use Markov chain modelling, which is very powerful. So, we can bring in time-dependent effects of systems failing at certain times and then being required, or systems failing and being repaired, and look at overall availability so that we can get an estimate of how often the overall system will be available. If we look at potential failures in all the redundant constituent parts. Lots of techniques there for doing that, some of them quite advanced. And again, very often this is what safety consultants, this is what we find ourselves doing so.
Common Cause Failures
Common cause failure, this is another classic. We might think about something very obvious and physical, maybe we get debris, maybe we’ve got three sets of input channels guarded by filters to stop debris getting into the system, but what if debris blocks all the filters so we get no flow? So, obvious – I say obvious – often missed sources of sometimes quite major accidents. Or let’s say something more subtle, we’ve got three redundant channels, or a number of redundant channels, in an electronic system and we need two out of three to work, or whatever it might be. But we’ve got the same software working each channel. So, if the software fails systematically, as it does, then potentially all three channels will just fail at the same time.
So, there’s a good example of non-independent failures taking down a system that on paper has a very high reliability but actually doesn’t. Once you start considering common cause failure or common mode analysis. So, really what we would like is we would like all redundancy to be diverse if possible. So, for example, if we wanted to know how much fuel we had left in the aeroplane, which is quite important if you want the engines to keep working, then we can employ diverse methods. We can use sensors to measure how much fuel is in the tanks directly and then we can cross-check that against a calculated figure where we’ve entered, let’s say, how much fuel was in the tanks to start with. And then we’ve been measuring the flow of fuel throughout the flight. So, we can calculate or estimate the amount of fuel and then cross-check that against the actual measurements in the tanks. So, there’s a good diverse method. Now, it’s not always possible to engineer a diverse method, particularly in complex systems. Sometimes there’s only really one way of doing something. So, diversity kind of goes out of the window in such an engineered system.
But maybe we can bring a human in
So, another classic in the air world, we give pilots instruments in order to tell them what’s going on with the aeroplane, but we also suggest that they look out the window to look at reality and cross-check. Which is great if you’re not flying a cloud or in darkness and there are maybe visual references so you can’t necessarily cross-check. But even things like system failures, can the pilot look out the window and see which propeller has stopped turning? Or which engine the smoke and flames coming out of? And that might sound basic and silly, but there have been lots of very major accidents where that hasn’t been done and the pilots have shut down the wrong engine or they’ve managed the wrong emergency. And not just pilots, but operators of nuclear power plants and all kinds of things. So, visual inspection, going and looking at stuff if you have time, or take some diverse way of checking what’s going on, can be very helpful if you’re getting confusing results from instrument readings or sensor readings.
And those are examples of the terrific power of human diversity. Humans are good at taking different sensory inputs and fusing them together and forming a picture. Now, most of the time they fuse the data well and they get the correct picture, but sometimes they get confused by a system or they get contradictory inputs and they get the wrong mental model of what’s going on and then you can have a really bad accident. So, thinking about how we alert humans, how we use alarms to get humans attention, and how we employ human factors to make sure that we give the humans the right input, the right mental picture, mental model, is very important. So, back to human factors again, especially important, at this level for task 205.
And of course, there are many specialist common cause failure analysis techniques so we can use fault trees. Normally in a fault tree when you’ve got an and gate, we assume that those two sub-events are independent, but we can use ‘beta factors’ (they’re called) to say, “Let’s say event a and event b are not independent, but we think that 50 percent or 10 percent of the time they will happen at the same time”. So, you can put that beta factor in to change the calculation. So, fault trees can cope with non-independent fate is providing you program the logic correctly. You understand what’s going on. And maybe if there’s uncertainty on the beta factors, you must do some sensitivity modelling on the tree with different beta factors. Or you run multiple models of the tree, but again, we’re now talking quantitative techniques with the fault tree, maybe, or semi-quantitative. We’re talking quite advanced techniques, where you would need a specialist who knows what they do in this area to come up with realistic results, that sensitivity analysis. The other thing you need to do is if the sensitivity analysis gives you an answer that you don’t want, you need to do something about that and not just file away the analysis report in a cupboard and pretend it never happened. (Not that that’s ever happened in real life, boys and girls, never, ever, ever. You see my nose getting longer? Sorry, let’s move on before I get sued.)
So other classic techniques. Zonal hazard analysis, it looks at lots of different components in a compartment. If component A blows up, does it take out everything else in that compartment? Or if the compartment floods, what functionality do we lose in there? And particularly good for things like ships and planes, but also buildings with complex machinery. Big plant where you’ve got different stuff in different locations. There’re also things called particular risk analysis where you think of, and these tend to be very unusual things where you think about what a fan blade breaks in a jet engine. Can the jet engine contain the fan blade failure? And if not, where you’ve got very high energy piece of metal flying off somewhere – where does that go? Does that embed itself in the fuselage of the aeroplane? Does it puncture the pressure hull of the aeroplane? Or, as has sadly happened occasionally, does it penetrate and injure passengers? So, things like that, usually quite unusual things that are all very domain or industry specific. And then there are common mode analysis techniques and a good example of a standard that incorporates those things is ARP 4761. This is a civil aircraft standard which looks at those things quite well, for example, there are many others.
In summary, I’ve emphasized the differences between Task 205 and 204. So, we might do a first pass 205 and 204 where we’re essentially doing the same thing just at different levels of granularity. So, we might do the whole system initially 205, one big hexagon, and then we might break down the jigsaw and do some 204 at a more detailed level. But where 205 is really going to score is in the differences between 204. So instead of just repeating, it’s valuable to repeat that analysis at a higher-level, but really if we go to diversify if we want success. So, we need to think about the different purpose and timing of these analyses. We need to think about what we’re going to get out of going top-down versus bottom-up, different sides of the ‘V’ model let’s say.
We need to think about the differences of looking at internals versus external interfaces and interactions, and we need to think of appropriate techniques and tools for all those things – and, of course, whether we need to do that at all! We will have an idea about whether we need to do that from all the previous analysis. So, if we’ve done our PHI or PHA, we’ve looked at the history and some simple functional techniques, and we’ve involved end-users and we’ve learnt from experience. If we’ve done our early tasks, we’re going to get lots of clues about how much risk is present, both in terms of the magnitude of the risk and the complexity of the things that we’re dealing with.
So, clearly, if we’ve got a very complex thing with lots of risks where we could kill lots of people, we’re going to do a whole lot more analysis than for a simple low-risk system. And we’re going to be guided by the complexity and risks and the hot spots where they are and go “Clearly, I’ve got a particular interface or particular subsystem, which is a hotspot for risk. We’re going to concentrate our effort there”. If you haven’t done the early analysis, you don’t get those clues. So, you do the homework early, which is quite cheap and that helps you. We direct effort to get the best return on investment.
The Second major bullet point, which I talk about this again and again. That the client and end-user and/or the prime contractor need to do analysis early in order to get the benefits and to help them set requirements for lower down the hierarchy and pass relevant information to the sub-contractors. Because the sub-contractors, if you leave them in isolation, they’ll do a hazard analysis in isolation, which is usually not as helpful as it could be. You get more out of it if you give them more context. So really, the ultimate client, end-user, and probably the prime as well, both need to do this task, even if they’re subcontracting it to somebody else. Whereas, maybe the Sub-System Hazard Analysis, Task 204, could be delegated just down to the sub-system contractors and suppliers. If they know what they’re doing and they’ve got the data to do it, of course. And if they haven’t, there’s somebody further up the food chain on the supply chain may have to do that.
And lastly, Tasks 204 and 205 are complimentary, but not the same. If you understand that and exploit those similarities and differences, you will get a much more powerful overall result. You’ll get synergy. You’ll get a win-win situation where the two different analyses complement, reinforce each other. And you’re going to get a lot more success probably for not much more money and effort time. If you’ve done that thinking exercise and really sought to exploit the two together, then you’re going to get a greater holistic result.
So, that’s the end of our session for today. Just a reminder that I’ve quoted from the Mil. Standard 882, which is copyright free, but the contents of this presentation are copyright Safety Artisan, 2020.
For More …
That’s the end of the lesson on system hazard analysis task 205. And it just reminds me to say thanks very much for watching and look out for the next in the series of Mil. Standard 882 tasks. We will be moving on to Task 206, which is Operating and Support Hazard Analysis (OSHA), a quite different analysis to what we’ve just been talking. Well, thanks very much for watching and it’s goodbye from me.