Categories
Mil-Std-882E Safety Analysis

System of Systems Hazard Analysis

In this full-length (38-minute) session, The Safety Artisan looks at System of Systems Hazard Analysis, or SoSHA, which is Task 209 in Mil-Std-882E. SoSHA analyses collections of systems, which are often put together to create a new capability, which is enabled by human brokering between the different systems. We explore the aim, description, and contracting requirements of this Task, and an extended example to illustrate SoSHA. (We refer to other lessons for special techniques for Human Factors analysis.)

This is the seven-minute demo version of the full 38-minute video.

System of Systems Hazard Analysis: Topics

  • System of Systems (SoS) HA Purpose;
  • Task Description (2 slides);
  • Documentation (2 slides);
  • Contracting (2 slides);
  • Example (7 slides); and
  • Summary.

Transcript: System of Systems Hazard Analysis

Click here for the Transcript

Introduction

Hello everyone and welcome to the Safety Artisan. I’m Simon and today we’re going to be talking about System of Systems Hazard Analysis – a bit of a mouthful that. What does it actually mean? Well, we shall see.

System of Systems Hazard Analysis

So, for Systems of Systems Hazard Analysis, we’re using task 209 as the description of what to do taken from a military standard, 882E. But to be honest, it doesn’t really matter whether you’re doing a military system or a civil system, whatever it might be – if you’ve got a system of systems, then this will help you to do it.

Topics for this Session

Looking at what we’ve got coming up.

So, we look at the purpose of system of systems – and by the way, if you’re wondering what that is what I’m talking about is when we take different things that we’ve developed elsewhere, e.g. platforms, electronic systems, whatever it might be, and we put them together. Usually, with humans gluing the system together somewhere, it must be said, to make it all tick and fit together. Then we want this collection of systems to do something new, to give us some new capability, that we didn’t have before. So, that’s what I’m talking about when I say a system of systems. I’ll show you an example – it’s the best way. So, we’ve got a couple of slides on task description, a couple of slides or documentation, and a couple of slides on contracting. Tasks 209 is a very short task, and therefore I’ve decided to go through an example.

So, we’ve got seven slides of an example of a system of systems, safety case, and safety case report that I wrote. And hopefully, that will illustrate far better than just reading out the description. And that will also give us some issues that can emerge with systems of systems and I’ll summarize those at the end.

SOSHA Purpose

So, let’s get on. I’m going to call it the SOSHA for short; Systems of Systems Hazard Analysis. The purpose of the SOSHA, task 209, is to document or perform and document the analysis of the system of systems and identify unique system of systems hazards. So, things we don’t get from each system in isolation. This task is going to produce special requirements to deal with these hazards, which otherwise would not exist. Because until we put the things together and start using them for something new – We’ve not done this before.

Task Description (T209) #1

Task description: As in all of these tasks, the contractor shall perform and document an analysis of the system of systems to identify hazards and mitigation requirements. A big part of this, as I said earlier, we tend to use people to glue these collections, these portfolios, of systems together and humans are fantastic at doing that. Not always the ideal way of doing it, but sometimes it’s the only way of doing it within the constraints that we’ve got. The human is very important. The human will receive inputs from one or more systems and initiate outputs within the analysis and in fact within the real world, to be honest, which is what we’re trying to analyse. That’s probably a better way of looking at it.

And we’ve got to provide traceability of all those hazards to – it says – architecture locations, interfaces, data and stakeholders associated with the hazard. This is particularly important because with a system of systems each system tends to come with its own set of stakeholders, its own physical location, its own interfaces, etc, etc. The issue of managing all of those extraneous things and getting the traceability, it goes up. It is multiplied with every system you’ve got. In fact, I would say it was the square of. The example we’ll see: we’ve got three systems being put together in a system of systems and, in effect, we had nine times the amount of work in that area, I would say. I think that’s a reasonable approximation.

Task Description (T209) #2

Part two of the task description: The contractor will assess the risk of each hazard and recommend mitigation measures to eliminate the hazards. Or, very often, we can’t eliminate the hazards to reduce the associated risks. Then, as always with this standard, it says we’re going to use tables one, two and three, which are the severity, probability and the risk matrix that comes with the standard. Unless, of course, we have created or tailored our own matrix. Which we very often should do but it isn’t often done – I’ll have to do a session on how to do tailoring a matrix.

Then the contractor has got to verify and validate the effectiveness of those recommended mitigation measures. Now, that’s a really good point and I often see that missed. People come up with control measures or mitigation measures but don’t always assess how effective they’re going to be. Sometimes you can’t so we just have to be conservative but it’s not always done as well as it could be.

Documentation (T209) #1

So, let’s move on. Documentation: So, whoever does the analysis- the standard assumes it’s a contractor – shall document the results to include: you’ve got to describe the system of systems, the physical and functional characteristics of the system of systems, which is very important. Capturing these things is not a given. It’s not easy when you’ve got one system, but when you’ve got multiple systems, some of which are being misused to do something they’ve never done before, perhaps, then you’ve got to take extra care.

Then basically it says when you get more detail of the individual systems you need to supply that when it becomes available. Again, that’s important. And not only if the contractor supplies it, who’s going to check it? Who’s going to verify it? Etc., etc.

Documentation (T209) #2

Slide two on documentation: We’ve got to describe the hazard analysis methods and techniques used, providing a description of each method and technique used, and the assumptions and the data used in support. This is important because I’ve seen lots of times where you get a hazard analysis’ results and you only get the results. It’s impossible to verify those results or validate them to say whether they’ve been done in the correct context. And it’s impossible to say whether the results are complete or whether they’re up to date or even whether they were analysing the correct system because often systems come in different versions. So, how do you know that the version being analysed was the version you’re actually going to use? Without that description, you don’t know. So, it’s important to contract for these things.

And then hazard analysis results. What contents and formats do you want? It’s important to say. Also, we’re going to be looking to put the key items, the leading particular’s, from the results. The top-level results are going to go into the hazard tracking system which is more commonly known as a hazard log or a risk register, whatever it might be. Might be an Excel spreadsheet, might be a very fancy database, but whatever it’s going to be you’re going to have to standardize your fields of what things mean. Otherwise, you’re going to have – the data is going to be a mess and a poor quality and not very usable. So, again, you’ve got a contract for these things upfront and make sure you make clear definitions and say what you want.

Contracting #1

Contracting; implicitly, we’ve been talking about contracting already, but this is what a standard says. So, the request for proposal or statement of work has got to include the following. Typically we have an RFP before we’ve got a contract, so we need to have worked out what we need really early in the program or project, which isn’t always done very well. To work out what you need the customer, the purchaser, has probably got to do some analysis of their own in order to work all this stuff out. And I know I say this every time with these tasks, but it is so important. You can’t just dump everything on the contractor and expect them to produce good results because often the contractor is hamstrung. If you haven’t done your homework to help them do their work, then you’re going to get poor results and it’s not their fault.

So, we’ve got to impose the requirement for the task if we want it or need it. We’ve got to identify the functional disciplines. So, which specialists are going to do this work? Because very often the safety team are generalists. They do not have specialist technical knowledge in some of these areas. Or maybe they are not human factor specialists. We need somebody in, some human factor specialists, some user representatives, people who understand how the system will be used in real life and what the real-world constraints are. We need those stakeholders involved – That’s very important. We’ve got to identify those architectures and systems which make up the SOS -very important. The concept of operations. SOS is very much about giving capability. So, it’s all about what are you going to do with the whole thing when you put it together? How’s all that going to work?

Contracting #2

Interesting one, E, which is unique, I think, to task 209, what are the locations of the different systems and how far apart are they? We might be dealing with systems where the distance between them is so great that transmission time becomes an issue for energy or communications. Let’s say you’re bouncing a signal from an aircraft or a drone around the world via a couple of satellites back to home base. There could be a significant lag in communications. So, we need to understand all of these things because they might give rise to hazards or reduce the effectiveness of controls.

Part F; what analysis, methods, techniques do you want to use? And any special data to be used? Again, with these collections of systems that becomes more difficult to specify and more important. And then do we have any specific hazard management requirements? For example, are we using standard definitions and risk matrix from a standard or have we got our own? That all needs to be communicated.

Example #1

So, that is the totality of the task. As you can see, there’s not much to Task 209, so I thought it would be much more helpful to use an example, an illustration, and as they used to say in children’s TV, “Here’s one I made earlier” because a few years ago I had to produce a safety case report. I was the safety case report writer, and there was a small team of us generating the evidence, doing the analysis for the safety case itself.

What we were asked to do is to assure the safety of a system and – in fact, it was two systems but I just treat it as one – of a system for guiding aircraft onto ships in bad weather. So, all of these things existed beforehand. The aircraft were already in service. The ships were already in service. Some of the systems were already in service, but we were putting them together in a new combination. So, we had to take into account human factors. That was very important. We’ll see why in just a moment.

The operating environment, which was quite demanding. So, the whole point is to get the aircraft safely back to the ships in bad weather. They could do it in good weather you could do it visually, but in bad weather, visual wasn’t going to cut it. So, the operating environment- we were being asked to operate in a much more difficult environment. So, that changed everything and drove everything.

We’ve got to consider operating procedures because, as we’re about to see, people are gluing the systems together. So, how do they make it work? And also got to think about maintenance and management. Although in actual fact, we didn’t really consider maintenance and management that much. As an ex-maintainer, this annoys me, but the truth is people are much more focused on getting their capability and service. Often, they think about support as an afterthought. We’ll talk about that one day.

Example #2

Here’s a little demonstration of our system of systems. Bottom right-hand corner, we’ve got the ship with lots of people on the ship. So, if the aircraft crashes into it that could be bad news, not just for the people in the aircraft, but for the people on the ship – big risks there!

We’ve got our radar mounted on the ship so the ship is supplying the radar with power and control and data, telling it where to point for example. Also, the ship might be inadvertently interfering with the radar. There are lots of other electronic systems on the ship. There are bits of the ship getting in the way of the radar, depending on where you’ve put it, and so on and so forth. So, the ship interacts with the radar, the radar interacts with the ship, radars producing radiation. Could that be doing anything to the ship systems?

And then the radar is being operated. Now, I think that symbol is meant to indicate a DJ, but we’ve got the DJ wearing headphones and we got a disk there but it looks like a radar scope to me. So, I’ve just hijacked that. That’s the radar operator who is going to talk to the pilot and give the pilot verbal commands to guide them safely back to the ship. So, that’s how the system works.

In an ideal world, the ship would use the radar and then talk electronically direct to the aircraft and guide it – maybe automatically? That would be a much more sensible setup. In fact, that’s often the way it’s done. But in this particular case, we had to produce a bit of a – I hesitate to call it a lash-up because it was a multi-million-dollar project, but it was a bit of a lash-up.

So, there is the human factors. We’ve got a radar operator doing quite a difficult job and a pilot doing a very difficult job trying to guide their aircraft back onto the ship in bad weather. How are they going to interact and perform? And then lastly, as I alluded to earlier, the aircraft and the ship do actually interact in a limited way. But of course, it’s a physical interaction, so you can actually hurt people and of course, if we get it wrong, the aircraft interacts with the surface of the ocean, which is very bad indeed for the aircraft. So, we’ve got to be careful there. So, there’s a little illustration of our system of systems.

Example #3

And – this is the top-level argument that we came up with – it’s in goal structuring notation. But don’t worry too much about that – We’ll have a session on how to do GSN another time.

So, our top goal, or claim if you like, is that our system of systems is adequately safe for the aircraft to locate and approach the ship. So, that’s a very basic, very simple statement, but of course, the devil is in the detail and all of that detail we call the context. So, surrounding that top goal or claim, we’ve got descriptions of the system, of the aircraft and the ship. We got a definition of what we mean by adequately safe and we’ve got safety targets and reporting requirements.

So, what supports the top goal? We’ve got a strategy and after a lot of consultation and designing the safety argument, we came up with a strategy where we said, “We are going to show that all elements of the system of systems are safe and all the interactions are safe”. To do that, we had to come up with a scope and some assumptions to underpin that as well to simplify things. Again, they sit in the context, we just keep the essence of the argument down the middle.

And then underneath, we’ve got four subgoals. We aim to show that each system equipment is safe to operate, so it’s ready to be operated safely. Then each one is safe in operation so it can be operated safely with real people, etc. And then we’ve got all system safety requirements are satisfied for the whole collection of stuff and then finally that all interactions are safe. So, if we can argue all four of those, we should have covered everything. Now, I suspect if I did this again today, I might do it slightly differently. Maybe a little bit more elegantly, but that’s not the point. The point is, we came up with this and it worked.

Example #4

So, I’m going to unpack each one very briefly, just to illustrate some points. First of all, each component system is safe to operate. Each of these systems, bar one, had all been purchased already, sometimes a long time ago. They all came with their own safety targets, their own risk matrices, etc, etc. So, we had to make sure that when an individual system said, “This is what we’ve got to achieve” that that was good enough for the overall system of systems. So, we had to make sure that each system met its own safety requirements and targets and that they were valid in context.

Now, you would think that double-checking existing systems would be a foregone conclusion. In reality, we discovered that the ship’s communication system and its combat data system were not as robust as assumed. We discovered some practical issues were reported by stakeholders and we also discovered some flaws in previous analysis that had been accepted a long time ago. Now, in the end, those problems didn’t change the price of fish, as we say. It didn’t make a difference to the overall system of systems.

The frailty of the ship’s comms got sorted out and we discovered it didn’t actually matter about the combat system. So, we just assumed that the data coming out of the combat system was garbage and it didn’t make any difference. However, we did upset a few stakeholders along the way. So beware, people don’t like discovering that a system that they thought was “tickety-boo” was not as good as they thought.

Example #5

The second goal was to show that the system of systems is safe in operation. So, we looked at the actual performance. We looked at test results of the radar and then also we were very fortunate that trials of the radar on the ship with aircraft were carried out and we were able to look at those trials reports. And once again, it emerged that the system in the real world wasn’t operating quite as intended, or quite as people had assumed that it would. It wasn’t performing as well. So, that was an issue. I can’t say any more about that but these things happen.

Also, a big part of the project was we included the human element. So, as I’ve said before, we had pilots and we had radar air traffic talk-down operators. So, we brought in some human factors specialists. They captured the procedures and tasks that the pilots and the radar operators had to perform. They captured them with what’s called a Hierarchical Task Analysis, they did some analysis of the tasks and what could go wrong. Then they created a model of what the humans were doing and ran it through a simulation several thousand times. So in that way, they did some performance modelling.

Now, they couldn’t give us an absolute figure on workload or anything like that but what they could do – fortunately, our new system was replacing an older system which was even more informally cobbled together than the one that we were we were bringing in. And so, the Human Factor specialists were able to compare human performance in the old system vs. human performance with the new system. Very fortunately, we were pleased to find out that the predicted performance was far better with the new system. The new system was much easier to operate for both the pilots and the talk-down radar operators. So, that was terrific.

Example #6

So, the third one; All system of systems safety requirements are satisfied. Now, this is a bit more nebulous, this goal, but what it really came down to was when you put things together, very often you get what’s called emergent behaviour. As in things start to happen that you didn’t expect or you didn’t predict based on the individual pieces. It’s the saying, two plus two equals five. You get more out of a system – you get synergy for good or ill out when you start putting different things together.

So, does the whole thing actually work? And broadly speaking, the answer was yes, it works very well. There were some issues, a good example the old radar that they used to use to talk the planes down was a search radar so the operator could see other traffic apart from the plane they were they were guiding in. Now, the operator being able to see other things is both good and bad because on the one hand gives them improved situational awareness so they can warn off traffic if it’s a collision situation develops. But also, it’s bad because it’s a distraction for the operator. So, it could have gone either way.

So, the new radar was specialized. It focused only on the aircraft being talked down. So, the operator was blind to other traffic. So that was great in terms of decreasing operator workload and ultimately pilot workload as well. But would this increase the collision risk with other traffic? And I’ll talk about that in the summary briefly.

Example #7

And then our final goal is to show that all interactions are safe between the guidance system, the aircraft and the ship. This was a non-trivial exercise because ships have large numbers of electronic systems and there’s a very involved process to go through to check that a new piece of kit doesn’t interfere with anything else or vice versa.

And also, of course, does the new electronic system/the new radar does the radiation effect ship? Because you’ve got weapons on the ship and some of those explosive devices that the weapons uses are electrically initiated. So, could the radiation set off an explosion? So, all of those things had to be checked. And that’s a very specialized area.

And then we’ve got, does the system interfere with the aircraft and the aircraft with the system? What about the integration of the ship and the aircraft and the aircraft to the ship? Yet another specialized area where there’s a particular way of doing things. And of course, the aircraft people want to protect the aircraft and the ship people want to protect the ship. So, getting those two to marry up is also another one of those non-trivial exercises I keep referring to but it all worked out in the end.

Summary

Points to note: When we’re doing system of systems – I’ve got five points here, you can probably work some more points out from what I’ve said for yourself – but we’re putting together disparate systems. They’re different systems. They’ve been procured by different organizations, possibly, to do different things. The stakeholders who bought them and care about them have got different aims and objectives. They’ve got different agendas to each other. So, getting everyone to play nicely in the zoo can be challenging. And even with somebody pulling it all together at the top to say “This has got to work. Get with the program, folks!” there’s still some friction.

Particularly, you end up with large numbers of stakeholders. For example, we would have regular safety meetings, but I don’t think we ever had two meetings in a row with exactly the same attendees because with a large group of people, people are always changing over and things move up. And that can be a challenge in itself. We need to include the human in the loop in systems of systems because typically that’s how we get them all to play together. We rely on human beings to do a lot of translation work and in effect. So, how do the systems cope?

A classic mistake really with systems design is to design a difficult-to-operate system and then just expect the operator to cope. That can be from things as seemingly trivial as amusement park rides – I did a lesson on learning lessons from an amusement park ride accident only a month or two ago and even there it was a very complex system for two operators, neither of whom had total authority over the system or to be honest, really had the full picture of what was going on. As a result, there were several dead bodies. So, how did the operators cope, and have we done enough to support them? That’s a big issue with a system of systems.

Thirdly, this is always true with safety analysis, but especially so with system of systems. The real-world performance is important. You can do all the analysis in the world making certain assumptions and the analysis can look fine, but in the real world, it’s not so simple. We have to do analysis that assumes the kit works as advertised because you’ve got nothing else to go on until you get the test results and you don’t get them until towards the end of the program. So, you’re going down a path, assuming that things work, that they do what they say on the tin, and perhaps you then discover they don’t do what they say on the tin. Or they don’t do everything they say on a tin. Or they do what they say and they do some other things that you weren’t expecting as well and then you’ve got to deal with those issues.

And then fourthly, somewhat related to what I’ve just talked about, but you put systems together in an informal way, perhaps, and then you discover how they actually get on – what really happens. In reality, once you get above a certain level of complexity, you’re not really going to discover all the emergent behaviours and consequences until you get things into service and it’s clocked up a bit of time in service under different conditions in the real world. In fact, that was the case with this and I think with a system of systems, you’ve just got to assume that it’s sufficiently complex that that is the case.

Now, that’s not an unsolvable problem but, of course, how do you contract for that? Where you’ve got your contractors wanting you to accept their kit and pay them at a certain date or a certain point in the program, but you’re not going to find out whether it all truly works until it’s got into service and been in service for a while. So, how do you incentivize the contractor to do a good job or indeed to correct defects in a timely manner? That’s quite a challenge for system systems and it’s something that needs thinking about upfront.

And then finally, I’ve said, remember the bigger picture. It’s very easy when you’re doing analysis and you’ve made certain assumptions and you set the scope, it’s very easy to get fixated on that scope and on those assumptions and forget the real world is out there and is unpredictable. We had lots of examples of that on this program. We had the ship’s comms that didn’t always work properly, we couldn’t rely on the combat system, the radar in the real world didn’t operate as well as it said in the spec, etc, etc. There were lots of these things.

And, one example I mentioned was that with the new radar, the radar operator does not see any traffic other than the aircraft that is being guided in. So, there’s a loss of situational awareness there and there’s a risk, maybe an increased risk, of collision with other traffic. And that actually led to a disagreement in our team because some people who had got quite fixated on the analysis and didn’t like the suggestion that maybe they’d missed something. Although it was never put in those terms, that’s the way they took it. So, we need to be careful of egos. We might think we’ve done a fantastic analysis and we’ve produced hundreds of pages of data and fault trees or whatever it might be but that doesn’t mean that our analysis has captured everything or that it’s completely captured what goes on in the real world because that’s very difficult to do with such a complex system of systems.

So, we need to be aware of the bigger picture, even if it’s only just qualitatively. Somebody, a little voice, piping up somewhere saying, “What about this? And we thought about that? I know we’re ignoring this because we’ve been told to but is that the right thing to do?” And sometimes it’s good to be reminded of those things and we need to remember the big picture.

Copyright Statement

Anyway, I’ve talked for long enough. It just remains for me to point out that all the text in quotations, in italics, is from the military standard, which is copyright free but this presentation is copyright of the Safety Artisan. As I’m recording this, it’s the 5th of September 2020.

For More …

And so if you want more, please do subscribe to the Safety Artisan channel on YouTube and you can see the link there, but just search for Safety Artisan in YouTube and you’ll find us. So, subscribe there to get free video lessons and also free previews of paid content. And then for all lessons, both paid and free, and other resources on safety topics please visit the Safety Artisan at www.safetyartisan.com/  where I hope you’ll find much more good stuff that you find helpful and enjoyable.

End: System of Systems Hazard Analysis

So, that is the end of the presentation and it just remains for me to say thanks very much for watching and listening. It’s been good to spend some time with you and I look forward to talking to you next time about environmental analysis, which is Task 210 in the military standard. That’ll be next month, but until then, goodbye.

Categories
Mil-Std-882E Safety Analysis

Health Hazard Analysis

In this full-length (55-minute) session, The Safety Artisan looks at Health Hazard Analysis, or HHA, which is Task 207 in Mil-Std-882E. We explore the aim, description, and contracting requirements of this complex Task, which covers: physical, chemical & biological hazards; Hazardous Materials (HAZMAT); ergonomics, aka Human Factors; the Operational Environment; and non/ionizing radiation. We outline how to implement Task 207 in compliance with Australian WHS. (We refer to other lessons for specific tools and techniques, such as Human Factors analysis methods.)

This is the seven-minute-long demo. The full version is a 55-minute-long whopper!

Health Hazard Analysis: Topics

  • Task 207 Purpose;
  • Task Description;
  • ‘A Health Hazard is…’;
  • ‘HHA Shall provide Information…’;
  • HAZMAT;
  • Ergonomics;
  • Operating Environment;
  • Radiation; and
  • Commentary.

Health Hazard Analysis: Transcript

Click here for the Transcript

Introduction

Hello, everyone, and welcome to the Safety Artisan. I’m Simon, your host, and today we are going to be talking about health hazard analysis.

Task 207: Health Hazard Analysis

This is task 207 in the Mil. standard, 882E approach, which is targeted for defense systems, but you will see it used elsewhere. The principles that we’re going to talk about today are widely applicable. So, you could use this standard for other things if you wish.

Topics for this Session

We’ve got a big session today so I’m going to plough straight on. We’re going to cover the purpose of the task; the description; the task helpfully defines what a health hazard is; says what health hazard analysis, or HHA, shall provide in terms of information. We talk about three specialist subjects: Hazardous materials or hazmat, ergonomics, and operating environment. Also, radiation is covered, another specialist area. Then we’ll have some commentary from myself.

Now the requirements of the standard of this task are so extensive that for the first time I won’t be quoting all of them, word for word. I’ve actually had to chop out some material, but I’ll explain that when we come to it. We can work with that but it is quite a demanding task, as we’ll see.

Task Purpose

Let’s look at the task purpose. We are to perform and document a health hazard analysis and to identify human health hazards and evaluate what it says, materials and processes using materials, etc, that might cause harm to people, and to propose measures to eliminate the hazards or reduce the associated risks. In many respects, it’s a standard 882 type approach. We’re going to do all the usual things. However, as we shall see it, we’re going to do quite a lot more on this one.

Task Description #1

So, task description. We need to evaluate the potential effects resulting from exposure to hazards, and this is something I will come back to again and again. It’s very easy dealing in this area, particularly with hazardous materials, to get hung up on every little tiny amount of potentially hazardous material that is in the system or in a particular environment and I’ve seen this done to death so many times. I’ve seen it overdone in the UK when COSHH, a control of substance hazardous to health, came in in the military. We went bonkers about this. We did risk assessments up the ying-yang for stuff that we just did not need to worry about. Stuff that was in every office up and down the land. So, we need to be sensible about doing this, and I’ll keep coming back to that.

So, we need to do as it says; identification assessment, characterisation, control, and communicate assets in the workplace environment. And we need to follow a systems approach, considering “What’s the total impact of all these potential stressors on the human operator or maintainer?” Again, I come from a maintenance background. The operator often gets lots of attention because a) because if the operator stuffs up, you very often end up with a very nasty accident where lots of people get hurt. So, that’s a legitimate focus for a human operator of a system.

But also, a lot of organizations, the executive management tend to be operators because that’s how the organization evolves. So, sometimes you can have an emphasis on operations and maintenance and support, and other things get ignored because they’re not sexy enough to the senior management. That’s a bad reason for not looking at stuff. We need to think about the big picture, not just the people who are in control.

Task Description #2

Moving on with task description. We need to do all of this good stuff and we’re thinking about materials and components and so forth, and if they cause or contribute to adverse effects in organisms or offspring. We’re talking about genetic effects as well. Or pose a substantial present or future danger to the environment. So in 882, we are talking about environmental impact as well as human health impact. There is a there is an environmental task as well that is explicitly so.

Personally, I would tend to keep the human impact and the environmental impact separate because there are very often different laws that apply to the two. If you try and mix them together or do a sort of one size fits all analysis, you’ll frequently make life more difficult for yourself than you need to. So, I would tend to keep them separate. However, that’s not quite how the standard is written.

A Health Hazard is …

So what is a health hazard? As it says, a health hazard is a condition and it’s got to be inherent to the operation, etc, through to disposal of the system. So, it’s cradle to grave – That’s important. That’s consistent with a lot of Western law. It’s got to be capable of causing death, injury, illness, disability, or even in this standard, they’ve just reduced the job performance of personnel by exposure to physiological stresses.

Now I’m getting ahead of myself because, in Australia, health hazards can include psychological impacts as well, not just impacts on physical health. Now reduced job performance? – Are we really interested in minor stuff? Maybe not. Maybe we need to define what we mean by that. Particularly when it comes to operators or maintainers making mistakes, perhaps through fatigue that can have very serious consequences.

So, this analysis task is going to address lots of causes or factors that we typically find in big accidents and relate them to effects on human performance. Then it goes on to specify that certain specific hazards must be included chemical, physical, biological, ergonomic – for ergonomic, I would say human factors, because when you look at the standard, what we call ergonomics is much wider than the narrow definition of ergonomics that I’m used to.

Now, this is the first area that chops some material because where in a-d it says e.g. in those examples there is in effect a checklist of chemical, physical, biological, and ergonomic hazards that you need to look at. This task has its own checklist. You might recall when we talked about preliminary hazard identification, a hazard checklist is a very good method for getting broad coverage in general. Now, in this task, we have further checklists that are specific to human health. That’s something to note.

We’ve also got to think about hazardous materials that may be formed by test, operation, maintenance, disposal, or recycling. That’s very important, we’ll come back to that later. Thinking about crashworthiness and survivability issues. We’ve got to also think about it says non-ionizing radiation hazards, but in reality, we’ve got to consider ionizing as well. If we have any radioactive elements in our system and it does say that in G. So, we’ve got to do both non-ionizing and ionizing.

HHA Shall Provide Info #1

What categories of information should this health hazard analysis generate? Well, first of all, it’s got to identify hazards and as I’ve said or hinted at before, we’ve got to think about how could human beings be exposed? What is the pathway, or the conditions, or mode of operations by which a hazardous agent could come into contact with a person? I will focus on people. So, just because there is a potentially hazardous chemical present doesn’t mean that someone’s going to get hurt. I suspect if I looked around in the computer in front of me that I’m recording this on or at the objects on my desk, there are lots of materials that if I was to eat them or swallow them or ingest them in some other way would probably not do me a lot of good. But it’s highly unlikely that I’m going to start eating them so maybe we don’t need to worry about that.

HHA Shall Provide Info #2

We also need to think about the characterization of the exposure. Describing the assessment process: names of the tools or any models used; how did we estimate intensities of energy or substances at the concentrations and so on and so forth? This is one of those analyses that are particularly sensitive to the way we go about doing stuff. Indeed, in lots of jurisdictions, you will be directed as to how you should do some of these analyses and we’ll talk about that in the commentary later. So, we’ve got to include that. We’ve got to “show our working” as our teachers used to tell us when preparing us for exams.

HHA Shall Provide Info #3

We’ve got to think about severity and probability. Here the task directs us to use the standard definition tables that are found in 882. I talked about those under task 202 so I’m not going to talk about further here. Now, of course, we can, and maybe should tailor these matrices. Again, I’ve talked about that elsewhere, but if we’re not using the standard matrices and tables, then we should set out what we’ve done and why that’s appropriate as well.

HHA Shall Provide Info #4

Then finally, the mitigation strategy. We shouldn’t be doing analysis for the sake of analysis. We should be doing to say, “How can we make things better?” And in particular for health, “How can we make things acceptable?” Because health hazards very often attract absolute limits on exposure. So, questions of SFARP or ALARP or cost-benefit analysis simply may not enter into the equation. We simply may be direct to say “This is the upper limit of what you can expose a human being to. This is not negotiable.” So, that’s another important difference with this task.

Three More Topics

Now, at this point, I am just foreshadowing. We’re about to move on to talk about some different topics. First of all, in this section, we’re going to talk about three particular topics. Hazardous material or HAZMAT for short; ergonomics; and the operational environment. When we say the operational environment, it’s mainly about the people, aspects of the system, and the environment that they experience. Then after these three, we would go on to talk about radiation. There are special requirements in these three areas for HAZMAT, ergonomics, and operational environment.

HAZMAT (T207) #1

First of all, we have to deal with HAZMAT. If it’s going to appear in our system, or in the support system, we’ve got to identify the HAZMAT and characterize it. There are lots of international and national standards about how this is to be done. There’s a UN convention on hazardous materials, which most countries follow. And then there will usually be national standards as well that direct what we shall do. More on that later. So, we’ve got to think about the HAZMAT.

A word of caution on that. Certainly in Australian defence, we do HAZMAT to death because of a recent historical example of a big national scandal about people being exposed to hazardous materials while doing defence work. So, the Australian Defence Department is ultrasensitive about HAZMAT and will almost certainly mandate very onerous requirements on performing this. And whilst we might look at that go “This is nuts! This is totally over the top!” Unfortunately, we just have to get on with it because no one is going to make, I’m afraid, a sensible decision about the level of risk that we don’t have to worry about because it’s just too sensitive a topic.

So, this is one of those areas were learning from experience has actually gone a bit wrong and we now find ourselves doing far too much work looking at tiny risks. Possibly at the expense of looking at the big picture. That’s just something to bear in mind.

HAZMAT (T207) #2

So, lots of requirements for HAZMAT. In particular, we need to think about what are we going to do with it when it comes to disposal? Either disposal of consumables, worn components or final disposal of the system. And very often, the hazardous material may have become more hazardous. In that, let’s say engine or lubricating oil will probably have metal fragments in it once it’s been used and other chemical contamination, which may render it carcinogenic. So, very often we start with a material that is relatively harmless, but use – particularly over a long period of time – can alter those chemicals or introduce contaminants and make them more dangerous. So, we need to think about the full life of the system.

Ergonomics (T207) #1

Moving on to ergonomics, and this is another big topic. Now, Mil.standard 882 doesn’t address human factors, in my view, particularly well. The human factors stuff gets buried in various tasks and we don’t identify a separate human factors program with all of the interconnections that you need in order to make it fully effective. But this is one task where human factors do come in, very much so, but they are called ergonomics rather than human factors. Under this task description, we need to think about mission scenarios. We need to think about the staff who will be exposed as operators or maintainers, whatever they might be doing. We’ve got to start to characterize the population at risk.

Ergonomics (T207) #2

We’ve got to think about the physical properties of things that personnel will handle or wear and the implications that has on body weight. So, for example, there is a saying that the “Air Force and the Navy man their equipment and the army equip their men”. Apologies for the gendered language but that’s the saying. So, we’re putting human beings – very often – inside ships and planes and tanks and trucks. And we’re also asking soldiers to carry – very often – lots of heavy equipment. Their rations, their weapons, their ammunition, water, various tools and stuff that they need to survive and fight on the battlefield. And all that stuff weighs and all of that stuff, if you’re running about carrying it, bangs into the body and can hurt people. So, we need to address that stuff.

Secondly, we need to look at physical and cognitive actions that operators will take. So, this is really very broad once we get into the cognitive arena thinking about what are the operators going to be doing. And exposures to mechanical stress while performing work. So, maybe more of a focus on the maintainer in part three. Now, for all of this stuff, we need to identify characteristics of the design of the system or the design of the work that could degrade performance or increase the likelihood of erroneous action that could result in mishaps or accidents.

This is classic human factor’s stuff. How might the designed work or the designed equipment induce human error? So, that’s a huge area of study for a lot of systems and very important. And this will be typically a very large contributor to serious accidents and, in fact, accidents of all kinds. So, it should be an area of great focus. Often it is not. We just tend to focus on the so-called technical risks and overdo that while ignoring the human in the system. Or just assuming that the human will cope, which is worse.

Ergonomics (T207) #3

Continuing with ergonomics. How many staff do we need to operate and maintain the system and what demands are we placing on them? Also, if we overdo these demands, what are we going to do about that? Now, this can be a big problem in certain systems. I come from an aviation background and fatigue and crew duty time tend to be very heavily policed in aviation. But I was actually quite shocked when I sort of began looking at naval surface ships, submarines, where it seemed that fatigue and crew duty time was not well policed. In fact, there even seemed to be, in some places, quite a macho attitude to forcing the crew into working long hours. I say macho attitude because the feeling seemed to be “Well if you can’t take it, you shouldn’t have joined.”

So, it seems to be to me, quite a negative culture in those areas potentially, and it’s something that we need to think about. In particular, I’ve noticed on certain projects that you have a large crew who seem to be doing an extraordinary amount of work and becoming very fatigued. That’s concerning because, of course, you could end up with a level of fatigue where the crew might as well – they’re making mistakes to the same level as a drunk driver. So, this is something that needs to be considered carefully and given the attention it deserves.

Operating Environment #1

Moving on to the operating environment. How will these systems be used and maintained? And what does that imply for human exposure? This is another opportunity where we need to learn from legacy systems and go back and look at historical material and say “What are people being exposed to in the past? And what could happen again?”

Now, that’s important. It’s often not very systematically done. We might go and talk to a few old, bold operators and maintainers and ask their advice on the things that can go wrong but we don’t always do it very systematically. We don’t always survey past hazard and accident data in order to learn from it. Or if we do there is sometimes a tendency to say, “That happened in the past, but we will never make those mistakes. We’re far too clever to stuff up like that – like our predecessors did.” Forgetting that our predecessors were just as clever as we are and just as well –meaning as we are but they were human and so are we.

I think pride can get in the way of a lot of these analyses as well. And there may be occasions where we’re getting close to exposure limits, where regulations say we simply cannot expose people to a certain level of noise, or whatever, and then “How are we going to deal with that? How are we going to prevent people from being overexposed?” Again, this can be a problem area.

Operating Environment #2

This next bit of operating environment is really – I said about putting people in the equipment. Well, this is this bit. This is part A and B. So, we’re thinking about “If we stick people in a vehicle – whether it be a land vehicle, marine vehicle, an air vehicle, whatever it might be – what is that vehicle going to do to their bodies?” In terms of noise, of vibration and stresses like G forces, for example, and shock, shock loading? Could we expose them to blast overpressure or some other sudden changes of pressure or noise that’s going to damage their ears, temporarily or permanently? Again, remarkably easy to do. So, that’s that aspect.

Operating Environment #3

Moving on, we continue to talk about noise and vibration in general. In this particular standard, we’ve got some quite stringent guidance on what needs to be looked at. Now, these requirements, of course, are assuming a particular way of doing things, which we will come to later. There are a lot of standards reference by task 207. This task is assuming that we’re going to do things the American government or the American military way, which may not be appropriate for what we’re doing or the jurisdiction we’re in. So, we’ll just move on.

Operating Environment #4

Then again, talking about noise, blast, vibration, how are we going to do it? Some quite specific requirements in here. And again, you’ll notice, two-thirds of the way down in the paragraph, I’ve had to chop out some examples. There is some more in effect, hazard checklists in here saying we must consider X, Y, Z. Now, again, this seems to be requiring a particular way of doing things that may not be appropriate in a non-American defence environment.

However, the principle I think, to take away from this is that this is a very demanding task. If we consider human health effects properly, it’s going to require a lot of work by some very specialist and skilled people. In fact, we may even get in some specialist medical people. If you work in aviation or medicine, you may be aware that there is a specialist branch of medicine for called aviation medicine where these things are specifically considered. And similarly, there are medical specialists are a diving operations and other things where we expose human beings to strange effects. So, this can be a very, very demanding task to follow.

Operating Environment #5

So, when we’re going to equip people with protective equipment or we’re going to make engineering changes to the system to protect them, how effective are these things going to be? And given that most of these things have a finite effectiveness – they’re rarely perfect unless you can take the human out of the system entirely, then we’re going to be exposing people to some level of hazard and there will be some risk that that might cause that injury.

So, how many individuals are we going to expose per platform or over the total population exposed over the life of the system? Now, bearing in mind we’re talking sometimes about very large military systems that are in service for decades. This can be thousands and thousands of people. So, we may need to think about that and certainly in Australia, if we expose people to certain potential contaminants and noise, we may have to run a monitoring program to monitor the health and exposure of some of this exposed population or all of them. So, that can be a major task and we would need to identify the requirements to do that quite early on, hopefully.

And then, of course, again, we’re not doing this for the sake of it. How can we optimize the design and effectively reduce noise exposure and vibration exposure to humans? And how did we calculate it? How did we come to those conclusions? Because we’re going to have to keep those records for a long, long time. So, again, very demanding recording requirements for this task.

Operating Environment #6

And then I think this is the final one on operating environment. What are the limitations of this protective equipment and what burden do they impose? Because, of course, if we load people up with protective equipment that may introduce further hazards. Maybe we’re making the individual more likely to suffer a muscular musculoskeletal disorder.

Or maybe we are making them less agile or reducing their sensitivity to noise? Maybe if we give people hearing protection, if somebody else has assumed that they will hear a hazard coming, well, they’re not going to anymore, are they? If they’re wearing lots of protective equipment, they may not be as aware of the environment around them as they once were. So, we can introduce secondary hazards with some of this stuff. And then we need to look at the trade-offs. When and where? Is it better to equip people or not to equip people and limit their exposure or just keep them away altogether?

Radiation (T207)

So moving on briefly, we’re just going to talk about radiation. Now in this task – again, I’ve had to chop a lot of stuff out – you’ll see that in square brackets this task refers to certain US standards for radiation. Both ionizing and non-ionizing, lasers and so forth. That’s appropriate for the original domain, which this standard was targeted at. It may be wholly inappropriate for what you and I are doing.

So, we need to look at the principles of this task, but we may need to tailor the task substantially in order to make it appropriate for the jurisdiction we’re working in. Again, we’re going to have to keep these records for a long time. Radiation is always going to be dreaded by humans so it’s a controversial topic. We’re going to have to monitor people’s exposure and protect them and show that we have done so, potentially decades into the future. So, we should be looking for the very highest standards of documentation and recording in these areas because they will come under scrutiny.

Contracting #1

Moving onto contracting, this is more of a standard part of this task or part of the standard, I should say. These words or very similar words exist in every task. So, I’m not going to go through all of these things in any great detail. It’s worth noting, and I’ll come back to this in part B, we may need to direct whoever is doing the analyses to consider or exclude certain areas because it’s quite possible to fritter away a lot of resources doing either a wide but shallow analysis that fails to get to the things that can really hurt people.

So, we might be doing a superficial analysis or we might go overboard on a particular area and I’ve mentioned HAZMAT but there are many things that people can get overexcited about. So, we might see people spending a lot of time and effort and money in a particular area and ignoring others that can still hurt people. Even though they might be mundane, not as sexy. Maybe the analysts don’t understand them or don’t want to know. So, the customer who is paying for this may need to direct the analysis. I will come on to how you do that later.

Also the customer or client may need to specify certain sources of information, certain standards, certain exposure standards, certain assumptions, certain historical sets of data and statistics to be used. Or some statistics about the population, because, of course, for example, the military systems, the people who operate military systems tend to be quite a narrow subset of the population. So, there are very often age limits. Frontline infantry soldiers tend to be young and fit. In certain professions, you may not be allowed to work if you are colour-blind or have certain disabilities. So, it may be that a broad analysis of the general population is not appropriate for certain tasks.

It may be perfectly reasonable to assume certain things about the target population. So, we need to think about all of these things and ensure that we don’t have an unfocused analysis that as a result is ineffective or wastes a lot of money looking at things that don’t really matter, that are irrelevant.

Contracting #2

Standards and criteria. In part F, there are 29 references which the standard lists, which are all US military standards or US legal standards. Now, probably a lot of those will be inappropriate for a lot of jurisdictions and a lot of applications. So, there’s going to be quite a lot of work there to identify what are the appropriate and mandatory references and standards to use. And as I said, in the health hazard area, there are often a lot. So, we will often be quite tightly constrained on what to do.

And Part H, if the customer knows or has some idea of the staff numbers and profile, they’re going to be exposed to this system of operating and maintaining the system. That’s a very useful information and needs to be shared. We don’t want to make the analyst, the contractor, guess. We want them to use appropriate information. So, tell them and make sure you’ve done your homework, that you tell them the right thing to do.

Commentary #1

So, that’s all of the standard. I’ve got four slides now of commentary. And the first one, I just want to really summarize what we’ve talked about and think about the complexity of what we’re being asked to do. First bullet point, we are considering cradle to grave operation and maintenance and disposal. Everything associated with, potentially, quite a complex system. Now, this lines up very nicely with the requirements of Australian law, which require us to do all of this stuff. So, it’s got to be comprehensive.

Second bullet point, we’ve got to think about a lot of things. Death and injury, illness, disability, the effects on and could we infect somebody or contaminate somebody with something that will cause birth defects in their offspring? There’s a wide range of potential vectors of harm that we’re talking about here, and we will probably – for some systems, we will need to bring in some very specialist knowledge in order to do this effectively. And also thinking about reduced job performance – this is one aspect of human factors. This task is going to linking very strongly to whatever human factors program we might.

Thirdly, we’ve got to think about chemical, physical, and biological hazards. So, again, there’s a wide range of stuff to think about there. An example of that is hazmat and the requirements on hazmat are, in most jurisdictions, tend to be very stringent. So, that is going to be done and we need to be prepared to do a thorough job and demonstrate that we’ve done a thorough job and provide all the evidence.

Then we’ve also got ergonomics. Actually, strictly speaking, we’re talking human factors here because it’s a much wider definition than what the definition of ergonomics that I’m used to, which tends to be purely physical effects on a human. Because we’re talking about cognitive and perception and job performance as well and also we’ve got vibration and acoustics. So, again, particular medical effects and stringent requirements. So, a whole heap of other specialists work there.

And operating environment, thinking about the humans that will be exposed. How are we going to manage that? What do we need to specify in order to set up whatever medical monitoring program of the workforce we might have to bring in in the future through life? So, again, potentially a very big, expensive program. We need to plan that properly.

Then finally, radiation. Another controversial topic which gets lots of attention. Very stringent requirements, both in terms of exposure levels and indeed we will often be directed as to how we are to calculate and estimate stuff. It’s another specialist area and it has to be done properly and thoroughly.

Overall, every one of those seven bullet points shows how complex and how comprehensive a good health hazard analysis needs to be. So, to specify this well, to understand what is required and what is needed through life, for the program to meet our legal and regulatory obligations, this is a big task and it needs a lot of attention and potentially a lot of different specialist knowledge to make it work. I flogged that one to death, so I’ll move on.

Commentary #2

Now, as I’ve said before, too, this is an American military standard, so it’s been written to conform to that world. Now in Australia, the requirements of Australian work, health and safety are quite different to the American way of doing things. Whilst we tend to buy a lot of American equipment and there’s a lot of American-style thinking in our military and in our defence industry, actually, Australian law much is much more closely linked to English law. It’s a different legal basis to what the Americans do. So Australian practitioners take note.

It’s very easy to go down the path of following this standard and doing something that will not really meet Australian requirements. It’ll be, “We’ll do some work” and it may be very good work, but when we come to the end and we have to demonstrate compliance with Australian requirements, if we haven’t thought about and explicitly upfront, we’re probably in for a nasty shock and a lot of expensive rework that will delay the program. And that means we’re going to become very, very unpopular very quickly. So, that’s one to avoid in my experience.

So, we will need to tailor task 207 requirements upfront in order to achieve WHS compliance. And the client customer needs to do that and understand that not the – well the contractor needs to. The analysts need to understand that. But the customer needs to understand that first, otherwise, it won’t happen.

Commentary #3

Let’s talk a bit more about tailoring for WHS. For example, there are several WHS codes of practice which are relevant. And just to let you know, these codes of practice cover not only requirements of what you have to achieve, but also, to a degree, how you are to achieve them. So, they mandate certain approaches. They mandate certain exposure standards. Some of them also list a lot of other standards that are not mandated but are useful and informative.

So, we’ve got codes of practice on hazardous manual tasks so avoiding muscular-skeletal injuries. We’ve got several codes of practice on hazardous chemicals. So, we’ve got a COP specifically on risk management and risk assessment of hazardous chemicals, on safety data sheets, on labelling of HAZCHEM in a workplace. We’ve got a COP on noise and hearing loss and also, we have other COPs on specific risks, such as asbestos, electricity and others, depending on what you’re doing. So, potentially there is a lot of regulation and codes of practice that we need to follow.

And remember that COPs are, while they contain regulations, they also are a standard that a court will look to enforce if you get prosecuted. If you wind up in court, the prosecution will be asking questions to determine whether you’ve met the requirements of COP or not. If you can’t demonstrate that you’ve met them, you might have done a whole heap of work and you might be the greatest expert in the world on a certain kind of risk, but if he can’t demonstrate that you’ve met at minimum the requirements of COP – because they are minimum requirements – then you’re going to be in trouble. So, you need to be aware of what those things are.

Then on radiation, we have separate laws outside the WHS. So, we have the Australian Radiation Protection and Nuclear Safety Agency, ARPANSA, and there is an associated act and associated regulations and some COP as well. So, for radiation side, there’s a whole other world that you’ve got to be aware of and associated with all of this stuff are exposure standards.

Commentary #4

Finally, how do we do all of this without spending every dollar in the defence budget and taking 100 years to do it? Well, first of all, we need to set our scope and priorities. So, before we get to Task 207, the client/the customer should be involving end-users and doing a preliminary hazard identification exercise. That should be broad and as thorough as possible. They should also be doing a preliminary hazardous hazard analysis exercise, Task 202, to think about those hazards and risks further.

Also, you should be doing Task 203, which is system requirements hazard analysis. We need to be thinking about what are the applicable requirements for my system from the law all the way down to what specific standards? What codes of practice? What historical norms do we expect for this type of equipment? Maybe there is industry good practice on the way things are done. Maybe as we work through the specifications for the equipment, we will derive further requirements for hazard controls or a safety management system or whatever it might be. That’s a big job in itself.

So, we need to do all three of those tasks, 201, 202, 203, in order to be prepared and ready to focus on those things that we think might hurt us. Might hurt people physically, but also might hurt us in terms of the amount of effort we’re going to have to make in order to demonstrate compliance and assurance. So, that will focus our efforts.

Secondly, when we need to do the specialist analyses and we may not always need to do so. This is where 201, 202, and 203 come in. But where we need to do specialist analyses, we may need to find specialist staff who are competent to do these this kind of unusual or specialist work and do it well. Now, typically, these people are not cheap, and they tend to be in short supply. So, if you can think about this early and engage people early, then you’re going to get better support.

You’re probably going to get a better deal because in my experience if you call in the experts and ask their opinion early on, they’re more likely to come back and help you later. As opposed to, if you ignore them or disregard their advice and then ask them for help because you’re in trouble, they may just ignore you because they’ve got so much work on. They don’t need your work. They don’t need you as a client. You may find yourself high and dry without the specialists you need or you may find yourself paying through the nose to get them because you’re not a priority in their eyes. So do think about this stuff early, I would suggest and do cultivate the specialist. If you get them in early and listen to them and they feel involved, you’re much more likely to get a good service out of them.

So thirdly, try not to do huge amounts of work on stuff that doesn’t really have a credible impact on health. Now, I know that sounds like a statement of the blinking obvious, but because people get so het up about health issues, particularly things like radiation and other hazards that humans can’t see: we dread them. We get very emotional about this stuff and therefore, management tends to get very, very worried about this stuff. And I’ve seen lots of programs spend literally millions of dollars analyzing stuff to death, which really doesn’t make any difference to the safety of people in the real world. Now, obviously, that’s wasted money, but also it diverts attention from those areas that really are going to cause or could cause harm to people through the life of the system.

So, we need to use that risk matrix to understand what is the real level of risk exposure to human beings and therefore, how much money should we be spending? How much effort and priority should we be spending on analyzing this stuff? If the risk is genuinely very low, then probably we just take some standard precautions, follow industry best practice, and leave it at that and we keep our pennies for where they can really make a difference.

Now, having said that, there are some exceptions. We do need to think about accident survivability. So, what stresses are people going to be exposed to if their vehicle is an accident? How do we protect them? How do they escape afterward? Hopefully. How do we get them to safety and treat the injured? And so on and so forth. That may be a very significant thing for your system.

Also post-accident scenarios in terms of – very often a lot of hazardous materials are safely locked away inside components and systems but if the system catches fire or is smashed to pieces and then catches fire, then potentially a lot of that HAZMAT is going to become exposed. Very often materials that pose a very low level of risk, if you set them on fire and then you look at the toxic residue left behind after the fire, it becomes far more serious. So, that is something to consider. What do we do after we’ve had an accident and we need to sort of clean up the site afterward? And so on and so forth.

Again, this tends to be a very specialist job so maybe we need to get in some specialists to give us advice on that. Or we need to look to some standards if it’s a commonplace thing in our industry, as it often is. We learn we learned from bitter experience. Well, hopefully, we learn from bitter experience.

Copyright Statement

So, that’s it from me. I appreciate it’s been a long session, but this is a very complex task and I’ve really only skimmed the surface on this and pointed you at sort of further reading and maybe some principles to look at in more depth. So, all the quotations are from the Mill standard, which is copyright free. But this presentation is copyright of the Safety Artisan.

For More…

And for more information on this topic and others, and for more resources, do please visit www.safetyartisan.com. There are lots of free resources on the website as well, and there’s plenty of free videos to look at.

End: Health Hazard Analysis

So, that is the end of the session. Thank you very much for listening. And all that remains for me to say is thanks very much for supporting the work of the Safety Artisan and tuning into this video. And I wish you every success in your work now and in the future. Goodbye.

Categories
Mil-Std-882E Safety Analysis

Operating & Support Hazard Analysis

In this full-length session, The Safety Artisan looks at Operating & Support Hazard Analysis, or O&SHA, which is Task 206 in Mil-Std-882E. We explore Task 206’s aim, description, scope, and contracting requirements. We also provide value-adding commentary, which explains O&SHA: how to use it with other tasks; how to apply it effectively on different products; and some of the pitfalls to avoid. We refer to other lessons for specific tools and techniques, such as Human Factors analysis methods.

This is the seven-minute-long demo. The full version is about 35 minutes long.

Operating & Support Hazard Analysis: Topics

  • Task 206 Purpose:
    • To identify and assess hazards introduced by O&S activities and procedures;
    • To evaluate the adequacy of O&S procedures, facilities, processes, and equipment used to mitigate risks associated with identified hazards.
  • Task Description (six slides);
  • Reporting (two slides);
  • Contracting (two slides); and
  • Commentary (four slides).

Operating & Support Hazard Analysis: Transcript

Click here for the Transcript

Introduction

Hello everyone and welcome to the Safety Artisan; home of safety engineering training. I’m Simon and today we’re going to be carrying on with our series on Mil. Standard 882E system safety engineering.

Operating & Support Hazard Analysis

Today, we’re going to be moving on to the subject of operating and support hazard analysis. This is, as it says, task 206 under the standard. Operating and support hazard analysis, I’ll just call it O&S or OSHA (also O&SHA) for short. Unfortunately, that will confuse people if I call OSHA. Let’s call it O&S.

Topics for this Session

The purpose of O&S hazard analysis is to identify and assess hazards introduced by those activities and procedures and also to evaluate the adequacy of O&S procedures, processes, equipment, facilities, etc, to mitigate risks that have been already identified. A twofold task but a very big task. And as we’ll see, we’ve got lots of slides today on task description, and reporting, contracting, and commentary. As always, I present the full text as is of the task, which is copyright free, but I’m only going to talk about the things that are important. So, we’re not going to go through every little clause of the standard that would be pointless.

O&S Hazard Analysis (T206)

Let’s get started with the purpose. As we’ve already said, it’s to identify and assess those hazards which are introduced by operational and support activities and procedures and evaluate their adequacy. So, we’re looking at operating the system, whatever it may be- And of course, this is a military standard, so we assume a military system, but not all military systems are weapon systems by any means. Not all are physical systems. So, there may be inventory management systems, management information systems, all kinds of stuff. So, does operating those systems and just supporting them (maintaining them are resupplying them, disposing of them, etc.,) does that create any hazards or introduce any hazards? And how do we mitigate? That’s the purpose of the task.

Task Description (T206) #1

Let’s move on to the task description. Again, we’re assuming a contractor is performing the analysis, but that’s not necessarily the case. For this task, this actually says this typically begins during engineering and manufacturing development, or EMD.  So, we’re assuming an American style lifecycle for a big system and EMD comes after concept and requirements development. So, we are beginning to move into the very expensive stage of development for a system where we begin to commit serious money. It’s suggesting that O&SHA can wait until then which is fine in general unless you’ve identified any particularly novel hazards that will need to be dealt with earlier on. As it says, it should build on design hazard analyses, but we’ll also talk about the case later on when there is no design hazard analyses. And the O&SHA shall identify requirements or alternatives or eliminating hazards, mitigating risks, etc. This is one of those tasks where the human is very important – In fact, dominant to be honest. Both as a source of hazards and the potential victim of the associated risks. A lot of human-centric stuff going on here.

Task Description (T206) #2

As always, we’re going to think about the system configurations. We’re going to think about what we’re going to do with the system and the environment that we’re going to do it in. So, a familiar triad and I know I keep banging on about this, but this really is fundamental to bounding and therefore evaluating safety. We’ve got to know what the system is, what we’re doing with it, and the environment in which we’re doing it. Let’s move on.

Task Description (T206) #3

Again, Human Factors, regulatory requirements, and particularly specified personnel requirements need to be thought of. Particularly for operating and support, we need to take into account the staffing and personnel concept that we have. It’s frighteningly easy to produce a system that needs so much maintenance, for example, or support activity that it is unaffordable. And lots and lots of military systems and, it must be said, government and commercial systems in the past have come in that required enormous amounts of support, which soon proved to be unaffordable or no one would sign up to the commitment required. So, lots of projects have simply died because the system was going to be too expensive to sustain. That’s a key point of what we’re doing with O&S here. It’s not just about health and safety. It’s about health and safety, which is affordable.

We also need to look at unplanned events. So, not just designed in things, but things introduced- It says human errors. Again, I’m going to re-emphasize it’s erroneous human action because human error makes it sound like a human is at fault. Whereas very often it’s the design or the concept or the requirements that are at fault and place unacceptable burdens on the human being. Again, lots of messy systems seen in the past, which didn’t quite work and we just kind of expected the operator to cope. And most of the time they cope and then every so often they have a bad day at the office or a bunch of factors come together and lots of people die. And then we blame the human. Well, it’s not the human’s fault at all. We put them in that position. And as always, we need to look at past- Past evaluations of related legacy systems and support operations. If you have good data about legacy systems or about similar systems that your organization or another organization has operated, then that’s gold dust. So, do make an effort to get hold of that information if you can. Maybe a trade association or some wider pan organization body can help you there.

Task Description (T206) #4

At a minimum, we’ve got to identify activities involving known hazards. This assumes that we’ve done some hazard analysis in the past, which is very important. We always need to do that. I’ll come back to that commentary. Secondly, changes needed in requirements, be they functional requirements – what we want the system to do. Or design requirements, if we put constraints on how the system may do it for whatever it may be, hardware, software, support equipment, whatever to make those hazards and risks more manageable. Requirements for safety features – so requirements for engineered features and devices, equipment, because always, in almost any jurisdiction, we will have a hierarchy of control that recognizes that designed and engineered in safety features are more effective than just relying on people to get it right. And then we’ve also got to communicate to people the hazards associated with the system. Warnings, cautions, and whatever special emergency procedures might be required associated with the system. Again, that’s something that we see reinforced in law and regulations in many parts of the world. This is all good stuff. It’s accepted good practice all across the world.

Task Description (T206) #5

Moving on, we also need to think about how are we going to move the system around and the associated spares and supplies? How are we going to package them, handle them, stole them, transport them? Particularly if there are hazardous materials, etc, etc, involved. That’s the next part, G. Again, training requirements. We’re thinking about a human-centric approach. Whatever we expect people to do, they’ve got to be trained in how to do it. Point I, we’ve got to include everything, whether it’s developmental or non-developmental terms. We can’t just ignore stuff because it’s GFE or it’s off the shelf. It doesn’t mean it can never go wrong. Far from it. Particularly if we are putting stuff together that’s never been put together before in a novel combination or in a novel environment. Something that might be perfectly safe and stable in an air-conditioned office might start to do odd things in a much more corrosive and uncontrolled environment, let’s say.

We need to think about what modes might the system be potentially hazardous when under operative control. Particularly, we might think about degraded modes of operation. So, for whatever reason, a part of the system has gone wrong or the system has got into an operating environment within which it doesn’t operate as well as it could. It’s not in an optimal operating environment or state. The human being in control of it, we’re assuming, has still got to be able to operate the system, even if it’s only to shut it down or to get it back into a safer state or safer environment. We’ve got to think about all of those nuances.

Then because we’re talking about support as well, we need to think about a related legacy systems, facilities and processes which may provide background information. Also, of course, the system presumably will very often be operating alongside other systems or it will be supported by all systems maybe that exist or being procured separately. So, we’ve got to think about all those interactions as well and all those potential contributions. As you can see, this is quite a wide-ranging broadly scoped task.

Task Description (T206) #6

Finally, on this section, the customer/the end-user/or whoever may specify some specific analysis techniques. Very often they will not. So, whoever is doing the analysis, be they a contractor or third party outside agency, needs to make sure that whatever they propose to do is going to be acceptable to the program manager. In the sense that it is going to be compatible and relevant and useful. And then finally, the contractor has got to do some O&SHA at the appropriate time but maybe more detailed data will come along later. In which case that needs to be incorporated and also operational changes.

An absolute classic [situation] with military and non-military systems is; the system gets designed, it goes into test and evaluation and we discover that things- assumptions that were made during development- don’t actually hold up. The real world isn’t like that or whatever it might be and we find we’re making changes- making changes in assumptions. Those need to be factored in which, sadly, is often not done very well. So, that’s an important point to think about. What’s my change control mechanism and how will the people doing the and O&SHA find out about these changes? Because very often it’s easy to assume that everybody knows about this stuff but when you start making assumptions, the truth is that it very often goes adrift.

Reporting (T206) #1

Let’s talk about reporting- Just a couple of slides here. In the reporting, there’s some fairly standard stuff in here, the physical and functional characteristics of the system- that’s important. Again, we might assume that everybody knows what they are, but it’s important to put them in. It may be that the people doing the analysis were given a different system description to the people developing the system, to the people doing the personnel planning, etc. All the different things that have to be brought together, we need to make sure that they join up again. It’s too easy to get that wrong. Reinforcing the point I made on the previous slide, as more detailed descriptions and specifications come in that needs to be supplied when it becomes available and provided.

Hazard analysis methods and techniques. What techniques are we using? Give a description. If you’re doing it to a particular standard, so much the better. Great- that saves a lot of paper. What assumptions that we made? What data, both qualitative and quantitative have we used to support analysis? That all needs to be declared. By the way, one of the reasons is to be declared is that when things change- not if- that’s when these assumptions and the data and the techniques get exposed. So, if there are changes, if we don’t have this kind of information declared, we can’t assess the impact changes. And it gets even more difficult to keep up with what’s going on.

Reporting (T206) #2

And then hazard analysis results. Again, the leading particulars of the results should be recorded in the hazard tracking system, the HTS, or hazard log, or risk register- whatever you want to call it. But there will be more detailed information that we wouldn’t want to clutter up the risk register with and we also need to provide warnings, cautions, and procedures to be included in maintenance manuals, training courses, operator manuals, etc. So, we’re going to or we’re probably going to generate an awful lot of data out of this task and that needs to be provided in a suitable format. Again, whoever the program manager on the client-side, or is the end-user representation, needs to think about this stuff quite early on.

Contracting #1

That leads us neatly on to contracting. Now, this task, in theory, can be specified a little bit down the track, after the program started. In practice, what you find is program managers tried to specify everything up front in a single contract for various reasons.

There are good reasons for doing that sometimes. Also, there are bad reasons but I’m not going to talk about this session. We’ll have a talk about planning your system safety program in another session. There’s a lot of nuances in there to be considered.

Just sticking to this task, identification of functional disciplines – who do we need to get involved in order to do this work properly? It’s likely that the safety team if you have one, may not have relevant operating experience or relevant sustainment experience for this kind of system. If they do, that’s fantastic but that doesn’t negate the read the requirement to get the end-user represented and involved. In fact, that’s a near legal requirement in Australia, for example, and in some other jurisdictions. We need to get the end-users involved. We need the discipline specialist to get involved. Typically, your integrated logistic support team, your reliability people, your maintainability, and your testability people, if you have those disciplines. Or maybe you’re calling them something else, it doesn’t really matter.

We need to know what are the reporting requirements. What, if any, analysis methods and techniques do we desire to be used? Maybe the client or end-user has got to jump through some regulatory hoops and therefore they need specific analysis work and safety results to be done and produced. If that’s the case, then that needs to be specified in the contract. And what data is to be generated in what format? And how is it to be reported on when, etc? Considering the hazard tracking system, etc? And then the client may also select or specify known hazards, known hazardous areas, or other specific items to be examined or excluded because maybe it’s being covered elsewhere or we don’t expect the contractor to be able to do this stuff. Maybe we need to use a specialist organization. Again, maybe a regulator has directed us to do so. So, all of these things need to be thought about when we’re putting together the contract requirements for task 206.

Contracting #2

Again, I say this every time, we need to include all items within the scope of the system and the environment, not just developmental stuff. In fact, these days, maybe the majority of programs that I am seeing are mostly non-developmental. So, we’re taking lots of COTS stuff, GFE components, and putting it all together. That’s all going to be included, particularly integration.

We need to think about legacy and related processes and the hazard analysis associated with them if we can get them. They should be supplied to whoever is doing the work and an analyst should be directed to review them and include lessons learned.

Then, reinforcing the previous point that has a tracking system- How will information reported in this task be correlated with tasks and analyses that are being done maybe elsewhere or by different teams? And the example here is 207 health hazard analysis. I’ll talk a little bit about the linkages between the two later. But it’s quite likely in this sort of area there will be large groups of people thinking about operations and maintenance and support. Very often those groups are very different. Sometimes they don’t even talk to each other. That’s the culture in different organizations. You don’t see airline pilots hanging around with baggage handlers very much, do you, down the pub for whatever reason? Different set of people- they don’t always mix very much. And again, you may also have different specialist disciplines, especially the Human Factors people. Again, you’ve got to tie everything in there. So, there’s going to be lots of interfaces in this kind of task that they’ve got to be managed.

Point I – concept of operations. Yes, that’s in every task. You’ve got to understand what we intend to do with this system or what the end-user intends to do with the system in order to have some context for the analysis.

And then finally, what risk definitions and what risk matrix are we using? If we’re not using the standard 882 matrix, then what are we doing?

Commentary #1

I’ve got four slides of commentary now – a number of things to say about Task 206.

Now, I’ve picked an Australian example. So, Task 206 ties in very neatly with Australian WHS requirements. I suspect Australian WHS requirements have been strongly influenced by American OSHA and system safety practices. In Australia, we are heavily influenced by the US approach. This standard and legal requirements in Australia, and in many other states and territories let’s be honest, do tie in nicely with the standard. Although not always perfectly, you’ve got to remember that. So, we do need to focus on operations and support activities. That’s a big part of WHS, thinking about all relevant activities and cradle to grave – the whole life of the system. We need to think about the working environment, the workplace. We need to think about humans as an integral part of the system, be they operators or maintainers, suppliers, other kinds of sustainers. And we need to be providing relevant information on hazards, risks, warnings, trainings, and procedures, and requirements for PPE, and so on and so forth to workers.

So, task 206 is going to be absolutely vital to achieving WHS compliance in Australia and compliance with health and safety legislation and regulations in many parts of the world. In the US and UK and I would say in virtually all developed nations. So, this is a very important task for achieving compliance with the law and regulations. It needs to get the requisite amount of attention- It doesn’t always. People so often on a program during procurement and acquisition development, the technical system is the sexy thing. That’s the thing that gets all the attention, especially early on. The operating and particularly the support side tends to get neglected because it’s not so sexy. We don’t buy a system to support it after all do we? We buy a system to do a job. So, we get the operators in and we get their input on how to optimize the system to do the job most cost-effectively and with most mission effectiveness that we can get out of it. We don’t often think about support effectiveness. But to achieve WHS compliance or the equivalent this is a very important task so we will almost always need to do it.

Commentary #2

The second item to think about – what is going to be key for the maintenance support side is a technique called Job Safety Analysis or Job Hazard Analysis. I’ve highlighted a couple of sources of information there, particularly I would recommend going to the American www.OSHA.gov site and the guidance that they provide on how to do a job hazard analysis. So, use that or use something else if something different is specified in the jurisdiction you’re working it, then go ahead and use that. But if you don’t have any [guidance] on what to do, this will help you.

This is all about – I’ve got a task to do, whatever it might be doing, how do I do it? Let’s analyse this step-by-step, or at least in reasonable size chunks, thinking about how we do the tasks that need to be done. Now, there’s the operator side, and then, of course, we’re always dealing with human beings working on the system or working with the system. So, we’re going to be seeing potentially a lot of Human Factors type techniques being relevant. And there are lots of tasks that we can think about, Hierarchical Task Analysis and that kind of approach is going to fit in with the Job Hazard Analysis as well. Those are going to link together quite well. There will also be things like workload analysis. Particularly for the operators, if we’re asking the operator to do a lot and to maintain a particular level of concentration or respond rapidly, we need to think about workload and too much workload and too little workload can make things worse.

There are lots of techniques out there, I’m not going to talk about Human Factors here. I’m going to be putting on a series on Human Factors techniques in cooperation with a specialist in that area. So, I’m not going to say more here.

For certain kinds of operators, let’s say, pilots, people navigating a ship and so on, drivers, there will be well-established ways that those operators are trained the way they have to operate. There will often be a legal framework and a regulatory framework that says how they have to operate. And then that may direct a particular kind of analysis to be done or a particular approach to be taken for how operators do their jobs. But equally, there is a vast range of operator roles in industry, in chemical plants. Various specialist operating roles where there’s an industry-specific approach to doing things. Or indeed the general approach may be left up to whoever is developing the system. So, there’s a huge range of approaches here that are going to be largely dictated by the concept of operations and also an awareness of what is relevant law, regulation, and good practice in a particular industry, in a particular situation. That’s where doing your Task 203, your safety requirements analysis really kicks in. It’s a very broad subject we’re covering here. You’ve got to get the specialist in to do it well.

Contracting #3

Now, I mention that these days we’re seeing more and more legacy and COTS systems being used and repurposed. Partly to save time and money. We’re not developing mega systems as often as we used to, particularly in defence, but also in many other walks of life as well. So, we may find ourselves evaluating a system where very little technical hazard analysis has been done because there are no developmental items and it’s even difficult to do analysis on legacy or a COTS system because we cannot get the data to do so. Perhaps we can’t get the data for commercial reasons, contractual reasons.

Or maybe we’ve got a legacy system that was developed in a different jurisdiction and whatever information is available with it just doesn’t fit the jurisdictional regulatory system that we’ve got to work in where we want to operate the system. This is very common. Australia, for example, [acquires] a lot of systems from abroad, which have not been developed in line with how we normally do things.

We could in theory just do Task 206 if there was no developmental hazard analysis to do but that’s not quite true. At a minimum, we will always need to do some Preliminary Hazard Listing and hazard analysis – that’s Tasks 201 and 202 respectively. And we will very definitely need to do some System Requirements Hazard Analysis, Task 203, to understand what we need to do for a particular system in a particular application, operating environment, and regulatory jurisdiction. So, we’re always going to have to do those and we may well have to look at the integration of COTS things and do some system-level analysis. That’s 204. We’re definitely going to need to do the early analyses. In fact, the client and the end-user representatives should be doing 201, 202 and 203 and then we may be in a position to finish things off with 206 for certain systems.

Contracting #4

Now, having said that, I’ve mentioned already that Task 206 can be very broad in scope and very wide-ranging. There’s a danger that we will turn Task 206 into a bottomless pit into which we pour money and effort and time without end. So, for most systems, we cannot afford to just do O&SHA across the board without any discernment or any prioritization.

So, we need to look at those other hazard analyses and prioritize those areas where people could get hurt. Particularly we should be using legacy and historical data here to say “What does – in reality, what does hurt people when looking after these systems or operating systems?” Again, as I’ve said before, in many industries there is a standard industry approach or good practice to how certain systems are operated, and maintained, and supported. So, if there is a standard industry approach available – particularly if we can justify that by available historical data – if that [is as good] as doing analysis, then why not just use the standard approach? It’s going to be easier to make a SFARP or a ALARP argument that way anyway. And why spend the money on analysis when we don’t have to? We could just spend the money on actually making the system safer. So, let’s not do analysis for the sake of doing analysis.

Also, there’s a strong synergy between the later tasks in the 200 series. There’s a strong linkage between this Task 206 and 207, which is Health Hazard Analysis. Also, there can be a strong linkage between Task 210, which is the Environmental Hazard Analysis. So, this trio of tasks focuses on the impact on living things, whether they be human beings or animals and plants and ecosystems and very often there’s a lot of overlap between them. For example, hazardous chemicals that are dangerous for humans are often dangerous for animals and plants and watercourses and so on and so forth. I’ll be talking about that more in the next session on Task 207.

One word of warning, however. Certainly, in Australia, we have got fixated on hazardous chemicals because we’ve had some very high-profile scandals involving HAZCHEM in the past. Now, there’s nothing wrong, of course, with learning from experience and applying rigorous standards when we know things have gone wrong in the past. But sometimes we go into a mindset of analysis for analysis sake. Dare I say, to cover people’s backsides rather than to do something useful. So, we need to focus on whether the presence of a HAZCHEM could be a problem. Whether people get exposed to it, not just that it’s there.

Certain chemicals may be quite benign in certain circumstances, and they only become dangerous after an emergency, for example. There are lots of things in the system that are perfectly safe until the system catches fire. Then when you’re trying to dispose or repair a fire damage system that can be very dangerous, for example. So, we need to be sensible about how we go about these things. Anyway, more on that in the next session.

Copyright Statement

That’s the commentary that I have on Task 206. As we said, it links very tightly with other things and we will talk about those in later sessions. I just like to point out that the “italic text in quotations” is from the Mil. standard. That is copyright free as most American government standards are. However, this presentation and my commentary, etc. are copyright of the Safety Artisan 2020.

For More …

Now, for all lessons and resources, please do visit the www.safetyartisan.com. Now, as you’ll notice, it’s an https – it’s a secure website.

End: Operating & Support Hazard Analysis

So, that is the end of the lesson and it just remains for me to say thank you very much for your time and for listening. And I look forward to seeing you again soon. Cheers.

Categories
Mil-Std-882E Safety Analysis

Functional Hazard Analysis

In this full-length (40-minute) session, The Safety Artisan looks at Functional Hazard Analysis, or FHA, which is Task 208 in Mil-Std-882E. FHA analyses software, complex electronic hardware, and human interactions. We explore the aim, description, and contracting requirements of this Task, and provide extensive commentary on it. (We refer to other lessons for special techniques for software safety and Human Factors.)

This is the seven-minute demo; the full version is 40 minutes long.

Topics: Functional Hazard Analysis

  • Task 208 Purpose;
  • Task Description;
  • Update & Reporting
  • Contracting; and
  • Commentary.

Transcript: Functional Hazard Analysis

Click here for the Transcript

Introduction

Hello, everyone, and welcome to the Safety Artisan; Home of Safety Engineering Training. I’m Simon and today we’re going to be looking at how you analyse the safety of functions of complex hardware and software. We’ll see what that’s all about in just a second.

Functional Hazard Analysis

I’m just going to get to the right page. This, as you can see, functional hazard analysis is Task 208 in Mil. Standard 882E.

Topics for this Session

What we’ve got for today: we have three slides on the purpose of functional hazard analysis, and these are all taken from the standard. We’ve got six slides of task description. That’s the text from the standard plus we’ve got two tables that show you how it’s done from another part of the standard, not from Task 208. Then we’ve got update and recording, another two slides. Contracting, two slides. And five slides of commentary, which again include a couple of tables to illustrate what we’re talking about.

Functional Purpose HA #1

What we’re going to talk about is, as I say, functional hazard analysis. So, first of all, what’s the purpose of it? And in classic 882 style, Task 208 is to perform this functional hazard analysis on a system or subsystem or more than one. Again, as with all the other tasks, it’s used to identify and classify system functions and the safety consequences of functional failure or malfunction. In other words, hazards.

Now, I should point out at this stage that the standard is focused on malfunctions of the system. The truth is in the real world, that lots of software-intensive systems have been involved in accidents that have killed lots of people, even when they’re functioning as intended. That’s one of the short-sightedness of this Mil. Standard is that it focuses on failure. The idea that if something is performing as specified, that either the specification might be wrong or there might be some disconnect between what the system is doing and what the human expects – The way the standard is written just doesn’t recognize that. So, it’s not very good in that respect. However, bearing that in mind, let’s carry on with looking at the task.

Functional HA Purpose #2

We’re going to look at these consequences in terms of severity – severity only, we’ll come back to that – for the purpose of identifying what they call safety-critical functions, safety-critical items, safety-related functions, and safety-related items. And a quick word on that, I hate the term ‘safety-critical’ because it suggests a sort of binary “Either it’s safety-critical. Yes. Or it’s not safety-critical. No.” And lots of people take that to mean if it’s “safety-critical, no,” then it’s got nothing to do with safety. They don’t recognize that there’s a sort of a sliding scale between maximum safety criticality and none whatsoever. And that’s led to a lot of bad thinking and bad behaviour over the years where people do everything they can to pretend that something isn’t safety-related by saying, “Oh, it’s not safety-critical, therefore we don’t have to do anything.” And that kind of laziness kills people is the short answer.

Anyway, moving on. So, we’ve got these SCFs, SCIs, SRFs, SRIs and they’re supposed to be allocated or mapped to a system design architecture. The presumption in this – the assumption in this task is that we’re doing early – We’ll see that later – and that system design, system architecture, is still up for grabs. We can still influence it. Often that is not the case these days. This standard was written many years ago when the military used to buy loads of bespoke equipment and have it all developed from new. That doesn’t happen anymore so much in the military and it certainly doesn’t happen in many other walks of life – But we’ll talk about how you deal with the realities later. And they’re allocating these functions and these items of interest to hardware, software and human interfaces. And I should point out, when we’re talking about all that, all these things are complex. Software is complex, human is complex, and we’re talking about complex hardware. So, we’re talking about components where you can’t just say, “Oh, it’s got a reliability of X, and that’s how often it goes wrong” because those type of simple components that are only really subject to random failure, that’s not what we’re talking about here. We’re talking about complex stuff where we’re talking about systematic failure dominating over random, simple hardware failure. So, that’s the focus of this task and what we’re talking about. That’s not explained in the standard, but that’s what’s going on.

Functional HA Purpose #3

Now, our third slide on purpose; so we use the FHA to identify consequences of malfunction or functional failure, lack of function. As I said just now, we need to do this as early as possible in the systems engineering process to enable us to influence the design. Of course, this is assuming that there is a systems engineering process – that’s not always the case. We’ll talk about that at the end as well. And we’re going to identify and document these functions and items and allocate and it says partition them in the software design architecture. When we say partition, that’s jargon for separate them into independent functions. We’ll see the value of that later on. Then we’re going to identify requirements and constraints to put on the design team to say, “To achieve this allocation in this partitioning, this is what you must do and this is what you must not do”. So again, the assumption is we’re doing this early. There’s a significant amount of bespoke design yet to be done.

Task Description (T208) #1

Moving on to task description. It says the contractor, but whoever’s doing the analysis has to perform and document the FHA, to analyse those functions, as it says, with the proposed design. I talked about that already so we’ll move on.

It’s got to be based on the best available data, including mishap data. So, accident/incident data, if you can get it from similar systems and lessons learned. As I always say in these sessions, this is hard to do, but it’s really, really valuable so do put some effort into trying to get hold of some data or look at previous systems or similar systems. We’re looking at inputs, outputs, interfaces and the consequences of failure. So, if you can get historical data or you can analyse a previous system or a similar system, then do so. It will ultimately save you an awful lot of money and heartache if you can do that early on. It really is worth the effort.

Task Description (T208) #2

At a minimum, we’ve got to identify and evaluate functions and to do that, we need to decompose the system. So, imagine we’ve got this great big system. We’ve got to break it down into subsystems of major components. We’ve got to describe what each subsystem and major component does, its function or its intended function. Then we need a functional description of interfaces and thinking about what connects to what and the functional ins and outs. I guess pretty obvious stuff  – needs to be done.

Task Description (T208) #3

And then we also need to think about hazards associated with, first of all, loss of function. So, no function when we need it. Now, we have degraded functional malfunction and sort of functioning out of time or out of sequence. So, we’ve got different kinds of malfunctions. What we don’t have here is function when not required. So, the system goes active for some reason and does something when it’s not meant to. Now, if we add that third base and we’ve got a functional failure analysis.

Essentially here, we’re talking about a functional failure analysis, maybe something a bit more sophisticated, like a HAZOP. And the HAZOP is more sophisticated because instead of just those three things that can go wrong, we think about we’ve got lots of guide words to help us think about ‘out of time, out of sequence’. So, too early, too late, before intended, after intended, whatever it might be. And there are there variations on HAZOP called computer HAZOP, or CHAZOP, where people have come up with different keywords, different prompt words, to help you think about software in data-intensive systems. So, that’s a possible technique to use here.

And then when we’re thinking about these hazards that might be generated by malfunction, or functional failure in its various forms, we need to think about, “What’s the next step in the mishap sequence? In the accident sequence? And what’s the final outcome of the accident sequence?” And that’s very important for software because software is intangible. It has no physical form. On its own, in isolation, software cannot possibly hurt anyone. So, you’ve got to look at how the software failure propagates through the system into the real world and how it could harm people. So, that’s a very important prompt that that last sentence in yellow there.

Task Description (T208) #4

And we carry on. We need to assess the risk with failure of a function subsystem or component. We’re going to do so using the standard 882 tables, tables one and two, and risk assessment codes in table three, unless we come up with our own tailored versions of those tables and that matrix and that’s all approved. In reality, most people don’t tailor this stuff. They should make it appropriate for the system, but they rarely do.

Table I and II

So just to remind us what we’re talking about, here’s table one and two. Table one is severity categories ranging from catastrophic, which could kill somebody – a catastrophic outcome – down to negligible, where we’re talking cuts and bruises – very, very, very minor injuries.

And then table two, probability levels. We’ve got everything from frequent down to eliminated – There’s no hazard at all because we’ve eliminated. It will never happen in the lifetime of the universe. So, it really is a zero probability. We’ve got frequent down to improbable and then in the standard, we’ve got a definition for these things in words, for a single item and also for a fleet or inventory of those items, assuming that there’s a large number of them. And that’s very useful. That helps us to think about how often something might go wrong per item and per fleet.

Table III

So, that’s tables one and two, we put them together, the severity and the probability to give us table three. As you can see, we’ve got probability down the left-hand side and at the bottom, if we’ve eliminated the hazard, then there is no severity. The hazard is completely eliminated. So, forget about that row. Then everything else we’ve got frequent down to improbable, probability. And we’ve got catastrophic down to negligible. Together those generate the risk assessment code, which is either high, serious, medium or low. That’s the way this standard defines things. Nothing is off-limits. Nothing is perfect except for elimination. We’ve just defined a level of risk and then you have to make up rules about how you will treat these levels of risk. The standard does some of that for you, but usually, you’ve got to work out depending on which jurisdiction you’re in legally, what you’re required to do about different levels of risk.

Now this table on its own, I’ll just mention, is not helpful in a British or Australian jurisdiction where we have to reduce or eliminate risks SOFARP. The table on its own won’t help you do that, because this is just an absolute level of risk. It’s not considering what you could have done to make it better. It’s just saying where we are. It’s a status report.

So, those are your tables one, two and three, as the standard describes them. That’s the overall method and we’re going to do what it says in Section four of the standard. In the main body of the standard, Section four talks about software and complex hardware and how we allocate these things.

Task Description (T208) #5

And then finally, I think on task description, an assessment of whether the functions identified are to be implemented in the design – sorry, of whether the functions are to be implemented in the design and map those functions into the components. And then it says functions allocated to software should be matched to the lowest level of technical design or configuration item. So, if you’ve got a software or hardware configuration item that is further subdivided into sub-items, then you need to go all the way down and see which items can contribute to that function and which can’t.

That’s an important labour-saving device, because if you’ve got  – you could have quite a large configuration item, but actually, only a tiny bit contributes to the hazard. So, that’s the only thing you need to worry about in theory. In reality, partitioning software is not as easy as the standard might suggest. However, if we can do a meaningful partition, then we could and should aim to have as little software safety-related as we possibly can. If nothing else, for cost in order to get the project in on time. So, the less criticality we have in our system, the better.

Task Description (T208) #6

So, we need to assess the software control category for each configuration item that’s been allocated a safety-significant software function (SSSF). Having assigned the SCC, we then have to work at the software criticality index for each of those functions and we’ll talk about how to do that at the end. Then from all of this work, we need to generate a list of requirements and constraints to include in the spec which, if they work, will eliminate the hazard or reduce the risk.

And the standard talks about that these could be in the form of fault tolerance, fault detection, fault isolation, fault annunciation or warning, or fault recovery. Now, this breakdown reveals – basically this is a reliability breakdown. So, in the world of reliability, we talk typically about fault tolerance, fault detection, warning, and recovery. Four things – I mean they split them down to five here. Now, software reliability is highly controversial. So really, this is a bit of a mismatch here. These reliability-based suggestions are not necessarily much use for software, or indeed for people sometimes. You may have to use other more typical software techniques to do this and in fact, the standard does point you to do that. But that’s for another session.

FHA Update & Records #1

So, we’ve done the FHA, or we’re doing the FHA. We’ve got to record it and we’ve got to update it when new information comes through. So, we’ve got to update the FHA as the design progresses or operational changes come in. We’ve got to have a system description of the physical and functional characteristics of the system and subsystems. And of course, for design complex items like software, context is everything. So, this is very important. Again, software in isolation cannot hurt anyone. You’ve got to have the context to understand what the implications might be. If we don’t have that, we’re stuffed pretty much. Then it goes on to say that when further documentation becomes available, more detail that needs to be supplied. So, don’t forget to ask for that in your contract and expect it as well and be ready to deal with it.

FHA Update & Records #2

 Moving on. When it comes to hazard analysis, method and techniques, we need to describe the method and the technique used for the analysis, what assumptions and what data was used in support of the analysis and this statement is pretty much in every single task so I’ll say no more. You’ve heard this before. Then again, analysis results need to be captured in the hazard tracking system and, as I’ve always said, usually the leading details, the top-level details, go in there has a tracking system. The rest of it goes into the hazard analysis report otherwise, you end up with a vast amount of data in your HTS and it becomes unwieldy and potentially useless.

Contracting #1

Contracting – Again, this is a pretty standard clause, or set of clauses, in a Mil. Standard 882 task. So, in our request for proposal and statement of work, we’ve got to ask for Task 208. We’ve got to point the analyst, the contractor, at what we want them to analyse particularly or maybe as a minimum. And what we don’t want to analyse, maybe because it’s been done elsewhere or it’s out of scope for this system.

We need to say what are data reporting requirements are considering Task 106, which is all about hazard tracking system or the hazard log or the risk register, whatever you want to call it. So, what data do we want? What format? What are the definitions, etc.? Because if you’re dealing with multiple contractors or you want data that is compatible with the rest of your inventory, then you’ve got to specify what you want. Otherwise, you’re going to get variability in your data and that’s going to make your life a whole lot harder downstream – Again, this is standard stuff.

And what are the applicable requirements, specifications and standards? Of course, this is an American standard so compliance with specifications, requirements and standards is all because that’s the American system.

Contracting #2

We need to supply the concept of operations, as I’ve said before, with a complex design. Especially software, context is everything. So, we need to know what we’re going to do with the system that the software is sat within. So, this system has got some functions, this is what we’re looking at in Task 208: What are those functions for? How do they to relate with the real world? How could we hurt people? And then if we got any other specific hazard management requirements. Maybe we’re using a special matrix because we’ve decided the standard matrix isn’t quite right for our system. Whatever we’re doing, if we’ve got special requirements that are not the norm for the vanilla standard, that we need to say what they are. Pretty straightforward stuff.

Commentary #1

We’re onto commentary, and I think we’ve got five slides of commentary today.

As it says, functional hazard analysis depends on systems engineering. So, if we don’t have good systems engineering, we’re unlikely to have good functional analysis. So, what do I mean by good systems engineering? I mean, that for the complete system – apart from things that we deliberately excluded for a good reason – but for the complete system we need or functions to be identified, we need those functions to be analysed and allocated correctly in accordance and rigorously and consistently. We need interface analysis, control, and we need the architecture of the design to be determined based on the higher-level requirements, all that work that we’ve done.

Now, if those things are not done or they’re incomplete, or they were done too late to influence the design architecture, then you’re going to have some compromised systems engineering. And these days, because we’re using lots of commercial off the shelf stuff, what you find is that your top-level design architecture is very often determined before you even start because you’ve decided you’re going to have an off the shelf this and you’re going to have a modified off the shelf that and you’re going to put them together in a particular way with a set of business rules, a concept of operations, that says this is how we’re going to use this stuff. And our new system interfaces with some existing stuff and we can’t modify the existing stuff.

So, that really limits what we can do with the design architecture. A lot of the big design decisions have already been taken before we even got started. Now, if that’s the case, then that needs to be recognized and dealt with. I’ve seen those things dealt with well. In other words, the systems engineering has been done recognizing those constraints, those things that that can’t be done. And I’ve seen it done badly in that figuratively speaking, the systems engineering team or the program manager, whoever has just given us a Gallic shrug and gone “Yeah, what the heck, who cares?” So, there’s this the two extremes that you can see.

Now, if the systems engineering is weak or incomplete, then you’re going to get a limited return on doing Task 208. Maybe there are some areas where you can do it on new areas, or maybe you’ve got a new interface that’s got to be worked up and created in order to get these things to talk to each other. Clearly, there is some mileage in doing that. You’re going to get some benefits from doing that in that area. But for the stuff that’s already been done, probably – Well, what what’s the point of doing systems engineering here? What does it achieve? So, maybe in those circumstances, it’s better – Well, in fact, I would say it’s essential to understand where systems engineering is still valid, where you are still going to get some results and where it isn’t. And maybe you just declare that scope; What’s in and out.

Or maybe you take a different approach. Maybe you go “OK, we’re dealing with a predominantly COTS system. We need a different way of dealing with this than the way the Mil. standard 882 assumes.” So, you’re going to have to do some heavy tailoring of the standard because 882 assumes that you’re determining all these requirements predesigned. If that’s not the case, then maybe 882 isn’t for you. Or maybe you just need to recognize you’re going to have to hack it about severely. Which in turn means you’ve got to know what you’re doing fundamentally. In which case the standard really is no longer fulfilling its role of guiding people.

Commentary #2

Moving on. Let’s assume that we are still going to do some Task 208. We’re going to determine some software criticality. We’re also going to determine some criticality for complex hardware. So, things whether it be software in complex electronics, so pre-programmed electronics, whatever that might be. First of all, as we said before, we’re going to determine the software control category and what that’s really saying is how much authority does the software have? And then secondly, we’re going to be looking at severity, which was table one. How severe is the worst hazard or risk that the software could contribute to? And these are illustrated in the next two slides. And we do a session or several sessions on software safety is coming soon. That will be elsewhere. I’m not going to go into massive detail here. I’m just giving you an overview of what the task requires.

Commentary #3: Software Control Categories 1-5

First of all, how do we determine software control category? So, there’s the table from the standard. We’ve got five levels of SCC.

At the top, we’ve got autonomous. Basically, the software does whatever it wants to and there’s no checks and balances.

Secondly, they’re semi-autonomous. The software is there’s one software system performing a function, but there are hardware interlocks and checks. And those hardware interlocks and checks, and whatever else that are not software, can work fast enough to prevent the accident happening. So, they can prevent harm. So, that’s semi-autonomous.

Then we’ve got redundant fault-tolerant where you’ve got an architecture typically with more than one channel, and maybe all channels are software controlled. Maybe there’s diversity in the software and there is some fault-tolerant architecture. Maybe a voting system or some monitoring system saying, “Well, Channel Three’s output is looking a bit dodgy” or “Something gone wrong with Channel two”. I’ll ignore the channel at fault, and I’ll take the good output from the channels that are still working and I’ll use that. So that’s that option. Very common.

Then we’ve got number four, which is influential. So, the software is displaying some information for a human to interpret and to accept or reject.

And then we’ve got five, which is no safety impact at all. Now, the problem is this: because it’s very easy to say, “The software just displays some information, it doesn’t do anything”. So, unless a human does something – so we don’t have to worry about the safety implications of that at all. Wrong! Because the human operator may be forced to rely on the software output by circumstances, there may not be time to do anything else. Or the human may not be able to work out what’s going on without using the software output. Or more typically, the humans have just got used to the software generating the correct information or even they interpret it incorrectly.

And a classic example of that was when the American warship, the USS Vincennes, shot down an airliner and killed three hundred people because the way the system was set up, the supposedly not safety-related radar system was displaying information not associated with the airliner, but associated with the with a military Iranian aircraft. And the crew got mixed up and shot down the airliner. So, that’s a risky one. Even though it’s down at number four, that doesn’t mean it’s without risk or without criticality.

Commentary #4

So, if we have the software control category, and that’s down the right-hand side – sorry down the left-hand side, one to five. And along the top, we have the severity category from catastrophic down to negligible. We can use that to determine the software criticality index, which varies from one most critical down to five least critical. It’s similar to the risk assessment code in the table three coloured matrix that I showed you earlier. So, we’ve made – the writers of the standard have made a determination for us based on some assessment that they’ve done saying, “Well, this is this is how we assess these different criticality levels”. Whether there is actually any real-world evidence supporting this assessment, I don’t know and I’m not sure anybody else does either. However, that’s the standard and that’s where we are.

Commentary #5

And so just to finish up on the commentary. Task 208 is focused on software engineering, also programmable electronics, complex hardware, but typically electronics with software functionality or logic functionality embedded within it. Now if all of that software, all that programmable electronic systems, if they’re all developed already, is there any point in doing Task 208? That’s the first step – it’s got to pass the “So what?” test.

Is it feasible to do 208 and expect to get benefits? If not, maybe you just do system and subsystem hazard analysis. That’s tasks 205 and 204, respectively. And we just look at the complex components and subsystems as a black box and say, “OK, what’s it meant to do? What are the interfaces?” Maybe that would be a better thing to do. Particularly bearing in mind that the software or the complex electronic system could be working perfectly well and we still get an accident because there’s been a misunderstanding of the output. Maybe it’s more beneficial to look at those interfaces and think about, “Well, in what scenarios could the human misunderstand? How do we how do we guard against that?”

It’s also worth saying that some particularly American software development standards, can work well with Mil. Standard 882 because they share a similar conceptual basis. For example, I’ve seen many, many times in the air world, the systems software system safety standard is 882 and the systems software standard is DO-178 (AKA ED12, it’s the same standard, just different labels). Now they work relatively well together because the concept underpinning 178 is very similar to 882. It’s American centric.

It’s all about, you put requirements on the software development and – this is sort of a cookbook approach – the standard assumes that if you use the right ingredients and you mix them up in the right way, then you’re going to get a good result. And that’s a similar sort of concept for 882 and the two work relatively well together, fairly consistently. Also because they’re both American, there’s a great focus on software testing. Certainly, in the earlier versions of DO-178, it’s exclusively focused on software testing. Things like source code analysis and other things – more modern techniques that have come in – they’re not recognized at all in earlier versions of 178 because they just weren’t around.

That focus on testing suits 882, because 882, generates lots of requirements and constraints which you need to test. What it’s not so good at is generating cases where you say, “Well if this goes wrong” or “If we’re at the edge of the envelope where we should be, let’s test for those edge of the envelope cases, let’s test that the software is working correctly when it’s outside of the operating envelope that it should be”. Now, that kind of thinking isn’t so strong in 882, nor in 178. So, there are some limitations there. Good practice, experienced practitioners will overcome those by adding in the smarts that the standards lack. But just to be aware, a standard is not smart. You’ve still got to know what you’re doing in order to get the most out of it.

So, maybe you’re buying software that’s predevelopment or that you’re using – you’re not in the States. You’ve got a European or an Asian Indian supplier or Japanese supplier or whatever. Maybe they’re not using American style techniques and standards. Is that – how well is that going to work with 882? Are they compatible? They might be, but maybe they’re not. So, that requires some thought. If they’re not obviously compatible, then what do you need to do to make that translation and make it work. Or at least understand where the gaps are and what you might do about it to compensate?

And I’ve not talked about data, but it is worth mentioning that with data-rich systems these days – and I heard just the other day, is it two quintillion bytes of data being generated every two days or something ridiculous? That was back in 2017. So, gigantic amounts of data being generated these days and used by computing systems, particularly artificial intelligence systems. So, the rigour associated with that data  – the things that we need to think about on data are potentially just as important as the software. Because if the software is processing rubbish data, you’re probably going to get rubbish results. Or at the very least unreliable results that you can’t trust. So, you need to be thinking about all of those attributes of your data; correct, complete, consistent, etc, etc. I mean, I probably need to do a session on that and maybe I will.

Copyright Statement

That’s the presentation. As you can see, everything in italics and quotes is out of the standard, which is copyright free. But this presentation is copyright of the Safety Artisan.

For More…

And you will find many more presentations and a lot more resources at the website www.safetyartisan.com. Also, you’ll find the paid videos on our Patreon page, which is www.patreon.com/SafetyArtisan or go to Patreon and search for the Safety Artisan.

End

Well, that’s the end of our presentation, and it just remains for me to say thanks very much for listening. Thanks for your time and I look forward to seeing you in the next session, Task 209. Looking forward to it. Goodbye.

End

Categories
Mil-Std-882E Safety Analysis

System Hazard Analysis

In this 45-minute session, The Safety Artisan looks at System Hazard Analysis, or SHA, which is Task 205 in Mil-Std-882E. We explore Task 205’s aim, description, scope, and contracting requirements. We also provide value-adding commentary, which explains SHA – how to use it to complement Sub-System Hazard Analysis (SSHA, Task 204) in order to get the maximum benefits for your System Safety Program.

This is the seven-minute-long demo. The full video is 47 minutes long.

System Hazard Analysis: Topics

  • Task 205 Purpose [differences vs. 204];
    • Verify subsystem compliance;
    • ID hazards (subsystem interfaces and faults);
    • ID hazards (integrated system design); and
    • Recommend necessary actions.
  • Task Description (five slides);
  • Reporting;
  • Contracting; and
  • Commentary.
Transcript: System Hazard Analysis

Introduction

Hello, everyone, and welcome to the Safety Artisan, where you will find professional, pragmatic, and impartial safety training resources and videos. I’m Simon, your host, and I’m recording this on the 13th of April 2020. And given the circumstances when I record this, I hope this finds you all well.

System Hazard Analysis Task 205

Let’s get on to our topic for today, which is System Hazard Analysis. Now, system hazard analysis is, as you may know, is Task 205 in the Mil. Standard 882E system safety standard.

Topics for this Session

What we’re going to cover in this session is purpose, task description, reporting, contracting and some commentary – although I’ll be making commentary all the way through. Going to the back to the top, the yellow highlighting with this and with task 204, I’m using the yellow highlighting to indicate differences between 205 and 204 because they are superficially quite similar. And then I’m using underlining to emphasize those things that I want to really bring to your attention and emphasize. Within task 205, purpose. We’ve got four purposes for this one. Verify subsistent compliance and recommend necessary actions – fourth one there. And then in the middle of the sandwich, we’ve got identification of hazards, both between the subsystem interfaces and faults from the subsystem propagating upwards to the overall system and identifying hazards in the integrated system design. So, quite different emphasis to 204 which was really thinking about subsystems in isolation. We’ve got five slides of task description, a couple on reporting, one on contracting – nothing new there – and several commentaries.

System Requirements Hazard Analysis (T205)

Let’s get straight on with it. The purpose, as we’ve already said, there is a three-fold purpose here; Verify system compliance, hazard identification and recommended actions, and then, as we can see in the yellow, the identifying previously unidentified hazards is split into two. Looking at subsystem interfaces and faults and the integration of the overall system design. And you can see the yellow bit, that’s different from 204 where we are taking this much higher-level view, taking an inter subsystem view and then an integrated view.

Task Description (T205) #1

On to the task description. The contract has got to do it and documented, as usual, looking at hazards and mitigations, or controls, in the integrated system design, including software and human interface. It’s very important that we’ll come onto that later. All the usual stuff about we’ve got to include COTS, GOTS, GFE and NDI. So, even if stuff is not being developed, if we’re putting together a jigsaw system from existing pieces, we’ve still got to look at the overall thing. And as with 204, we go down to the underlined text at the bottom of the slide, areas to consider. Think about performance, and degradation of performance, functional failures, timing and design errors, defects, inadvertent functioning – that classic functional failure analysis that we’ve seen before. And again, while conducting this analysis, we’ve got to include human beings as an integral component of the system, receiving inputs, and initiating outputs.  Human factors were included in this standard from long ago.

Task Description (T205) #2

Slide two. We’ve got to include a review of subsystem interrelationships. The assumption is that we’ve previously done task 204 down at a low level and now we’re building up to task 205. Again, verification of system compliance with requirements (A.), identification of new hazards and emergent hazards, recommendations for actions (B.), but Part C is really the new bit. We are looking at possible independent, dependent, and simultaneous events (C.) including system failures, failures of safety devices, common cause failures, and system interactions that could create a hazard or increase risk. And this is really the new stuff in 205 and we are going to emphasize in the commentary, you’re going to look very carefully at those underlying things because they are key to understanding task 205.

Task Description (T205) #3

Moving on to Slide 3, all new stuff, all in yellow. Degradation of the system or the total system (D.), design changes that affect subsystems (E.). Now, I’ve underlined this because what’s the constant in projects? It’s change. You start off thinking you’re going to do something and maybe the concept changes subtly or not so subtly during the project. Maybe your assumptions change the schedule changes, the resources available change. You thought you were going to get access to something, but it turns out that you’re not. So, all these things can change and cause problems, quite frankly, as I am sure we know. So, we need to deal with not just the program as we started out, but the program as it turns out to be – as it’s actually implemented. And that’s something I’ve seen often go awry because people hold on to what they started out with, partly because they’re frightened of change and also because of the work of really taking note changes. And it takes a really disciplined program or project manager to push back on random change and to control it well, and then think through the implications. So, that’s where strength of leadership comes in, but it is difficult to do.

Moving on now. It says effects of human errors (F.) in the blue, I’ve changed that. Human error implies that the human is at fault, that the human made a mistake. But very often, we design suboptimal systems and we just expect the human operator to cope. Whether it’s fair or unfair or unreasonable, it results in accidents. So, what we need to think about more generally is erroneous human action. So, something has gone wrong but it’s not necessarily the humans’ fault. Maybe the system has induced the human to make an error. We need to think very carefully about.

Moving on, determination (G.), potential contribution of all those components in G. 1. As we said before, all the non-developmental stuff. G.2, have design requirements in the specifications being satisfied? This standard emphasizes specifications and meeting requirements, we’ve discussed that in other lessons. G.3 and whether methods of system implementation have introduced any new hazards. Because of course, in the attempted to control hazards, we may introduce technology or plant or substances that themselves can create problems. So, we need to be wary of that.

Task Description (T205) #4

Moving on to slide four. Now, in 205.2.2, the assumption here is that the PM has specified methods to be used by the contractor. That’s not necessarily true, the PM may not be an expert in this stuff. While they may for contractual or whatever reasons have decided we want the contractor to decide what techniques to use. But the assumption here is that the PM has control and if the contractor decides they want to do something different they’ve got to get the PM’s authority to do that. This is assuming, of course, that the this has been specified in the contract.

And 205.2.3, whichever contractor is performing the system hazard analysis, the SHA, they are expected to have oversight of software development that’s going to be part of their system. And again, that doesn’t happen unless it’s contracted. So, if you don’t ask for it, you’re not going to get it because it costs money. So, if the ultimate client doesn’t insist on this in the contract and police it to be fair because it’s all very well asking for stuff. If you never check what you’re getting or what’s going on, you can’t be sure that it’s really happening. As an American Admiral Rickover once said, “You get the safety you inspect”. So, if you don’t inspect it, don’t expect to get anything in particular, or it’s an unknown. And again, if anything requires mitigation, the expectation in the standard is that it will be reported to the PM, the client PM this is and that they will have authority. This is an assumption in the way that the standard works. If you’re not going to run your project like that, then you need to think through the implications of using this standard and manage accordingly.

Task Description (T205) #5

And the final slide on task description. We’ve got another reminder that the contractor performing the SHA shall evaluate design changes. Again, if the client doesn’t contract for this it won’t necessarily happen. Or indeed, if the client doesn’t communicate that things have changed to the contractor or the subcontractors don’t communicate with the prime contractor then this won’t happen. So, we need to put in place communication channels and insist that these things happen. Configuration control, and so forth, is a good tool for making sure that this happens.

Reporting (T205) #1

So, if we move on to reporting, we’ve got two slides on this. No surprises, the contractor shall prepare a report that contains the results from the analysis as described. First, part A, we’ve got to have a system description. Including the physical and functional characteristics and subsystem interfaces. Again, always important, if we don’t have that system description, we don’t have the context to understand the hazard analysis that had been done or not being done for whatever reason. And the expectation is that there will be reference to more detailed information as and when it becomes available. So maybe detailed design stuff isn’t going to emerge until later, but it has to be included. Again, this has got to be required.

Reporting (T205) #2

Moving onto parts B and C. Part B as before we need to provide a description of each analysis method used, the assumptions made, and the data used in that analysis. Again, if you don’t do this, if you don’t include this description, it’s very hard for anybody to independently verify that what has been done is correct, complete, and consistent. And without that assurance, then that’s going to undermine the whole purpose of doing the analysis in the first place.

And then part C, we’ve got to provide the analysis results and at the bottom of this subparagraph is the assumption. The analysis results could be captured in the hazard tracking system, say the hazard log, but I would only expect the sort of leading to be captured in that hazard log. And the detail is going to be in the task 205 hazard analysis report, or whatever you’re calling it. We’ve talked about that before, so I’m not going to get into that here.

Contracting

And then the final bit of quotation from the standard is the contracting. And again, all the same things that you’ve seen before. We need to require the task to be completed. It’s no good just saying apply Mil. Standard 882E because the contractor, if they understand 882E, they will tailor it to suit selves, not the client. Or if they don’t understand 882E they may not do it at all, or just do it badly. Or indeed they may just produce a bunch of reports that have got all the right headings in as the data item description, which is usually supplied in the contract, but there may be no useful data under those headings. So, if you haven’t made it clear to the contractor, they need to conduct this analysis and then report on the results – I know it sounds obvious. I know this sounds silly having to say this, but I’ve seen it happen. You’ve got a contractor that does not understand what system safety is.

(Mind you, why have you contracted them in the first place to do this? You should know that you should have done your research, found out.)

But if it’s new to them, you’re going to have to explain it to them in words of one syllable or get somebody else to do it for them. And in my day job, this is very often what consultancies get called in to do. You’ve got a contractor who maybe is expert building tanks, or planes, or ships, or chemical plants, or whatever it might be, but they’re not expert in doing this kind of stuff. So, you bring in a specialist. And that’s part of my day job.

So, getting back to the subject. Yes, we’ve got to specify this stuff. We’ve got to specify it early, which implies that the client has done quite a lot of work to work this all out. And again, the client may above the line, as we say, say engage a consultant or whoever to help them with this, a specialist. We’ve got to include all of the details that are necessary. And of course, how do you know what’s necessary, unless you’ve worked it out. And you’ve got to supply the contractor, it says concept of operations, but really supplying the contractor with as much relevant data and information as you can, without bogging them down. But that context is important to getting good results and getting a successful program.

Illustration

I’ve got a little illustration here. The supposition in the standard in Task 205 is we’ve got a number of subsystems and there may be some other building blocks in there as well. And some infrastructure we’ve going to have probably some users, we’re going to have an operating environment, and maybe some external systems that our system, or the system of interest, interfaces with or interacts with in some way. And that interaction might be deliberate, or it might be just in the same operating environment at night. And they will interact intentionally or otherwise.

Commentary – Go Early

With that picture in mind, let’s think about some important points. And the first one is to get 205, get some 205-work done early. Now, the implication in the standard by the numbering and when you read the text is that subsystem hazard analysis comes first. You do those hexagonal building blocks first and then you build it up and task 205 comes after the subsystem hazard analysis. You thought, “Well, you’ve already got the SHHAs for each subsystem and then you build the SHA on top”. However, if you don’t do 205 early, you’re going to lose an opportunity to influence the design and to improve your system requirements. So, it’s worth doing an initial pass of 205 first, top-down, before you do the 204 hexagons and then come back up and redo 205. So, the first pass is done early to gain insight, to influence the design, and to improve your requirements, and to improve, let’s say, the prime contractor’s appreciation and reporting of what they are doing. And that’s really, dare I say, a quick and dirty stab at 205 could be quite cheap and will probably the payback/the return on investment should be large if you do it early enough. And of course, act on the results.

And then the second part is more about verifying compliance, verifying those as required interfaces, and looking at emergent stuff, stuff that’s emerged – the devil’s in the detail as the saying goes. We can look at the emerging stuff that’s coming out of that detail and then pull all that together and tidy up it up and look for emergent behaviour.

Commentary – Tools & Techniques

Looking at tools and techniques, most safety analysis techniques that we use look at single events or single failures only in isolation. And usually, we expect those events and failures to be independent. So, there’re lots of analyses out there. Basic fault tree analysis, event tree analysis. Well, event tree is slightly different in that we can think about subsequent failures, but there’re lots of basic techniques out there that will really only deal with a single failure at a time. However, 205.2.1C requires us to go further. We’ve got to think about dependent simultaneous events and common cause failures. And for a large and complex system, each of those can be a significant undertaking. So, if we’re doing task 205 well, we are going to push into these areas and not simply do a copy of task 204, but at a higher level. We’re now really talking about the second pass of 205. The previous, quick and dirty, 205 is done. The task 204 on the subsystems is done. Now we’re pulling it all together.

Dependent & Simultaneous Events

Let’s think about independent simultaneous events. First, dependent failures. Can an initial failure propagate? For example, a fire could lead to an explosion or an explosion could lead to a fire. That’s a classic combination. If something breaks or wears could be as simple as components wearing and then we get debris in the lubrication system. Could that – could the debris from component wear clog up the lubrication system and cause it to fail and then cause a more serious seizure of the overall system? Stuff like that. Or there may be more subtle functional effects. For example, electric effects, if we get a failure in an electrical system or even non-failure events that happen together. Could we get what’s called a sneak circuit? Could we get a reverse flow of current that we’re not expecting? And could that cause unexpected effects? There’s a special technique we’re looking at called sneak circuits analysis. That’s sneak, SNEAK, go look it up if you’re interested. Or could there be multiple effects from one failure? Now, I’ve already mentioned fire. It’s worth repeating again. Fire is the absolute classic. First, the effects of fire. You’ve got the fire triangle. So, to get fire, we need an inflammable substance, we need an ignition source, and we need heat. And without all three, we don’t get a fire. But once we do get a fire, all bets are off, and we can get multiple effects. So, we recall, you might remember from being tortured doing thermodynamics in class, you might remember the old equation that P1V1T1 equals P2V2T2. (And I’ve put R2 that for some reason, so sorry about that.)

What that’s saying is, your initial pressure, volume and temperature multiplied together, P1V1T1, is going to be the same as your subsequent pressure, volume and temperature multiply together, P2V2T2. So, what that means is if you dramatically increase the temperature say, because that’s what a fire does, then your volume and your pressure are going to change. So, in an enclosed space we get a great big increase in pressure, or if we’re in an unenclosed space, we’re going to get an increase in volume in a [gas or] fluid. So, if we start to heat the [gas or] fluid, it’s probably going to expand. And then that could cause a spill and further knock-on effects. And fire, as well as effect making pressure and volume changes to the fluids, it can weaken structures, it makes smoke, and produces toxic gases. So, it can produce all kinds of secondary hazardous effects that are dangerous in themselves and can mess up your carefully orchestrated engineering and procedural controls. So, for example, if you’ve got a fire that causes a pressure burst, you can destroy structures and your fire containment can fail. You can’t send necessarily people in to fix the problem because the area is now full of smoke and toxic gas. So, fire is a great example of this kind of thing where you think, “Well, if this happens, then this really messes up a lot of controls and causes a lot of secondary effects”. So, there’s a good example, but not the only one.

And then simultaneous events, a hugely different issue. What we’re talking about here is we have got undetected, or latent, failures. Something has failed, but it’s not apparent that it’s failed, we’re not aware, and that could be for all sorts of reasons. It could be a fatigue failure. We’ve got something that’s cracked, or it could be thermal fatigue. So, lots of things that can degrade physical systems, make them brittle. For example, an odd one, radiation causes most metals to expand and neutron bombardment makes them brittle. So, it can weaken things, structure and so forth. Or we might have a safety system that has failed, but because we’ve not called upon it in anger, we don’t notice. And then we have a failure, maybe the primary system fails. We expect the secondary system to kick in, but it doesn’t because there’s been some problem, or some knock-on effect has prevented the secondary system from kicking in. And I suspect we’ve all seen that happen.

My own experience of that was on a site I was working on. We had a big electricity failure, a contractor had sawed through the mains electricity cable or dug through it. And then, for some unknown reason, the emergency generators failed to kick in. So, that meant that a major site where thousands of people worked had to be evacuated because there was no electricity to run the computers. Even the old analogue phones failed after a while. Today, those phones would be digital, probably voice over IP, and without electricity, they’d fail instantly. And eventually, without power for the plumbing, the toilets back up. So, you’re going to end up having to evacuate the entire site because it’s unhygienic. So, some effects can be very widespread. Just because you had a late failure, and your backup system didn’t kick in when you expected it to.

So how can we look at that? Well, this is classic reliability modelling territory. We can look at meantime between failures, MTBF, and meantime to repair (MTTR) and therefore we could work out what the exposure time might be. We can work out, “What’s the likelihood of a latent failure occurring?” If we’ve got an interval, presumably we’ve going to test the system periodically. We’ve got to do a proof test. How often do we have to do the proof test to get a certain level of reliability or availability when we need the system to work? And we can look at synchronous and asynchronous events. And to do that, we can use several techniques. The classic ones, reliability, lock diagrams and fault tree analysis. Or if we’ve got repairable systems, we can use Markov chain modelling, which is very powerful. So, we can bring in time-dependent effects of systems failing at certain times and then being required, or systems failing and being repaired, and look at overall availability so that we can get an estimate of how often the overall system will be available. If we look at potential failures in all the redundant constituent parts. Lots of techniques there for doing that, some of them quite advanced. And again, very often this is what safety consultants, this is what we find ourselves doing so.

Common Cause Failures

Common cause failure, this is another classic. We might think about something very obvious and physical, maybe we get debris, maybe we’ve got three sets of input channels guarded by filters to stop debris getting into the system, but what if debris blocks all the filters so we get no flow? So, obvious – I say obvious – often missed sources of sometimes quite major accidents. Or let’s say something more subtle, we’ve got three redundant channels, or a number of redundant channels, in an electronic system and we need two out of three to work, or whatever it might be. But we’ve got the same software working each channel. So, if the software fails systematically, as it does, then potentially all three channels will just fail at the same time.

So, there’s a good example of non-independent failures taking down a system that on paper has a very high reliability but actually doesn’t. Once you start considering common cause failure or common mode analysis. So, really what we would like is we would like all redundancy to be diverse if possible. So, for example, if we wanted to know how much fuel we had left in the aeroplane, which is quite important if you want the engines to keep working, then we can employ diverse methods. We can use sensors to measure how much fuel is in the tanks directly and then we can cross-check that against a calculated figure where we’ve entered, let’s say, how much fuel was in the tanks to start with. And then we’ve been measuring the flow of fuel throughout the flight. So, we can calculate or estimate the amount of fuel and then cross-check that against the actual measurements in the tanks. So, there’s a good diverse method. Now, it’s not always possible to engineer a diverse method, particularly in complex systems. Sometimes there’s only really one way of doing something. So, diversity kind of goes out of the window in such an engineered system.

But maybe we can bring a human in.

So, another classic in the air world, we give pilots instruments in order to tell them what’s going on with the aeroplane, but we also suggest that they look out the window to look at reality and cross-check. Which is great if you’re not flying a cloud or in darkness and there are maybe visual references so you can’t necessarily cross-check. But even things like system failures, can the pilot look out the window and see which propeller has stopped turning? Or which engine the smoke and flames coming out of? And that might sound basic and silly, but there have been lots of very major accidents where that hasn’t been done and the pilots have shut down the wrong engine or they’ve managed the wrong emergency. And not just pilots, but operators of nuclear power plants and all kinds of things. So, visual inspection, going and looking at stuff if you have time, or take some diverse way of checking what’s going on, can be very helpful if you’re getting confusing results from instrument readings or sensor readings.

And those are examples of the terrific power of human diversity. Humans are good at taking different sensory inputs and fusing them together and forming a picture. Now, most of the time they fuse the data well and they get the correct picture, but sometimes they get confused by a system or they get contradictory inputs and they get the wrong mental model of what’s going on and then you can have a really bad accident. So, thinking about how we alert humans, how we use alarms to get humans attention, and how we employ human factors to make sure that we give the humans the right input, the right mental picture, mental model, is very important. So, back to human factors again, especially important, at this level for task 205.

And of course, there are many specialist common cause failure analysis techniques so we can use fault trees. Normally in a fault tree when you’ve got an and gate, we assume that those two sub-events are independent, but we can use ‘beta factors’ (they’re called) to say, “Let’s say event a and event b are not independent, but we think that 50 percent or 10 percent of the time they will happen at the same time”. So, you can put that beta factor in to change the calculation. So, fault trees can cope with non-independent fate is providing you program the logic correctly. You understand what’s going on. And maybe if there’s uncertainty on the beta factors, you must do some sensitivity modelling on the tree with different beta factors. Or you run multiple models of the tree, but again, we’re now talking quantitative techniques with the fault tree, maybe, or semi-quantitative. We’re talking quite advanced techniques, where you would need a specialist who knows what they do in this area to come up with realistic results, that sensitivity analysis. The other thing you need to do is if the sensitivity analysis gives you an answer that you don’t want, you need to do something about that and not just file away the analysis report in a cupboard and pretend it never happened. (Not that that’s ever happened in real life, boys and girls, never, ever, ever. You see my nose getting longer? Sorry, let’s move on before I get sued.)

So other classic techniques. Zonal hazard analysis, it looks at lots of different components in a compartment. If component A blows up, does it take out everything else in that compartment? Or if the compartment floods, what functionality do we lose in there? And particularly good for things like ships and planes, but also buildings with complex machinery. Big plant where you’ve got different stuff in different locations. There’re also things called particular risk analysis where you think of, and these tend to be very unusual things where you think about what a fan blade breaks in a jet engine. Can the jet engine contain the fan blade failure? And if not, where you’ve got very high energy piece of metal flying off somewhere – where does that go? Does that embed itself in the fuselage of the aeroplane? Does it puncture the pressure hull of the aeroplane? Or, as has sadly happened occasionally, does it penetrate and injure passengers? So, things like that, usually quite unusual things that are all very domain or industry specific. And then there are common mode analysis techniques and a good example of a standard that incorporates those things is ARP 4761. This is a civil aircraft standard which looks at those things quite well, for example, there are many others.

Summary

In summary, I’ve emphasized the differences between Task 205 and 204. So, we might do a first pass 205 and 204 where we’re essentially doing the same thing just at different levels of granularity. So, we might do the whole system initially 205, one big hexagon, and then we might break down the jigsaw and do some 204 at a more detailed level. But where 205 is really going to score is in the differences between 204. So instead of just repeating, it’s valuable to repeat that analysis at a higher-level, but really if we go to diversify if we want success. So, we need to think about the different purpose and timing of these analyses. We need to think about what we’re going to get out of going top-down versus bottom-up, different sides of the ‘V’ model let’s say.

We need to think about the differences of looking at internals versus external interfaces and interactions, and we need to think of appropriate techniques and tools for all those things – and, of course, whether we need to do that at all! We will have an idea about whether we need to do that from all the previous analysis. So, if we’ve done our PHI or PHA, we’ve looked at the history and some simple functional techniques, and we’ve involved end-users and we’ve learnt from experience. If we’ve done our early tasks, we’re going to get lots of clues about how much risk is present, both in terms of the magnitude of the risk and the complexity of the things that we’re dealing with. So, clearly, we’ve got a very complex thing with lots of risks where we could kill lots of people, we’re going to do a whole lot more analysis than for a simple low-risk system. And we’re going to be guided by the complexity and risks and the hot spots where they are and go “Clearly, I’ve got a particular interface or particular subsystem, which is a hotspot for risk. We’re going to concentrate our effort there”. If you haven’t done the early analysis, you don’t get those clues. So, you do the homework early, which is quite cheap and that helps you. Direct effort, the best return on investment.

The Second major bullet point, which I talk about this again and again. That the client and end-user and/or the prime contractor need to do analysis early in order to get the benefits and to help them set requirements for lower down the hierarchy and pass relevant information to the sub-contractors. Because the sub-contractors, if you leave them in isolation, they’ll do a hazard analysis in isolation, which is usually not as helpful as it could be. You get more out of it if you give them more context. So really, the ultimate client, end-user, and probably the prime as well, both need to do this task, even if they’re subcontracting it to somebody else. Whereas, maybe the sub-system hazard analysis 204 could be delegated just down to the sub-system contractors and suppliers. If they know what they’re doing and they’ve got the data to do it, of course. And if they haven’t, there’s somebody further up the food chain on the supply chain may have to do that.

And lastly, 204 and 205 are complimentary, but not the same. If you understand that and exploit those similarities and differences, you will get a much more powerful overall result. You’ll get synergy. You’ll get a win-win situation where the two different analyses complement, reinforce each other. And you’re going to get a lot more success probably for not much more money and effort time. If you’ve done that thinking exercise and really sought to exploit the two together, then you’re going to get a greater holistic result.

Copyright

So, that’s the end of our session for today. Just a reminder that I’ve quoted from the Mil. Standard 882, which is copyright free, but the contents of this presentation are copyright Safety Artisan, 2020.

For More …

And for more lessons and more resources, please do visit www.safetyartisan.com and you can see the videos at www.patreon.com/safetyartisan.

End

That’s the end of the lesson on system hazard analysis task 205. And it just reminds me to say thanks very much for watching and look out for the next in the series of Mil. Standard 882 tasks. We will be moving on to Task 206, which is Operating and Support Hazard Analysis (OSHA), a quite different analysis to what we’ve just been talking. Well, thanks very much for watching and it’s goodbye from me.

The End

You can find a free pdf of the System Safety Engineering Standard, Mil-Std-882E, here.

Categories
Mil-Std-882E Safety Analysis

Sub-System Hazard Analysis

In this video lesson, The Safety Artisan looks at Sub-System Hazard Analysis, or SSHA, which is Task 204 in Mil-Std-882E. We explore Task 204’s aim, description, scope, and contracting requirements. We also provide value-adding commentary and explain the issues with SSHA – how to do it well and avoid the pitfalls.

This is the seven-minute demo, the full video is 40-minutes’ long.

Topics: Sub-System Hazard Analysis

  • Preamble: Sub-system & System HA.
  • Task 204 Purpose:
    • Verify subsystem compliance;
    • Identify (new) hazards; and
    • Recommend necessary actions.
  • Task Description (six slides);
  • Reporting;
  • Contracting; and
  • Commentary.
Transcript: Sub-System Hazard Analysis

Introduction

Hello, everyone, and welcome to the Safety Artisan, where you will find professional, pragmatic, and impartial instruction on all things system safety. I’m Simon – I’m your host for today, as always and it’s the fourth of April 22. With everything that’s going on in the world, I hope that this video finds you safe and well.

Sub-System Hazard Analysis

Let’s move straight on to what we’re going to be doing. We’re going to be talking today about subsystem hazard analysis and this is task 204 under the military standard 882E. Previously we’ve done 201, which was preliminary hazard identification, 202, which is preliminary hazard analysis, and 203, which is safety requirements hazard analysis. And with task 204 and task 205, which is system has analysis, we’re now moving into getting stuck into particular systems that we’re thinking about, whether they be physical systems or intangible. We’re thinking about the system under consideration and I’m really getting into that analysis.

Topics for this Session

So, the topics that we’re going to cover today, I’ve got a little preamble to set things in perspective. We then get into the three purposes of task 204. First, to verify compliance. Secondly, to identify new hazards. And thirdly, to recommend necessary actions. Or in fact, that would be recommend control measures for hazards and risks. We’ve got six slides of task description, a couple of slides on reporting, one on contracting, and then a few slides on some commentary where I put in my tuppence worth and I’ll hopefully add some value to the basic bones of the standard. It’s worth saying that you’ll notice that subsystem is highlighted in yellow and the reason for that is that the subsystem and system hazard analysis tasks are very, very similar. They’re identical except for certain passages and I’ve highlighted those in yellow. Normally I use a yellow highlighter to emphasize something I want to talk about. This time around, I’m using underlining for that and the yellow is showing you what these different for subsystem analysis as opposed to system. And when you’ve watched both sessions on 204 and 205, I think you’ll see the significance of why I’ve done.

Preamble – Sub-system & System HA

Before we get started, we need to explain the system model that the 882 is assuming. If we look on the left-hand side of the hexagons, we’ve got our system in the centre, which we’re considering. Maybe that interfaces with other systems. They work within operating environment; hence we have the icon of the world, and the system and maybe other systems are there for a purpose. They’re performing some task; they’re doing some function and that’s indicated by the tools. We’re using the system to do something, whatever it might be.

Then as we move to the right-hand side, the system is itself broken down into subsystems. We’ve got a couple here. We’ve got sub-system A and B and then A further broken down into A1 and A2, for example. There’s some sort of hierarchy of subsystems that are coming together and being integrated to form the overall system. That is the overall picture that I’d like to bear in mind while we’re talking about this. The assumption in the 882, is we’re going to be looking at this subsystem hierarchy bottom upwards, largely. We’ll come on to that.

System Requirements Hazard Analysis (T204)

Purpose of the task, as I’ve said before, it’s threefold. We must verify subsystem compliance with requirements. Requirements to deal with risk and hazards. We must identify previously unidentified hazards which may emerge as we’re working at a lower level now. And we must recommend actions necessary. That’s further requirements to eliminate all hazards or mitigate associated risks. We’ll keep those three things in mind and that will keep coming up.

Task Description (T204) #1

The first of six slides on the task description. Basically, we are being told to perform and document the SSHA, sub-system hazard analysis. And it’s got to include everything, whether it be new developments, COTS, GOTS, GFE, NDI, software and humans, as we’ll see later. Everything must be included. And we’re being guided to consider the performance of the subsystem: ‘What it is doing when it is doing it properly’. We’ve got to consider performance degradation, functional failures, timing errors, design errors or defects, and inadvertent functioning – we’ll come back to that later. And while we’re doing analysis, we must consider the human as a component within the subsystem dealing with inputs and making outputs. If, of course, there is an associated human. We’ve got to include everything, and we’ve got to think about what could go wrong with the system.

Task Description (T204) #2

The minimum that the analysis has got to cover is as follows. We’ve got to verify subsystem compliance with requirements and that is to say, requirements to eliminate hazards or reduce risks. The first thing to note about that is you can’t verify compliance with requirements if there are no requirements. if you haven’t set any requirements on the subsystem provider or whoever is doing the analysis, then there’s nothing to comply with and you’ve got no leverage if the subsystem turns out to be dangerous. I often see it as it gets missed. People don’t do their top-down systems engineering properly; They don’t think through the requirements that they need; and, especially, they don’t do the preliminary hazard identification and analysis that they need to do. They don’t do Task 203, the SRHA, to think about what requirements they need to place further down the food chain, down the supply chain. And if you haven’t done that work, then you can’t be surprised if you get something back that’s not very good, or you can’t verify that it’s safe. Unfortunately, I see that happen often, even on exceptionally large projects. If you don’t ask, you don’t get, basically.

We’ve got two sub-paragraphs here that are unique to this task. First, we’ve got to validate flow down of design requirements. “Are these design requirements valid?”, “Are they the right requirements?” From the top-level spec down to more detailed design specifications for the subsystem. Again, if you haven’t specified anything, then you’ve got no leverage. Which is not to say that you have to dive into massive detail and tell the designer how to do their job, but you’ve got to set out what you want from them in terms of the product and what kind of process evidence you want associated with that product.

And then the second sub-paragraph, you’ve got to ensure design criteria in the subsystem specs have been satisfied. We need to verify that they’re satisfied, and that V and V of subsystem mitigation measures or risk controls have been included in test plans and procedures. As always, the Mil. standard 882 is the American standard, and they tend to go big on testing. Where it says test plans and procedures that might be anything – you might have been doing V and V by analysis, by demonstration, by testing, by other means. It’s not necessarily just testing, but that’s often the assumption.

Task Description (T204) #3

We must also identify previously unidentified hazards because we are now down at a low level of detail in a subsystem and stuff probably will emerge at that level that wasn’t available before. First, number one, we’ve got to ensure the implementation of subsystem design requirements and controls. And ensure that those requirements and controls have not introduced any new hazards, because very often accidents occur. Not because the system has gone wrong – the system is working as advertised – but the hazards with normal operation maybe just weren’t appreciated and guarded against or we just didn’t warn the operators that something might happen that they needed to look out for. A common shortfall, I’m afraid.

And number two, we’ve got to determine modes of failure down to component failure and human errors, single points of failure, common-mode failures, effects when failures occur in components, and from functional relationships. “What happens if something goes wrong over on this side of the system or subsystem and something else is happening over here?” What are those combinations? What could result? And again, we’ve got to consider hardware and software, including all non-developmental type stuff, and faults, and occurrences. Again, I see very often, buyers/purchases don’t think about the off the shelf stuff in advance or don’t include it. And then sometimes also you see contractors going “This is off the shelf, so we’re not analysing it.” Well, the standard requires that they do analyse it to the extent practicable. And they’ve got to look at what might go wrong with all of this non-development to stuff and integrate the possible effects and consider. That’s another common gotcha, I’m afraid. we do need to think about everything, whether it’s developmental or not.

Task Description (T204) #4

And then part C, recommending actions necessary to eliminate hazards if we can. Very often we can’t, of course, and we have to mitigate. We must reduce or minimize the associated risk of those hazards. In terms of the harm that might come to people. We’ve got to ensure that system-level hazards, it says attributed. Maybe we believe when we did the earlier analysis that the subsystem could contribute to a higher-level hazard, or maybe we’ve allocated some failure budget to this particular subsystem, which it has got to keep to if we’re going to meet the higher-level targets. You can imagine lots of these subsystems all feeding up a certain failure rate and different failure modes. And overall, when you pull it all together, we may have to meet some target or reduce the number of failures in their propagation upwards in order to manage hazards and risks. We’ve got to make sure that we’ve got adequate mitigation controls of these potential hazards are implemented in the design.

If we think back to the hierarchy, we prefer to fix things in the design, eliminate the hazard if possible, or make changes to the design to eliminate or reduce the hazard, rather than just rely on human beings to catch the problem and deal with it further downstream. It’s far more effective and cheaper, in the long run, to fix things in design they are more effective controls. Certainly, in this standard in Australian law, and in the UK and elsewhere, you will find either regulations or law or codes of practice or recognized and accepted good practice that says, “You should do this”. It’s a very, very common requirement and we should pretty much assume that we have to do this.

Task Description (T204) #5

Interesting clause here in 2.2, it says if no specific hazard analysis techniques are directed or the contractor wants to take a different route to what is directed, then they’ve got to obtain approval from the program manager. If the PM (Project Manager) hasn’t specified analysis techniques, and they may not wish to, they may just wish to say you’ll do whatever analysis is required in order to identify hazards and mitigate them. But in many industries, there are certain ways of doing things and I’ve said before in previous lessons, if you don’t specify that you want something, then contractors will very often cut the safety program to the bone in order to be the cheapest bid. the customer will get what they prioritize. If the customer prioritizes a cheap bid and doesn’t specify what they want, then they will get the bare minimum that the contractor thinks they can get away with. If you don’t ask, you don’t get – Becoming a theme that isn’t it?

Task Description (T204) #6

Let’s move on to 2.3. Returning to software, we’ve got to include that. The software might be developed separately, but nevertheless, the contractor performing the SSHA shall monitor the software development, shall obtain data from each phase of the software development process in order to evaluate the contribution of the software to the subsystem hazard analysis. There’s no excuse for just ignoring the software and treating it as a black box. Of course, very often these days the software is already developed. It’s a GFE or NDI item, but there still should be evidence available or you do a black-box analysis of the subsystem that the software is sitting in. Again, if the software developer reports any identified hazards, they’ve got to be reported to the program manager in order to request appropriate direction.

This assumes a level of interaction between the software developers right up the chain to the program manager. Again, this won’t happen unless the program manager directs it and pays for it. If the PM doesn’t want to pay for it, then they are either going to have to take a risk on not knowing about the functionality of the software that’s hidden within the subsystem. Or they’re going to deal with it some other way, which is often not effective. The PM needs to do a lot of work upfront in order to think what kind of problems there might be associated with a typical subsystem of whatever kind it is we’re dealing with. And think about “How would I deal with the associated risks?” “What’s the best way to deal with them in the circumstances?” If I’m buying stuff off the shelf and I’m not going to get access to hazard analysis or other kinds of evidence, how am I going to deal with them? Big questions.

And then 2.4, the contractor shall update the SSHA following changes, including software design changes. Again, we can’t just ignore those things.  That’s slide six out of six. Let’s move on to reporting.

Reporting (T204) #1

The first slide, contractor’s got to prepare a report that contains results from the task, including within the system description, physical and functional characteristics of the system, a list of the subsystems, and a detailed description of the subsystem being analysed, including its boundaries. And from other videos, you’ll know how much and how often I emphasize knowing where the boundaries are because you can’t really do effective safety analysis and safety management on an unbounded system. It just doesn’t work. There’s a requirement here for quite a lot of information reference to more detailed descriptions as they become available. The standard says they shall be supplied. That’s a lot of information that probably texts and pictures of all sorts of stuff and that’s going to need to go into a report. And typically, we would expect to see a hazard analysis report or a HAR with this kind of information in it. Again, if the PM/customer doesn’t specify that HAR, then they’re not going to get it and they’re not going to get textual information that they need to manage the overall system.

Reporting (T204) #2

So, if we move on to parts B and C of the reporting requirement. We’ve got to describe hazard analysis methods and techniques, provide a description of each method and the technique used, and a description of the assumptions made. And it says for each qualitative or quantitative data. This is another area that often gets missed. If you don’t know what techniques have been used and you don’t know the assumptions that almost certainly that subsystem analyser will have to make because they probably don’t have visibility in the rest of the system. If you don’t have that information, it becomes very difficult to verify the hazard analysis work and to have confidence in it.

And the hazard analysis results. Content and format vary. Something else the PM is going to think about and specify upfront. Then results should be captured the hazard tracking system. Now, usually, this hazard tracking system is hazard log. It might be a database, a spreadsheet or even a word document, or something like that. And usually, in the hazard tracking system, we have the leading particulars. We don’t always have, in fact, we shouldn’t have, every little piece of information in the hazard tracking system because it will quickly become unwieldy. Really, we want the hazard log to have the leading particulars of all the hazards, causes, consequences and controls. And then the hazard log should refer out to that hazard analysis report or other reports and data, whatever they’re called, other records.

If we go back up, this reemphasizes the kind of detail that’s here in 2.5 A. That really shouldn’t be going in the hazard log. That should be going in a separate report which the hazard log/the hazard tracking system refers to. Otherwise, it all gets that unwieldy.

Contracting

I’ve said repeatedly the PM needs to think about this and ask for that.

Contracting; The standard assumes that the information in A to H below is specified way up front in the request for proposal. That’s not always possible to do in full detail, but nevertheless, you’ve got to think about these things really early and include them in the contractual documentation. And again, if your if you’re running a competition, by the time you get to the final RFP, you need to make sure that you’re asking for what you really need. maybe run a preliminary expression of interest or pre-competition exercise in order to tease out, detect. We’ve got to impose task 204 (A.) as a requirement. We may have to specify which people we want to involve, which functional specialists, which discipline specialists (B.). We want to get involved to address this work. Identification of subsystems to be analysed (C.). Well, if you don’t know what the design is upfront, we can’t always do that, but you could say all.

You may specify desired analysis methodologies and techniques (D.). And again, that’s largely domain dependent. We tend to do safety in certain ways in different worlds, in the air world is done in a particular way. in the maritime world, it’s a different way. With Road or Off-Road Vehicle, it’s done in a particular way, etc, etc., whatever it might be. Chemical plant, whatever. If they’re known hazards, hazardous areas or other specific items be examined or excluded (E.) because they’re covered adequately elsewhere. The PM or the client has got to provide technical data on all those non-development developmental items (F.), particularly if they’re specifying that the contractor will use them. If the client says “You will use this. You will use these tires, therefore, this data with these tires” or whatever it might be, you’re going to – we want a system that’s going to use to standardized spares of standardized fuel or whatever it might be or is maintainable by technicians and mechanics with these standard skill sets. There may be all sorts of reasons for asking or forcing contracts to do certain things, in which case the purchaser is responsible for providing that data.

And again, many purchases forget to do that entirely or do it very badly, and then that can cripple a safety program. What’s the concept of operations (G.). What are we going to do with this stuff? What’s the context? What’s the big picture? That’s important. And any other specific requirements (H.). What risk matrix? What risk definitions are we using on this program? Again, important otherwise, different contractors do their own thing, or they do nothing at all. And then the client must pick up pieces afterwards, which is always time-consuming and expensive and painful. And it tends to happen at the back of a program when you’re under time pressure anyway. It’s never a happy place to be. do make sure that clients and purchases that you’ve done your homework and specified this stuff upfront, even if it turns out to be not the best thing you could have specified, it’s better to have an 80 percent solution that’s pretty standard and locked down.

Commentary #1

That’s the wording that’s in the task with some commentary by myself. Now some additional commentary. It says right up front, areas to consider include performance, performance degradation, functional failures, timing errors, design errors or defects and inadvertent function. What we have here basically is a causal analysis, there will be some simple techniques that you can use to identify this kind of stuff. Something like a functional failure analysis or a failure modes effects analysis, which is like an FFA, but an FMEA requires design to work on. And, FMEA, a variant of FME is FMECA, where we include the criticality of the failure as it possibly propagates out the hierarchy of the system.

These sorts of techniques will think about what could go wrong, no function when required, inadvertent function – the subsystem functions when it’s not supposed to – and incorrect function, and there’s often multiple versions of incorrect function. considering all of those causes, all of those failure modes and if we’re doing a big safety program on something quite critical, very often the those identified faults and failures and failure modes will feed into the bottom of a fault tree where we have a hierarchical build-up of causation and we look at how redundancy and mitigation and control measures mitigate those low-level failures and hopefully prevent them from becoming full-blown incidents and accidents.

And these techniques, particularly the FFA in the FMEA, are also good for hazard identification and for investigating performance and non-compliance issues. you can apply an FFA and FMEA those type of techniques to a specification and say “We’ve asked for this. What could happen if we get what we ask for?” What could go wrong? And, what could go wrong with these requirements?

Commentary #2

Now, the second part that I’ve chosen to highlight a consideration of the human within a subsystem and this is important. Traditionally, it’s not always been done that well. Human factors, I’m glad to say is becoming more prominent and more used both because in many, many systems, human is a key component, is a key player in the overall system. And in the past, we have tended to build systems and then just expect the human operator and maintainer to cope with the vicissitudes of that system. maybe the system isn’t that well designed in terms of it is not very usable, its performance depends on being lovingly looked after and tweaked and maybe systems are vulnerable to human error, and even induce human error. We need to get a lot better at designing systems for human use.

So, we could use several techniques. We could use a HAZOP, a hazard analysis operability study to consider information flows to and from the human. There are lots of specialist human factors analyses out there. And I’m hoping to run a series of human factors sessions, interviewing a very knowledgeable colleague of mine but more on that later. that will come in due course. We’ll look at those specialist human analysis techniques. But there’s been a couple of conceptual models around for quite a long time, about 20 years now at least, for how to think about humans in the system.

Human-System Models

So, we’ve got a 5M model and the SHELL model. I’m just going to briefly illustrate those. Now, both models are taken from the US Federal Aviation Authority System Safety Handbook, which dates to the end of 2000. These have been around a long time and they were around before the year 2000, and they’re quite long in the tooth.

We’ve got the SHELL model, which considers our software, hardware, environments and live-ware – the human. And there’s quite a nice checklist on Wikipedia for things to consider. We’re considering all the different interfaces between those different elements. That’s at the hyperlink you can see at the bottom of the slide.

Then on the right-hand side, we’ve got the 5M model and apologies for the gendered language. Where the five Ms are the man/the human, the machine, management, the media – and the media is the environment for operating and maintenance environment – and then in the middle is the mission. the humans, the machines, the systems, and the management come together in order to perform a mission within a certain environment. that’s another very useful way of conceptualizing our contribution of humans and interaction between human and system. Human operators usually are maintainers, frontline staff, and management, all in a particular operating environment and environmental context and how they come together to accomplish the mission or the function of the system, whatever it might be.

Now a word of caution, on this. It’s very possible to spend gigantic sums of money on human analysis. very often we tend to target it at the most critical points and we very often target it at the operator, particularly for those phases of operation where the operator must do things in a limited amount time. the operator will be under pressure and if they don’t take the right action within a certain time, something could go wrong. we do tend to target this analysis in those areas and tend to spend money hopefully in a sensible and targeted way.

Commentary #3

My final slide on the additional commentary. The other things we’ve talked about for this task, compliance checking. We should get a subsystem specification. If we don’t get a subsystem specification, well, what are the expectations on the subsystem? Are they documented anywhere? Is it in the consent to box? Is there an interface requirement document or are there interface control documents for other systems that or subsystems that interface with our subsystem – anywhere where we can get information. if we have a subsystem spec, a bunch of functional requirements say, early on we could do a functional failure analysis of those functional requirements. we can do this work really quite early if we need to and think about, “Well, what interfaces are expected or required from our subsystem?” versus “What is our subsystem actually do?” any mismatches that could give rise to problems.

So, this is a type of activity where we’re looking for continuity and we’re looking for coherence across the interface. And we’re looking for things to join up. And if they don’t join up or they’re mismatched, then there’s a potential problem. And, as we look down into the subsystem, are there any derived safety requirements from above that says this subsystem needs to do this or not do that in order to manage a hazard? Those are important to identify.

Again, if it’s not been done probably the subsystem contractor won’t do it because it’s extra expense. And they may well truly believe that they don’t need to. We’re all proud of the things that we do, and we feel sometimes emotionally threatened if somebody suggests a piece of kit might go wrong and it does blind people to potential problems.

If going the other way, we are a higher-level authority where a system prime contractor or something and we’ve got to look at the documentation from a subsystem supplier. Well, we might find out some information from sales brochures or feature lists, or there might be a description of the benefits or the functions of the system with its outputs. We hopefully should be able to get hold of some operating and maintenance manuals. And very often those manuals will contain warnings and cautions and say, “You must look after the piece of cake by doing this”. I’m thinking the gremlins now “Don’t feed it after midnight or get it wet” otherwise bad things will happen. Sorry about that, slightly fatuous example, but a good illustration, I think. And ideally, if there’s any training materials associated with the piece of kit, is there a training needs analysis that shows how the training was developed? It’s very often in a TNA if it’s done well, there’s lots of good information in there. Even if it’s not quite for the same application that weighs in the piece of kit for, you could learn a lot from that kind of stuff

And finally, if all else fails, if you’ve got a legacy piece of kit, then you can physically inspect it. And if you can take it apart, put it back together again – do so. You might discover there’s asbestos in it. You might discover that lithium batteries or whatever it might be, fire hazards, flammable materials, toxic materials, you name it. there’s a lot of ways that we can get information about the subsystem. Ideally, we ask for everything upfront. Say, you know, if there’s any hazardous chemicals in there, then you must provide the hazard sheets and the hazard data in accordance with international or national standards and so on and so forth. But if you can’t get that or you haven’t asked for it, there are other ways of doing it, but they’re often time-consuming and not the optimal way of doing it.

So again, do think about what you need upfront and do ask for it. And if the contractor can’t supply exactly what you want, what you need, you then have to decide whether you could live with that, whether you could use some of these alternative techniques or whether you just have to say, “No, thanks. I’ll go to another supplier of something similar”. And I may have to pay more for it, but I’ll get a better-quality product that actually comes with some safety evidence that means I can actually integrate it and use it within my system. Sometimes you do have to make some tough decisions and the earlier we do those tough decisions the better, in my experience.

Copyright Statement

So that’s all the technical content. Just to say that all the text that’s in italics and in speech marks is from the standard, which is copyright free. But this presentation, and especially all the commentary and the added value, is copyright at the Safety Artisan 2020.

For More …

And if you want more videos like this, rest in the 882 series and other resources on safety topics, you can find them at the website www.safetyartisan.com. And you can also go to the safety artisan page at Patreon. that’s www.pateron.com and search for Safety Artisan – all one word.

End

So, that’s the end of the presentation and it just remains for me to say, thanks very much for watching and supporting the Safety Artisan. And I’ll be doing Task 205 system hazard analysis next in the series, look forward to seeing you again soon. Bye-bye, everyone.

End: Sub-System Hazard Analysis

You can find a free pdf of the System Safety Engineering Standard, Mil-Std-882E, here.

Categories
Mil-Std-882E Safety Analysis

Preliminary Hazard Analysis

In this 45-minute session, The Safety Artisan looks at Preliminary Hazard Analysis, or PHA, which is Task 202 in Mil-Std-882E. We explore Task 202’s aim, description, scope and contracting requirements. We also provide value-adding commentary and explain the issues with PHA – how to do it well and avoid the pitfalls.

This is the seven-minute-long demo video. The full video is 45 minutes’ long.

Topics: Preliminary Hazard Analysis

  • Task 202 Purpose;
  • Task Description;
  • Recording & Scope;
  • Risk Assessment (Tables I, II & III);
  • Risk Mitigation (order of preference);
  • Contracting; and
  • Commentary.

Transcript: Preliminary Hazard Analysis

Transcript: Preliminary Hazard Analysis

Hello and welcome to the Safety Artisan, where you’ll find professional, pragmatic and impartial safety training resources. So, we’ll get straight on to our session and it is the 8th February 2020. 

Preliminary Hazard Analysis

Now we’re going to talk today about Preliminary Hazard Analysis (PHA). This is Task 202 in Military Standard 882E, which is a system safety engineering standard. It’s very widely used mostly on military equipment, but it does turn up elsewhere.  This standard is of wide interest to people and Task 202 is the second of the analysis tasks. It’s one of the first things that you will do on a systems safety program and therefore one of the most informative. This session forms part of a series of lessons that I’m doing on Mil-Std-882E.

Topics for This Session

What are we going to cover in this session? Quite a lot! The purpose of the task, a task description, recording and scope. How we do risk assessments against Tables 1, 2 and 3. Basically, it is severity, likelihood and the overall risk matrix.  We will talk about all three, about risk mitigation and using the order of preference for risk mitigation, a little bit of contracting and then a short commentary from myself. In fact, I’m providing commentary all the way through. So, let’s crack on.

Task 202 Purpose

The purpose of Task 202, as it says, is to perform and document a preliminary hazard analysis, or PHA for short, to identify hazards, assess the initial risks and identify potential mitigation measures. We’re going to talk about all of that.

Task Description

First, the task description is quite long here. And as you can see, I’ve highlighted some stuff that I particularly want to talk about.

It says “the contractor” [does this or that], but it doesn’t really matter who is doing the analysis, and actually, the customer needs to do some to inform themselves, otherwise they won’t really understand what they’re doing.  Whoever does it needs to perform and document PHA. It’s about determining initial risk assessments. There’s going to be more work, more detailed work done later. But for now, we’re doing an initial risk assessment of identified hazards. And those hazards will be associated with the design or the functions that we’re proposing to introduce. That’s very important. We don’t need a design to do this. We can get in early when we have user requirements, functional requirements, that kind of thing.

Doing this work will help us make better requirements for the system. So, we need to evaluate those hazards for severity and probability. It says based on the best available data. And of course, early in a program, that’s another big issue. We’ll talk about that more later. It says including mishap data as well, if accessible: American term mishap, it means accident, but we’re avoiding any kind of suggestion about whether it is accidental or deliberate.  It might be stupidity, deliberate, whatever. It’s a mishap. It’s an undesirable event. We look for accessible data from similar systems, legacy systems and other lessons learned. I’ve talked about that a little bit in Task 201 lesson about that, and there’s more on that today under commentary. We need to look at provisions, alternatives, meaning design provisions and design alternatives in order to reduce risks and adding mitigation measures to eliminate hazards. If we can all reduce associated risk, we need to include all of that. What’s the task description? That’s a good overview of the task and what we need to talk about.

Reading & Scope

First, recording and scope, as always, with these tasks, we’ve got to document the results of the PHA in a hazard tracking system. Now, a word on terminology; we might call hazard tracking system; we might call it hazard log; we might call it a risk register. It doesn’t really matter what it’s called. The key point is it’s a tracking system. It’s a live document, as people say, it’s a spreadsheet or a database, something like that. It’s something relatively easy to update and change. And, we can track changes through the safety program once we do more analysis because things will change. We should expect to get some results and to refine them and change them as time goes on. Very important point.

Scope #1

Scope. Big section this. Let me just check. Yes, we’ve got three slides on scope. This does go on and on. The scope of the PHA is to consider the potential contribution from a lot of different areas. We might be considering a whole system or a subsystem, depending on how complex the thing is we’re dealing with. And we’re going to consider mishaps, the accidents and incidents, near misses, whatever might occur from components of the system (a. System components), energy sources (b. Energy sources), ordnance (c. Ordnance)- well that’s bullets and explosives to you and me, rockets and that kind of stuff.

Hazardous materials (d. Hazardous Materials (HAZMAT)), interfaces and controls (3. Interfaces and controls), interface considerations to other systems (f. Interface considerations to other systems when in a network or System-of-Systems (SoS) architecture), external systems. Maybe you’ve got a network of different systems talking to each other. Sometimes that’s called a system of systems architecture. Don’t worry about the definitions. Our system probably interacts and talks to other systems, or It relies on other systems in some way, or other systems rely on it. There are external interfaces. That’s the point.

Scope #2

We might think about material compatibilities (g. Material Compatibilities) – Different materials and chemicals are not compatible with others-, inadvertent activation (h. Inadvertent activation).

Now, I’ve highlighted I. (Commercial-Off-the-Shelf (COTS), Government-Off-the-Shelf (GOTS), Non-Developmental Items (NDIs), and Government-Furnished Equipment (GFE).) because it’s something that often gets neglected. We also need to think about stuff that’s already been developed. The general term is NDIs and it might be commercial off the shelf, it might be a government off the shelf system, or government-furnished equipment  GFE- doesn’t really matter what it is. These days, especially, very few complex systems are developed purely from scratch. We try and reuse stuff wherever we can in order to keep costs down and schedule down.

We’re going to need to integrate all these things and consider how they contribute to the overall risk picture. And as I say, that’s not often done well. Well, it’s hardly ever done well. It’s often not done at all. But it needs to be, even if only crudely. That’s better than nothing.

J. (j. Software, including software developed by other contractors or sources.  Design criteria to control safety-significant software commands and responses (e.g., inadvertent command, failure to command, untimely command or responses, and inappropriate magnitude) shall be identified, and appropriate action shall be taken to incorporate these into the software (and related hardware) specifications)  we need to include software, including software developed elsewhere. Again, that’s very difficult, often not done well. Software is intangible. If somebody else has developed it maybe we don’t have the rights to see the design, or code, or anything like that. Effectively it’s a black box to us. We need to look at software. I’m not going to bother going through all the blurb there.

Another big thing in part k (k.  Operating environment and constraints) is we need to look at the operating environment. Because a piece of kit that behaves in a certain way in one environment, you put it in a different environment and it behaves differently. And it might become much more dangerous. You never know. And the constraints that we put under on system. Operating environment is very big. And in fact, if you see the lesson I did on the definition of safety, we can’t really define whether a system is safe or not until we define the operating environment. It’s that important, a big point there.

Scope #3

And then third slide of three procedures (l. Procedures for operating, test, maintenance, built-in-test, diagnostics, emergencies, explosive ordnance render-safe and emergency disposal). Again, these are well these often don’t appear until later unless of course, we’ve gone off the shelf system. But if we have got off the shelf system; there should be a user manual, there should be maintenance manuals, there should be warnings and cautions, all this kind of stuff. So, we should be looking for procedures for all these things to see what we could learn from them. We want to think about the different modes (m. Modes) of operation of the system. We want to think about health hazards (n. Health hazards) to people, environmental impacts (o. Environmental Impacts), because they take to includes environmental.

We need to think about human factors, human engineering and human error analysis (p. Human factors engineering and human error analysis of operator functions, tasks, and requirements). And it says operator function tasks and requirements, but there’s also maintenance and disposal of storage. All the good stuff. Again, Human Factors is another big issue. Again, it’s not often done well, but actually, if you get a human factor specialist statement early, you can do a lot of good work and save yourself a lot of money, and time, and aggravation by thinking about things early on.

We need to think about life support requirements (q.  Life support requirements and safety implications in manned systems, including crash safety, egress, rescue, survival, and salvage). If the system is crewed or staffed in some way, I’m thinking about, well, ‘What happens if it crashes?’ ‘How do we get out?’ ‘How do we rescue people?’ ‘How do we survive?’ ‘How do we salvage the system?’

 Event-unique hazards (r. Event-unique hazards). Well, that’s kind of a capsule for your system does something unusual. If it does something unusual you need to think about it.

And then thinking about part s. infrastructure (s.  Built infrastructure, real property installed equipment, and support equipment), property installed equipment and support equipment in property and infrastructure.

And then malfunctions (t. Malfunctions of the SoS, system, subsystems, components, or software) of all the above.

I’m just going to whizz back and forth. We’ve got to sub-item T there. We’ve got an awful lot of stuff there to consider. Now, of course, this is kind of a hazard checklist, isn’t it? It’s sort of a checklist of things. We need to look at all that stuff. And in that respect, that’s excellent, and we should aim to do something on all of them just to see if they’re relevant or not if nothing else. The mistake people often make is because they can’t do something perfect and comprehensive, they don’t do anything. We’ve got a lot of things to go through here. And it’s much better to have a go at all these things early and do a bit of rough work in order to learn some stuff about our system. It’s much better to do that than to do nothing at all. And with all of these things, it may be difficult to do some of these things, the software, the COTS, things where we don’t have access to all the information, but it’s better to do a little bit of work early than to do nothing at all waiting for the day to arrive when we’ll be able to do it perfectly with only information. Because guess what? That day never comes! Get in and have a go at everything early, even if it’s only to say, ‘I know nothing about this subject, and we need to investigate it.’ That’s the pros and cons of this approach. Ideally, we need to do all these things, but it can be difficult.

Risk Assessment

Moving on. Well, we’ve looked to a broad scope of things for all the hazards that we identify and there are various techniques you can use. The PHA has go to include a risk assessment. That means that we’ve got to think about likelihood and severity and then that gives us an overall picture of risk when we combine the two together. That’s tables 1 and 2.

And then, forget risk assessment codes I’m not sure why that’s in there, table 3 is the risk matrix and 88 2 has a standard risk matrix. And it says use that unless you’ve got a tailored matrix for your system that’s been approved for use. And in this case, it says approved effectively in accordance with the US Department of Defence. But it’s whoever is the acquiring organization, the authority, the customer, the purchaser, whatever you want to call it, the end-user. We’ll talk about that more in a sec.

Table I, Severity

Let’s start by looking at severity, which in many ways is the easiest thing to look at. Now, here we’ve got in this standard we’ve got an approach based on harm to people, harm to the environment, and monetary loss due to smashing stuff up. At the top catastrophic accident. Category 1 is a fatal accident. This accident could result in death, permanent total disability, irreversible significant environmental impact, or monetary loss. And in this case, it says $10 million. Well, this, that’s 10 million US dollars. This standard was created in 2012, this version of the standard, probably inflation has had an effect since then. And a critical accident, we could cause partial disability injuries or occupational illness that can hospitalized three people are reversible. Significant environmental impact or some losses between 1 million and 10. And then we go down to marginal. Injury or hospital, lost workdays for one person, reversible moderate environmental impact or monetary loss between $100,000 and one million dollars. And then finally negligible is less than that. Negligible is an injury or illness that doesn’t result in any lost time at work, minimal environmental impact, or a monetary loss of less than a hundred thousand dollars. That’s easy to do in this standard. We just say, ‘What are our losses that we think could result?’ Worst case, reasonable scenario or an accident? That’s straightforward.

Table II, Probability

Now let’s look at probability. We’ve got a range here from ‘a’ to ‘e’, frequent down to improbable, and then F is eliminated. And eliminated in the standard really does mean eliminated. It cannot happen ever! It does not mean that we managed to massage the figures, the likelihood a probability figures, down Low that we pretend that it will never happen. It means that it is a physical impossibility. Please take note because I’ve seen a lot of abuse of that approach. That’s bad practices to massage the figures down to a level where you say, ’I don’t need to bother thinking about this at all!’ because the temptation is just to frig [massage] the figures and not really consider stuff that needs to be considered. Well, I’ll get off my soapbox now.

Let’s go back to the top. Frequent- you’ve said, for one item, likely to occur often. Down to probable- occur several times in the life of an item. Occasional- likely to occur sometimes, we think it’ll happen once in the life of an item. Remote- we don’t think it’ll happen at all, but it could do. And improbable – so unlikely for an individual item that we might assume that the occurrence won’t happen at all. But when we consider a fleet, particularly, I’ve got hundreds or thousands of items, the cumulative risk or cumulative probability, sorry, I should say, is unlikely to occur across the fleet, but it could.

And this is where this specific vs. fleet occurrence or probability is useful. For example, if we think ‘Let’s imagine a frequent hazard’. We think that something could happen to an item, per item, let’s say once a year. Now, if we’ve got a fleet of fifty of these items or fifty-something of these items, that means it’s going to happen across the fleet pretty much every week on average. That’s the difference. And sometimes it’s helpful to think about an individual system. And sometimes it’s helpful to think about a fleet where you can you’ve got relevant experience to say, ‘Well the fleet that we’re replacing. We had a fleet of 100 of these things. And we at this go, this went wrong every week or every month or once a year or only happened once every 10 years across the entire fleet.’ And therefore, we could reason about it that way.

We’ve got two different ways of looking at probability here. And use whichever one is more useful or helps you. But when we’re doing that, try and do that with historical data, not just subjective judgment. Because otherwise your subjective judgment, one individual might say ‘That will never happen!’, whereas another will say, ‘Well, actually we experienced it every month on our fleet!’. Circumstances are different.

Table III, Risk Matrix

We put severity and probability together. We have got 1 to 4 on severity, A to F on probability, and we get this matrix. We’ve got probability down the side and severity along the top. And in this standard, we’ve got high risk, serious risk, medium risk and low risk. And now how exactly you define these things is, of course, somewhat arbitrary. We’ll just look at some general principles.

The good thing about this risk matrix is- First, the thing to remember is that risk is the product of probability and severity. Effectively we multiply the two together and we go, well, if we’ve got a catastrophic or a critical risk. And it’s if we’ve got a more serious risk and it’s going to happen often that’s a big risk. That’s a that’s a high risk. Where alternatively, if we’ve got a low severity accident that we think will happen very, very rarely, then that’s a low risk. That’s great.

One thing to note here it’s easier to estimate the severity than it is the probability. It’s quite easy to under- or overestimate probability. Usually, because of the physical mechanism involved, it’s easier to estimate the severity correctly. If we look on the right-hand side, at negligible. We can see that if we’re confident that something is negligible, then it can be a low risk. But at the very most, it can only be a medium risk. We are effectively prioritizing negligible severity risks quite low down the pecking order.

Now, on the other side, if we think we’ve got a risk that could be catastrophic, we could kill somebody or do irreversible environmental damage, then, however improbable we think it is, it’s never going to be classified less than medium. That’s a good point to note. This matrix has been designed well, in the sense that all catastrophic and critical risks are never going to get the low medium and they can quite easily become serious or high. That means they’re going to get serious management attention. When you put risks up in front of a manager, senior person, a decision-maker, who’s responsible and they see red and orange, they’re going to get uncomfortable and they’re going to want to know all about that stuff. And they will want to be confident that we’ve understood the risk correctly and it’s as low as we can get it. This matrix is designed to get attention and to focus attention where it is needed.

And in this standard, in 88, you ultimately determine whether you can accept risk based on this risk rating. In 882, there is no unacceptable, intolerable risk. You can accept anything if you can persuade the right person with the right amount of authority to sign it off. And the higher the risk, the higher the level of authority you must get in order to accept the risk and expose people to it. This matrix is very important because it prioritizes attention. It prioritizes how much time and effort money gets spent on reducing risks. You will use it to rank things all the time and it also prioritizes, as we’ll see later, how often you review a risk because clearly, you don’t want to have high risks or serious risks. Those are going to get reviewed more often than a medium risk or low risk. A low risk might just get review routinely, not very often, maybe once a year or even less. We want to concentrate effort and attention on high risks and this matrix helps us to do that. But of course, no matrix is perfect.

Now, if we go back. Looking at the yellow highlight, we’re going to use table three unless there’s a tailored alternative definition, a tailored alternative matrix. Now, noting this matrix, catastrophic risk, the highest possible risk, we’ve got one death. Now, if we had a system where it was feasible to kill more than one person in an accident, then really, we would need another column worse than catastrophic. We could imagine that if you had a vehicle that had one person in it and the vehicle crashed, whatever it was, a motorbike let’s say. Let’s imagine you only said ‘We’re only going to have solo riders. We can only kill one person.’. We’re assuming we won’t hurt anybody else. But if you’ve got a car where you’ve got four or more people in, you could kill several people. If you’ve got a coach or a bus, you could drive it off a cliff and kill everybody, or you might have a fire and some people die, but most of them get out. You can see that for some vehicles, for some systems, you would need additional columns. Killing one person isn’t the worst conceivable accident.

Some systems. You might imagine quite easily, say with a ship, it’s actually very rare for a ship to sink and everybody dies. But it’s quite common for individuals on ships to die in health and safety type accidents, workplace accidents. In fact, being a merchant seaman is quite a risky occupation. But also in between those two, it’s also quite possible to have a fire or asphyxiating gases in a compartment. You can kill more than one person, but you won’t kill the entire ship’s company. Straight away in a ship, you can see there are three classes, if you like, of serious accidents where you can kill people. And we knew we should really differentiate between the three when we’re thinking about risk management. And this matrix doesn’t allow you to do that. If you’ve got a system where more than one death this is feasible, then this matrix isn’t necessarily going to serve well, because all of those type of accidents get shoved over into a catastrophic column, on this matrix, and you don’t differentiate between any of between them which is not helpful. You may need to tailor your matrix and add further columns.

And depending on the system, you might want to change the way that those risks are distributed. Because you might have a system, for example riding a bicycle. It’s very common riding a bicycle to get negligible type injuries. You know you fall off, cuts and bruises, that kind of thing. But, if you’re not on the road, let’s say you’re riding off-road it is quite rare to get utilities unless you do a mountain biking on some extreme environment. You’ve got to tailor the matrix for what you’re doing. I think we’ve talked about that enough. We’ll come back to that in later lessons, I’m sure.

Risk Mitigation

Risk mitigation, we’re doing this analysis, not for the sake of it, we’re doing it because we want to do something about it. We want to reduce the risk or eliminate it if we can. 88 2 standard gives us an order of precedence, and as it says it’s specified in section 4.3.4, but I’ve reproduced that here for convenience. Ideally, we would like to eliminate hazards by designing them. We would make a design decision to say, ‘We will- we won’t have a petrol engine, let’s say, in this vehicle or vessel because petrol a serious fire explosion hazard. We’ll have something else. We’ll have diesel or we’ll have an all-electric vehicle maybe these days or something like that.’ We can eliminate the risk.

We could reduce the risk by altering the design introducing sort of failsafe features, or making the design crashworthy, or whatever it might be. We could add engineered features or devices to reduce risk safety features seatbelts in cars or airbags, roll balls, crash survivable cages around the people, whatever it might be. We can provide warning devices to say ‘Something’s going wrong here, and you need to pull over’ or whatever it is you need to do. ‘Watch out!’ because the system is failing and maybe ‘Your brakes are failing. You’ve got low brake fluid. Time to pull over now before it gets worse!’.

And then finally, the least effective precautions or mitigations signage, warning signs – because nobody reads warning signs, sadly. Procedures. Good, if they’re followed. Again, very often people don’t follow them. They cut corners. We train people. Again, they don’t always listen to the training or carry it out. And we provide PPE. That’s personal protective equipment. And again, PPE is great if you enforce it. For example, I live in Australia. If you cycle in Australia, if you ride a bicycle, it’s the law that you wear a bike helmet. Most people obey the law because they don’t want to get a $300 fine or whatever it is if the cops catch you, but you still see people around who don’t wear one. Presumably, because they think they’re bulletproof, and it will never happen to them. PPE is fine if it’s useful. But of course, sometimes PPE can make a job that much harder that people discard it. We really need to think about designing a job to make it easy to do, if we’re going to ask people to wear awkward PPE. Also, by the way, we need to not ask them to wear PPE for trivial reasons just so that the managers can cover their backsides. If you ask people to wear PPE when they’re doing trivial jobs where they don’t need it then it brings the system into disrepute. And then people end up not wearing PPE for jobs where they really should be wearing it. You can over-specify safety and lose goodwill amongst your workers if you’re not careful.

Now those risk mitigation priorities, that’s the one in this standard, but you will see an order of precedence like that in many different countries in the law. It’s the law in Australia. It’s the law in the UK, for example, expressed slightly differently. It’s in lots of different standards for good reason because we want to design out the risks. We want to reduce them in the design because that’s more effective than trying to bolt on or stick home safety afterwards. And that’s another reason why we want to get in early in a project and think about our hazards and our risks early on. Because it’s cheaper at an early stage to say, ‘We will insist on certain things in the design. We will change the requirements to favour a design that is inherently safe.’

Contracting

We only get these things if we contract for them. The model in 88 2, the assumption is it’s a government somewhere contracting a contractor to do stuff. But it doesn’t have to be a government, it can be any client or purchase of world authority or end-user asking for something, buying something, contracting something, be it the physical system, or service, or whatever it might be. The assumption is that the client issues a request for proposal.

Right at the start, they say ‘I want a gizmo’. Or ‘I want- I don’t even want to specify that I want a gizmo. I want something that will do this job. I don’t care what it is. Give me something that will do this job.’ But even at that early stage, we should be asking for preliminary hazard analysis (PHA) to be done. We should be saying, ‘Well, who?’ ‘Which specialists?’ ‘Which functional disciplines need to be involved?’. We need to specify the data that we require and the format that it’s in. Considering, especially the tracking system, which is task 106. If we’re going to get data from lots of different people, best we get it in a standardized format we can put it all together. We want to insist that they identify hazards, hazardous locations, etc. We want to insist on getting technical data on non-developmental items, either getting it for the client or the client supplies it. Says to the contractor or doing it ‘This is the information that I’m going to supply you’ and you will use it. We need to supply the concept of operations and of course, the operating environment. Let me just check, no that that’s it. We’ve only got one slide on commentary. It doesn’t say the environment, but we do need to specify that as well, and hopefully that should be in the concept of operations, and a specific hazard management requirement. For example, what matrix are we going to use? What is a suitable matrix to use for this system?

Now to do all of this, the purchaser, the client really probably needs to have done Task 202 and 201 themselves, and they’ve done some thinking about all of this in order to say, ‘With this system, we can envisage- with this kind of requirement, we can envisage these risks might be applicable.’ And ‘We think that the risks might be large or small’ depending on what the system is or ‘We think that-’. Let’s say if you if you purchase a jet fighter, jet fighters because of that demand, the overwhelming demand for performance, they tend to be a bit riskier than airliners. They fall out of the sky more often. But the advantages, there’s normally only one or two people on board. And jet fighters tend to fly a lot of the time in the middle of nowhere. You’re likely to hurt relatively few people, but it happens more often.

Whereas if you’re buying an airliner something, you can shove a couple of hundred people in at one go, those fall out of the sky much less frequently, thank goodness, but when they do, lots of people get hurt. Aa different approach to risk might be appropriate for different types of system. And when your, you should be thinking about early on, if you’re the client, if you’re the purchaser. You should have done some analysis to enable you to write a good request for proposal, because if you write a bad request for proposal, it’s very difficult to recover the situation afterwards because you start at a disadvantage. And the only way often to fix it is to reissue the RFP and start again. And of course, nobody wants to do that because it’s expensive and it wastes a lot of time. And it’s very embarrassing. It is a career-limiting thing to do, a lot of people. You do need to do some work up front in order to get your RFP correct. That’s what it says in the standard.

Commentary

I want to add a couple of comments, I’m not going to say the much. First, it’s a little line from a poem by Kipling that I find very, very helpful. And Kipling used to be a journalist and it was it was his job to go out and find out what the story was and report it. And to do that he used his six honest serving men. He asked ‘What?’ and ‘Why?’ and “When?’ and ‘Who?’, sorry, and ‘How?’ and ‘Where?’ and ‘Who?’. Those are all good questions to ask. If you can ask all those questions and get definite answers, you’re doing well. And a little tip here as a consultant, I rock up and one of the tricks of the trade I use is I turn up as the ‘dumb consultant’ – I always pretend to be a bit dumber than what I really are- and I ask these stupid questions. And I ask the same questions of several different people. And if I get the same answer to the same question from everyone, I’m happy. But that doesn’t always happen. If you start getting very different answers to the same question from different people, then you think, ‘Okay, I need to do some more digging here’. And it’s the same with hazard analysis. Ask the what, why, when, where and who question.

Another issue, of course, is ‘How much?’ ‘How much is this going to take?’ ‘How long is this going to take?’ ‘How many people am I going to have to invite to this meeting?’, etc. And that’s difficult. And really, the only way to answer these questions properly is to just do some PHI and PHA early and to learn from the results. The other alternative, which we are really good as human beings, is to ask the questions early to get answers that we don’t really like and then just to sweep them under the carpet and not ask those questions ever again because we’re frightened of the answers that we might. However frightened you are of the answer, you might get do ask the question because forewarned is forearmed. And if you know about a problem, you can do something about it. Even if that something is to rewrite your CV and start looking for another job. Do ask the questions even if it makes people uncomfortable. And I guess learning how to ask the questions without making people uncomfortable is one of the tricks that we must learn as safety engineers and consultants. And that’s an important part of the job. The soft skills really that you can only learn through practice, really, and observing people.

What’s the way to do it? Well, I’ve said this several times but do your PHI and PHA early. Do it as early as possible because it’s cheap to do it early. If you’re the only safety person or safety, you often in the beginning, maybe you’re a manager, maybe safety is part of your portfolio, you’ve got other responsibilities as well. Just sit down one day and ask these dumb questions, go through the checklist in Task 202 and say, ‘Do I have these things in my system?’

If you know for sure you’re not going to have explosive ordnance, or radiation, or whatever it might be, you can go, ‘Great. I can cross those off the list’. I can make an assumption or I can put a constraint in, by the way, if you really want to do it well and say ‘We will have no explosive devices’, ‘We will have no energetic materials.’, ‘We will have no radiation’ or whatever it might be. Make sure that you insist that you’ll have none of it then you can hopefully move on and never have to deal with those issues again.

Do the analysis early, but expect to repeat it because things change, and you learn more and more information comes in. But of course, the further you go down the project, the more expensive everything gets. Now, having said do it, do it early, the Catch 22 is very often people think ‘How can I analyse when I don’t have a design?’

The catch 22 question is what comes first, design or analysis? Now, the truth is that you could do analysis on very simple functions. You don’t need any design at all. You don’t even need to know what kind of vehicle or what kind of system you might dealing with. But of course, that will only take you so far. And it may be that you want to do early analysis, but for whatever reason, [Intellectual Property Rights] IPR or whatever it might be, you can’t get access to data.

What do you do? You can’t get access to data about your system or the system that you’re replacing. What do you do? Well, one of the things you can do is you can borrow an idea from the logistics people. Logistic support analysis Task 203 is a baseline comparison system. Imagine that you’re going to have a new system, maybe is replacing an old system, but maybe it does a lot more than the old system used to do. Just looking at the old system isn’t going to give you the full picture. Maybe what you need to do is make up an imaginary comparison system. You take the old system and say, ‘Well, I’m adding all this extra functionality’. Maybe the old system, we just bought the vehicle. We didn’t buy the support system, we didn’t buy the weapons, we didn’t buy the training, whatever it might be. But, this time round, we’re buying the complete package. We’re going to have all this extra stuff that probably has hazards associated with it, but just doing lessons learned from the previous system will not be enough.

Maybe you need to construct an imaginary Baseline Comparison System and go, ‘I’ll borrow bits from all these other systems, put them all together, and then try and learn from that sort of composite system that I’ve invented, even though it’s imaginary.’ That can be a very powerful technique. You may get told, ‘Oh, we haven’t got the money’ or ‘We haven’t got the time to do that’. But to be honest, if there’s no other way of doing effective, early analysis, then spend the money and do it early. Because many times I’ve seen people go, ‘Oh, we haven’t got time to do that’. They’ve never got time to do it properly and therefore, you end up doing it. You go around the buoy two or three times. You do it badly. You do it again slightly less badly. You do it a third time. And it’s sort of barely adequate. And then you move forward. Well, you’ve wasted an awful lot of time and money and held up other people, the rest of the project doing that. Probably it’s better off to spend the money and just get on with it. And then you’re informed going forwards before you start to spend serious money elsewhere on the project.

Copyright Statement

Well, that’s it for me. Just one thing to say, that Mil. Standard 882E came out in 2012. Still going strong, unlikely to be replaced anytime soon. It’s copyright free. All the quotations are from the standard, they’re copyright free. But this video is copyright of The Safety Artisan 2020.

For More …

And you can find a lot more information, a lot more safety videos, at The Safety Artisan page at www.Patreon.com and you can find more resources at www.safetyartisan.com.

End

That is the end of the show. Thank you very much for listening. And it just remains for me to say. Come and watch some more videos on Mill-Std-882E. There’s going to be a complete course on them, and you should be able to get, I hope, a lot of value out of the course. So, until I see you again, cheers.

End: Preliminary Hazard Analysis

You can find a free pdf of the System Safety Engineering Standard, Mil-Std-882E, here.

Categories
Mil-Std-882E Safety Analysis

System Requirements Hazard Analysis

In this 45-minute session, The Safety Artisan looks at System Requirements Hazard Analysis, or SRHA, which is Task 203 in the Mil-Std-882E standard. We explore Task 203’s aim, description, scope, and contracting requirements.  SRHA is an important and complex task, which needs to be done on several levels to be successful.  This video explains the issues and discusses how to perform SRHA well.

This is the seven-minute demo video, the full version is 40-minutes’ long.

Topics: System Requirements Hazard Analysis

  • Task 202 Purpose;
  • Task Description:
    • Determine Requirements;
    • Incorporate Requirements; and
    • Assess the compliance of the System.
  • Contracting;
  • Section 4.2 (of the standard); and
  • Commentary.
Transcript: System Requirements Hazard Analysis

Introduction

Hello and welcome to the Safety Artisan, where you will find professional, pragmatic and impartial advice on all things system, safety and related.

System Requirements Hazard Analysis

And so today, which is the 1st of March 2020, we’re going to be talking about – let me just find it for you – we’ll be talking about system requirements, hazard analysis. And this is part of our series on Mil. Standard 882E (882 Echo) and this one a task 203. Task 203 in the Mil. standard. And it’s a very widely used system safety engineering standard and its influence is found in many places, not just on military procurement programs.

Topics for this Session

We’re going to look at this task, which is very important, possibly the most important task of all, as we’ll see. so in to talk about the purpose of the task, which is word for word from the task description itself. We’re going to talk about in the task description, the three aims of this task, which is to determine or work out requirements, incorporate them, and then assess the compliance of the system with those requirements, because, of course, it may not be a simple read-across. We’ve got six slides on that. That’s most of the task. Then we’ve just got one slide on contracting, which if you’ve seen any of the others in this series, will seem very familiar. We’ve got a little bit of a chat about Section 4.2 from the standard and some commentary, and the reason for that will become clear. So, let’s crack on.

System Requirements Hazard Analysis

Task 203.1, the purpose of Task 203 is to perform and document a System Requirements Hazard Analysis or SRHA. And as we’ve already said, the purpose of this is to determine the design requirements. We’re going to focus on design rather than buying stuff off the shelf – we’ll talk about the implications of that a little bit later. Design requirements to eliminate or reduce hazards and risks, incorporate those requirements, into a says, into the documentation, but what it should say is incorporate risk reduction measures into the system itself and then document it. And then finally, to assess compliance of the system with these requirements. Then it says the SRHA address addresses all life-cycle phases, so not just meant for you to think about certain phases of the program. What are the requirements through life for the system? And in all modes. Whether it’s in operation, whether it’s in maintenance or refit, whether it’s being repaired or disposed of, whatever it might be.

Task Description #1

First of six slides on the task description. I’m using more than one colour because there’s some quite a lot of important points packed quite tightly together in this description. We’re assuming that the contractor performs and documents this SRHA. The customer needs to do a lot of work here before ever gets near a contractor. More on that later. We need to determine system design requirements to eliminate hazards or reduce associated risks.

Two things here. By identifying applicable policies, regulations and standards etc. More on that later. And analysing identified hazards. So, requirements to perform the analysis as well as to simply just state ‘We want a system to do this and not to do that’. So, we need to put some requirements to say ‘Here’s what we want analysed maybe to what degree? And why.’ is always helpful.

Task Description #2

Breaking those breaking those two requirements down.

Part a. We’re going to identify applicable requirements by reviewing our military and industry standards and specs, historical documentation of systems that are similar or with a system that we’re replacing, perhaps. Look at, it’s assumed that the US Department of Defense is the customer, ultimate customer. So, the ultimate customer’s requirements, including whatever they’ve said about standard ways of mitigating certain common risks. System performance spec, that’s your functional performance spec or whatever you want to call it. Other system design requirements and documents- Bit of a catchall there. And applicable federal, military, state and local regulations. This is a US standard. It’s a federated system, much like Australia or indeed lots of modern states, even the UK. There are variations in law across England, Wales, Scotland and Ireland. They’re not great, but they do exist. And in the US and Australia, those differences are greater. And it says applicable executive orders. Executive orders, they’re not law, but they are what the executive arm of the U.S. government has issued, and international agreements. A lot of words in there- have a look at the different statements that are in that in white, blue and yellow. Basically, from international agreements right down to whatever requirements may be applicable, they all need to be looked at and taken account of. So, there’s a huge amount of work there for someone to do. I’ll come back to who that someone should be later.

Task Description #3

Part B. It says the contractor shall recommend appropriate system design requirements. The assumption here is that the contractor is the designer and knows the design better than anybody, better than the purchaser, which is fair enough. It’s your system, you should understand it. And the requirement is that the contractor is not just passive, ‘doing as they’re told’, they’re there to actively investigate possible hazards associated with their system and recommend appropriate requirements in order to manage those hazards and risks. And then there’s further guidance here is the contractor to do that in accordance with Section 4 of Mil. Standard 882E. Now, Section 4 is the general requirements of the standards and there’s lots of good advice in that. And I’ll be doing a lesson, maybe more than one lesson in fact, on Section 4 because there is quite a lot in there. The contractor is to refer to the standard and apply the principles therein. All good stuff.

Part C. The contractor shall also define verification and validation approaches. So, the contractor shall define V and V approaches for each design requirement to eliminate hazards and reduce risks. In part C- Well, B and C- we’ve got a very much narrower focus on requirements to eliminate hazards or reduce risks. Whereas in A, notice we’ve got incredibly broad scope looking requirements. It’s not just about the narrow job of dealing with hazards and controlling them, that we’ve got in parts B and C.

Task Description #4

Onwards and upwards. We get to the second major part of this task, which is to incorporate those design requirements. It’s all very well to have them, but they’ve got to be built into the engineering design, into documentation, hardware, software, test plans, etc. And the second highlighted bit that I’ve got is ‘as the design evolves ensure applicable design requirements flow down into lower-level specifications’, etc, etc, etc. There’s a lot of repetition there, so I won’t go through it. Clearly the assumption in this standard is that the design will be done top-down and that the main contractor, design contractor, will be doing work and then identifying lower-level requirements to be passed on to subcontractors and suppliers. And again, the assumption is we’re dealing with a large military system, which is at least, in part, bespoke. It is being developed and/or integrated for the first time for a specific user and a specific use.

I’ll come onto the third yellow highlighted bit first, and then it says as appropriate use engineering change proposals to incorporate applicable design requirements into these documents. What we’re saying here is that even if something hasn’t been specified up front in the original contract, the contractor should use Engineering Change Proposals – ECP – should use it controlled change mechanism in order to change things as they go with approval and refine and evolve the design.

Years of experience have taught me that these statements are coming from the assumption – still true in the US, I believe – whereby major military projects are designed and developed under a cost-plus basis. In other words, the government pays the main contractor / the prime contractor / prime designer on a sort of time and materials basis, not on a firm or fixed price basis, but says ‘Go away and do what we say’. And there’s controls there, and there’s open-book accounting to try and prevent the government being defrauded. But basically, the contractor goes off and does what is required and gets paid for what they do. So, the government has transferred relatively low amounts of risk onto the contractor anticipating that this will result in the lowest possible overall cost of design development. Now, as we probably could know from the news, that doesn’t always work. However, that is the assumption behind this standard. This cost-plus approach will pay you to do the job and therefore we don’t have to specify every single nut and bolt in the contract right at the beginning. Which in some ways takes a lot of risk away from the purchaser because they don’t have to get everything right at the start. So that’s good. There’s always a balance of risk in whichever approach we take.

So, if we go firm price, yes, we could inject more competition into procurement and supply activity, but you’ve got to get your contract upfront right. And all your requirements, right- more or less. That is notoriously difficult to do. Whichever way you go, there are risks. But it’s important to note that this is the assumption underlying the standard. Not every standard follows this approach, follows this philosophy, but 88 2 does. So, if we’re going to use it in a different way, we need to understand the fact that in. More on that later.

Task Description #5

Fifth slide of six. Third part. We need to assess compliance of that development of hardware, software, documentation, data, etc., whatever it might be. In order to do that, the contractor is going to have to address the customer requirements at technical reviews. So again, the assumption is that development is following a systems-engineering process with certain gated reviews. So, you go into a series of reviews, you might start with system requirements review, SRR. Then you might have preliminary design review, top-level design, PDR. And then we go down to detailed design which is reviewed at Critical Design Review, or CDR. And then we might have a further software specification review for software components and then we’ll go on and test readiness routines and so on and so forth. Mil. Standard 882 is assuming a particular systems-engineering-lifecycle approach to development. This is very widely used not just for military standards, but for civil, and all over the place. Whatever we call these reviews, the idea of a gated review is that you don’t start a review until you’ve reached maturity requirements or design. You then conduct the review against objective criteria and then decide whether the review has passed. Now, usually, there is a hefty payment milestone associated with passing review. The contractor is incentivized to pass the review. And hopefully, if we’ve got the requirements right, a passed review means we’re on the right track and we’re getting the right product. But that’s not always that’s not always the case that we’ve got to get all these things right.

And then it says during those reviews, the contractor shall address hazards, mitigation measures or controls and methods of V and V, and recommendations arising. A lot goes on at these reviews. They are on big programs, especially, the very important, very high stress. And in fact, in Australia now, there are some projects that are so big that a delay in a PDR review actually made it into the national news on the future submarine because it’s such a huge multibillion-dollar project. It could all get very painful and political as well.

Task Description #6

However, let’s move on to the final slide of the task description. So, A. was is do the reviews. B. is review test plans and review test results to make sure to verify and validate hardware and software compliance with those requirements. And as it says, this includes V and V of the effectiveness of risk mitigation measures. So, we need to test these risk controls where we can and see how effective they are and whether they live up to the requirements or the assumptions that we’ve made. Now, again, this is an American standard, so it’s very test centric. American government likes to test things to death and depending on your point of view, that’s sensible or not, it’s sensible in the sense that you’re testing a real system hopefully in a representative test environment. Although it’s going to be representative of the operational environment. So, it should be a very solid, robust, valid approach to proving a system.

However, there is a downside to testing in that it’s very expensive and it tends to come at the end of a program. Whereas really you need an indication much earlier on if things are going astray. So, you really need to review documentation and do analysis and so forth. Or maybe you maybe you test a prototype for some samples or something early on, rather than waiting until yet when it’s often may be too late and then very expensive to fix things.

And then part C, we need to ensure that hazard control information is incorporated into manuals and plans, whether it be for the operator, the maintainer, the trainer, the logistician, the diagnostics or indeed for the final disposal. We need to take that hazard control information, risk control information, and record it so that it doesn’t get lost and it gets to the people who need it. That’s very important.

OK, so we’ve spent quite a lot of time going through the description because it’s a big, complex task this one, as you can see, with three major parts to it. It’s worth just going back over it. We’ve got our top-level description on slide one, which summarizes the whole thing. We’re talking about finding those requirements, identifying them. We’re talking about the contractor as an active recommender and developer of requirements and actively developing the V and V techniques to make sure that they’re met. Second major part, we’re talking about incorporating those design requirements as the design evolves and using a controlled change method to make sure that we keep up with what’s going on. We’re talking about assessing compliance both at major systems engineering reviews and during testing. And then finally, we’re talking about making sure that the required information gets through to those who need it at the end of the food chain, as it were. All important stuff.

Contracting

Here’s as a page we should be familiar with by now, contracting. We need to require SRHA, Task 203.  We need to put it in the request for proposal and the contractual state, the work. So once again, as I’ve said before, we’ve got to get this stuff in early on. At least the requirement to do it, even if we haven’t fully worked everything out. We need to get that in right at the start of the request for proposal. We need to require task 203 to be done. It’s imposed (A. Imposition of Task 203).

We need to identify (B. Identification of functional disciplines) who we want to take part in it because it’s not, as we will see, it’s not just the discipline and the job of the safety engineers or the safety team to do this. The design engineers, the specialist engineers in reliability, maintainability and testability, whoever, they all need to be involved as well, etc, etc.

Contractor level of effort (C.) for reviews and so on. We may need to specify some hard requirements there to ensure that we get early scrutiny of the product and the design.

Big point tailoring of the task (D. Tailor 203.2 and 203.2.3 as appropriate). The task may need to be tailored assuming again that the contractor is responsible for the design. Maybe if the prime contractor isn’t responsible for the design, maybe we’re contracting somebody to buy something that’s mostly off the shelf and then operating force for 30 years. Let’s say a so-called turnkey solution. And we might do that for a piece of military kit, or we might do that for a hospital, or whatever it might be. A piece of infrastructure, a service, whatever. So, it may be that the contractor who must do most of task 203 is not the Prime at all. But, the prime needs to pass those requirements down to some key subcontractors who are doing the development stuff. So, it’s not a given that the prime contractor right underneath the customer must do all this stuff. It may have to be done at several different levels.

And again, we’ve got to provide the concept of operations (E.), that gives the context for all this work. Otherwise, it gets very difficult to do it. You’ve got to say, ‘What’s the jurisdictional context?’ ‘Where will we be operating under?’ ‘Which rules and conditions?’ As well as everything else that you would find in Con. Ops (Concept of Operations).

And then if there are any specific hazard management requirements (F.) that need to be imposed and specific measures of risk, then they need to be passed on to the contractor as well. This is how we will assess, and measure, and prioritize risks. That needs to be done for the program otherwise, you can end up with lots of different ways doing it and it becomes difficult to govern mess.

Section 4.2 #1

I promised we would have a little section on Section 4.2 in the standard and I’ve got two slides here that say two important things. We’re not going to go through all of Section 4 of the 882- That’s for another session. But here in 4.2, we’ve got two important things.

It says Section 4 defines system safety requirements through life for any system. And when properly applied, these requirements should enable the identification and management of hazards and their associated risks. Not only during system development but also during sustainment. And any engineering activities that go on in sustainment, whether it be repair, overhaul, modification, update, whatever it might be. These requirements are put in place to enable that good work to take place and make predictions for the through-life operation, support, sustainment of system, whatever it might be.

Section 4.2 #2

And then secondly, there’s another important point here, which I alluded to earlier. System safety staff are not responsible for hazard management in other functional disciplines. If you’re a structural designer, you’re responsible for making your structure or designing your structure such that risks of failure and collapse and catastrophe are managed. And the same for everything else. Whatever it is you’re dealing with, propulsion, fuels, you name it, whatever the discipline is, they’re all responsible for managing the risks.

The safety team are there really to pull it together and try and ensure some consistency and honesty and to report status. They are not there to do it all for the designers. Indeed, they can’t because they will not have the design specialist knowledge to do so. Only the designers can do. But it does go on to say all functional disciplines, using this generic methodology that’s in Section 4, should coordinate their efforts as part of the overall systems engineering process. The standard provides standardization and it should force all these different disciplines to work together in a standardized way following a standardized-systems-engineering process. And remember we said earlier, Mil. standard 882 assumes that there is a higher-level systems-engineering process going on into which the safety program fits. And that’s very, very important.

On so many programs I’ve seen, there’s either no systems engineering process or a weak one. Or the safety program is divorced or isolated from the systems engineering, the higher-level program, and as a result, it can become irrelevant if you’re not careful. So, having these things and making sure that they lock together is very important. And the reasoning given here is because you might mitigate a hazard in one discipline only to make it worse for somebody else. We can all think of examples of one (which is code for me saying I can’t right now). But anyway, trade-offs – that’s what we end up with. There’s Section 4.2, which gives us a little insight into the thrust of the whole of section 4.

Commentary #1

Just two slides of commentary for me. First, it’s worth remembering that there are lots, and lots, and lots of requirements. We’ve got requirements of the standard itself, which is about following a rigorous process. We’ve got law at international and national level, and whether those laws apply in a particular jurisdiction or not can be complex. You’ve got product specifications; you’ve got applicable standards, or maybe only parts of the standards that are applicable to your system. And then you’ve got program project requirements, etc., etc. You’ve got lots and lots of layers of requirements that are out there and may or may not be relevant to your system you want to develop, or service, whatever it is going to be. But of course, if we’re using this kind of approach, it’s going to be a complex system or service. It’s going to be challenging to find and identify all these things. It’s going to take some dedicated effort.

That’s one issue, doing all that work. And this is not a trivial exercise and I’ve seen it done badly far more often than I’ve seen it done well. That’s the thing to bear in mind, this is not easy to do. And people didn’t really want to do it – it’s hard work.

And then secondly, we get down to what we might call derived safety requirements. We have a high-level requirement that says, ‘We want a very high level of performance out of this vehicle’ or whatever it might be. And that very demanding performance requirement might force us to use some very high energy fuel, or it might force us to pack a lot of power and a lot of equipment into a very small space, and these requirements can lead to sort of secondary hazards. So, we’ve got high energy fuel inside the vehicle- Well, clearly, that’s dangerous if it leaks. We’ve got a lot of stuff, complex stuff, packed into a small system that can give us thermal control problems. Or if a bit of it goes wrong, if it’s tightly packed together, it can take out something else next to it.

So, these performance requirements can cause hazards that probably weren’t there before or needn’t have been there in, let’s say, a common or garden system that doesn’t have to perform as well. So, we might well look at doing some analysis on our requirements and our top-level design or conceptual design, whatever it might be very early on. And we might say, ‘Well, clearly this is going to drive us down a particular path’ and therefore we will derive some additional safety requirements to deal with these challenges. They don’t come out straight out of higher-level requirements, they’re a secondary effect. But in complex systems, these are very common. And if we’re doing our systems engineering well, we will identify, derive safety requirements for ourselves and for the next level of contractors down the chain.

So, instead of just passing on ‘back-to-back’ requirements from the ultimate customer, which may not mean anything at all to the component supplier (in fact, it probably won’t). We need to change these top-level requirements and say, ‘What’s relevant for you as the supplier role of the engine?’ Let’s say or the wheels, or the wings, or the hull, or whatever it might be. We need to pass on required controls, whether it be prevention of hazards, detection or mitigation. We also need to remember the order of precedence. It’s preferable to eliminate hazards if we can’t, we put in engineering- engineered features- to reduce the risk or lessen the probability, or severity, etc. And those rules are in section 4.3.4 of the Mil. Standard. There’s a lot of work to do on requirements on many different levels and it may be that this task must be repeated at many different levels.

Commentary #2

But the first level task must be done by the client, and actually by the ultimate end-user because to mangle a famous quote, ‘What you don’t specify – what you don’t see can hurt you’. So, we need to do this work as end-users, and as purchases, as customers. It is tempting to assume that the contractors will just do it, that they’ll just get it. ‘They’ve been making planes for years’ or ‘They’ve been making tanks’, or boots, or guns, or ships, or whatever it might be. ‘They’ve been making fuel for years’, ‘these chemicals for years’. We just assume that they know what they’re doing. Well, they probably do know what they’re doing within a particular context. However, if we impose competition, as we always do because we’re always looking for value for money, and whether we have a competition where we’re asking for a firm price to do something or whether we employ other methods of competition and cost-cutting, that will always be pressure on the contract costs. And that means they will be tempted to tailor the safety approach they’re taking in order to reduce costs. Which is a perfectly legitimate thing to do, nothing immoral about doing that, if it’s done appropriately and sensibly.

But if you as the customer or client are going to incentivize your suppliers to do that, you need to be aware of that and the fact that may just not bother because you haven’t told them to. You’re not contractually specified it so you aren’t going to get it. It’s not their problem. And indeed, the suppliers may not understand how their customer will integrate what they provide or use it. The prime contractor may not have a great idea as to how you’re going to use their product. And you can be certain that the subcontractors and the low level secondary and tertiary suppliers are probably going to have no clue whatsoever about what’s going to happen to their components. They are just not going to know. So, you need to specify that as purchaser and you need to make sure that your immediate suppliers pass on those requirements and that context and that they police contract appropriately. Otherwise, there’s going to be trouble for the ultimate client and end-user.

And then finally, in these days of globalization and business-to-business and international procurement, you may be – probably are – buying stuff that’s been made abroad and designed in another country where they may have completely different laws or no laws at all on how safety is built-in – designed in – to a system. And of course, you don’t always know where design work is going to get done; just because you engage a prime contractor in your own country and think that you’re safe. You don’t know whether the prime contractor is going to subcontract software development – let’s say, out to India. It’s so common it’s a cliché! But there are certain things that tend to be done offshore because it’s cheaper, or quicker, or whatever. Or because somebody has already got a system that you can just plugin and use – allegedly.

There’s all kinds of reasons why your supply chain will not necessarily ‘Just get it’, or ‘Just do it”’. In fact, there’s lots of good reasons why they won’t. So, the purchaser has got to do a lot of work and it’s critical for the purchaser to know what their obligations are, because a lot of purchasers don’t. They sit there in blithe ignorance of what their safety responsibilities are, and the lucky ones get away with it. And the unlucky ones are either killed or maimed, or they kill or maim somebody else and they end up going to jail or massive fines. But you’ve not only got to understand the requirements, the obligations, safety on the end item being used but how do you translate that to the contractors, because it’s not always obvious. You can’t just say, ‘Well, these are the laws that I have to obey- I’ll just pass those on to you, Mr Contractor’ because they may not apply to the contractor if they’re in a different country.

Or it just may not make any sense at their level. Laws that were designed to protect people will not often make much sense to a component supplier. Just doesn’t work. Two important points there on the commentary. Lots of layers of requirements that need to be worked on. This is all classic systems engineering stuff, isn’t it? And then the purchaser and the end-user cannot evade their responsibilities at the top of the food chain. Indeed, they’ll be stuck with the problem, whatever it is, for 30 years or however long they use the system.

It’s important for the end-user and the ultimate client to do this work may be several times at many different layers.

Copyright Statement

Well, that’s the end of the technical content. I just wanted to say that I’ve quoted a lot of text from the Mil, standard, which is itself copyright-free, and it’s available for free online, including on the Web site the Safety Artisan. But this presentation’s copyright of the Safety Artisan 2020.

For More …

And for more resources and for more videos like this one, please go to www.safetyartisan.com.

End

Well, that is the end of the presentation. And it just remains for me to say thanks again for watching and do lookout for the next sessions in the series on 882 echo (882E). There are quite a few to go. We’re going to go through all the tasks and the general and specific requirements of the standard and the appendices. We will also talk about more advanced topics, about how we manage and apply all this stuff.

So, from The Safety Artisan, thanks very much and goodbye.

End: System Requirements Hazard Analysis

You can find a free pdf of the System Safety Engineering Standard, Mil-Std-882E, here.

Categories
Mil-Std-882E Safety Analysis

Preliminary Hazard Identification

Want to know how to perform Preliminary Hazard Identification?  This is the first step in safety assessment.  We look at three classic complementary techniques to identify hazards and their pros and cons.  This includes all the content from Task 201, and also practical insights from my 25 years of experience with Mil-Std-882. 

This is the seven-minute-long demo video.

Topics: Preliminary Hazard Identification

  • Task 201 Purpose & Task Description;
  • Historical Review;
  • Recording Results;
  • Contracting; and
  • Commentary:
    • Historical Data;
    • Hazard Checklists; and
    • Analysis Techniques.
Transcript: Preliminary Hazard Identification

Hello, everyone, and welcome to the Safety Artisan, where you will find instructional materials that are professional, pragmatic and impartial, because we don’t have anything to sell and we don’t have an axe to grind. Let’s look at what we’re doing today, which is Preliminary Hazard Identification. We are looking at one of the first actual analysis tasks in Mil-Std-882E, which is a systems safety engineering standard from the US government, and it’s typically used on military systems, but it does turn up elsewhere.

Preliminary Hazard ID is Task 201

I’m recording this on the 2nd of February 2020, however, the Mil-Std has been in existence since May 2012 and it is still current, it looks like it is sticking around for quite a while, this lesson isn’t likely to go out of date anytime soon.

Topics for this session

What we’re going to cover is, quoting from the task, first of all, we’re going to look at the purpose and the task description, where the task talks quite a lot about historical review (I think we’ve got three slides of that), recording results, putting stuff in contracts and then I’m adding some commentary of my own. I will be commenting all the way through, that’s the value add, that’s why I’m doing this, but then there’s some specific extra information that I think you will find helpful, should you need to implement Task 201. In this session, we’ve moved up one level from awareness and we are now looking at practice, at being equipped to actually perform safety jobs, to do safety tasks.

Preliminary Hazard Identification (T201)

The purpose of Task 201 is to compile a list of potential hazards early in development. two things to note here: it is only a list, it’s very preliminary. I’ll keep coming back to that, this is important. Remember, this is the very first thing we do that’s an analytical task. There are planning tasks in the 100 series, but actually, some of them depend on you doing Task 201 because you can’t work out how are you going to manage something until you’ve got some idea of what you’re dealing with. We’ll come back to that in later lessons.

It is a list of potential hazards that we’re after, and we’re trying to do it early in development. And I really can’t over emphasize how important it is to do these things early in development, because we need to do some work early on in order to set expectations, in order to set budgets, in order to set requirements and to basically get a grip, get some scope on what we think we might be doing for the rest of the program. this is a really important task and it should be done as early as possible, and it’s okay to do it several times. Because it’s an early task it should be quick, it should be fairly cheap. We should be doing it just as soon as we can when we’re at the conceptual stage when we don’t even have a proper set of requirements and then we redo it thereafter maybe. And maybe different organizations will do it for themselves and pass the information on to others. And we’ll talk about that later as well.

the task description. It says the contractor shall – actually forget about who’s supposed to do it, lots of people could and should be doing this as part of their project management or program management risk reduction because as I said, this is fundamental to what we’re doing for the rest of the safety program and indeed maybe the whole project itself. So, what we need to do is “examine the system shortly after the material solution analysis begins and compile a Preliminary Hazard List (PHL) identifying potential hazards inherent in the concept”. That’s what the standard actually says.

A couple of things to note here. Saying that you start doing it after material solution analysis has begun might be read as implying you don’t do it until after you finish doing the requirements, and I think that’s wrong, I think that’s far too late. to my mind, that is not the correct interpretation. Indeed, if we look at the last four words in the definition, it says we’re “identifying potential hazards inherent in the concept”. that, I think, gives us the correct steer. we’ve got a concept, maybe not even a full set of requirements, what are the hazards associated with that concept, with that scope? And I think that’s a good way to look at it.

Historical Review

This task places a great deal of emphasis on the review of historical documentation, and specifically on reviewing documentation with similar and legacy systems. an old system, a legacy system that we are maybe replacing with this system but there might be other legacy systems around. We need to look at those systems. The assumption is that we actually have some data from similar and legacy systems. And that’s a key weakness really with this, is that we’re assuming that we can get hold of that data. But I’ll talk about the issues with that when I get to my commentary at the end.

We need to look at the following (and it says including but not limited to).

a) mishap and incident reports, this is a US standard. they talk about mishaps because they’re trying to avoid saying accidents because that implies that something has gone wrong accidentally. Whereas the term mishap, I believe, is meant to imply that it might be accidental, it might be deliberate, whatever it might be, it doesn’t matter, something has gone wrong. An undesirable event has happened, it’s a mishap. we need to look at mishap and incident reports. Well, that’s great, if you’ve got them if they’re of good quality.

b) You need to look at hazard tracking systems. When the Mil-Std talks about hazard tracking systems it is referring to what you and I might describe as a hazard log or a risk register. It doesn’t really matter what they called, where are you storing information about your hazards? And indeed, the tracking implies that they are live hazards, in other words, associated with a live system and things are dynamic and changing. But don’t worry about that, you should, we should, be looking in our hazard logs, in our risk registers, that kind of thing.

c) Can we look at lessons learned? Fantastic, again, if we’ve got them. But unfortunately, learning lessons can be a somewhat political exercise, unfortunately. it doesn’t always happen.

d) We need to look at previous safety analyses and assessments. That’s fantastic. If we’ve got stuff that’s even halfway relevant, maybe we could use it and save ourselves a lot of time and trouble. Or maybe we could look at what’s around and go, actually, I think that’s not suitable because…, and then even that gives you a steer to say, we need to avoid what’s gone wrong with the previous set of analysis. But hopefully without just throwing them out and dismissing them out of hand, because that’s far too easy to do (not invented here, I didn’t do it, therefore it’s no good). Human pride is a dangerous thing.

e) It says health hazard information. Maybe there are some medical results, some toxicology, maybe we’ll be tracking the exposure of people to certain toxins in similar systems. What can we learn from that?

f) And test documentation. let’s look at these legacy systems. What went right, what didn’t go right and what had to be done about it. all useful sources of information.

g) And then that list continues. Mil-Std 882 includes environmental impact, its safety, and environmental impact is implicit all the way through the standard. we also need to look at environmental issues, thinking about system testing, training, where it’s going to be deployed, and maintenance at different levels. And we talk about potential locations for these things because often environmental issues are location sensitive. doing a particular task in the middle of nowhere in a desert, for example, might be completely harmless, doing it next to a significant watercourse, which is near a Ramsar Wetland (an environment of international importance) or an area of outstanding natural beauty or a national park, something like that, might have very different implications. it’s always location-sensitive with environmental stuff.

h) And being an American publication, it goes on to give a specific example: The National Environmental Policy Act (NEPA), which is in the U.S. and then similarly there is an executive order looking at actions by the federal government when abroad and how the federal government should manage that. Now, those are U.S. examples. If you’re not in the U.S. there’s probably a local equivalent of these things. And certainly, I live and work in Australia, we have an Australian Environmental Protection and Biodiversity Conservation Act (AEPBC), and that equivalent of it, it doesn’t just apply in Australia, it also applies to what the Commonwealth Government does abroad as well. outside the normal Australian jurisdiction, it does apply.

i) And then finally, we’ve got to think about disposing of the kit. Demilitarisation: maybe we’re going to take out the old military stuff and flog it to somebody, we need to think about the safety and environmental impacts of doing that. Or maybe we’re just going to dispose of the kit, whatever it might be, we’re going to scrap it or destroy it or put it away somewhere, store it again in the desert somewhere for a rainy day. If that’s not a contradiction in terms. we’re going to think about the disposal of it as well and what are the safety and environmental implications of doing so? there’s a good, broad checklist here to help us think about different issues.

Recording Results

It says that, whoever is doing this stuff, the contractor, shall document identified hazards in this hazard tracking system, in this hazard log, this risk register, whatever you want to call it. And the content of this recording and the formats to be used have got to be agreed between, it says the contractor and the program office, but generally the purchaser and whoever is doing the work. the purchaser might also be the ultimate end-user, as is often the case with the government, or it might be something else. Again, it might be the purchaser who will sell on to an end-user, but they’ve got to agree what they’re going to do with the contractor.

And of course, doing so, you’ve got to understand what your legal obligations are. Again, for example, in Australia, the WHS Act puts particular obligations on designers, manufacturers, suppliers, importers, etc. There are three duties and two of them are associated with passing on information to the end-user. be aware of what your obligations are, the kind of information that at a minimum you must provide, and probably make sure that you’re going to get, that minimum information in a usable format and maybe some other stuff as well that you might need. And it says unless specified elsewhere, in other words, by agreement with the government or whoever is the purchaser, you’ve got to have a brief description of the hazard and the causal factors associated with each identified hazard.

Now this is beginning to get away from just a pure list, isn’t it? it’s not just a list, we have to have a description that we can scope out the hazard that we’re talking about. Bear in mind, early on we might identify a lot of hazards that subsequently actually turn out to be just one hazard or are not applicable or are covered by something else. we need a description that allows us to understand the boundaries of what we’re talking about. And then we’re also being asked to identify causes or causal factors. maybe circumstances, what could cause these things, etc. it’s a little bit more than just a list, but we’re beginning to fill in the fields in the hazard log as we do this at the start.

Contracting

Now, this is very useful, in the standard for every task it says here are the details to be specified in the contractual documentation, and notice it says details to be specified in the Request for Proposal. you’ve got to ask for this stuff if you need it. You’ve got to know that you need it and why you need it and what you’re going to do with the information as a purchaser. And you’ve got to put that in right at the start in the Request for Proposal and the Statement of Work. And here’s some guidance on what to include.

The big point here is this needs to be done very early on. In fact, to be honest, the purchaser is going to have to do Task 201 themselves and maybe some other tasks in order to get enough data and enough understanding to write the Request for Proposal and the Statement of Work in the first place. you do it yourself and then maybe you do a quick job to inform your contracting strategy and what you’re going to do and then you get the contractor to do it as well.

What have we got to include? Well, we’ve got to impose Task 201. I’ve seen lots of contracts where they just say, ah, do safety, do safety in accordance with this standard, do Mil-Std 882, or whatever it might be. And a very broad open-ended statement like that is vulnerable to interpretation because what your contractors, your tenderers, will do is in order to come in at the minimum price and try and be competitive is they will tailor the Mil-Std and they will chop out things that they think are unnecessary, or that they can get away without doing and they might chop out some stuff that actually you find that you need. That can cause problems.

But even worse, if you’ve got a contractor who doesn’t understand how to do system safety engineering, who doesn’t understand Mil-Std 882, they might just blindly say, oh, yeah we’ll do that, and the classic mistake is you get in the contract, it says do Mil-Std 882E and here are all the DIDs, data item descriptors which describe what’s got to be in the various documents that the contractor has to provide. And of course, government projects love having lots of documentation, whether it’s actually helpful or not.

But the danger with this is this can mislead the contractor because if they don’t understand what a system safety program is, they might just go, I’ve got to produce all these documents, yeah, I can do that and not actually realize that they’ve got to do quite a lot of analysis work in order to generate the content for those reports. And I know that sounds daft, but it does happen, I’ve seen it again and again. You got a contractor who produces these reports that on paper have met the requirements of the DID because it’s got all the right headings, it’s got all the right columns, or whatever else. But it’s full of garbage information or TBD or stuff that is obviously rubbish. And you think, no, no, you actually have to specify, you need to do the task and the documentation is the result of the task. we don’t want the tail wagging the dog. Anyway, I’ll get off my soapbox. You’ve got to impose the task, it’s a job to be done, not just a piece of paper to be produced.

Identification of the functional disciplines to be addressed. who’s going to be involved? What are you including? Are you including engineering, maintenance, human factors? Who’s got to be involved? Ideally, you want quite a wide involvement, you want lots of stakeholders, which you need to think about.

Guidance on obtaining access to government information. Now, whether it’s the government or whoever the purchaser is, it doesn’t have to be a government, getting a hold of information and guidance out of the purchaser can be very difficult. And very often that’s because the purchaser hasn’t done their homework. They haven’t worked out what information they will need to provide because maybe they don’t understand the demands of the task or they’ve just not thought it through, quite frankly. And the contractor or whoever is trying to do the analysis finds that they are hamstrung, they can’t actually do the work without the information provided by the purchaser.

That means the contractor can’t do the work, and then they just pass the risk straight back to the government, back to the purchaser, and say: I need this stuff. And then the purchaser ends up having to generate information very quickly at short notice, which is never good, you never get a quality result doing that. And often my job as a consultant is I get called in by the purchaser as often as I do by the supplier to say help, we don’t know what’s going on here, the contractor has said I can’t do the safety program without this information and I don’t understand what they want or what to tell them. as a consultant, I find myself spending a lot of time providing this kind of expertise because either the purchase or contractor doesn’t understand their obligations and hasn’t fulfilled them. Which is great for me, my firm gets paid a lot of money. It’s not good for the safety program.

Content and format requirements. Yes, we need to specify the content that we need. I say need not want. What are we going to do with this stuff? If we’re not going to do anything with it, do we actually need it at all? And what’s the format requirements? Because maybe we need to take information from lots of different subcontractors and put it all together in a consistent risk register. if it comes in all different formats, that’s going to make a lot more work and it may even make merging the information impossible. we need to think about that.

Now, what’s the concept of operations? We’ll come back to that in later tasks. But the concept of operations is, what are we going to do with this system? that should provide the operating environment. It should provide an overview of some basic requirements, maybe how the system will interface with other systems, how it will interact, concepts of operation deployment basing, and maintenance. And maybe they’re only assumptions at this stage, but the people doing the analysis will need this stuff. You recall the environmental stuff is very location-sensitive, we need a stab at where these things will happen and we need to understand what the system is going to be used for because in safety, context is everything. A system that might be perfectly safe in one context, if it’s being used not for what it was originally designed for or conceived for, can become very dangerous without anybody realizing.

Other specific hazard management requirements. What definitions are we using? Very important because again, it’s very easy to get different information that’s being generated against different definitions by different contractors. And then it’s utter confusion. Can we compare like with like, or can’t we? What risk matrix are we going to use on this program? What normally happens on 882 programs is people just take the risk matrix out of the standard and use it without changing it. Now, that might be appropriate in certain circumstances, but it isn’t always. But I’m going to talk about that, that’s a very complex, high-level management issue and I’m going to be talking about that in a separate issue about how do we actually derive a suitable risk matrix for our purposes and why we should do so. Because the use of an unsuitable matrix can cause all sorts of problems downstream, both conceptual problems in the way that we think about stuff and lower levels, sort of mechanistic problems. But I don’t have time to go into that here.

Then references and sources of hazard identification. This is another reason why the purchaser needs to have done their homework. Maybe we want the contractor or whoever is doing the analysis to look at particular sources of information that we consider to be relevant and necessary to consider. we need to specify that and understand what they are. And usually we need to understand why we want them as well.

Commentary

That’s what was in the standard, as you see it’s very short, is only a page and a half in the standard and it is quite a light, high-level definition of the task because it’s an early task. Now let’s add some value here. Task 201 goes talks all about historical data. However, that is not the only way to do preliminary hazard identification. There are in fact two other classic methods to do PHI. One is the use of hazard checklists and you can also use some simple analysis techniques. And we need to remember that this is preliminary hazard identification, we’re doing this early and often to identify as many hazards as possible to find those hazards and the associated causes, consequences, maybe some controls as well. we’re trying to find stuff, not dismiss it or close out the hazards. And again, I’ve seen projects where I’ve read a preliminary hazard identification report and it says, we closed 50 hazards, and I think, no, you didn’t, you weren’t supposed to close anything because this is preliminary hazard identification. You identify stuff and then it gets further analyzed. And if upon analysis, you discover actually this hazard is not relevant, it cannot possibly happen, then, and only then, can you close it. Let’s remember, this is a preliminary hazard ID.

Commentary – Historical Data

First of all, let’s look at historical data. And first of all some issues with using this historical data, availability. Can we actually get hold of it? Now, it may be that you work for a big corporate or government organization that for whatever reason has good record keeping and you’ve got lots and lots of internal data that is of good quality that you’re allowed to access and that you know about and you can find or discover. If you are one of those people who are very, very lucky, you are in a minority, in my experience. If you’ve got all that stuff, fantastic, use it. But if you haven’t or if the information is of poor quality or people won’t give you access for whatever reason. And there are all sorts of reasons why people want to conceal information, they’re frightened of what people may discover, especially safety engineers. You may have to go out to external sources.

Now, the good news is that in the age of the Internet, getting hold of external data is extremely easy. there’s lots of potential sources of data out there, and it may range from stuff on Wikipedia, public reporting of accidents and incidents by regulators or by trade associations or by learned societies that study these things or by academics or by consultancy such as the one that I work for. there’s all kinds of potential sources of information out there that might be relevant to what you’re doing. And even if you’ve got good internal information, it’s probably worth searching out there for what’s external as a due diligence exercise, if nothing else, just to show that you haven’t just looked inwardly, that you’ve actually looked outwardly the rest of the world. There are lots of good sources of information out there. And depending on what industry you’re in, what domain you work in, you will probably know some of the things that are relevant in your area.

Now, just because data is available doesn’t mean that it’s reliable. It might be vague or inconsistent. We’ll come onto that later. It might be patchy. It’s usual for incidents to be underreported, especially minor incidents. you will find often that the stuff that gets reported is only the more serious stuff, and you should really assume that there has been under-reporting unless you’ve got a good reason not to. But to be honest, underreporting is the norm almost everywhere. there’s the issue of reliability, the data that you’ve got will be incomplete.

Secondly, another big issue is consistency. People might be reporting mishaps or incidents or accidents or events or occurrences. They might be using all sorts of different terminology to describe stuff that may or may not be relevant to what you’re talking about. And there’s lots of information out there, but actually, how has it been classified? Is it consistent? Can you compare all these different sources of information? And that can be quite tricky. And very often because of inconsistencies in the definition of a serious injury, for example, you may find that all you can actually compare with confidence is fatalities because it’s difficult to interpret death in different ways. As a safety engineer, frequently I start with fatal accidents, if there are any, because those can’t be misinterpreted. And then you start looking at serious injuries, minor injuries, incidents where no one gets hurt, but somebody could have been. There are all sorts of pitfalls with the consistency of the data that you might get a hold of.

There’s relevance. It may be that you’re looking at data from a system that superficially looks similar to yours, but with a bit of digging, you may discover that although the system was similar, it’s being used in a completely different context and therefore there are significant differences in the reporting and what you’re seeing. there may be data that is out there, but just not relevant for whatever reason.

And finally, objectivity. Now, this is a two-way street. Historical data is fantastic for objectivity because it stops people from saying subjectively, this couldn’t possibly happen. And I’ve heard this many times, you come up with something and somebody said, oh that couldn’t possibly happen, and then you show them the historical evidence that says, well it’s happened many times already and then they have to eat their words. Historical data is fantastic for keeping things objective, provided of course, that it’s available, reliable, consistent, and relevant. you’ve got to do a bit of work to make sure that you’re getting good data, but if you can, it’s absolutely worth its weight in gold, not just for Hazard I.D., but for torpedoing some of the stupid things that people come out with when they’re trying to stop you doing your job for whatever reason. historical data is great for shooting down prejudice is basically what I’m saying. reality always wins. That’s true in safety in the real world and in the safety analysis.

Having said all that, what’s the applicability of historical data? It may be that really we can only use it for preliminary hazard identification and analysis. (I’ve just noticed I’ve got preliminary hazard identification and analysis.) Sometimes I see contractors try to use historical data to say, that’s the totality of my safety argument, my kit is wonderful, it never goes wrong and therefore it will never go wrong, that’s the totality of my safety argument. And that never works, because when you start trying to use historical data as the complete safety argument, you very quickly come up against these problems of availability, reliability, consistency, and relevance.

It’s almost impossible to argue that a future system will be safe purely because it’s never gone wrong in the past. And in fact, trying to make such claims as, it’s never gone wrong, we’ve never had a problem, we’ve never had an incident, straight away that would suggest to me that they don’t have a very good incident reporting system or that they’ve just conveniently ignored the information they do have and not that people selling things ever do anything like that? Of course, no, never. There’s a lot of used car salesmen out there. probably this use of historical data, we might have to keep it fairly limited. It might be usable for preliminary work only. And then we have to do the real work with analysis. But almost certainly it’s not going to be the whole answer on its own. do bear that in mind, historical data has its limits.

It’s also worth remembering that we get data from people as well. In Australia, the law requires managers to consult with workers in order to get this kind of information. No doubt in other countries there are similar obligations. there’s lots of people out there, potentially workers, management, suppliers and users, maintainers, regulators, trade associations, lots of people who might have relevant information. we really ought to consult them if we can. Sometimes that information is published, but other times we have to go and talk to people or get them to come to a preliminary hazard I.D. meeting in order to take part. There are lots of good ways of doing this stuff.

Commentary – Hazard Checklists

Let’s move on to hazard checklists. Checklists are great because someone else has done the work for you to a degree that it’s quick and cheap to get a checklist from somewhere and go through it to see if you can find anything that prompts you to go, yeah that could be an issue with my system. And the great thing about checklists is they broaden the scope of your hazard I.D. because if your historical data is a bit patchy or a bit inconsistent as it often is, it will identify some stuff, but not everything. the great thing about a checklist is very often broad and shallow, it really broadens the scope of the hazard I.D., it complements your historical data. I would always recommend having a go with a checklist.

Now, bear in mind that checklists tend to identify causes, you then have to use some imagination to go, okay, here’s a cause, how in the context of my system, how in the context of this concept of operations (very important), in this context, how could that cause lead to a hazard and maybe to a mishap? you need to apply some imagination with your checklist and it can be a good way of prompting a meeting of stakeholders to think about different issues because people will turn up with an axe to grind, they’ll have their favourite thing they want to talk about. Having a checklist keeps it objective or having historical data to review, keep it objective, and keeps people on track, so they don’t just go down a rabbit hole and never look at anything else.

But again, this is preliminary hazard identification only. if something comes up, I would advise you to take the position that it could happen unless we have evidence that it could not. And notice, I say evidence, not opinion. I’ve met plenty of people who will swear blind, that such and such could not possibly happen. A classic one that suckered me, somebody said no British pilot would ever be stupid enough to take off with that problem, and like a fool, I believed them. Don’t listen to opinion, however convincing it is unless there’s evidence to say it cannot happen because it will. And in that case, it did two weeks later. don’t believe people when they say, oh that couldn’t possibly happen, it just shows a lack of imagination. Or they’ve got some vested interests and they’re trying to keep peace and keep you away from something.

It’s worth mentioning, in Australia at minimum, we need to use the approach for Hazard I.D. that is in the WHS Risk Management Code of Practice. there’s some good basic advice in that code of practice on what to do to identify and analyse hazards and assess risks and manage them. We need to do it, at minimum. It’s a good way to start, and in fact there’s a bit of a hazard checklist in there as well. It’s not great, it’s workplace stuff mainly rather than design stuff or systems engineering stuff. But nevertheless, there’s some good stuff in there and that is the absolute bare minimum that we have to do in Australia. And there will probably be local equivalents wherever you are.

If you’re looking for a good example for a general checklist, if you look in, the U.K.’s ASEMS, which is the MOD acquisition safety and environmental management system. In POSMS, which is the project-oriented safety management system, there is a safety management procedure, SMP04, which is PHI. And that’s got a checklist in there. It’s aimed at sort of big equipment, military equipment, but there’s a lot of interesting stuff in there that you could apply to almost anything. If you look online, you’ll probably find lots of checklists, both general checklists and specialist checklists for your areas, maybe your trade association, or whatever has a specialist checklist for the particular stuff, the thing that you do. always good to look up those things online, and see if you can access them and use them. And as I say, using multiple techniques helps us to ensure or have confidence that we’ve got fairly complete coverage, which is something that we’re going to need later on. And dependent on your regulator, you might have to demonstrate that you’ve done a thorough job, using multiple techniques is a good way of doing that. I’ve already said checklists nicely complement historical data because they’ve got different weaknesses.

Commentary – Analysis Technique

A third technique, which again takes a different approach, it complements the other two, is to use some kind of analysis technique to identify hazards. And there are lots of them out there. Again, I’m not going to go into them now in this session, I’m just going to give you one example, which is probably the simplest one I know, and therefore the most cost-effective. And you can do this as a desktop exercise and then get some stakeholders in and do it live with the stakeholders. Either use what you’ve prepared, or keep what you’ve prepared in your back pocket if you need to get things going, if people are stumped, they’re not sure what to do.

Now, this technique I’m just going to talk about is called Functional Failure Analysis (FFA). And really all it does, you take a basic top-level function of whatever it is that you’re considering, you’ve got your concept of operations that says, I need a system to do X, Y, Z, you go, let’s look at X, Y, and Z, and with each one of these functions, what happens if it doesn’t work when it’s supposed to work, or what happens if it works when I don’t want it to? That’s the un-commanded function or unwanted function, maybe. And then what if it happens, but it doesn’t happen completely correctly. What if it happens incorrectly? And there might be several different answers to that.

I’ll give you an example. Let’s assume that you’re Mr. Mercedes, and you’re inventing the horseless carriage, you’re inventing the automobile, the car. And you say, this thing, it’s got a motor, I wanted it to start off, I want it to go and then I want to stop. those really, really simple conceptual ideas, I want it to go, or I want it to start moving. What happens if it doesn’t? Well, nothing actually, from a safety point of view. The driver might be a bit frustrated, but it’s not going to hurt anybody. Next, an un-commanded function: what if it goes when it’s not supposed to? Now that’s bad. Or maybe the vehicle will roll away downhill when it’s not supposed to. We need a parking brake, in that case, we need a handbrake or use chocks or something or we restrain it.

Straight away, with something as simple and simplistic as this, you can begin to identify issues and say, we need to do something about that. this is a really powerful technique, you get a lot of bangs per buck. And then, of course, we could go on with the example, it’s a trivial example, but you can see potentially how powerful it is providing you’re prepared to ask these open-ended questions and answer them imaginatively without closing your mind to different possibilities. there’s an example of an analysis technique, and again, remember that this preliminary hazard ID. If we’ve identified something that could happen, then it could happen unless we have evidence that it could not.

I’ve talked for long enough, it just remains for me to point out that the quotations from Mil-Std are copyright free. But this video is copyright of The Safety Artisan 2020. And you can find more safety information, more lessons, and more safety resources at www.safetyartisan.com. I just want to say that’s the end of the lesson, thank you very much for listening and I hope you’ve found today’s session useful. Goodbye.

End: Preliminary Hazard Identification

You can find a free pdf of the System Safety Engineering Standard, Mil-Std-882E, here.