Categories
Mil-Std-882E

Transcript: Functional Hazard Analysis (T208)

In the full-length (40-minute) session, The Safety Artisan looks at Functional Hazard Analysis, or FHA, which is Task 208 in Mil-Std-882E. FHA analyses software, complex electronic hardware, and human interactions. We explore the aim, description, and contracting requirements of this Task, and provide extensive commentary on it. (We refer to other lessons for special techniques for software safety and Human Factors.)

Transcript: Functional Hazard Analysis

Introduction

Hello, everyone, and welcome to the Safety Artisan; Home of Safety Engineering Training. I’m Simon and today we’re going to be looking at how you analyse the safety of functions of complex hardware and software. We’ll see what that’s all about in just a second.

Functional Hazard Analysis

I’m just going to get to the right page. This, as you can see, functional hazard analysis is task 208 in mil. Standard 882E.

Topics for this Session

What we’ve got for today: we have three slides on the purpose of functional hazard analysis, and these are all taken from the standard. We’ve got six slides of task description. That’s the text from the standard plus we’ve got two tables that show you how it’s done from another part of the standard, not from task 208. Then we’ve got update and recording, another two slides. Contracting, two slides. And five slides of commentary, which again include a couple of tables to illustrate what we’re talking about.

Functional Purpose HA #1

What we’re going to talk about is, as I say, functional hazard analysis. So, first of all, what’s the purpose of it? And in classic 882 style, task 208 is to perform this functional hazard analysis on a system or subsystem or more than one. Again, as with all the other tasks, it’s used to identify and classify system functions and the safety consequences of functional failure or malfunction. In other words, hazards.

Now, I should point out at this stage that the standard is focused on malfunctions of the system. The truth is in the real world, that lots of software-intensive systems have been involved in accidents that have killed lots of people, even when they’re functioning as intended. That’s one of the short-sightedness of this Mil.Standard is that it focuses on failure. The idea that if something is performing as specified, that either the specification might be wrong or there might be some disconnect between what the system is doing and what the human expects- The way the standard is written just doesn’t recognize that. So, it’s not very good in that respect. However, bearing that in mind, let’s carry on with looking at the task.

Functional HA Purpose #2

We’re going to look at these consequences in terms of severity- severity only, we’ll come back to that- for the purpose of identifying what they call safety-critical functions, safety-critical items, safety-related functions, and safety-related items. And a quick word on that, I hate the term ‘safety-critical’ because it suggests a sort of binary “Either it’s safety-critical. Yes. Or it’s not safety-critical. No.” And lots of people take that to mean if it’s “safety-critical, no,” then it’s got nothing to do with safety. They don’t recognize that there’s a sort of a sliding scale between maximum safety criticality and none whatsoever. And that’s led to a lot of bad thinking and bad behaviour over the years where people do everything they can to pretend that something isn’t safety-related by saying, “Oh, it’s not safety-critical, therefore we don’t have to do anything.” And that kind of laziness kills people is the short answer.

Anyway, moving on. So, we’ve got these SCFs, SCIs, SRFs, SRIs and they’re supposed to be allocated or mapped to a system design architecture. The presumption in this- the assumption in this task is that we’re doing early- We’ll see that later- and that system design, system architecture, is still up for grabs. We can still influence it. Often that is not the case these days. This standard was written many years ago when the military used to buy loads of bespoke equipment and have it all developed from new. That doesn’t happen anymore so much in the military and it certainly doesn’t happen in many other walks of life- But we’ll talk about how you deal with the realities later. And they’re allocating these functions and these items of interest to hardware, software and human interfaces.

And I should point out, when we’re talking about all that, all these things are complex. Software is complex, human is complex, and we’re talking about complex hardware. So, we’re talking about components where you can’t just say, “Oh, it’s got a reliability of X, and that’s how often it goes wrong” because those type of simple components that are only really subject to random failure, that’s not what we’re talking about here. We’re talking about complex stuff where we’re talking about systematic failure dominating over random, simple hardware failure. So, that’s the focus of this task and what we’re talking about. That’s not explained in the standard, but that’s what’s going on.

Functional HA Purpose #3

Now, our third slide on purpose; so we use the FHA to identify consequences of malfunction or functional failure, lack of function. As I said just now, we need to do this as early as possible in the systems engineering process to enable us to influence the design. Of course, this is assuming that there is a systems engineering process- that’s not always the case. We’ll talk about that at the end as well. And we’re going to identify and document these functions and items and allocate and it says partition them in the software design architecture. When we say partition, that’s jargon for separate them into independent functions. We’ll see the value of that later on. Then we’re going to identify requirements and constraints to put on the design team to say, “To achieve this allocation in this partitioning, this is what you must do and this is what you must not do”. So again, the assumption is we’re doing this early. There’s a significant amount of bespoke design yet to be done.

Task Description (T208) #1

Moving on to task description. It says the contractor, but whoever’s doing the analysis has to perform and document the FHA, to analyse those functions, as it says, with the proposed design. I talked about that already so we’ll move on.

It’s got to be based on the best available data, including mishap data. So, accident/incident data, if you can get it from similar systems and lessons learned. As I always say in these sessions, this is hard to do, but it’s really, really valuable so do put some effort into trying to get hold of some data or look at previous systems or similar systems. We’re looking at inputs, outputs, interfaces and the consequences of failure. So, if you can get historical data or you can analyse a previous system or a similar system, then do so. It will ultimately save you an awful lot of money and heartache if you can do that early on. It really is worth the effort.

Task Description (T208) #2

At a minimum, we’ve got to identify and evaluate functions and to do that, we need to decompose the system. So, imagine we’ve got this great big system. We’ve got to break it down into subsystems of major components. We’ve got to describe what each subsystem and major component does, its function or its intended function. Then we need a functional description of interfaces and thinking about what connects to what and the functional ins and outs. I guess pretty obvious stuff – needs to be done.

Task Description (T208) #3

And then we also need to think about hazards associated with, first of all, loss of function. So, no function when we need it. Now, we have degraded functional malfunction and sort of functioning out of time or out of sequence. So, we’ve got different kinds of malfunctions. What we don’t have here is function when not required. So, the system goes active for some reason and does something when it’s not meant to. Now, if we add that third base and we’ve got a functional failure analysis. Essentially here, we’re talking about a functional failure analysis, maybe something a bit more sophisticated, like a HAZOP. And the HAZOP is more sophisticated because instead of just those three things that can go wrong, we think about we’ve got lots of guide words to help us think about ‘out of time, out of sequence’. So, too early, too late, before intended, after intended, whatever it might be. And there are there variations on HAZOP called computer HAZOP, or CHAZOP, where people have come up with different keywords, different prompt words, to help you think about software in data-intensive systems. So, that’s a possible technique to use here.

And then when we’re thinking about these hazards that might be generated by malfunction, or functional failure in its various forms, we need to think about, “What’s the next step in the mishap sequence? In the accident sequence? And what’s the final outcome of the accident sequence?” And that’s very important for software because software is intangible. It has no physical form. On its own, in isolation, software cannot possibly hurt anyone. So, you’ve got to look at how the software failure propagates through the system into the real world and how it could harm people. So, that’s a very important prompt that that last sentence in yellow there.

Task Description (T208) #4

And we carry on. We need to assess the risk with failure of a function subsystem or component. We’re going to do so using the standard 882 tables, tables one and two, and risk assessment codes in table three, unless we come up with our own tailored versions of those tables and that matrix and that’s all approved. In reality, most people don’t tailor this stuff. They should make it appropriate for the system, but they rarely do.

Table I and II

So just to remind us what we’re talking about, here’s table one and two. Table one is severity categories ranging from catastrophic, which could kill somebody- a catastrophic outcome- down to negligible, where we’re talking cuts and bruises- very, very, very minor injuries.

And then table two, probability levels. We’ve got everything from frequent down to eliminated- There’s no hazard at all because we’ve eliminated. It will never happen in the lifetime of the universe. So, it really is a zero probability. We’ve got frequent down to improbable and then in the standard, we’ve got a definition for these things in words, for a single item and also for a fleet or inventory of those items, assuming that there’s a large number of them. And that’s very useful. That helps us to think about how often something might go wrong per item and per fleet.

Table III

So, that’s tables one and two, we put them together, the severity and the probability to give us table three. As you can see, we’ve got probability down the left-hand side and at the bottom, if we’ve eliminated the hazard, then there is no severity. The hazard is completely eliminated. So, forget about that row. Then everything else we’ve got frequent down to improbable, probability. And we’ve got catastrophic down to negligible. Together those generate the risk assessment code, which is either high, serious, medium or low. That’s the way this standard defines things. Nothing is off-limits. Nothing is perfect except for elimination. We’ve just defined a level of risk and then you have to make up rules about how you will treat these levels of risk. The standard does some of that for you, but usually, you’ve got to work out depending on which jurisdiction you’re in legally, what you’re required to do about different levels of risk.

Now this table on its own, I’ll just mention, is not helpful in a British or Australian jurisdiction where we have to reduce or eliminate risks SOFARP. The table on its own won’t help you do that, because this is just an absolute level of risk. It’s not considering what you could have done to make it better. It’s just saying where we are. It’s a status report.

So, those are your tables one, two and three, as the standard describes them. That’s the overall method and we’re going to do what it says in Section four of the standard. In the main body of the standard, Section four talks about software and complex hardware and how we allocate these things.

Task Description (T208) #5

And then finally, I think on task description, an assessment of whether the functions identified are to be implemented in the design- sorry, of whether the functions are to be implemented in the design and map those functions into the components. And then it says functions allocated to software should be matched to the lowest level of technical design or configuration item. So, if you’ve got a software or hardware configuration item that is further subdivided into sub-items, then you need to go all the way down and see which items can contribute to that function and which can’t.

That’s an important labour-saving device, because if you’ve got – you could have quite a large configuration item, but actually, only a tiny bit contributes to the hazard. So, that’s the only thing you need to worry about in theory. In reality, partitioning software is not as easy as the standard might suggest. However, if we can do a meaningful partition, then we could and should aim to have as little software safety-related as we possibly can. If nothing else, for cost in order to get the project in on time. So, the less criticality we have in our system, the better.

Task Description (T208) #6

So, we need to assess the software control category for each configuration item that’s been allocated a safety-significant software function or a triple SF(SSSF). Having assigned the SCC, we then have to work at the software criticality index for each of those functions and we’ll talk about how to do that at the end. Then from all of this work, we need to generate a list of requirements and constraints to include in the spec which, if they work, will eliminate the hazard or reduce the risk.

And the standard talks about that these could be in the form of fault tolerance, fault detection, fault isolation, fault annunciation or warning, or fault recovery. Now, this breakdown reveals- basically this is a reliability breakdown. So, in the world of reliability, we talk typically about fault tolerance, fault detection, warning, and recovery. Four things – they split them down to five here. Now, software reliability is highly controversial. So really, this is a bit of a mismatch here. These reliability-based suggestions are not necessarily much use for software, or indeed for people sometimes. You may have to use other more typical software techniques to do this and in fact, the standard does point you to do that. But that’s for another session.

FHA Update & Records #1

So, we’ve done the FHA, or we’re doing the FHA. We’ve got to record it and we’ve got to update it when new information comes through. So, we’ve got to update the FHA as the design progresses or operational changes come in. We’ve got to have a system description of the physical and functional characteristics of the system and subsystems. And of course, for design complex items like software, context is everything. So, this is very important. Again, software in isolation cannot hurt anyone. You’ve got to have the context to understand what the implications might be. If we don’t have that, we’re stuffed pretty much. Then it goes on to say that when further documentation becomes available, more detail that needs to be supplied. So, don’t forget to ask for that in your contract and expect it as well and be ready to deal with it.

FHA Update & Records #2

 Moving on. When it comes to hazard analysis, method and techniques, we need to describe the method and the technique used for the analysis, what assumptions and what data was used in support of the analysis and this statement is pretty much in every single task so I’ll say no more. You’ve heard this before. Then again, analysis results need to be captured in the hazard tracking system and, as I’ve always said, usually the leading details, the top-level details, go in there has a tracking system. The rest of it goes into the hazard analysis report otherwise, you end up with a vast amount of data in your HTS and it becomes unwieldy and potentially useless.

Contracting #1

Contracting- Again, this is a pretty standard clause, or set of clauses, in a Mil. Standard 882 task. So, in our request for proposal and statement of work, we’ve got to ask the task 208. We’ve got to point the analyst, the contractor, at what we want them to analyse particularly or maybe as a minimum. And what we don’t want to analyse, maybe because it’s been done elsewhere or it’s out of scope for this system.

We need to say what are data reporting requirements are considering Task 106, which is all about hazard tracking system or the hazard log or the risk register, whatever you want to call it. So, what data do we want? What format? What are the definitions, etc.? Because if you’re dealing with multiple contractors or you want data that is compatible with the rest of your inventory, then you’ve got to specify what you want. Otherwise, you’re going to get variability in your data and that’s going to make your life a whole lot harder downstream- Again, this is standard stuff.

And what are the applicable requirements, specifications and standards? Of course, this is an American standard so compliance with specifications, requirements and standards is all because that’s the American system.

Contracting #2

We need to supply the concept of operations, as I’ve said before, with a complex design. Especially software, context is everything. So, we need to know what we’re going to do with the system that the software is sat within. So, this system has got some functions, this is what we’re looking at in task 208: What are those functions for? How do they to relate with the real world? How could we hurt people? And then if we got any other specific hazard management requirements. Maybe we’re using a special matrix because we’ve decided the standard matrix isn’t quite right for our system. Whatever we’re doing, if we’ve got special requirements that are not the norm for the vanilla standard, that we need to say what they are. Pretty straightforward stuff.

Commentary #1

We’re onto commentary, and I think we’ve got five slides of commentary today. As it says, functional hazard analysis depends on systems engineering. So, if we don’t have good systems engineering, we’re unlikely to have good functional analysis. So, what do I mean by good systems engineering? I mean, that for the complete system – apart from things that we deliberately excluded for a good reason – but for the complete system we need or functions to be identified, we need those functions to be analysed and allocated correctly in accordance and rigorously and consistently. We need interface analysis, control, and we need the architecture of the design to be determined based on the higher-level requirements, all that work that we’ve done.

Now, if those things are not done or they’re incomplete, or they were done too late to influence the design architecture, then you’re going to have some compromised systems engineering. And these days, because we’re using lots of commercial off the shelf stuff, what you find is that your top-level design architecture is very often determined before you even start because you’ve decided you’re going to have an off the shelf this and you’re going to have a modified off the shelf that and you’re going to put them together in a particular way with a set of business rules, a concept of operations, that says this is how we’re going to use this stuff.

And our new system interfaces with some existing stuff and we can’t modify the existing stuff. So, that really limits what we can do with the design architecture. A lot of the big design decisions have already been taken before we even got started. Now, if that’s the case, then that needs to be recognized and dealt with. I’ve seen those things dealt with well. In other words, the systems engineering has been done recognizing those constraints, those things that that can’t be done. And I’ve seen it done badly in that figuratively speaking, the systems engineering team or the program manager, whoever has just given us of Gallic shrug and gone “Yeah, what the heck, who cares?” So, there’s this the two extremes that you can see.

Now, if the systems engineering is weak or incomplete, then you’re going to get a limited return on doing task 208. Maybe there are some areas where you can do it on new areas, or maybe you’ve got a new interface that’s got to be worked up and created in order to get these things to talk to each other. Clearly, there is some mileage in doing that. You’re going to get some benefits from doing that in that area. But for the stuff that’s already been done, probably – well, what’s the point of doing systems engineering here? What does it achieve? So, maybe in those circumstances, it’s better- Well, in fact, I would say it’s essential to understand where systems engineering is still valid, where you still going to get some results and where it isn’t. And maybe you just declare that scope; What’s in and out.

Or maybe you take a different approach. Maybe you go “OK, we’re dealing with a predominantly CoP system. We need a different way of dealing with this than the way the Mil. standard 882 assumes.” So, you’re going to have to do some heavy tailoring of the standard because 882 assumes that you’re determining all these requirements predesigned. If that’s not the case, then maybe 882 isn’t for you. Or maybe you just need to recognize you’re going to have to hack it about severely. Which in turn means you’ve got to know what you’re doing fundamentally. In which case the standard really is no longer fulfilling its role of guiding people.

Commentary #2

Moving on. Let’s assume that we are still going to do some task 208. We’re going to determine some software criticality. We’re also going to determine some criticality for complex hardware. So, things whether it be software in complex electronics, so pre-programmed electronics, whatever that might be.

First of all, as we said before, we’re going to determine the software control category and what that’s really saying is how much authority does the software have? And then secondly, we’re going to be looking at severity, which was table one. How severe is the worst hazard or risk that the software could contribute to? And these are illustrated in the next two slides. And we do a session or several sessions on software safety is coming soon. That will be elsewhere. I’m not going to go into massive detail here. I’m just giving you an overview of what the task requires.

Commentary #3: Software Control Categories 1-5

First of all, how do we determine software control category? So, there’s the table from the standard. We’ve got five levels of SCC.

At the top, we’ve got autonomous. Basically, the software does whatever it wants to and there’s no checks and balances.

Secondly, they’re semi-autonomous. The software is there’s one software system performing a function, but there are hardware interlocks and checks. And those hardware interlocks and checks, and whatever else that are not software, can work fast enough to prevent the accident happening. So, they can prevent harm. So, that’s semi-autonomous.

Then we’ve got redundant fault-tolerant where you’ve got an architecture typically with more than one channel, and maybe all channels are software controlled. Maybe there’s diversity in the software and there is some fault-tolerant architecture. Maybe a voting system or some monitoring system saying, “Well, Channel Three’s output is looking a bit dodgy” or “Something gone wrong with Channel two”. I’ll ignore the channel at fault, and I’ll take the good output from the channels that are still working and I’ll use that. So that’s that option. Very common.

Then we’ve got number four, which is influential. So, the software is displaying some information for a human to interpret and to accept or reject.

And then we’ve got five, which is no safety impact at all. Now, the problem child in this, of course, is influential because it’s very easy to say, “The software just displays some information, it doesn’t do anything”. So, unless a human does something – so we don’t have to worry about the safety implications of that at all. Wrong! Because the human operator may be forced to rely on the software output by circumstances, there may not be time to do anything else. Or the human may not be able to work out what’s going on without using the software output. Or more typically, the humans have just got used to the software generating the correct information or even they interpret it incorrectly.

A classic example of that was when the American warship, the USS Vincennes, shot down an airliner and killed three hundred people because the way the system was set up, the supposedly not safety-related radar system was displaying information not associated with the airliner, but associated with the with a military Iranian aircraft. And the crew got mixed up and shot down the airliner. So, that’s a risky one. Even though it’s down at number four, that doesn’t mean it’s without risk or without criticality.

Commentary #4

So, if we have the software control category, and that’s down the right-hand side- sorry down the left-hand side, one to five. And along the top, we have the severity category from catastrophic down to negligible. We can use that to determine the software criticality index, which varies from one most critical down to five least critical. It’s similar to the risk assessment code in Table III, the coloured matrix that I showed you earlier. So, the writers of the standard have made a determination for us based on some assessment that they’ve done saying, “Well, this is this is how we assess these different criticality levels”. Whether there is actually any real-world evidence supporting this assessment, I don’t know and I’m not sure anybody else does either. However, that’s the standard and that’s where we are.

Commentary #5

And so just to finish up on the commentary. 208 is focused on software engineering, also programmable electronics, complex hardware, but typically electronics with software functionality or logic functionality embedded within it. Now if all of that software, all that programmable electronic systems, if they’re all developed already, is there any point in doing task 208? That’s the first-it’s got to pass the “So what?” test. Is it feasible to do 208 and expect to get benefits? If not, maybe you just do system and subsystem hazard analysis. That’s tasks 205 and 204, respectively. And we just look at the complex components and subsystems as a black box and say, “OK, what’s it meant to do? What are the interfaces?” Maybe that would be a better thing to do.

Particularly, bearing in mind that the software or the complex electronic system could be working perfectly well and we still get an accident because there’s been a misunderstanding of the output. Maybe it’s more beneficial to look at those interfaces and think about, “Well, in what scenarios could the human misunderstand? How do we how do we guard against that?”

It’s also worth saying that some particularly American software development standards, can work well with Mil.standard 882 because they share a similar conceptual basis. For example, I’ve seen many, many times in the air world, the systems software system safety standard is 882 and the systems software standard is DO-178. Or ED12. Anyway, it’s the same standard, just different labels. Now they work relatively well together because the concept underpinning 178 is very similar to 882. It’s American centric. It’s all about, you put requirements on the software development and it’s assumed that if you – this is sort of a cookbook approach – the standard assumes that if you use the right ingredients and you mix them up in the right way, then you’re going to get a good result. And that’s a similar sort of concept for 882 and the two work relatively well together, fairly consistently. Also because they’re both American, there’s a great focus on software testing. Certainly, in the earlier versions of DO-178, it’s exclusively focused on software testing. Things like source code analysis and other things- more modern techniques that have come in- they’re not recognized at all in earlier versions of 178 because they just weren’t around. So, that focus on testing suits 882, because 882, generates lots of requirements and constraints which you need to test.

What it’s not so good at is generating cases where you say, “Well if this goes wrong” or “If we’re at the edge of the envelope where we should be, let’s test for those edge of the envelope cases, let’s test that the software is working correctly when it’s outside of the operating envelope that it should be”. Now, that kind of thinking isn’t so strong in 882, nor in 178. So, there are some limitations there. Good practice, experienced practitioners will overcome those by adding in the smarts that the standards lack. But just to be aware, a standard is not smart. You’ve still got to know what you’re doing in order to get the most out of it.

So, maybe you’re buying software that’s predevelopment or that you’re using- you’re not in the States. You’ve got a European or an Asian Indian supplier or Japanese supplier or whatever. Maybe they’re not using American style techniques and standards. Is that- how well is that going to work with 882? Are they compatible? They might be, but maybe they’re not. So, that requires some thought. If they’re not obviously compatible, then what do you need to do to make that translation and make it work. Or at least understand where the gaps are and what you might do about it to compensate?

And I’ve not talked about data, but it is worth mentioning that with data-rich systems these days- and I heard just the other day, is it two quintillion bytes of data being generated every two days or something ridiculous? That was back in 2017. So, gigantic amounts of data being generated these days and used by computing systems, particularly artificial intelligence systems. So, the rigour associated with that data – the things that we need to think about on data are potentially just as important as the software. Because if the software is processing rubbish data, you’re probably going to get rubbish results. Or at the very least unreliable results that you can’t trust. So, you need to be thinking about all of those attributes of your data; correct, complete, consistent, etc, etc. I mean, I probably need to do a session on that and maybe I will.

Copyright Statement

That’s the presentation. As you can see, everything in italics and quotes is out of the standard, which is copyright free. But this presentation is copyright of the Safety Artisan.

For More…

And you will find many more presentations and a lot more resources at the website www.safetyartisan.com. Also, you’ll find the paid videos on our Patreon page, which is www.patreon.com/SafetyArtisan or go to Patreon and search for the Safety Artisan.

End

Well, that’s the end of our presentation, and it just remains for me to say thanks very much for listening. Thanks for your time and I look forward to seeing you in the next session, Task 209. Looking forward to it. Goodbye.

Back to the Home Page | Mil-Std-882 Page | System Safety Page

#Safety #Engineering #Training
Categories
Mil-Std-882E

Functional Hazard Analysis (Task 208)

To view this content, you must be a member of Simon's Patreon at $45 or more
Already a qualifying Patreon member? Refresh to access this content.
Categories
Mil-Std-882E

Mil-Std-882E Appendix B

This is Mil-Std-882E Appendix B.
Back to Appendix A.

SOFTWARE SYSTEM SAFETY ENGINEERING AND ANALYSIS

B.1 Scope. This Appendix is not a mandatory part of the standard. The information contained herein is intended for guidance only. This Appendix provides additional guidance on the software system safety engineering and analysis requirements in 4.4. For more detailed guidance, refer to the Joint Software Systems Safety Engineering Handbook and Allied Ordnance Publication (AOP) 52, Guidance on Software Safety Design and Assessment of Munition-Related Computing Systems.

B.2. Software system safety. A successful software system safety engineering activity is based on a hazard analysis process, a safety-significant software development process, and Level of Rigor (LOR) tasks. The safety-significant software development process and LOR tasks comprise the software system safety integrity process. Emphasis is placed on the context of the “system” and how software contributes to or mitigates failures, hazards, and mishaps. From the perspective of the system safety engineer and the hazard analysis process, software is considered as a subsystem. In most instances, the system safety engineers will perform the hazard analysis process in conjunction with the software development, software test, and Independent Verification and Validation (IV&V) team(s). These teams will implement the safety-significant software development and LOR tasks as a part of the overall Software Development Plan (SDP). The hazard analysis process identifies and mitigates the exact software contributors to hazards. The software system safety integrity process increases the confidence that the software will perform as specified to software system safety and performance requirements while reducing the number of contributors to hazards that may exist in the system. Both processes are essential in reducing the likelihood of software initiating a propagation pathway to a hazardous condition or mishap.

B.2.1 Software system safety hazard analysis. System safety engineers performing the hazard analysis for the system (Preliminary Hazard Analysis (PHA), Subsystem Hazard Analysis (SSHA), System Hazard Analysis (SHA), System-of-Systems (SoS) Hazard Analysis, Functional Hazard Analysis (FHA), Operating and Support Hazard Analysis (O&SHA), and Health Hazard Analysis (HHA)) will ensure that the software system safety engineering analysis tasks are performed. These tasks ensure that software is considered in its contribution to mishap occurrence for the system under analysis, as well as interfacing systems within an SoS architecture. In general, software functionality that directly or indirectly contributes to mishaps, such as the processing of safety-significant data or the transitioning of the system to a state that could lead directly to a mishap, should be thoroughly analyzed. Software sources and specific software errors that cause or contribute to hazards should be identified at the software module and functional level (functions out-of-time or out-of-sequence malfunctions, degrades in function, or does not respond appropriately to system stimuli). In software-intensive, safety significant systems, mishap occurrence will likely be caused by a combination of hardware, software, and human errors. These complex initiation pathways should be analyzed and thoroughly tested to identify existing and/or derived mitigation requirements and constraints to the hardware and software design. As a part of the FHA (Task 208), identify software functionality which can cause, contribute to, or influence a safety-significant hazard. Software requirements that implement Safety-Significant Functions (SSFs) are also identified as safety significant.

B.2.2 Software system safety integrity. Software developers and testers play a major role in producing safe software. Their contribution can be enhanced by incorporating software system safety processes and requirements within the SDP and task activities. The software system safety processes and requirements are based on the identification and establishment of specific software development and test tasks for each acquisition phase of the software development life-cycle (requirements, preliminary design, detailed design, code, unit test, unit integration test, system integration test, and formal qualification testing). All software system safety tasks will be performed at the required LOR, based on the safety criticality of the software functions within each software configuration item or software module of code. The software system safety tasks are derived by performing an FHA to identify SSFs, assigning a Software Control Category (SCC) to each of the safety-significant software functions, assigning an Software Criticality Index (SwCI) based on severity and SCC, and implementing LOR tasks for safety-significant software based on the SwCI. These software system safety tasks are further explained in subsequent paragraphs.

B.2.2.1 Perform a functional hazard analysis. The SSFs of the system should be identified. Once identified, each SSF is assessed and categorized against the SCCs to determine the level of control of the software over safety-significant functionality. Each SSF is mapped to its implementing computer software configuration item or module of code for traceability purposes.

B.2.2.2 Perform a software criticality assessment for each SSF. The software criticality assessment should not be confused with risk. Risk is a measure of the severity and probability of occurrence of a mishap from a particular hazard, whereas software criticality is used to determine how critical a specified software function is with respect to the safety of the system. The software criticality is determined by analyzing the SSF in relation to the system and determining the level of control the software exercises over functionality and contribution to mishaps and hazards. The software criticality assessment combines the severity category with the SCC to derive a SwCI as defined in Table V in 4.4.2 of this Standard. The SwCI is then used as part of the software system safety analysis process to define the LOR tasks which specify the amount of analysis and testing required to assess the software contributions to the system-level risk.

B.2.2.3 Software Safety Criticality Matrix (SSCM) tailoring. Tables IV through VI should be used, unless tailored alternative matrices are formally approved in accordance with Department of Defense (DoD) Component policy. However, tailoring should result in a SSCM that meets or exceeds the LOR tasks defined in Table V in 4.4.2 of this Standard. A SwCI 1 from the SSCM implies that the assessed software function or requirement is highly critical to the safety of the system and requires more design, analysis, and test rigor than software that is less critical prior to being assessed in the context of risk reduction. Software with SwCI 2 through SwCI 4 typically requires progressively less design, analysis, and test rigor than high criticality software. Unlike the hardware-related risk index, a low index number does not imply that a design is unacceptable. Rather, it indicates a requirement to apply greater resources to the analysis and testing rigor of the software and its interaction with the system. The SSCM does not consider the likelihood of a software-caused mishap occurring in its initial assessment. However, through the successful implementation of a system and software system safety process and LOR tasks, the likelihood of software contributing to a mishap may be reduced.

B.2.2.4 Software system safety and requirements within software development processes. Once safety-significant software functions are identified, assessed against the SCC, and assigned a SwCI, the implementing software should be designed, coded, and tested against the approved SDP containing the software system safety requirements and LOR tasks. These criteria should be defined, negotiated, and documented in the SDP and the Software Test Plan (STP) early in the development life-cycle.

  • a. SwCI assignment. A SwCI should be assigned to each safety-significant software function and the associated safety-significant software requirements. Assigning the SwCI value of Not Safety to non-safety-significant software requirements provides a record that functionality has been assessed by software system safety engineering and deemed Not Safety. Individual safety-significant software requirements that track to the hazard reports will be assigned a SwCI. The intent of SwCI 4 is to ensure that requirements corresponding to this level are identified and tracked through the system. These “low” safety-significant requirements need only the defined safety-specific testing.
  • b. Task guidance. Guidance regarding tasks that can be placed in the SDP, STP, and safety program plans can be found in multiple references, including the Joint Software Systems Safety Engineering Handbook by the Joint Software Systems Safety Engineering Workgroup and AOP 52, Guidance on Software Safety Design and Assessment of Munition-Related Computing Systems. These tasks and others that may be identified should be based on each individual system or SoS and its complexity and safety criticality, as well as available resources, value added, and level of acceptable risk.

B.2.2.5. Software system safety requirements and tasks. Suggested software system safety requirements and tasks that can be applied to a program are listed in the following paragraphs for consideration and applicability:

  • a. Design requirements. Design requirements to consider include fault tolerant design, fault detection, fault isolation, fault annunciation, fault recovery, warnings, cautions, advisories, redundancy, independence, N-version design, functional partitioning (modules), physical partitioning (processors), design safety guidelines, generic software safety requirements, design safety standards, and best and common practices.
  • b. Process tasks. Process tasks to consider include design review, safety review, design walkthrough, code walkthrough, independent design review, independent code review, independent safety review, traceability of SSFs, SSFs code review, SSFs, Safety-Critical Function (SCF) code review, SCF design review, test case review, test procedure review, safety test result review, independent test results review, safety quality audit inspection, software quality assurance audit, and safety sign-off of reviews and documents.
  • c. Test tasks. Test task considerations include SSF testing, functional thread testing, limited regression testing, 100 percent regression testing, failure modes and effects testing, outof-bounds testing, safety-significant interface testing, Commercial-Off-the-Shelf (COTS), Government-Off-the-Shelf (GOTS), and Non-Developmental Item (NDI) input/output testing and verification, independent testing of prioritized SSFs, functional qualification testing, IV&V, and nuclear safety cross-check analysis.
  • d. Software system safety risk assessment. After completion of all specified software system safety engineering analysis, software development, and LOR tasks, results will be used as evidence (or input) to assign software’s contribution to the risk associated with a mishap. System safety and software system safety engineering, along with the software development team (and possibly the independent verification team), will evaluate the results of all safety verification activities and will perform an assessment of confidence for each safety-significant requirement and function. This information will be integrated into the program hazard analysis documentation and formal risk assessments. Insufficient evidence or evidence of inadequate software system safety program application should be assessed as risk.
  • (1) Figure B-1 illustrates the relationship between the software system safety activities (hazard analyses, software development, and LOR tasks), system hazards, and risk. Table B-I provides example criteria for determining risk levels associated with software.

FIGURE B-1. Assessing software’s contribution to risk

  • (2) The risks associated with system hazards that have software causes and controls may be acceptable based on evidence that hazards, causes, and mitigations have been identified, implemented, and verified in accordance with DoD customer requirements. The evidence supports the conclusion that hazard controls provide the required level of mitigation and the resultant risks can be accepted by the appropriate risk acceptance authority. In this regard, software is no different from hardware and operators. If the software design does not meet safety requirements, then there is a contribution to risk associated with inadequately verified software hazard causes and controls. Generally, risk assessment is based on quantitative and qualitative judgment and evidence. Table B-I shows how these principles can be applied to provide an assessment of risk associated with software causal factors.

TABLE B-I. Software hazard causal factor risk assessment criteria

  • e. Defining and following a process for assessing risk associated with hazards is critical to the success of a program, particularly as systems are combined into more complex SoS. These SoS often involve systems developed under disparate development and safety programs and may require interfaces with other Service (Army, Navy/Marines, and Air Force) or DoD agency systems. These other SoS stakeholders likely have their own safety processes for determining the acceptability of systems to interface with theirs. Ownership of the overarching system in these complex SoS can become difficult to determine. The process for assessing software’s contribution to risk, described in this Appendix, applies the same principals of risk mitigation used for other risk contributors (e.g., hardware and human). Therefore, this process may serve as a mechanism to achieve a “common ground” between SoS stakeholders on what constitutes an acceptable level of risk, the levels of mitigation required to achieve that acceptable level, and how each constituent system in the SoS contributes to, or supports mitigation of, the SoS hazards.

This is the last excerpt from the Standard

Back to the Home Page | Mil-Std-882 Page | System Safety Page

Professional | Pragmatic | Impartial