Transcript: System Requirements Hazard Analysis (T203)

Here is the full transcript: Systems Requirements Hazard Analysis.

The full video is here.

Introduction

Hello and welcome to the Safety Artisan, where you will find professional, pragmatic and impartial advice on all things system, safety and related.

System Requirements Hazard Analysis

And so today, which is the 1st of March 2020, we’re going to be talking about – let me just find it for you – we’ll be talking about system requirements, hazard analysis. And this is part of our series on Mil. Standard 882E (882 Echo) and this one a task 203. Task 203 in the Mil. standard. And it’s a very widely used system safety engineering standard and its influence is found in many places, not just on military procurement programs.

Topics for this Session

We’re going to look at this task, which is very important, possibly the most important task of all, as we’ll see. so in to talk about the purpose of the task, which is word for word from the task description itself. We’re going to talk about in the task description, the three aims of this task, which is to determine or work out requirements, incorporate them, and then assess the compliance of the system with those requirements, because, of course, it may not be a simple read-across. We’ve got six slides on that. That’s most of the task. Then we’ve just got one slide on contracting, which if you’ve seen any of the others in this series, will seem very familiar. We’ve got a little bit of a chat about Section 4.2 from the standard and some commentary, and the reason for that will become clear. So, let’s crack on.

Purpose of SRHA

Task 203.1, the purpose of Task 203 is to perform and document a System Requirements Hazard Analysis or SRHA. And as we’ve already said, the purpose of this is to determine the design requirements. We’re going to focus on design rather than buying stuff off the shelf – we’ll talk about the implications of that a little bit later. Design requirements to eliminate or reduce hazards and risks, incorporate those requirements, into a says, into the documentation, but what it should say is incorporate risk reduction measures into the system itself and then document it. And then finally, to assess compliance of the system with these requirements. Then it says the SRHA address addresses all life-cycle phases, so not just meant for you to think about certain phases of the program. What are the requirements through life for the system? And in all modes. Whether it’s in operation, whether it’s in maintenance or refit, whether it’s being repaired or disposed of, whatever it might be.

Task Description #1

First of six slides on the task description. I’m using more than one colour because there’s some quite a lot of important points packed quite tightly together in this description. We’re assuming that the contractor performs and documents this SRHA. The customer needs to do a lot of work here before ever gets near a contractor. More on that later. We need to determine system design requirements to eliminate hazards or reduce associated risks.

Two things here. By identifying applicable policies, regulations and standards etc. More on that later. And analysing identified hazards. So, requirements to perform the analysis as well as to simply just state ‘We want a system to do this and not to do that’. So, we need to put some requirements to say ‘Here’s what we want to be analysed, to what degree? And why.’ is always helpful.

Task Description #2

Breaking those breaking those two requirements down.

Part a. We’re going to identify applicable requirements by reviewing our military and industry standards and specs, historical documentation of systems that are similar or with a system that we’re replacing, perhaps. Look at, it’s assumed that the US Department of Defense is the customer, ultimate customer. So, the ultimate customer’s requirements, including whatever they’ve said about standard ways of mitigating certain common risks. System performance spec, that’s your functional performance spec or whatever you want to call it. Other system design requirements and documents- Bit of a catchall there. And applicable federal, military, state and local regulations.

This is a US standard. It’s a federated system, much like Australia or indeed lots of modern states, even the UK. There are variations in law across England, Wales, Scotland and Ireland. They’re not great, but they do exist. And in the US and Australia, those differences are greater. And it says applicable executive orders. Executive orders, they’re not law, but they are what the executive arm of the U.S. government has issued, and international agreements. A lot of words in there- have a look at the different statements that are in that in white, blue and yellow. Basically, from international agreements right down to whatever requirements may be applicable, they all need to be looked at and taken account of. So, there’s a huge amount of work there for someone to do. I’ll come back to who that someone should be later.

Task Description #3

Part B. It says the contractor shall recommend appropriate system design requirements. The assumption here is that the contractor is the designer and knows the design better than anybody, better than the purchaser, which is fair enough. It’s your system, you should understand it. And the requirement is that the contractor is not just passive, ‘doing as they’re told’, they’re there to actively investigate possible hazards associated with their system and recommend appropriate requirements in order to manage those hazards and risks. And then there’s further guidance here is the contractor to do that in accordance with Section 4 of Mil. Standard 882E. Now, Section 4 is the general requirements of the standards and there’s lots of good advice in that. And I’ll be doing a lesson, maybe more than one lesson in fact, in Section 4 because there is quite a lot in there. The contractor is to refer to the standard and apply the principles therein. All good stuff.

Part C. The contractor shall also define verification and validation approaches. So, the contractor shall define V and V approaches for each design requirement to eliminate hazards and reduce risks. In part C- Well, B and C- we’ve got a very much narrower focus on requirements to eliminate hazards or reduce risks. Whereas in A, notice we’ve got incredibly broad scope looking requirements. It’s not just about the narrow job of dealing with hazards and controlling them, that we’ve got in parts B and C.

Task Description #4

Onwards and upwards. We get to the second major part of this task, which is to incorporate those design requirements. It’s all very well to have them, but they’ve got to be built into the engineering design, into documentation, hardware, software, test plans, etc. And the second highlighted bit that I’ve got is ‘as the design evolves ensure applicable design requirements flow down into lower-level specifications’, etc, etc, etc. There’s a lot of repetition there, so I won’t go through it. Clearly the assumption in this standard is that the design will be done top-down and that the main contractor, design contractor, will be doing work and then identifying lower-level requirements to be passed on to subcontractors and suppliers. And again, the assumption is we’re dealing with a large military system, which is at least, in part, bespoke. It is being developed and/or integrated for the first time for a specific user and specific use.

I’ll come onto the third yellow highlighted bit first, and then it says as appropriate use engineering change proposals to incorporate applicable design requirements into these documents. What we’re saying here is that even if something hasn’t been specified upfront in the original contract, the contractor should use Engineering Change Proposals – ECP – should use it controlled change mechanism in order to change things as they go with approval and refine and evolve the design.

Years of experience have taught me that these statements are coming from the assumption – still true in the US, I believe – whereby major military projects are designed and developed under a cost-plus basis. In other words, the government pays the main contractor / the prime contractor / prime designer on a sort of time and materials basis, not on a firm or fixed price basis, but says ‘Go away and do what we say’. And there are controls there, and there’s open-book accounting to try and prevent the government from being defrauded. But basically, the contractor goes off and does what is required and gets paid for what they do. So, the government has transferred relatively low amounts of risk onto the contractor anticipating that this will result in the lowest possible overall cost of design development. Now, as we probably could know from the news, that doesn’t always work. However, that is the assumption behind this standard. This cost-plus approach will pay you to do the job and therefore we don’t have to specify every single nut and bolt in the contract right at the beginning. Which in some ways takes a lot of risks away from the purchaser because they don’t have to get everything right at the start. So that’s good. There’s always a balance of risk in whichever approach we take.

So, if we go firm price, yes, we could inject more competition into procurement and supply activity, but you’ve got to get your contract upfront right. And all your requirements, right- more or less. That is notoriously difficult to do. Whichever way you go, there are risks. But it’s important to note that this is the assumption underlying the standard. Not every standard follows this approach, follows this philosophy, but 88 2 does. So, if we’re going to use it in a different way, we need to understand the fact that in. More on that later.

Task Description #5

Fifth slide of six. Third part. We need to assess compliance of that development of hardware, software, documentation, data, etc., whatever it might be. In order to do that, the contractor is going to have to address the customer requirements at technical reviews. So again, the assumption is that development is following a systems-engineering process with certain gated reviews. So, you go into a series of reviews, you might start with system requirements review, SRR. Then you might have preliminary design review, top-level design, PDR. And then we go down to detailed design which is reviewed at Critical Design Review, or CDR. And then we might have a further software specification review for software components and then we’ll go on and test readiness routines and so on and so forth.

Mil. Standard 882 is assuming a particular systems-engineering-lifecycle approach to development. This is very widely used not just for military standards, but for civil, and all over the place. Whatever we call these reviews, the idea of a gated review is that you don’t start a review until you’ve reached maturity requirements or design. You then conduct the review against objective criteria and then decide whether the review has passed. Now, usually, there is a hefty payment milestone associated with passing review. The contractor is incentivized to pass the review. And hopefully, if we’ve got the requirements right, a passed review means we’re on the right track and we’re getting the right product. But that’s not always the case that we’ve got to get all these things right.

And then it says during those reviews, the contractor shall address hazards, mitigation measures or controls and methods of V and V, and recommendations arising. A lot goes on at these reviews. They are on big programs, especially, the very important, very high stress. And in fact, in Australia now, there are some projects that are so big that a delay in a PDR review actually made it into the national news on the future submarine because it’s such a huge multi-billion-dollar project. It could all get very painful and political as well.

Task Description #6

However, let’s move on to the final slide of the task description. So, A. was is do the reviews. B. is review test plans and review test results to make sure to verify and validate hardware and software compliance with those requirements. And as it says, this includes V and V of the effectiveness of risk mitigation measures. So, we need to test these risk controls where we can and see how effective they are and whether they live up to the requirements or the assumptions that we’ve made. Now, again, this is an American standard, so it’s very ‘test centric’. The American government likes to test things to death and depending on your point of view, that’s sensible or not, it’s sensible in the sense that you’re testing a real system hopefully in a representative test environment. Although it may not be representative of the operational environment. So, it should be a very solid, robust, valid approach to proving a system.

However, there is a downside to testing in that it’s very expensive and it tends to come at the end of a program. Whereas really you need an indication much earlier on if things are going astray. So, you really need to review documentation and do analysis and so forth. Or maybe you test a prototype for some samples or something early on, rather than waiting until yet when it’s often may be too late and then very expensive to fix things.

And then part C, we need to ensure that hazard control information is incorporated into manuals and plans, whether it be for the operator, the maintainer, the trainer, the logistician, the diagnostics or indeed for the final disposal. We need to take that hazard control information, risk control information, and record it so that it doesn’t get lost and it gets to the people who need it. That’s very important.

OK, so we’ve spent quite a lot of time going through the description because it’s a big, complex task this one, as you can see, with three major parts to it. It’s worth just going back over it. We’ve got our top-level description on slide one, which summarizes the whole thing. We’re talking about finding those requirements, identifying them. We’re talking about the contractor as an active recommender and developer of requirements and actively developing the V and V techniques to make sure that they are met.

In the second major part, we’re talking about incorporating those design requirements as the design evolves and using a controlled change method to make sure that we keep up with what’s going on. We’re talking about assessing compliance both at major systems engineering reviews and during testing. And then finally, we’re talking about making sure that the required information gets through to those who need it at the end of the food chain, as it were. [This is ] all important stuff.

Contracting

Here’s as a page we should be familiar with by now, contracting. We need to require SRHA, Task 203.  We need to put it in the request for proposal and the contractual state, the work. So once again, as I’ve said before, we’ve got to get this stuff in early on. At least the requirement to do it, even if we haven’t fully worked everything out. We need to get that in right at the start of the request for proposal. We need to require task 203 to be done. It’s imposed (A. Imposition of Task 203).

We need to identify (B. Identification of functional disciplines) who we want to take part in it because it’s not, as we will see, it’s not just the discipline and the job of the safety engineers or the safety team to do this. The design engineers, the specialist engineers in reliability, maintainability and testability, whoever, they all need to be involved as well, etc, etc.

Contractor level of effort (C.) for reviews and so on. We may need to specify some hard requirements there to ensure that we get early scrutiny of the product and the design.

A big point is tailoring of the task (D. Tailor 203.2 and 203.2.3 as appropriate). The task may need to be tailored assuming again that the contractor is responsible for the design. Maybe if the prime contractor isn’t responsible for the design, maybe we’re contracting somebody to buy something that’s mostly off the shelf and then operating force for 30 years. Let’s say a so-called turnkey solution. And we might do that for a piece of military kit, or we might do that for a hospital, or whatever it might be. A piece of infrastructure, a service, whatever. So, it may be that the contractor who must do most of task 203 is not the Prime at all. But, the prime needs to pass those requirements down to some key subcontractors who are doing the development stuff. So, it’s not a given that the prime contractor right underneath the customer must do all this stuff. It may have to be done at several different levels.

And again, we’ve got to provide the concept of operations (E.), that gives the context for all this work. Otherwise, it gets very difficult to do it. You’ve got to say, ‘What’s the jurisdictional context?’ ‘Where will we be operating under?’ ‘Which rules and conditions?’ As well as everything else that you would find in Con. Ops (Concept of Operations).

Then if there are any specific hazard management requirements (F.) that need to be imposed and specific measures of risk, then they need to be passed on to the contractor as well. This is how we will assess, and measure, and prioritize risks. That needs to be done for the program otherwise, you can end up with lots of different ways doing it and it becomes difficult to govern mess.

Section 4.2 #1

I promised we would have a little section on Section 4.2 in the standard and I’ve got two slides here that say two important things. We’re not going to go through all of Section 4 of the 882- That’s for another session. But here in 4.2, we’ve got two important things.

It says Section 4 defines system safety requirements through life for any system. And when properly applied, these requirements should enable the identification and management of hazards and their associated risks. Not only during system development but also during sustainment. And any engineering activities that go on in sustainment, whether it be repair, overhaul, modification, update, whatever it might be. These requirements are put in place to enable that good work to take place and make predictions for the through-life operation, support, sustainment of system, whatever it might be.

Section 4.2 #2

And then secondly, there’s another important point here, which I alluded to earlier. System safety staff are not responsible for hazard management in other functional disciplines. If you’re a structural designer, you’re responsible for making your structure or designing your structure such that risks of failure and collapse and catastrophe are managed. And the same for everything else. Whatever it is you’re dealing with, propulsion, fuels, you name it, whatever the discipline is, they’re all responsible for managing the risks.

The safety team is there really to pull it together and try and ensure some consistency and honesty and to report status. They are not there to do it all for the designers. Indeed, they can’t because they will not have the design specialist knowledge to do so. Only the designers can do. But it does go on to say all functional disciplines, using this generic methodology that’s in Section 4, should coordinate their efforts as part of the overall systems engineering process. The standard provides standardization and it should force all these different disciplines to work together in a standardized way following a standardized-systems-engineering process. And remember we said earlier, Mil. standard 882 assumes that there is a higher-level systems-engineering process going on into which the safety program fits. And that’s very, very important.

On so many programs I’ve seen, there’s either no systems engineering process or a weak one. Or the safety program is divorced or isolated from the systems engineering, the higher-level program, and as a result, it can become irrelevant if you’re not careful. So, having these things and making sure that they lock together is very important. And the reasoning given here is because you might mitigate a hazard in one discipline only to make it worse for somebody else. We can all think of examples of one (which is code for me saying I can’t right now). But anyway, trade-offs – that’s what we end up with. There’s Section 4.2, which gives us a little insight into the thrust of the whole of section 4.

Commentary #1

Just two slides of commentary for me. First, it’s worth remembering that there are lots, and lots, and lots of requirements. We’ve got requirements of the standard itself, which is about following a rigorous process. We’ve got law at the international and national levels, and whether those laws apply in a particular jurisdiction or not can be complex. You’ve got product specifications; you’ve got applicable standards, or maybe only parts of the standards that are applicable to your system. And then you’ve got program project requirements, etc., etc. You’ve got lots and lots of layers of requirements that are out there and may or may not be relevant to your system you want to develop, or service, whatever it is going to be. But of course, if we’re using this kind of approach, it’s going to be a complex system or service. It’s going to be challenging to find and identify all these things. It’s going to take some dedicated effort.

That’s one issue, doing all that work. And this is not a trivial exercise and I’ve seen it done badly far more often than I’ve seen it done well. That’s the thing to bear in mind, this is not easy to do. And people didn’t really want to do it – it’s hard work.

And then secondly, we get down to what we might call derived safety requirements. We have a high-level requirement that says, ‘We want a very high level of performance out of this vehicle’ or whatever it might be. And that very demanding performance requirement might force us to use some very high energy fuel, or it might force us to pack a lot of power and a lot of equipment into a very small space, and these requirements can lead to sort of secondary hazards. So, we’ve got high energy fuel inside the vehicle- Well, clearly, that’s dangerous if it leaks. We’ve got a lot of stuff, complex stuff, packed into a small system that can give us thermal control problems. Or if a bit of it goes wrong, if it’s tightly packed together, it can take out something else next to it.

So, these performance requirements can cause hazards that probably weren’t there before or needn’t have been there in, let’s say, a common or garden system that doesn’t have to perform as well. So, we might well look at doing some analysis on our requirements and our top-level design or conceptual design, whatever it might be very early on. And we might say, ‘Well, clearly this is going to drive us down a particular path’ and therefore we will derive some additional safety requirements to deal with these challenges. They don’t come out straight out of higher-level requirements, they’re a secondary effect. But in complex systems, these are very common. And if we’re doing our systems engineering well, we will identify, derive safety requirements for ourselves and for the next level of contractors down the chain.

So, instead of just passing on ‘back-to-back’ requirements from the ultimate customer, which may not mean anything at all to the component supplier (in fact, it probably won’t). We need to change these top-level requirements and say, ‘What’s relevant for you as the supplier role of the engine?’ Let’s say or the wheels, or the wings, or the hull, or whatever it might be. We need to pass on required controls, whether it be the prevention of hazards, detection or mitigation. We also need to remember the order of precedence. It’s preferable to eliminate hazards if we can’t, we put in engineering- engineered features- to reduce the risk or lessen the probability, or severity, etc. And those rules are in section 4.3.4 of the Mil. Standard. There’s a lot of work to do on requirements on many different levels and it may be that this task must be repeated at many different levels.

Commentary #2

But the first level task must be done by the client, and actually by the ultimate end-user because to mangle a famous quote, ‘What you don’t specify – what you don’t see can hurt you’. So, we need to do this work as end-users, and as purchases, as customers. It is tempting to assume that the contractors will just do it, that they’ll just get it. ‘They’ve been making planes for years’ or ‘They’ve been making tanks’, or boots, or guns, or ships, or whatever it might be. ‘They’ve been making fuel for years’, ‘these chemicals for years’. We just assume that they know what they’re doing. Well, they probably do know what they’re doing within a particular context. However, if we impose competition, as we always do because we’re always looking for value for money, and whether we have a competition where we’re asking for a firm price to do something or whether we employ other methods of competition and cost-cutting, that will always be pressure on the contract costs. And that means they will be tempted to tailor the safety approach they’re taking in order to reduce costs. Which is a perfectly legitimate thing to do, nothing immoral about doing that, if it’s done appropriately and sensibly.

But if you as the customer or client are going to incentivize your suppliers to do that, you need to be aware of that and the fact that may just not bother because you haven’t told them to. You’re not contractually specified it so you aren’t going to get it. It’s not their problem. And indeed, the suppliers may not understand how their customer will integrate what they provide or use it. The prime contractor may not have a great idea as to how you’re going to use their product. And you can be certain that the subcontractors and the low level secondary and tertiary suppliers are probably going to have no clue whatsoever about what’s going to happen to their components. They are just not going to know. So, you need to specify that as purchaser and you need to make sure that your immediate suppliers pass on those requirements, and that context, and that they police the contract appropriately. Otherwise, there’s going to be trouble for the ultimate client and end-user.

And then finally, in these days of globalization and business-to-business and international procurement, you may be – probably are – buying stuff that’s been made abroad and designed in another country where they may have completely different laws or no laws at all on how safety is built-in – designed in – to a system. And of course, you don’t always know where design work is going to get done; just because you engage a prime contractor in your own country and think that you’re safe. You don’t know whether the prime contractor is going to subcontract software development – let’s say, out to India. It’s so common it’s a cliché! But there are certain things that tend to be done offshore because it’s cheaper, or quicker, or whatever. Or because somebody has already got a system that you can just plug in and use – allegedly.

There are all kinds of reasons why your supply chain will not necessarily ‘Just get it’, or ‘Just do it”’. In fact, there are lots of good reasons why they won’t. So, the purchaser has got to do a lot of work. It’s critical for the purchaser to know what their obligations are because a lot of purchasers don’t. They sit there in blithe ignorance of what their safety responsibilities are, and the lucky ones get away with it. And the unlucky ones are either killed or maimed, or they kill or maim somebody else and they end up going to jail or massive fines. But you’ve not only got to understand the requirements, the obligations, safety on the end item being used but how do you translate that to the contractors, because it’s not always obvious. You can’t just say, ‘Well, these are the laws that I have to obey- I’ll just pass those on to you, Mr Contractor’ because they may not apply to the contractor if they’re in a different country.

Or it just may not make any sense at their level. Laws that were designed to protect people will not often make much sense to a component supplier. Just doesn’t work. Two important points there on the commentary. Lots of layers of requirements that need to be worked on. This is all classic systems engineering stuff, isn’t it? And then the purchaser and the end-user cannot evade their responsibilities at the top of the food chain. Indeed, they’ll be stuck with the problem, whatever it is, for 30 years or however long they use the system.

It’s important for the end-user and the ultimate client to do this work may be several times at many different layers.

Copyright Statement

Well, that’s the end of the technical content. I just wanted to say that I’ve quoted a lot of text from the Mil, standard, which is itself copyright-free, and it’s available for free online, including on the Web site the Safety Artisan. But this presentation’s copyright of the Safety Artisan 2020.

For More …

And for more resources and for more videos like this one, please go to either www.safetyartisan.com or go to the Safety Artisan page at www.patreon.com.

Well, that is the end of the presentation. And it just remains for me to say thanks again for watching and do look out for the next sessions in the series on 882 echo (882E). There are quite a few to go. We’re going to go through all the tasks and the general and specific requirements of the standard and the appendices. We will also talk about more advanced topics, about how we manage and apply all this stuff.

So, from The Safety Artisan.com, thanks very much and goodbye.

Back to the Home Page | Mil-Std-882 Page | System Safety Page

Professional | Pragmatic | Impartial

Mil-Std-882E Preliminary Hazard List (T201) & Analysis (T202)

This is Mil-Std-882E Preliminary Hazard List & Analysis.
Back to: 100-series Tasks.

The 200-series tasks fall into several natural groups. Tasks 201 and 202 address the generation of a Preliminary Hazard List and the conduct of Preliminary Hazard Analysis, respectively.

TASK 201 PRELIMINARY HAZARD LIST

201.1 Purpose. Task 201 is to compile a list of potential hazards early in development.

201.2 Task description. The contractor shall:

201.2.1 Examine the system shortly after the materiel solution analysis begins and compile a Preliminary Hazard List (PHL) identifying potential hazards inherent in the concept.

201.2.2 Review historical documentation on similar and legacy systems, including but not limited to:

  • a. Mishap and incident reports.
  • b. Hazard tracking systems.
  • c. Lessons learned.
  • d. Safety analyses and assessments.
  • e. Health hazard information.
  • f. Test documentation.
  • g. Environmental issues at potential locations for system testing, training, fielding/basing, and maintenance (organizational and depot).
  • h. Documentation associated with National Environmental Policy Act (NEPA) and Executive Order (EO) 12114, Environmental Effects Abroad of Major Federal Actions.
  • i. Demilitarization and disposal plans.

201.2.3 The contractor shall document identified hazards in the Hazard Tracking System (HTS). Contents and formats will be as agreed upon between the contractor and the Program Office. Unless otherwise specified in 201.3.d, minimum content shall included:

  • a. A brief description of the hazard.
  • b. The causal factor(s) for each identified hazard.

201.3 Details to be specified. The Request for Proposal (RFP) and Statement of Work (SOW) shall include the following, as applicable:

  • a. Imposition of Task 201. (R)
  • b. Identification of functional discipline(s) to be addressed by this task. (R)
  • c. Guidance on obtaining access to Government documentation.
  • d. Content and format requirements for the PHL.
  • e. Concept of operations.
  • f. Other specific hazard management requirements, e.g., specific risk definitions and matrix to be used on this program.
  • g. References and sources of hazard identification.

TASK 202 PRELIMINARY HAZARD ANALYSIS

202.1 Purpose. Task 202 is to perform and document a Preliminary Hazard Analysis (PHA) to identify hazards, assess the initial risks, and identify potential mitigation measures.

202.2 Task description. The contractor shall perform and document a PHA to determine initial risk assessments of identified hazards. Hazards associated with the proposed design or function shall be evaluated for severity and probability based on the best available data, including mishap data (as accessible) from similar systems, legacy systems, and other lessons learned. Provisions, alternatives, and mitigation measures to eliminate hazards or reduce associated risk shall be included.

202.2.1 The contractor shall document the results of the PHA in the Hazard Tracking System (HTS).

202.2.2 The PHA shall identify hazards by considering the potential contribution to subsystem or system mishaps from:

  • a. System components.
  • b. Energy sources.
  • c. Ordnance.
  • d. Hazardous Materials (HAZMAT).
  • e. Interfaces and controls.
  • f. Interface considerations to other systems when in a network or System-of-Systems (SoS) architecture.
  • g. Material compatibilities.
  • h. Inadvertent activation.
  • i. Commercial-Off-the-Shelf (COTS), Government-Off-the-Shelf (GOTS), NonDevelopmental Items (NDIs), and Government-Furnished Equipment (GFE).
  • j. Software, including software developed by other contractors or sources. Design criteria to control safety-significant software commands and responses (e.g., inadvertent command, failure to command, untimely command or responses, and inappropriate magnitude) shall be identified, and appropriate action shall be taken to incorporate these into the software (and related hardware) specifications.
  • k. Operating environment and constraints.
  • l. Procedures for operating, test, maintenance, built-in-test, diagnostics, emergencies, explosive ordnance render-safe and emergency disposal.
  • m. Modes.
  • n. Health hazards.
  • o. Environmental impacts.
  • p. Human factors engineering and human error analysis of operator functions, tasks, and requirements.
  • q. Life support requirements and safety implications in manned systems, including crash safety, egress, rescue, survival, and salvage.
  • r. Event-unique hazards.
  • s. Built infrastructure, real property installed equipment, and support equipment.
  • t. Malfunctions of the SoS, system, subsystems, components, or software.

202.2.3 For each identified hazard, the PHA shall include an initial risk assessment. The definitions in Tables I and II, and the Risk Assessment Codes (RACs) in Table III shall be used, unless tailored alternative definitions and/or a tailored matrix are formally approved in accordance with Department of Defense (DoD) Component policy.

202.2.4 For each identified hazard, the PHA shall identify potential risk mitigation measures using the system safety design order of precedence specified in 4.3.4.

202.3 Details to be specified. The Request for Proposal (RFP) and Statement of Work (SOW) shall include the following, as applicable:

  • a. Imposition of Task 202. (R)
  • b. Identification of functional discipline(s) to be addressed by this task. (R)
  • c. Special data elements, format, or data reporting requirements (consider Task 106, Hazard Tracking System).
  • d. Identification of hazards, hazardous areas, or other specific items to be examined or excluded.
  • e. Technical data on COTS, GOTS, NDIs, and GFE to enable the contractor to accomplish the defined task.
  • f. Concept of operations.
  • g. Other specific hazard management requirements, e.g., specific risk definitions and matrix to be used on this program.

Forward to the next excerpt: Task 203

Back to the Home Page | Mil-Std-882 Page | System Safety Page

Professional | Pragmatic | Impartial

Mil-Std-882E Appendix B

This is Mil-Std-882E Appendix B.
Back to Appendix A.

SOFTWARE SYSTEM SAFETY ENGINEERING AND ANALYSIS

B.1 Scope. This Appendix is not a mandatory part of the standard. The information contained herein is intended for guidance only. This Appendix provides additional guidance on the software system safety engineering and analysis requirements in 4.4. For more detailed guidance, refer to the Joint Software Systems Safety Engineering Handbook and Allied Ordnance Publication (AOP) 52, Guidance on Software Safety Design and Assessment of Munition-Related Computing Systems.

B.2. Software system safety. A successful software system safety engineering activity is based on a hazard analysis process, a safety-significant software development process, and Level of Rigor (LOR) tasks. The safety-significant software development process and LOR tasks comprise the software system safety integrity process. Emphasis is placed on the context of the “system” and how software contributes to or mitigates failures, hazards, and mishaps. From the perspective of the system safety engineer and the hazard analysis process, software is considered as a subsystem. In most instances, the system safety engineers will perform the hazard analysis process in conjunction with the software development, software test, and Independent Verification and Validation (IV&V) team(s). These teams will implement the safety-significant software development and LOR tasks as a part of the overall Software Development Plan (SDP). The hazard analysis process identifies and mitigates the exact software contributors to hazards. The software system safety integrity process increases the confidence that the software will perform as specified to software system safety and performance requirements while reducing the number of contributors to hazards that may exist in the system. Both processes are essential in reducing the likelihood of software initiating a propagation pathway to a hazardous condition or mishap.

B.2.1 Software system safety hazard analysis. System safety engineers performing the hazard analysis for the system (Preliminary Hazard Analysis (PHA), Subsystem Hazard Analysis (SSHA), System Hazard Analysis (SHA), System-of-Systems (SoS) Hazard Analysis, Functional Hazard Analysis (FHA), Operating and Support Hazard Analysis (O&SHA), and Health Hazard Analysis (HHA)) will ensure that the software system safety engineering analysis tasks are performed. These tasks ensure that software is considered in its contribution to mishap occurrence for the system under analysis, as well as interfacing systems within an SoS architecture. In general, software functionality that directly or indirectly contributes to mishaps, such as the processing of safety-significant data or the transitioning of the system to a state that could lead directly to a mishap, should be thoroughly analyzed. Software sources and specific software errors that cause or contribute to hazards should be identified at the software module and functional level (functions out-of-time or out-of-sequence malfunctions, degrades in function, or does not respond appropriately to system stimuli). In software-intensive, safety significant systems, mishap occurrence will likely be caused by a combination of hardware, software, and human errors. These complex initiation pathways should be analyzed and thoroughly tested to identify existing and/or derived mitigation requirements and constraints to the hardware and software design. As a part of the FHA (Task 208), identify software functionality which can cause, contribute to, or influence a safety-significant hazard. Software requirements that implement Safety-Significant Functions (SSFs) are also identified as safety significant.

B.2.2 Software system safety integrity. Software developers and testers play a major role in producing safe software. Their contribution can be enhanced by incorporating software system safety processes and requirements within the SDP and task activities. The software system safety processes and requirements are based on the identification and establishment of specific software development and test tasks for each acquisition phase of the software development life-cycle (requirements, preliminary design, detailed design, code, unit test, unit integration test, system integration test, and formal qualification testing). All software system safety tasks will be performed at the required LOR, based on the safety criticality of the software functions within each software configuration item or software module of code. The software system safety tasks are derived by performing an FHA to identify SSFs, assigning a Software Control Category (SCC) to each of the safety-significant software functions, assigning an Software Criticality Index (SwCI) based on severity and SCC, and implementing LOR tasks for safety-significant software based on the SwCI. These software system safety tasks are further explained in subsequent paragraphs.

B.2.2.1 Perform a functional hazard analysis. The SSFs of the system should be identified. Once identified, each SSF is assessed and categorized against the SCCs to determine the level of control of the software over safety-significant functionality. Each SSF is mapped to its implementing computer software configuration item or module of code for traceability purposes.

B.2.2.2 Perform a software criticality assessment for each SSF. The software criticality assessment should not be confused with risk. Risk is a measure of the severity and probability of occurrence of a mishap from a particular hazard, whereas software criticality is used to determine how critical a specified software function is with respect to the safety of the system. The software criticality is determined by analyzing the SSF in relation to the system and determining the level of control the software exercises over functionality and contribution to mishaps and hazards. The software criticality assessment combines the severity category with the SCC to derive a SwCI as defined in Table V in 4.4.2 of this Standard. The SwCI is then used as part of the software system safety analysis process to define the LOR tasks which specify the amount of analysis and testing required to assess the software contributions to the system-level risk.

B.2.2.3 Software Safety Criticality Matrix (SSCM) tailoring. Tables IV through VI should be used, unless tailored alternative matrices are formally approved in accordance with Department of Defense (DoD) Component policy. However, tailoring should result in a SSCM that meets or exceeds the LOR tasks defined in Table V in 4.4.2 of this Standard. A SwCI 1 from the SSCM implies that the assessed software function or requirement is highly critical to the safety of the system and requires more design, analysis, and test rigor than software that is less critical prior to being assessed in the context of risk reduction. Software with SwCI 2 through SwCI 4 typically requires progressively less design, analysis, and test rigor than high criticality software. Unlike the hardware-related risk index, a low index number does not imply that a design is unacceptable. Rather, it indicates a requirement to apply greater resources to the analysis and testing rigor of the software and its interaction with the system. The SSCM does not consider the likelihood of a software-caused mishap occurring in its initial assessment. However, through the successful implementation of a system and software system safety process and LOR tasks, the likelihood of software contributing to a mishap may be reduced.

B.2.2.4 Software system safety and requirements within software development processes. Once safety-significant software functions are identified, assessed against the SCC, and assigned a SwCI, the implementing software should be designed, coded, and tested against the approved SDP containing the software system safety requirements and LOR tasks. These criteria should be defined, negotiated, and documented in the SDP and the Software Test Plan (STP) early in the development life-cycle.

  • a. SwCI assignment. A SwCI should be assigned to each safety-significant software function and the associated safety-significant software requirements. Assigning the SwCI value of Not Safety to non-safety-significant software requirements provides a record that functionality has been assessed by software system safety engineering and deemed Not Safety. Individual safety-significant software requirements that track to the hazard reports will be assigned a SwCI. The intent of SwCI 4 is to ensure that requirements corresponding to this level are identified and tracked through the system. These “low” safety-significant requirements need only the defined safety-specific testing.
  • b. Task guidance. Guidance regarding tasks that can be placed in the SDP, STP, and safety program plans can be found in multiple references, including the Joint Software Systems Safety Engineering Handbook by the Joint Software Systems Safety Engineering Workgroup and AOP 52, Guidance on Software Safety Design and Assessment of Munition-Related Computing Systems. These tasks and others that may be identified should be based on each individual system or SoS and its complexity and safety criticality, as well as available resources, value added, and level of acceptable risk.

B.2.2.5. Software system safety requirements and tasks. Suggested software system safety requirements and tasks that can be applied to a program are listed in the following paragraphs for consideration and applicability:

  • a. Design requirements. Design requirements to consider include fault tolerant design, fault detection, fault isolation, fault annunciation, fault recovery, warnings, cautions, advisories, redundancy, independence, N-version design, functional partitioning (modules), physical partitioning (processors), design safety guidelines, generic software safety requirements, design safety standards, and best and common practices.
  • b. Process tasks. Process tasks to consider include design review, safety review, design walkthrough, code walkthrough, independent design review, independent code review, independent safety review, traceability of SSFs, SSFs code review, SSFs, Safety-Critical Function (SCF) code review, SCF design review, test case review, test procedure review, safety test result review, independent test results review, safety quality audit inspection, software quality assurance audit, and safety sign-off of reviews and documents.
  • c. Test tasks. Test task considerations include SSF testing, functional thread testing, limited regression testing, 100 percent regression testing, failure modes and effects testing, outof-bounds testing, safety-significant interface testing, Commercial-Off-the-Shelf (COTS), Government-Off-the-Shelf (GOTS), and Non-Developmental Item (NDI) input/output testing and verification, independent testing of prioritized SSFs, functional qualification testing, IV&V, and nuclear safety cross-check analysis.
  • d. Software system safety risk assessment. After completion of all specified software system safety engineering analysis, software development, and LOR tasks, results will be used as evidence (or input) to assign software’s contribution to the risk associated with a mishap. System safety and software system safety engineering, along with the software development team (and possibly the independent verification team), will evaluate the results of all safety verification activities and will perform an assessment of confidence for each safety-significant requirement and function. This information will be integrated into the program hazard analysis documentation and formal risk assessments. Insufficient evidence or evidence of inadequate software system safety program application should be assessed as risk.
  • (1) Figure B-1 illustrates the relationship between the software system safety activities (hazard analyses, software development, and LOR tasks), system hazards, and risk. Table B-I provides example criteria for determining risk levels associated with software.

FIGURE B-1. Assessing software’s contribution to risk

  • (2) The risks associated with system hazards that have software causes and controls may be acceptable based on evidence that hazards, causes, and mitigations have been identified, implemented, and verified in accordance with DoD customer requirements. The evidence supports the conclusion that hazard controls provide the required level of mitigation and the resultant risks can be accepted by the appropriate risk acceptance authority. In this regard, software is no different from hardware and operators. If the software design does not meet safety requirements, then there is a contribution to risk associated with inadequately verified software hazard causes and controls. Generally, risk assessment is based on quantitative and qualitative judgment and evidence. Table B-I shows how these principles can be applied to provide an assessment of risk associated with software causal factors.

TABLE B-I. Software hazard causal factor risk assessment criteria

  • e. Defining and following a process for assessing risk associated with hazards is critical to the success of a program, particularly as systems are combined into more complex SoS. These SoS often involve systems developed under disparate development and safety programs and may require interfaces with other Service (Army, Navy/Marines, and Air Force) or DoD agency systems. These other SoS stakeholders likely have their own safety processes for determining the acceptability of systems to interface with theirs. Ownership of the overarching system in these complex SoS can become difficult to determine. The process for assessing software’s contribution to risk, described in this Appendix, applies the same principals of risk mitigation used for other risk contributors (e.g., hardware and human). Therefore, this process may serve as a mechanism to achieve a “common ground” between SoS stakeholders on what constitutes an acceptable level of risk, the levels of mitigation required to achieve that acceptable level, and how each constituent system in the SoS contributes to, or supports mitigation of, the SoS hazards.

This is the last excerpt from the Standard

Back to the Home Page | Mil-Std-882 Page | System Safety Page

Professional | Pragmatic | Impartial

Mil-Std-882E Appendix A

This is Mil-Std-882E Appendix A.
Back to the previous excerpt: 400-Series Tasks

GUIDANCE FOR THE SYSTEM SAFETY EFFORT

A.1 Scope. This Appendix is not a mandatory part of the standard. The information contained herein is intended for guidance only. This Appendix provides guidance on the selection of the optional tasks and use of quantitative probability levels.

A.2. Task Application. The system safety effort described in Section 4 of this Standard can be augmented by identifying specific tasks that may be necessary to ensure that the contractor adequately addresses areas that the Program needs to emphasize. Consideration should be given to the complexity and dollar value of the program and the expected levels of risks involved. Table A-I provides a list of the optional tasks and their applicability to program phases. Once recommendations for task applications have been determined, tasks can be prioritized and a “rough order of magnitude” estimate should be created for the time and effort required to complete each task. This information will be of considerable value in selecting the tasks that can be accomplished within schedule and funding constraints.

TABLE A-I. Task application matrix

A.3. Quantitative Probability Example. For quantitative descriptions, the frequency is the actual or expected number of mishaps (numerator) during a specified exposure (denominator). The denominator can be based on such things as the life of one item; number of missile firings, flight hours, systems fielded, or miles driven; years of service, etc.

TABLE A-II. Example probability levels

Forward to the next excerpt: Appendix B

Back to the Home Page | Mil-Std-882 Page | System Safety Page

Professional | Pragmatic | Impartial

System Safety Principles (Short)

Here is the short (15 mins) video on System Safety Principles, which is a sample of the full (45 mins) video on the subject.

System Safety Principles (15 mins)

Topics

  • Foundational statement;
  • Planning;
  • Management Authority;
  • Safety Precedence;
  • Safety Requirements;
  • System Analyses;
  • Assumptions & Criteria;
  • Emphasis & Results;
  • Management Authority Responsibilities;
  • Software hazard analysis; and
  • An Effective System Safety Program.

See both the videos on Patreon, here.

Back to: System Safety Page | Main Page

Professional | Pragmatic | Impartial

System Safety Principles

The Full Transcript

Hello!

Welcome to the Safety Artisan where you will find professional, pragmatic and impartial guidance and educational products on all things safety, be they System Safety, design safety, functional safety. Call it whatever you want. Today we’re going to be talking about System Safety principles. We will be going through some System Safety principles from the American Federal Aviation Authority System Safety Handbook.

This is a transcript of the full, 45-minute video, which you can see on Patreon, here.

Topics

So, our topics for today. There’s a fundamental statement to start with, we’ll talk about planning and Management Authority how we achieve safety in the precedence that we prefer to use. Safety requirements and analysis assumptions and criteria emphasis and results, Management Authority responsibilities, software and how to get an effective System Safety program. There’s quite a lot here, we’re going to charge on and see what we get.

System Safety is a Basic Requirement

The first thing we need to consider is that System Safety is a basic requirement of the total system. The FAA deal with airplanes, so, I thought I’d show you a picture of an airplane that’s had a bad day. Now the engines and the wings and the tail I think have been removed after the crash but as you can see it’s got to be bashed in the front when it crashed. The point we’re making here is that safety is to do with the total system. An unsafe airplane, an airplane that’s crashed no longer flies. It’s no longer really an airplane, it’s just shattered remains. Safety is a fundamental thing that we need from the whole system. We need the whole aeroplane to work. We could, for example, talk about the safety of the wings or the safety of the engines but that wouldn’t make much sense in isolation would it if the engines aren’t on the airplane or the wings aren’t on the airplane then what’s the point of them. So, we need System Safety. It’s a basic requirement of the whole thing, and the whole thing working.

Planning

OK, the next principle is planning. What do we need from planning? Well, we need the safety engineering effort to be comprehensive. In other words, we needed to cover everything it needs to cover, and it needs to be integrated, it all needs to be joined up. if the safety effort isn’t both of those things are then it’s either going to fall short or it’s going to be disconnected in some way and that doesn’t mean effected said we’re going to have this thing.

Now we need ongoing effort over a period to achieve safety for any kind of significant system. that probably means that we’re going to do a whole bunch of different tasks and those tasks that we’ve got to be done in sequence. They’ve got to relate to each other. If you can imagine a planning chart, a Gantt chart, a waterfall chart that kind of thing with tasks linked together. Typical planning stuff. Nothing unusual there. The plan must also, influence facilities equipment procedures and personnel.

When it says influence, I guess it’s better to say making choices, or decisions. Which facilities? which personnel? which procedures? and why are they appropriate? What we’re trying to achieve. That’s what that’s really all about, the fourth bullet point. Here we’ve got applicable to all program phases. We need a plan that gets us started that gets the work done and brings things to a satisfactory conclusion. Whether that be all parts of the program right through to integration getting our airplane or our other system into service then we need it to cover all the other stuff as well.

It’s very easy to think about sexy, design stuff particularly with things like airplanes. But we need to cover all the other things as well. What about transporting our system or spares. What about logistics support. What about spares and repair. What about storage in package handling? How do we ensure that stuff arrives where it’s supposed to in a fit state to be used and that kind of thing. Finally, not every program is all about the development of new things. There are probably going to be some non-developmental items or designs along the way. We’re going to reuse some stuff from elsewhere and we’ve got to make sure that it fits in and contributes to safety, so there are no disconnections or incompatibilities. We need to think about those NDIs as well. Whether we are in control of its development we need to think about that stuff. These seven bullet points talk about the comprehensiveness of the Plan.

Management Authority

Okay, Management Authority. In the FAA handbook, which is getting a bit old in the tooth by now it must be said it’s about 19 years old, we have the concept we’ve got the FAA is the regulator we’ve got the Management Authority whoever is putting together, in this case, an airplane project and then we’ve got the idea that the Management Authority has staff and also, contractors. The Management Authority is contracting out certain things they might be contracting out all the development or just bits of it or whatever it might be.

So, the M.A. has got to manage in this concept the overall system safety effort. They’ve got to pull it all together and the managerial and technical procedures to be used must be approved by the Management Authority. It’s the Management Authority that resolves any conflicts between safety and other design issues and resolves conflicts between different contractors. The Management Authority really has the power here and if need be, they must knock heads together in order to make sure that the whole thing works. That’s a key concept here. We’ll come back to that later as you’ll see.

Precedence of Controls

Moving on now, when we talk about controlling risk, we have several options for what kind of controls we can use. The FAA principles say we should start with designing for minimum hazards. So, we should try and make our system, whatever it may be, as inherently safe, as intrinsically safe as we can by designing out dangerous features.

Almost certainly we cannot completely design out risk in any significant system. Maybe we need to use specific safety devices. There’s a very simple illustration on the right. What you see with those little white boxes in the center with the wiring coming out the top and bottom. They are circuit breakers and they are what’s called residual current device circuit breakers. If a circuit breaker detects a spike of voltage or current on the line it will trip and isolate whatever it is feeding electricity to. So, if you’ve had a short circuit or you have an accident that would probably cause a voltage spike, the RCD circuit breaker trips and protect people from electrocution or protects equipment from being overvolted. In which case it might fail or catch fire or something.

There is a good example of some safety devices that you could fit into an electrical system. Having designed for minimum hazard and added safety devices we could warn people that there out that of impending problems and we could fit alarms of warning lights and or they might be warning signs that we might have a sign on the side of this box with these circuit breakers in saying watch out there’s electricity.

Finally we can use procedures, we can have written procedures that tell people how to do stuff safely and if the warnings and cautions that say ‘watch out for this’ or don’t do that or in and do this in a particular way and maybe the procedure might say in the case of the illustration you need to isolate the electricity before you open this box. All sorts of options but we want to start with the most effective options which are designing our hazards. In fact, you will still see a version of this precedence of controls in, for example, Australian work health and safety today it’s not called precedence of controls. It’s called a hierarchy of controls, but it says much the same thing.

Safety Requirements

Let’s talk about safety requirements and there are two points here that the FAA is making very wisely. First, those safety requirements have got to be consistent with other program requirements a safety program in isolation. It’s probably not going to be much use. It’s got to fit in and be consistent with what the overall program is doing to be effective. For example, if the safety program is making assumptions about how stuff is going to be used or maintained or the environment it’s going to work in, but those assumptions are incorrect. They’re not aligned with reality. Then you probably have a problem.

Secondly and this sounds a bit more controversial, performance cost and other requirements may have priority over safety requirements.

I’ll let that sink in.

So, it sounds odd: Other requirements may have priority for safety but, it’s quite logical when you think about it because there’s no such thing as perfect safety. Nothing is safe. Breathing in and out has risks for human beings. We just need to get on with it. It may be that if we give safety priority over everything we will end up with a system that has low performance, such that it’s not worth using, or it may cost so much that nobody could afford to buy or use it or sustain it. We’ve got to balance safety requirements with others and safety may not always win, it may not always be the pretty dominant requirement.

System Analyses

OK So, how do we understand what safety we need and whether we’ve achieved it or not. The answer is system analysis and system analyses, as it says, are basic tools for developing design specifications. Now, they do a lot more than that as we’ll see. But the focus with the FAA approach to System Safety is very much requirements-centric. The idea is that while you do a lot of work to get specifications and the requirements right, and then you make sure that what you design matches the specifications and then you verify and validate that it’s met the requirements at the end. And that is very much the American ethos for how you do safety.

Now, not all legal systems take this approach. For example, the UK and the Australian legal system are taking the view that its safety by intent. So, we measure safety or the achievement of safety based on saying that risks have been reduced to an acceptable level (but even that, of course, is a requirement). The two approaches are not incompatible. We must understand what we’re doing and remember these legal requirements, in whatever jurisdiction you’re in, are themselves requirements and need to be fed into the specifications. That’s the key thing. Is that something I often see missing in safety programs in all in all sorts of countries, where whoever is developing the requirements specifications, at whatever level, has forgotten about a bunch of requirements that just have to be met.

Of course, we have to remember that the measure of safety, it’s not the scope of the analysis – the analysis is just a means to an end. It’s a means to satisfy a requirement. That’s what it’s about. Having made sure we’ve considered all the requirements that we need for safety, we need to satisfy them. System analysis helps us to do that by looking at the system as a whole.

Purpose of Analyses

The purpose of these analyses is what do we do with them. I said they weren’t just for requirements. We can use analysis to identify hazards. It says corrective actions, it may be that we’ve identified hazards associated with the design or possible designs that we’re going to correct that design to reduce the hazard.

Or it may be that we’re going to add controls we might use analysis a trade-off to understand and review safety considerations and see how much safety we can get. How much safety is reasonable to have? Back to the requirements, we might use analysis to determine or evaluate safety design requirements, not just safety design requirements. We might also, need to evaluate operational, requirements for testing logistics, etc., Testing might be: how are we going to demonstrate safety? Again, the FAA is an American organization and the American approach to verification and validation tends to emphasize testing, sometimes to the exclusion of all else. Now, this isn’t necessarily the best way to do things but that’s the mentality. Just to be aware that’s one of the underlying philosophies or these principles because it’s from the American FAA.

Finally, we might use analyses to validate requirements that they’ve been met So, we might not be able to do testing. It might be too expensive or too dangerous to test something to destruction. Maybe what we need is a whole bunch of tests, different test points, and analysis is the way to do that particularly in the world of aircraft development. These days the way things tend to be done is that you have a model of your system and you use the model, in general, to validate that your system is correct and then you use certain test points to validate the model because it’s just too expensive, too time-consuming to physically test everything.

And then a final point that sounds rather odd: analysis our hazard analysis is not safety analysis. And I think what the FAA means by this is that we need to focus on real-world hazards. I’ve seen people get hung up trying to analyze a program or trying to start their analysis by analyzing safety controls and thinking about well what happened if my control goes wrong.

Well, we need to start at the other end. We need to start with the real-world hazard. That’s what’s really going to hurt people. we can work out how effective controls need to be from analyzing the hazard, not the other way around. That’s quite a common mistake I see in say programs, which is not focusing on physical hazards because then you can end up going around in circles in a rather theoretical or philosophical approach as opposed to getting the job done. That rather harks back to the previous point. The whole point of the exercise is to satisfy requirements by having a safe system not to do the analysis. There are some purposes of an analysis.

Assumptions and Criteria

As always in science and engineering. We’re going to need to make some assumptions because we can’t possibly prove absolutely everything. Now assumptions are good because they enable us to proceed. They enable us to work pragmatically but we’ve got to make sure that they are sensible. We’ve got a verify, validate them as far as we can and if we discover that an assumption turns out to be incorrect then we’re going to do something about it. Change in a program is inevitable. sometimes as we go through a large development program, we discover that the assumptions that we started with are not correct and we need to review and make changes.

That’s important. Again, people are sometimes nervous about doing that. They just want to well, dare I say, some people just want to stick their head in the sand and ignore these things but that’s not good safety management either. We’re going to have to set some risk criteria. Think we’re going to have to decide how much risk we can accept what our risk appetite is. Because as I’ve said before you can’t have zero risks, and to pretend that you can is foolish and ultimately self-defeating because then you end up with that’s an unrealistic assumption and you end up with a safety program that’s built on fantasy rather than reality.

That’s no good. Making assumptions and setting criteria are an inherent part of risk management. We need to understand that a risk is something that hasn’t yet happened. If it’s already happened, it’s an issue. So, a risk is something that could happen in the future. We’re talking about making estimates. We must set assumptions and we must set criteria. OK, I think I’ve said enough about that.

Safety Management

Moving on to safety management. So, we’ve got the Management Authority. But of course, safety management needs to be done at every level where we can influence the design. So, it’s not just the Management Authority’s responsibility to manage safety. Everybody who is managing safety must define safety functions, the authority that various people must make decisions and interrelationships between bodies and individuals and then safety management must be about exercising appropriate control. Whether it is control of the safety process is what we’re talking about here rather than management of hazard (controls). We need to when we’re exercising safety management. We need to do all those things

Effort and Emphasis

Not all risks are equal, not all safety controls hazard controls are equal. So, the degree of safety effort and the achievements that are required are dependent upon management emphasis. Now it says here by the FAA and tractors So, the FAA acts as a regulator. The emphasis that drives safety and where the emphasis on where we apply safety and the precedence and how much effort we put in, that’s going to be partly directed by the regulator. If you’re working in a regulated industry or it may be directed by the law and then the Management Authority or their contractors after then take and interpret those directives and apply them practically and then, of course, we’re going back to safety management. We define functions, authority, relationships and we exercise control in order to achieve the safety emphasis that is required to achieve the results that is required. That’s going to direct the effort.

We were probably going to spend a lot more effort managing higher risks than lower ones. We know our risks. Now that sounds so obvious doesn’t it?  But the reality is it’s very easy for programs to lose sight of what the big risks are and major on the miners if you will. It’s too easy to get carried away with little things and you end up spending all your time on a program dealing with trivia while ignoring the fact that the horse has already bolted (escaped)!

Clarity of Objectives

I guess that comes back to the clarity of objectives, doesn’t it? There’s an old saying, one of my favorites (I apologize) “if you don’t know to which port you are sailing then no wind it’s favorable”. You’ve got to know what your safety objectives are what your safety targets are (if you’re going to set quantitative targets, but you don’t have to). Whatever your safety objectives and requirements are the Management Authority needs to clearly state and communicate them to everybody who is required to take action to manage safety. So, again, this sounds obvious, but people get it wrong so often, or they just don’t do it. Then at the back end of a program, they’re surprised that they haven’t got what they need.

This can become a big problem if you’re at the back end of a program and the Management Authority is trying to demonstrate to the regulator, or whoever it might be, customers perhaps, that they met safety requirements and met safety objectives. They may find either they got kit that can’t meet the requirements because they didn’t specify the requirement up front, or, more often, they can’t demonstrate that the kit meets the requirements, which is quite galling because you’ve got kit, which you suspect it’s perfectly okay but you can’t prove it. So, then you end up having to spend more money and waste time at the back end of the program trying to fix those things. A lot of programs end up being late and over budget for things like that. The earlier and the clearer you set your objectives the better. That supports things like making trade-offs and making decisions.

It’s all about decision making.

Management Authority Responsibilities

And that brings us neatly on to Management Authority responsibilities. The assumption is that we have an SSP, a System Safety Program. So, we have a planned program that’s going to achieve safety. The MA must plan it, organize it and make it happen. The MA has got to establish what the safety requirements are for a system, for the design, and they’ve got to state those safety requirements in a contract. (The assumption is that we’re going to contract with somebody for the whole system may be, or parts of the system.) We need a statement of work, to say OK what activities do we need to meet these requirements?

Now I guess what varies here is the amount of detail in the statement of Work. The Management Authority might take a hands-off approach and go okay, I’m going to specify some things in a statement of work like we want reviews at particular points in the program, or we want safety reporting, or whatever it might be. Or they might take a really prescriptive approach and say we’re going to specify in a lot of detail what we want in the SoW. To do that effectively the management and authority you really got to understand the minimum the thing that they need, and how that minimum might be reasonably achieved, because the danger is if you over specified that state with without work and you’ve got something wrong then you might end up stopping the contractors doing something sensible. Or the contractors might just blindly follow what you’ve told them to do rather than thinking about safety, which is what you really want!

Moving on. The MA must also review things and ensure (I think we would say in English) ENSURE an adequate and complete System Safety Program Plan. We’ve got a System Safety Program. We need a plan for it, and whether it be the MA that produces an overall plan or whether they produce a plan for themselves and then specify that the other stakeholders do their own, whichever it might be.

So, this System Safety Program, System Safety Program Plan, the Statement of Work and the requirements: those four things really are linked together and need to be thought of together. You need to take a holistic approach because if you’ve got the requirements are out of step with the program, if the plan doesn’t adequately describe the program that you need, if the statement of work is at odds with the plan or the intended program. All these things are going to cause major problems. Those four things, the System Safety Program, safety requirements, Statement Of Work and the System Safety Program Plan really need to be worked consistently and coherently any to fit together.

Let’s move on from the first five bullet points. A rather odd one, it seems, to supply historical data. Now that looks really odd doesn’t it? out of place with the others. It’s quite logical. The Management Authority, the people who say I want a system and I’m going to set everything up to make sure I get the system that I need. They’re not doing this in isolation. This might be a new system, but it’s probably replacing an old system and a Management Authority should have some expectations, from prior use of other systems or related systems. They should have some expectation of what is reasonable to expect from this kind of system. In other words, setting the safety requirements.

What kind of accidents and incidents we’ve seen in the past? and therefore what kind of hazards and risks we’re going to need to control? So, that historical data is very important and it might literally be lots and lots of low-level data or it might be something a bit higher level where we’ve learned some lessons from the past and those lessons have helped to form our safety requirements for this future system. Historical data is very important.

And again, it’s very easy to get wrong. With historical data usually what we find in the real world is we have underreporting. We have confused reporting and we’ve got a lot of data. We’re not always sure what it means whether there are any overlaps that kind of thing. Gathering historical data and analyzing it can be quite difficult, but it can also, be tremendously useful. It’s worth doing.

So, next Bullet point they may need to review contractor System Safety effort. What we’re doing the data that they’re producing the MA needs to ensure specifications are updated with analysis and test results. Again, we talked about change being inevitable. Somebody has got to make that change happen and make sure the effects of change ripple through the system consistently and that somebody is the MA.  Somebody has got to have the authority to manage these things. One body. Management by committee doesn’t always work very well. Somebody some organization or some individual who clearly has authority to lead.

Finally, we need to establish and operate System Safety groups. These groups or committees, whatever you want to call them, we need to bring different stakeholders together different expertise and different competent people with different competencies together in order to support the Management Authority. The final decision rests with the Management Authority but the MA needs to pull together enough expertise to enable them to make sensible decisions. There’s a balance between this unity of leadership unity of purpose and diversity of representation that brings everything we need into the decision-making process.

Software

Okay: software! Now, this is a slight aside, when the FAA came up with these principles, software was maybe a little bit rarer back then. Now, these days software is everywhere. But back in 2000, particularly on high integrity systems, like airplane software, it was rarer. It was there and had been for some time, but it wasn’t always doing safety-related stuff. So, it’s still seen as a bit of a special case and to be honest, even these days lots of people are frightened of software because it’s intangible, and I suspect I’m going to end up doing quite a few sessions talking about software safety and explaining it.

We note that the FAA is still taking their very much requirements-focused approach So, analysing software for hazards is seen in this approach as all about taking requirements from the top left hand side of the V model, which we see illustrated here, and flowing those requirements down to lower and lower levels until we get to implementation: the development of the software. Then as we build those we conduct unit testing, integration testing, system testing, and user testing or operational testing whatever you want to call it.

We progressively build-up testing to show that we have verified that their requirements, at every level in the V model, have been met. This is a philosophy for looking at software and it is correct, but it’s not the only way of looking at software. This is a very American approach. It emphasizes requirements. It emphasizes testing. We will see when we get to a specialist subjects on software, software is not always very amenable to being tested and just because you’ve got a requirement, just because software meets all its requirements – that’s great – maybe we can demonstrate that, but can we demonstrate that it doesn’t do anything it’s not supposed to do?  Often in safety that’s half the battle or even more than half.  So, I’m not necessarily a fan of this statement here and to be honest it is a bit out of date.

System Safety Program

So, we move on and this is our final slide. We’ve talked about the System Safety Program before and we’ve got some good principles here. What do we need or an effective System Safety Program? And that word effective is key because anybody can make up a program that may or may not be effective. What do we need to make it work? Well, we need a plan a planned approach to getting tasks done, getting them accomplished. Again, I have seen lots of people start tasks and not finish them, or not finish them successfully. We need qualified people. Once again, I’ve seen lots of programs with people who don’t really know what they’re doing and they’re very busy. They’re running around like headless chickens. Maybe they’ve got a lot of people, but if they don’t know what they’re doing then sure they may if directed sensibly, they may still get a result but it’s probably not going to be very elegant. So, we need people who are competent at what they are doing.

We need somebody or something that wields the authority to get stuff done, implement tasks, and that authority has got to flow through all levels of management (because we might have multiple levels).  We’ve got the Management Authority in this model, who is reporting to the FAA and trying to demonstrate to that regulator that they’ve done what they were supposed to. Maybe you’ve got internal levels of management, but in the end the Management Authority has got to manage contractors, perhaps at multiple levels. On complex systems, you may have many levels of contracts contributing these parts and components and sub-assemblies et cetera et cetera into an overall complex system.

Finally, we’ve got to have appropriate staffing and funding. We’ve got to have enough people with the right skills to get the job done and that all costs money. Very often safety-qualified people are hard to find and therefore they tend to be expensive. That’s when people like myself get brought in and safety consultants, because a Management Authority or the contractors they were working for them discover that they don’t have enough staff with the right experience and competence in order to get the job done. People like me get brought in and we can be quite expensive!

Nothing wrong with doing that of course. But usually, to get effective results, I find that the Management Authority needs to have enough competent people at least to understand, to be able to realize we’re not making progress here, we need to bring in more highly qualified people. You need enough knowledge about safety in order just to realize that you’re not cutting it and you need to bring in some higher-powered help.

That’s one of the reasons for The Safety Artisan to exist, really, is to help people have enough background to realize what they’re supposed to be doing versus maybe what’s going on. Once you have that knowledge then hopefully you can build up enough knowledge to assess the situation and to decide whether what you’re doing is adequate or whether you need further help. That minimum level of knowledge is what you need to succeed. Once you’ve got that then maybe you buy more expertise and employ people in-house or maybe you bring people in temporarily, but that understanding requires a certain base-level knowledge about safety.  And that’s what the Safety Artisan is all about, ladies and gentlemen. That’s a nice point on which to end.

Copyright Statement

Just to say that all the “quotations in italics” are from the U.S. Federal Aviation Authority System Safety handbook. As you can see, they’re published in the year 2000. It is getting a bit long in the tooth in some ways, but the basic principles are good ones. To be honest, I can’t find them as clearly articulated anywhere else, even today, certainly not in a form publicly available for you and me to share. So, thanks and appreciation for the FAA for doing that. I do hope one day soon they’re going to update that system safety handbook because it is a very useful beast. There are still people out there using it and maybe not understanding where it falls short these days.

Now, U.S. government standards tend to be copyright free. The text itself is copyright free, but this video presentation and the value add that I’m providing is copyright of the Safety Artisan, 2019, to understand how current

I’m recording this on the 26th of October 2019.  Maybe you found this video on the Safety Artisan Page at www.Patreon.com, or maybe you found it elsewhere, but you will find all my System Safety videos on Patreon.com/SafetyArtisan.

That’s the end of the presentation on System Safety Principles. Thanks for your attention. it just remains for me to say thanks for tuning in as always. I will see you soon. Cheers now.

See the video on Patreon, here.

Back to: System Safety Page | Main Page

Professional | Pragmatic | Impartial

System Safety Concepts, Part 2

There are two versions of the System Safety Concepts video. The short version is available in a post here, as well as at the Safety Artisan Patreon page and on my YouTube channel.

The full version of the video is only available at the Safety Artisan Patreon page. The transcript is below.

Transcript, ‘System Safety Concept’ (Full)

Hi everyone, and welcome to the safety artisan where you will find professional pragmatic, and impartial advice on all thing’s safety. I’m Simon and welcome to the show today, which is recorded on the 23rd of September 2019. Today we’re going to talk about System safety concepts. A couple of days ago I recorded a short presentation on this, which is on the Patreon website and is also on YouTube.  Today we are going to talk about the same concepts but in much more depth.

Hence, this video is only available on the ‘Safety Artisan’ Patreon page. In the short session, we took some time picking apart the definition of ‘safe’. I’m not going to duplicate that here, so please feel free to go have a look. We said that to demonstrate that something was safe, we had to show that risk had been reduced to a level that is acceptable in whatever jurisdiction we’re working in.

And in this definition, there are a couple of tests that are appropriate that the U.K., but perhaps not elsewhere. We also must meet safety requirements. And we must define Scope and bound the system that we’re talking about a Physical system or an intangible system like a. A computer program or something. We must define what we’re doing with it what it’s being used for. And within which operating environment within which context is being used.  And if we could do all those things, then we can objectively say or claim that this system is safe. OK.  that’s very briefly that.

Topics

What we’re going to talk about a lot more Topics. We’re going to talk about risk accidents. The cause has a consequence sequence. They talk about requirements and. Spoiler alert. What I consider to be the essence of system safety. And then we’ll get into talking about the process. Of demonstrating safety, hazard identification, and analysis.

Risk Reduction and estimation. Risk Evaluation. And acceptance. And then pulling it all together. Risk management safety management. And finally, reporting, making an argument that the system is safe supporting with evidence. And summarizing all of that in a written report. This is what we do, albeit in different ways and calling it different things.

Risk

Onto the first topic. Risk and harm.  Our concept of risk. It’s a combination of the likelihood and severity of harm. Generally, we’re talking about harm. To people. Death. Injury. Damage to help. Now we might also choose to consider any damage to property in the environment. That’s all good. But I’m going to concentrate on. Harm. To people. Because. Usually. That’s what we’re required to do. By the law. And there are other laws covering the environment and property sometimes. That. We’re not going to talk.  just to illustrate this point. This risk is a combination of Severity and likelihood.

We’ve got a very crude. Risk table here. With a likelihood along the top. And severity. Downside. And we might. See that by looking at the table if we have a high likelihood and high severity. Well, that’s a high risk. Whereas if we have Low Likelihood and low severity. We might say that’s a low risk. And then. In between, a combination of high and low we might say that’s medium. Now, this is a very crude and simple example. Deliberately.

You will see risk matrices like this. In. Loads of different standards. And you may be required to define your own for a specific system, there are lots of variations on this but they’re all basically. Doing this thing and we’re illustrating. How we determine the level of risk. By that combination of severity. And likely, I think a picture is worth a thousand words. Moving online to the accident. We’re talking about (in this standard) an unintended event that causes harm.

Accidents, Sequences and Consequences

Not all jurisdictions just consider accidental event some consider deliberate as well. We’ll leave that out. A good example of that is work health and safety in Australia but no doubt we’ll get to that in another video sometime. And the accident sequences the progression of events. That results in an accident that leads to an. Now we’re going to illustrate the accident sequence in a moment but before we get there. We need to think about cousins.  here we’ve got a hazard physical situation of state system. Often following some initiating event that may lead to an accident, a thing that may cause harm.

And then allied with that we have the idea of consequences. Of outcomes or an outcome. Resulting from. An. Event. Now that all sounds a bit woolly doesn’t it, let’s illustrate that. Hopefully, this will make it a lot clearer. Now. I’ve got a sequence here. We have. Causes. That might lead to a hazard. And the hazard might lead to different consequences. And that’s the accident. See. Now in this standard, they didn’t explicitly define causes.

Cause, Hazard and Consequence

They’re just called events. But most mostly we will deal with causes and consequences in system safety. And it’s probably just easier to implement it. Whether or not you choose to explicitly address every cause. That’s often option step. But this is the accident Sequence that we’re looking at. And they this sort of funnels are meant to illustrate the fact that they may be many causes for one hazard. And one has it may lead to many consequences on some of those consequences. Maybe. No harm at all.

We may not actually have an accident. We may get away with it. We may have a. Hazard. And. Know no harm may befall a human. And if we take all of this together that’s the accident sequence. Now it’s worth. Reiterating. That just because a hazard exists it does not necessarily need. Lead to harm. But. To get to harm. We must have a hazard; a hazard is both necessary and sufficient. To lead to harmful consequences. OK.

Hazards: an Example

And you can think of a hazard as an accident waiting to happen. You can think of it in lots of different ways, let’s think about an example, the hazard might be. Somebody slips. Okay well while walking and all. That slip might be caused by many things it might be a wet surface. Let’s say it’s been raining, and the pavement is slippery, or it might be icy. It might be a spillage of oil on a surface, or you’d imagine something slippery like ball bearings on a surface.

So, there’s something that’s caused the surface to become slippery. A person slips – that’s the hazard. Now the person may catch themselves; they may not fall over. They may suffer no injury at all. Or they might fall and suffer a slight injury; and, very occasionally, they might suffer a severe injury. It depends on many different factors. You can imagine if you slipped while going downstairs, you’re much more likely to be injured.

And younger, healthy, fit people are more likely to get over a fall without being injured, whereas if they’re very elderly and frail, a fall can quite often result in a broken bone. If an elderly person breaks a bone in a fall the chances of them dying within the next 12 months are quite high. They’re about one in three.

So, the level of risk is sensitive to a lot of different factors. To get an accurate picture, an accurate estimate of risk, we’re going to need to factor in all those things. But before we get to that, we’ve already said that hazard need not lead to harm. In this standard, we call it an incident, where a hazard has occurred; it could have progressed to an accident but didn’t, we call this an incident. A near miss.

We got away with it. We were lucky. Whatever you want to call it. We’ve had an incident but no he’s been hurt. Hopefully, that incident is being reported, which will help us to prevent an actual accident in future.  That’s another very useful concept that reminds us that not all hazards result in harm. Sometimes there will be no accident. There will be no harm simply because we were lucky, or because someone present took some action to prevent harm to themselves or others.

Mitigation Strategies (Controls)

But we would really like to deliberately design out or avoid Hazards if we can. What we need is a mitigation strategy, we need a measure or measures that, when we put them into practice, reduce that risk. Normally, we call these things controls. Again, now we’ve illustrated this; we’ve added to the funnels. We’ve added some mitigation strategies and they are the dark blue dashed lines.

And they are meant to represent Barriers that prevent the accident sequence progressing towards harm. And they have dashed lines because very few controls are perfect, you know everything’s got holes in it. And we might have several of them. But usually, no control will cover all possible causes; and very few controls will deal with all possible consequences.  That’s what those barriers are meant to illustrate.

That idea that picture will be very useful to us later. When we are thinking about how we’re going to estimate and evaluate risk overall and what risk reduction we have achieved. And how we talk about justifying what we’ve done is good. That’s a very powerful illustration. Well, let’s move on to safety requirements.

Safety Requirements

Now. I guess it’s no great surprise to say that requirements, once met, can contribute directly to the safety of the system. Maybe we’ve got a safety requirement that says all cars will be fitted with seatbelts. Let’s say we’ll be required to wear a seatbelt.  That makes the system safer.

Or the requirement might be saying we need to provide evidence of the safety of the system. And, the requirement might refer to a process that we’ve got to go through or a set kind of evidence that we’ve got to provide. Safety requirements can cover either or both of these.

The Essence of System Safety

Requirements. Covering. Safety of the system or demonstrating that the system is safe. Should give us assurance, which is adequate confidence or justified confidence. Supported with evidence by following a process. And we’ll talk more about process. We meet safety requirements. We get assurance that we’ve done the right thing. And this really brings us to the essence of what system safety is, we’ve got all these requirements – everything is a requirement really – including the requirement. To demonstrate risk reduction.

And those requirements may apply to the system itself, the product. Or they may provide, or they may apply to the process that generates the evidence or the evidence. Putting all those things together in an organized and orderly way really is the essence of system safety, this is where we are addressing safety in a systematic way, in an orderly way. In an organized way. (Those words will keep coming back). That’s the essence of system safety, as opposed to the day-to-day task of keeping a workplace safe.

Maybe by mopping up spills and providing handrails, so people don’t slip over. Things like that. We’re talking about a more sophisticated level of safety. Because we have a more complex problem a more challenging problem to deal with. That’s system safety. We will start on the process now, and we begin with hazard identification and analysis; first, we need to identify and list the hazards, the Hazards and the accidents associated with the system.

We’ve got a system, physical or not. What could go wrong? We need to think about all the possibilities. And then having identified some hazards we need to start doing some analysis, we follow a process. That helps us to delve into the detail of those hazards and accidents. And to define and understand the accident sequences that could result. In fact, in doing the analysis we will very often identify some more hazards that we hadn’t thought of before, it’s not a straight-through process it tends to be an iterative process.

Risk Reduction

And what ultimately what we’re trying to do is reduce risk, we want a systematic process, which is what we’re describing now. A systematic process of reducing risk. And at some point, we must estimate the risk that we’re left with. Before and after all these controls, these mitigations, are applied. That’s risk estimation.  Again, there’s that systematic word, we’re going to use all the available information to estimate the level of risk that we’ve got left. Recalling that risk is a combination of severity and likelihood.

Now as we get towards the end of the process, we need to evaluate risk against set criteria. And those criteria vary depending on which country you’re operating in or which industry we’re in: what regulations apply and what good practice is relevant. All those things can be a factor. Now, in this case, this is a U.K. standard, so we’ve got two tests for evaluating risk. It’s a systematic determination using all the available evidence. And it should be an objective evaluation as far as we can make it.

Risk Evaluation

We should use certain criteria on whether a risk can be accepted or not. And in the U.K. there are two tests for this. As we’ve said before, there is ALARP, the ‘As Low As is Reasonably Practicable’ test, which says: Have we put into practice all reasonably practicable controls? (To reduce risk, this is risk reduction target). And then there’s an absolute level of risk to consider as well. Because even if we’ve taken all practical measures, the risk remaining might still be so high as to be unacceptable to the law.

Now that test is specific to the U.K, so we don’t have to worry too much about it. The point is there are objective criteria, which we must test ourselves or measure ourselves against. An evaluation that will pop out the decision, as to whether a further risk reduction is necessary if the risk level is still too high. We might conclude that are still reasonably practicable measures that we could take. Then we’ve got to do it.

We have an objective decision-making process to say: have we done enough to reduce risk? And if not, we need to do some more until we get to the point where we can apply the test again and say yes, we’ve done enough. Right, that’s rather a long-winded way of explaining that. I apologize, but it is a key issue and it does trip up a lot of people.

Risk Acceptance

Now, once we’ve concluded that we’ve done enough to reduce risk and no further risk reduction is necessary, somebody should be in a position to accept that risk.  Again, it’s a systematic process, by which relevant stakeholders agree that risks may be accepted. In other words, somebody with the right authority has said yes, we’re going to go ahead with the system and put it into practice, implement it. The resulting risks to people are acceptable, providing we apply the controls.

And we accept that responsibility.  Those people who are signing off on those risks are exposing themselves and/or other people to risk. Usually, they are employees, but sometimes members of the public as well, or customers. If you’re going to put customers in an airliner you’re saying yes there is a level of risk to passengers, but that the regulator, or whoever, has deemed [the risk] to be acceptable. It’s a formal process to get those risks accepted and say yes, we can proceed. But again, that varies greatly between different countries, between different industries. Depending on what regulations and laws and practices apply. (We’ll talk about different applications in another section.)

Risk Management

Now putting all this together we call this risk management.  Again, that wonderful systematic word: a systematic application of policies, procedures and practices to these tasks. We have hazard identification, analysis, risk estimation, risk evaluation, risk reduction & risk acceptance. It’s helpful to demonstrate that we’ve got a process here, where we go through these things in order. Now, this is a simplified picture because it kind of implies that you just go through the process once.

With a complex system, you go through the process at least once. We may identify further hazards, when we get into Hazard Analysis and estimating risk. In the process of trying to do those things, even as late as applying controls and getting to risk acceptance. We may discover that we need to do additional work. We may try and apply controls and discover the controls that we thought were going to be effective are not effective.

Our evaluation of the level of risk and its acceptability is wrong because it was based on the premise that controls would be effective, and we’ve discovered that they’re not, so we must go back and redo some work. Maybe as we go through, we even discover Hazards that we hadn’t anticipated before. This can and does happen, it’s not necessarily a straight-through process. We can iterate through this process. Perhaps several times, while we are moving forward.

Safety Management

OK, Safety Management. We’ve gone to a higher level really than risk because we’re thinking about requirements as well as risk. We’re going to apply organization, we’re going to applying management principles to achieve safety with high confidence. For the first time we’ve introduced this idea of confidence in what we’re doing. Well, I say the first time, this is insurance isn’t it? Assurance, having justified confidence or appropriate confidence, because we’ve got the evidence. And that might be product evidence too we might have tested the product to show that it’s safe.

We might have analysed it. We might have said well we’ve shown that we follow the process that gives us confidence that our evidence is good. And we’ve done all the right things and identified all the risks.  That’s safety management. We need to put that in a safety management system, we’ve got a defined organization structure, we have defined processes, procedures and methods. That gives us direction and control of all the activities that we need to put together in a combination. To effectively meet safety requirements and safety policy.

And our safety tests, whatever they might be. More and more now we’re thinking about top-level organization and planning to achieve the outcomes we need. With a complex system, with a complex operating environment and a complex application.

Safety Planning

Now I’ll just mention planning. Okay, we need a safety management plan that defines the strategy: how we’re going to get there, how are we going to address safety. We need to document that safety management system for a specific project. Planning is very important for effective safety. Safety is very vulnerable to poor planning. If a project is badly planned or not planned at all, it becomes very difficult to Do safety effectively, because we are dependent on the process, on following a rigorous process to give us confidence that all results are correct.  If you’ve got a project that is a bit haphazard, that’s not going to help you achieve the objectives.

Planning is important. Now the bit of that safety plan that deals with timescales, milestones and other date-related information. We might refer to as a safety program. Now being a UK Definition, British English has two spellings of program. The double-m-e version of programme. Applies to that time-based progression, or milestone-based progression.

Whereas in the US and in Australia, for example, we don’t have those two words we just have the one word, ‘program’. Which Covers everything: computer programs, a programme of work that might have nothing to do with or might not be determined by timescales or milestones. Or one that is. But the point is that certain things may have to happen at certain points in time or before certain milestones. We may need to demonstrate safety before we are allowed to proceed to tests and trials or before we are allowed to put our system into service.

Demonstrating Safety

We’ve got to demonstrate that Safety has been achieved before we expose people to risk.  That’s very simple. Now, finally, we’re almost at the end. Now we need to provide a demonstration – maybe to a regulator, maybe to customers – that we have achieved safety.  This standard uses the concept of a safety case. The safety case is basically, imagine a portfolio full of evidence.  We’ve got a structured argument to put it all together. We’ve got a body of the evidence that supports the argument.

It provides a Compelling, Comprehensible (or understandable) and valid case that a system is safe. For a given application or use, in a given Operating environment.  Really, that definition of what a safety case is harks back to that meaning of safety.  We’ve got something that really hits the nail on the head. And we might put all of that together and summarise it in a safety case report. That summarises those arguments and evidence, and documents progress against the Safe program.

Remember I said our planning was important. We started off saying that we need to do this, that the other in order to achieve safety. Hopefully, in the end, in the safety report we’ll be able to state that we’ve done exactly that. We did do all those things. We did follow the process rigorously. We’ve got good results. We’ve got a robust safety argument. With evidence to support it. At the end, it’s all written up in a report.

Documenting Safety

Now that isn’t always going to be called a safety case report; it might be called a safety assessment report or a design justification report. There are lots of names for these things. But they all tend to do the same kind of thing, where they pull together the argument as to why the system is safe. The evidence to support the argument, document progress against a plan or some set of process requirements from a standard or a regulator or just good practice in an industry to say: Yes, we’ve done what we were expected to do.

The result is usually that’s what justifies [the system] getting past that milestone. Where the system is going into service and can be used. People can be exposed to those risks, but safely and under control.

Everyone’s a winner, as they say!

Copyright – Creative Commons Licence

Okay. I’ve used a lot of information from the UK government website. I’ve done that in accordance with the terms of its creative commons license, and you can see more about that [here]. We have we complied with that, as we are required to, and to say to you that the information we’ve supplied is under the terms of this license.

More Resources

And for more resources and for more lessons on system safety. And other safe topics. I invite you to visit the safety artisan.com website or to go and look at the videos on Patreon, at my safety artisan page. And that’s www.Patreon.com/SafetyArtisan. Thanks very much for watching. I hope you found that useful.

We’ve covered a lot of information there, but hopefully in a structured way. We’ve repeated the key concepts and you can see that in that standard. The key concepts are consistently defined, and they reinforce each other. In order to get that systematic, disciplined approach to safety, that’s we need.

Anyway, that’s enough from me. I hope you enjoyed watching and found that useful. I look forward to talking to you again soon. Please send me some feedback about what you thought about this video and also what you would like to see covered in the future.

Thank you for visiting the Safety Artisan. I look forward to talking to you again soon. Goodbye.

Links

You can see the full video at the Safety Artisan Patreon Page!

You can see the Short Video posted here.

Go back to the Home Page, or the System Safety Page.

System Safety Concepts, Part 1

System Safety Concepts – a Short Introduction

System Safety Concepts – Transcript

Hi everyone and welcome to the Safety Artisan, where you will find professional, pragmatic and impartial advice. Whether you want to know how safety is done or how to do it. I hope you’ll find today’s session helpful. It’s the 21st of September 2019 as I record this. Welcome to the show. So, let’s get started. Well, we’re going to talk today about System Safety concepts. What does it all mean?  We need to ask this question because it’s not obvious, as we will see.

If we look at a dictionary definition of the word safe, it’s an adjective. To be protected from or not exposed to danger or risk. Not likely to be harmed or lost. There are synonyms – protect, shield, shelter, guard and keep out of harm’s way. They’re all good words, and I think we all know what we’re talking about. However, as a definition, it’s too imprecise. We can’t objectively say whether we have achieved safety or not.

A Practical Definition of ‘Safe’

What we need is a better definition, a more practical definition. I’ve taken something from an old UK Defence standard. Forget about which standard, that’s not important. It’s just that we’re using a consistent set of definitions to work through basic safety concepts. And it’s important to do that because different standards, come from different legal systems and they have different philosophies. So, if you start mixing standards and different concepts together, that doesn’t always work.

OK so whatever you do, be consistent. That’s the key point. We’re going to use this set of definitions from the U.K. defence standard because they are consistent.

In this standard, ‘safe’ means: “Risk has been demonstrated to have been reduced to a level that is ALARP, and broadly acceptable or tolerable. And relevant prescriptive safety requirements have been met. For a system, in a given application, in a given Operating Environment.” OK, so let’s unpack that.

System Safety – Risk

So, we start with risk. We need to manage risk. We need to show that risk has been reduced to an acceptable level. As required perhaps by law, or regulation or a standard. Or just good practice in a particular industry. Whatever it is, we need to show that the risk of harm to people has been reduced. Not just any old reduction, we need to show that it’s been reduced to a particular level. Now in this standard, there are two tests for that.

And they’re both objective tests. The first one says as low as reasonably practicable. Basically, it’s asking have all reasonably practicable risk reduction measures been taken. So that’s one test. And the second test is a bit simpler. It’s basically saying reduce the absolute level of risk to something that is tolerable or acceptable. Now don’t worry too much about precisely what these things mean. The purpose for today is to note that we’ve got an objective test to say that we’ve done enough.

System Safety – Requirements

So that’s dealt with risk. Let’s move on to safety requirements. If a requirement is relevant, then we need to apply it. If it’s prescriptive, if it says you must do this, or you must do that. Then we need to meet it. There are two separate parts to this ‘Safe’ thing: we’ve got to meet requirements; and, we’ve got to manage risk. We can’t use one as an excuse for not doing the other.

So just because we reduce risk until it’s tolerable or acceptable doesn’t mean that we can ignore safety requirements. Or vice versa. So those are the two key things that we’ve got to do. But that’s not actually quite enough to get us there. Because we’ve got to define what we’re doing, with what and in what context. Well, we’re reducing the risk of a system. And the system might be a physical thing.

Defining the Scope: The System

It might be a vehicle, an aeroplane or a ship or a submarine, it might be a car or a truck. Or it might be something a bit more intangible. It might be a computer program that we’re using to make decisions that affect the safety of human beings, maybe a medical diagnosis system. Or we’re processing some scripts or prescriptions for medicine and we’ve got to get it right. We could poison somebody. So, whether it’s a tangible or an intangible system.

We need to define it. And that’s not as easy as it sounds, because if we’re applying system safety, we’re doing it because we have a complex system. It’s not a toaster. It’s something a bit more challenging. Defining the system carefully and precisely is really important and helpful. So, we define what our system is, our thing or our service. The system. What are we doing with it? What are we applying it to?

Defining the Scope: The Application

What are we using it for? Now, just to illustrate that no standard is perfect. Whoever wrote that defence standard didn’t bother to define the application. Which is kind of a major stuff-up to be honest, because that’s really important. So, let’s go back to an ordinary dictionary definition just to get an idea of what it means. By the way, I checked through the standard that I was referring to, and it does not explain in this standard.

What it means by the application. Otherwise, I would use that by preference. But if we go back to the dictionary, we see application: the act of putting something into operation. OK, so, we’re putting something to use. We’re implementing, employing it or deploying it maybe we’re utilizing it, applying it, executing it, enacting it. We’re carrying it out, putting it into operation or putting it into practice. All useful words that help us to understand.

I think we know what we’re talking about. So, we’ve got a thing or a service. Well, what are we using it for? Quite obviously, you know a car is probably going to be quite safe on the road. Put it in water and it probably isn’t safe at all. So, it’s important to use things for their proper application, to the use to which they were designed. And then, kind of harking back to what I just said, the correct operating environment.

Defining the Scope: The Operating Environment

For this system, and the application to which we will put it to. So, we’ve got a thing that we want to use for something. What’s the operating environment in which it will be safe? What’s it qualified or certified for? What’s the performance envelope that it’s been designed for? Typically, things work pretty well within the operating environment, within the envelope for which they were designed. Take them outside of that envelope and they perform not so well.

Maybe not at all. You take an aeroplane too high and the air is too thin, and it becomes uncontrollable. You take it too low and it smashes into the ground. Neither outcome is particularly good for the occupants of the aeroplane. Or whoever happens to be underneath it when it hits the ground. All of those three things:  what is the system? What are we doing with it? and where are we doing it? All those things have to be defined. Otherwise, we can’t really say that risk has been dealt with, or that safety requirements have been met.

System Safety: why Bother?

So, we’ve spent several slides just talking about what safe means, which might seem a bit over the top. But I promise you it is not, because having a solid understanding of what we’re trying to do is important in safety. Because safety is intangible. So, we need to understand what it is we’re aiming for. As some Greek bloke said, thousands of years ago: “If you don’t know to which port, you are bound, then no wind is favourable.”

It’s almost impossible to have a satisfactory Safety Program if you don’t know what you’re trying to achieve. Whereas, if you do have a precise understanding of what you’re trying to achieve, you’ve got a reasonably good chance of success. And that’s what it’s all about.

Copyright Statement

Well, I’ve quoted you some information. From a UK government web site. And I’ve done so. In accordance with the terms of its creative commons license and you can see. More information about the terms of that can be found at this page.

The Full Version is Here…

If you want more, if you want to unpack all the Major Definitions, all the system safety concepts that we’re talking about, then there’s a longer version of this video. Which you can get at my Patreon page.

I hope you enjoy it. Well that’s it for the short video, for now. Please go and have a look at the longer video to get the full picture. OK, everyone, it’s been a pleasure talking to you and I hope you found that useful. I’ll see you again soon. Goodbye.