Hello everyone and welcome to the Safety Artisan. I’m Simon and today we’re going to be talking about System of Systems Hazard Analysis – a bit of a mouthful that. What does it actually mean? Well, we shall see.
System of Systems Hazard Analysis
So, for Systems of Systems Hazard Analysis, we’re using task 209 as the description of what to do taken from a military standard, 882E. But to be honest, it doesn’t really matter whether you’re doing a military system or a civil system, whatever it might be – if you’ve got a system of systems, then this will help you to do it.
Topics for this Session
Looking at what we’ve got coming up.
So, we look at the purpose of system of systems – and by the way, if you’re wondering what that is what I’m talking about is when we take different things that we’ve developed elsewhere, e.g. platforms, electronic systems, whatever it might be, and we put them together. Usually, with humans gluing the system together somewhere, it must be said, to make it all tick and fit together. Then we want this collection of systems to do something new, to give us some new capability, that we didn’t have before. So, that’s what I’m talking about when I say a system of systems. I’ll show you an example – it’s the best way. So, we’ve got a couple of slides on task description, a couple of slides or documentation, and a couple of slides on contracting. Tasks 209 is a very short task, and therefore I’ve decided to go through an example.
So, we’ve got seven slides of an example of a system of systems, safety case, and safety case report that I wrote. And hopefully, that will illustrate far better than just reading out the description. And that will also give us some issues that can emerge with systems of systems and I’ll summarize those at the end.
So, let’s get on. I’m going to call it the SOSHA for short; Systems of Systems Hazard Analysis. The purpose of the SOSHA, task 209, is to document or perform and document the analysis of the system of systems and identify unique system of systems hazards. So, things we don’t get from each system in isolation. This task is going to produce special requirements to deal with these hazards, which otherwise would not exist. Because until we put the things together and start using them for something new – We’ve not done this before.
Task Description (T209) #1
Task description: As in all of these tasks, the contractor shall perform and document an analysis of the system of systems to identify hazards and mitigation requirements. A big part of this, as I said earlier, we tend to use people to glue these collections, these portfolios, of systems together and humans are fantastic at doing that. Not always the ideal way of doing it, but sometimes it’s the only way of doing it within the constraints that we’ve got. The human is very important. The human will receive inputs from one or more systems and initiate outputs within the analysis and in fact within the real world, to be honest, which is what we’re trying to analyse. That’s probably a better way of looking at it.
And we’ve got to provide traceability of all those hazards to – it says – architecture locations, interfaces, data and stakeholders associated with the hazard. This is particularly important because with a system of systems each system tends to come with its own set of stakeholders, its own physical location, its own interfaces, etc, etc. The issue of managing all of those extraneous things and getting the traceability, it goes up. It is multiplied with every system you’ve got. In fact, I would say it was the square of. The example we’ll see: we’ve got three systems being put together in a system of systems and, in effect, we had nine times the amount of work in that area, I would say. I think that’s a reasonable approximation.
Task Description (T209) #2
Part two of the task description: The contractor will assess the risk of each hazard and recommend mitigation measures to eliminate the hazards. Or, very often, we can’t eliminate the hazards to reduce the associated risks. Then, as always with this standard, it says we’re going to use tables one, two and three, which are the severity, probability and the risk matrix that comes with the standard. Unless, of course, we have created or tailored our own matrix. Which we very often should do but it isn’t often done – I’ll have to do a session on how to do tailoring a matrix.
Then the contractor has got to verify and validate the effectiveness of those recommended mitigation measures. Now, that’s a really good point and I often see that missed. People come up with control measures or mitigation measures but don’t always assess how effective they’re going to be. Sometimes you can’t so we just have to be conservative but it’s not always done as well as it could be.
Documentation (T209) #1
So, let’s move on. Documentation: So, whoever does the analysis- the standard assumes it’s a contractor – shall document the results to include: you’ve got to describe the system of systems, the physical and functional characteristics of the system of systems, which is very important. Capturing these things is not a given. It’s not easy when you’ve got one system, but when you’ve got multiple systems, some of which are being misused to do something they’ve never done before, perhaps, then you’ve got to take extra care.
Then basically it says when you get more detail of the individual systems you need to supply that when it becomes available. Again, that’s important. And not only if the contractor supplies it, who’s going to check it? Who’s going to verify it? Etc., etc.
Documentation (T209) #2
Slide two on documentation: We’ve got to describe the hazard analysis methods and techniques used, providing a description of each method and technique used, and the assumptions and the data used in support. This is important because I’ve seen lots of times where you get a hazard analysis’ results and you only get the results. It’s impossible to verify those results or validate them to say whether they’ve been done in the correct context. And it’s impossible to say whether the results are complete or whether they’re up to date or even whether they were analysing the correct system because often systems come in different versions. So, how do you know that the version being analysed was the version you’re actually going to use? Without that description, you don’t know. So, it’s important to contract for these things.
And then hazard analysis results. What contents and formats do you want? It’s important to say. Also, we’re going to be looking to put the key items, the leading particular’s, from the results. The top-level results are going to go into the hazard tracking system which is more commonly known as a hazard log or a risk register, whatever it might be. Might be an Excel spreadsheet, might be a very fancy database, but whatever it’s going to be you’re going to have to standardize your fields of what things mean. Otherwise, you’re going to have – the data is going to be a mess and a poor quality and not very usable. So, again, you’ve got a contract for these things upfront and make sure you make clear definitions and say what you want.
Contracting; implicitly, we’ve been talking about contracting already, but this is what a standard says. So, the request for proposal or statement of work has got to include the following. Typically we have an RFP before we’ve got a contract, so we need to have worked out what we need really early in the program or project, which isn’t always done very well. To work out what you need the customer, the purchaser, has probably got to do some analysis of their own in order to work all this stuff out. And I know I say this every time with these tasks, but it is so important. You can’t just dump everything on the contractor and expect them to produce good results because often the contractor is hamstrung. If you haven’t done your homework to help them do their work, then you’re going to get poor results and it’s not their fault.
So, we’ve got to impose the requirement for the task if we want it or need it. We’ve got to identify the functional disciplines. So, which specialists are going to do this work? Because very often the safety team are generalists. They do not have specialist technical knowledge in some of these areas. Or maybe they are not human factor specialists. We need somebody in, some human factor specialists, some user representatives, people who understand how the system will be used in real life and what the real-world constraints are. We need those stakeholders involved – That’s very important. We’ve got to identify those architectures and systems which make up the SOS -very important. The concept of operations. SOS is very much about giving capability. So, it’s all about what are you going to do with the whole thing when you put it together? How’s all that going to work?
Interesting one, E, which is unique, I think, to task 209, what are the locations of the different systems and how far apart are they? We might be dealing with systems where the distance between them is so great that transmission time becomes an issue for energy or communications. Let’s say you’re bouncing a signal from an aircraft or a drone around the world via a couple of satellites back to home base. There could be a significant lag in communications. So, we need to understand all of these things because they might give rise to hazards or reduce the effectiveness of controls.
Part F; what analysis, methods, techniques do you want to use? And any special data to be used? Again, with these collections of systems that becomes more difficult to specify and more important. And then do we have any specific hazard management requirements? For example, are we using standard definitions and risk matrix from a standard or have we got our own? That all needs to be communicated.
So, that is the totality of the task. As you can see, there’s not much to Task 209, so I thought it would be much more helpful to use an example, an illustration, and as they used to say in children’s TV, “Here’s one I made earlier” because a few years ago I had to produce a safety case report. I was the safety case report writer, and there was a small team of us generating the evidence, doing the analysis for the safety case itself.
What we were asked to do is to assure the safety of a system and – in fact, it was two systems but I just treat it as one – of a system for guiding aircraft onto ships in bad weather. So, all of these things existed beforehand. The aircraft were already in service. The ships were already in service. Some of the systems were already in service, but we were putting them together in a new combination. So, we had to take into account human factors. That was very important. We’ll see why in just a moment.
The operating environment, which was quite demanding. So, the whole point is to get the aircraft safely back to the ships in bad weather. They could do it in good weather you could do it visually, but in bad weather, visual wasn’t going to cut it. So, the operating environment- we were being asked to operate in a much more difficult environment. So, that changed everything and drove everything.
We’ve got to consider operating procedures because, as we’re about to see, people are gluing the systems together. So, how do they make it work? And also got to think about maintenance and management. Although in actual fact, we didn’t really consider maintenance and management that much. As an ex-maintainer, this annoys me, but the truth is people are much more focused on getting their capability and service. Often, they think about support as an afterthought. We’ll talk about that one day.
Here’s a little demonstration of our system of systems. Bottom right-hand corner, we’ve got the ship with lots of people on the ship. So, if the aircraft crashes into it that could be bad news, not just for the people in the aircraft, but for the people on the ship – big risks there!
We’ve got our radar mounted on the ship so the ship is supplying the radar with power and control and data, telling it where to point for example. Also, the ship might be inadvertently interfering with the radar. There are lots of other electronic systems on the ship. There are bits of the ship getting in the way of the radar, depending on where you’ve put it, and so on and so forth. So, the ship interacts with the radar, the radar interacts with the ship, radars producing radiation. Could that be doing anything to the ship systems?
And then the radar is being operated. Now, I think that symbol is meant to indicate a DJ, but we’ve got the DJ wearing headphones and we got a disk there but it looks like a radar scope to me. So, I’ve just hijacked that. That’s the radar operator who is going to talk to the pilot and give the pilot verbal commands to guide them safely back to the ship. So, that’s how the system works.
In an ideal world, the ship would use the radar and then talk electronically direct to the aircraft and guide it – maybe automatically? That would be a much more sensible setup. In fact, that’s often the way it’s done. But in this particular case, we had to produce a bit of a – I hesitate to call it a lash-up because it was a multi-million-dollar project, but it was a bit of a lash-up.
So, there is the human factors. We’ve got a radar operator doing quite a difficult job and a pilot doing a very difficult job trying to guide their aircraft back onto the ship in bad weather. How are they going to interact and perform? And then lastly, as I alluded to earlier, the aircraft and the ship do actually interact in a limited way. But of course, it’s a physical interaction, so you can actually hurt people and of course, if we get it wrong, the aircraft interacts with the surface of the ocean, which is very bad indeed for the aircraft. So, we’ve got to be careful there. So, there’s a little illustration of our system of systems.
And – this is the top-level argument that we came up with – it’s in goal structuring notation. But don’t worry too much about that – We’ll have a session on how to do GSN another time.
So, our top goal, or claim if you like, is that our system of systems is adequately safe for the aircraft to locate and approach the ship. So, that’s a very basic, very simple statement, but of course, the devil is in the detail and all of that detail we call the context. So, surrounding that top goal or claim, we’ve got descriptions of the system, of the aircraft and the ship. We got a definition of what we mean by adequately safe and we’ve got safety targets and reporting requirements.
So, what supports the top goal? We’ve got a strategy and after a lot of consultation and designing the safety argument, we came up with a strategy where we said, “We are going to show that all elements of the system of systems are safe and all the interactions are safe”. To do that, we had to come up with a scope and some assumptions to underpin that as well to simplify things. Again, they sit in the context, we just keep the essence of the argument down the middle.
And then underneath, we’ve got four subgoals. We aim to show that each system equipment is safe to operate, so it’s ready to be operated safely. Then each one is safe in operation so it can be operated safely with real people, etc. And then we’ve got all system safety requirements are satisfied for the whole collection of stuff and then finally that all interactions are safe. So, if we can argue all four of those, we should have covered everything. Now, I suspect if I did this again today, I might do it slightly differently. Maybe a little bit more elegantly, but that’s not the point. The point is, we came up with this and it worked.
So, I’m going to unpack each one very briefly, just to illustrate some points. First of all, each component system is safe to operate. Each of these systems, bar one, had all been purchased already, sometimes a long time ago. They all came with their own safety targets, their own risk matrices, etc, etc. So, we had to make sure that when an individual system said, “This is what we’ve got to achieve” that that was good enough for the overall system of systems. So, we had to make sure that each system met its own safety requirements and targets and that they were valid in context.
Now, you would think that double-checking existing systems would be a foregone conclusion. In reality, we discovered that the ship’s communication system and its combat data system were not as robust as assumed. We discovered some practical issues were reported by stakeholders and we also discovered some flaws in previous analysis that had been accepted a long time ago. Now, in the end, those problems didn’t change the price of fish, as we say. It didn’t make a difference to the overall system of systems.
The frailty of the ship’s comms got sorted out and we discovered it didn’t actually matter about the combat system. So, we just assumed that the data coming out of the combat system was garbage and it didn’t make any difference. However, we did upset a few stakeholders along the way. So beware, people don’t like discovering that a system that they thought was “tickety-boo” was not as good as they thought.
The second goal was to show that the system of systems is safe in operation. So, we looked at the actual performance. We looked at test results of the radar and then also we were very fortunate that trials of the radar on the ship with aircraft were carried out and we were able to look at those trials reports. And once again, it emerged that the system in the real world wasn’t operating quite as intended, or quite as people had assumed that it would. It wasn’t performing as well. So, that was an issue. I can’t say any more about that but these things happen.
Also, a big part of the project was we included the human element. So, as I’ve said before, we had pilots and we had radar air traffic talk-down operators. So, we brought in some human factors specialists. They captured the procedures and tasks that the pilots and the radar operators had to perform. They captured them with what’s called a Hierarchical Task Analysis, they did some analysis of the tasks and what could go wrong. Then they created a model of what the humans were doing and ran it through a simulation several thousand times. So in that way, they did some performance modelling.
Now, they couldn’t give us an absolute figure on workload or anything like that but what they could do – fortunately, our new system was replacing an older system which was even more informally cobbled together than the one that we were we were bringing in. And so, the Human Factor specialists were able to compare human performance in the old system vs. human performance with the new system. Very fortunately, we were pleased to find out that the predicted performance was far better with the new system. The new system was much easier to operate for both the pilots and the talk-down radar operators. So, that was terrific.
So, the third one; All system of systems safety requirements are satisfied. Now, this is a bit more nebulous, this goal, but what it really came down to was when you put things together, very often you get what’s called emergent behaviour. As in things start to happen that you didn’t expect or you didn’t predict based on the individual pieces. It’s the saying, two plus two equals five. You get more out of a system – you get synergy for good or ill out when you start putting different things together.
So, does the whole thing actually work? And broadly speaking, the answer was yes, it works very well. There were some issues, a good example the old radar that they used to use to talk the planes down was a search radar so the operator could see other traffic apart from the plane they were they were guiding in. Now, the operator being able to see other things is both good and bad because on the one hand gives them improved situational awareness so they can warn off traffic if it’s a collision situation develops. But also, it’s bad because it’s a distraction for the operator. So, it could have gone either way.
So, the new radar was specialized. It focused only on the aircraft being talked down. So, the operator was blind to other traffic. So that was great in terms of decreasing operator workload and ultimately pilot workload as well. But would this increase the collision risk with other traffic? And I’ll talk about that in the summary briefly.
And then our final goal is to show that all interactions are safe between the guidance system, the aircraft and the ship. This was a non-trivial exercise because ships have large numbers of electronic systems and there’s a very involved process to go through to check that a new piece of kit doesn’t interfere with anything else or vice versa.
And also, of course, does the new electronic system/the new radar does the radiation effect ship? Because you’ve got weapons on the ship and some of those explosive devices that the weapons uses are electrically initiated. So, could the radiation set off an explosion? So, all of those things had to be checked. And that’s a very specialized area.
And then we’ve got, does the system interfere with the aircraft and the aircraft with the system? What about the integration of the ship and the aircraft and the aircraft to the ship? Yet another specialized area where there’s a particular way of doing things. And of course, the aircraft people want to protect the aircraft and the ship people want to protect the ship. So, getting those two to marry up is also another one of those non-trivial exercises I keep referring to but it all worked out in the end.
Points to note: When we’re doing system of systems – I’ve got five points here, you can probably work some more points out from what I’ve said for yourself – but we’re putting together disparate systems. They’re different systems. They’ve been procured by different organizations, possibly, to do different things. The stakeholders who bought them and care about them have got different aims and objectives. They’ve got different agendas to each other. So, getting everyone to play nicely in the zoo can be challenging. And even with somebody pulling it all together at the top to say “This has got to work. Get with the program, folks!” there’s still some friction.
Particularly, you end up with large numbers of stakeholders. For example, we would have regular safety meetings, but I don’t think we ever had two meetings in a row with exactly the same attendees because with a large group of people, people are always changing over and things move up. And that can be a challenge in itself. We need to include the human in the loop in systems of systems because typically that’s how we get them all to play together. We rely on human beings to do a lot of translation work and in effect. So, how do the systems cope?
A classic mistake really with systems design is to design a difficult-to-operate system and then just expect the operator to cope. That can be from things as seemingly trivial as amusement park rides – I did a lesson on learning lessons from an amusement park ride accident only a month or two ago and even there it was a very complex system for two operators, neither of whom had total authority over the system or to be honest, really had the full picture of what was going on. As a result, there were several dead bodies. So, how did the operators cope, and have we done enough to support them? That’s a big issue with a system of systems.
Thirdly, this is always true with safety analysis, but especially so with system of systems. The real-world performance is important. You can do all the analysis in the world making certain assumptions and the analysis can look fine, but in the real world, it’s not so simple. We have to do analysis that assumes the kit works as advertised because you’ve got nothing else to go on until you get the test results and you don’t get them until towards the end of the program. So, you’re going down a path, assuming that things work, that they do what they say on the tin, and perhaps you then discover they don’t do what they say on the tin. Or they don’t do everything they say on a tin. Or they do what they say and they do some other things that you weren’t expecting as well and then you’ve got to deal with those issues.
And then fourthly, somewhat related to what I’ve just talked about, but you put systems together in an informal way, perhaps, and then you discover how they actually get on – what really happens. In reality, once you get above a certain level of complexity, you’re not really going to discover all the emergent behaviours and consequences until you get things into service and it’s clocked up a bit of time in service under different conditions in the real world. In fact, that was the case with this and I think with a system of systems, you’ve just got to assume that it’s sufficiently complex that that is the case.
Now, that’s not an unsolvable problem but, of course, how do you contract for that? Where you’ve got your contractors wanting you to accept their kit and pay them at a certain date or a certain point in the program, but you’re not going to find out whether it all truly works until it’s got into service and been in service for a while. So, how do you incentivize the contractor to do a good job or indeed to correct defects in a timely manner? That’s quite a challenge for system systems and it’s something that needs thinking about upfront.
And then finally, I’ve said, remember the bigger picture. It’s very easy when you’re doing analysis and you’ve made certain assumptions and you set the scope, it’s very easy to get fixated on that scope and on those assumptions and forget the real world is out there and is unpredictable. We had lots of examples of that on this program. We had the ship’s comms that didn’t always work properly, we couldn’t rely on the combat system, the radar in the real world didn’t operate as well as it said in the spec, etc, etc. There were lots of these things.
And, one example I mentioned was that with the new radar, the radar operator does not see any traffic other than the aircraft that is being guided in. So, there’s a loss of situational awareness there and there’s a risk, maybe an increased risk, of collision with other traffic. And that actually led to a disagreement in our team because some people who had got quite fixated on the analysis and didn’t like the suggestion that maybe they’d missed something. Although it was never put in those terms, that’s the way they took it. So, we need to be careful of egos. We might think we’ve done a fantastic analysis and we’ve produced hundreds of pages of data and fault trees or whatever it might be but that doesn’t mean that our analysis has captured everything or that it’s completely captured what goes on in the real world because that’s very difficult to do with such a complex system of systems.
So, we need to be aware of the bigger picture, even if it’s only just qualitatively. Somebody, a little voice, piping up somewhere saying, “What about this? And we thought about that? I know we’re ignoring this because we’ve been told to but is that the right thing to do?” And sometimes it’s good to be reminded of those things and we need to remember the big picture.
Anyway, I’ve talked for long enough. It just remains for me to point out that all the text in quotations, in italics, is from the military standard, which is copyright free but this presentation is copyright of the Safety Artisan. As I’m recording this, it’s the 5th of September 2020.
For More …
And so if you want more, please do subscribe to the Safety Artisan channel on YouTube and you can see the link there, but just search for Safety Artisan in YouTube and you’ll find us. So, subscribe there to get free video lessons and also free previews of paid content. And then for all lessons, both paid and free, and other resources on safety topics please visit the Safety Artisan at www.safetyartisan.com/ where I hope you’ll find much more good stuff that you find helpful and enjoyable.