Apollo 13 and lessons for resilience

Blog Layout

Apollo 13 and lessons for resilience

Chris McFee • Apr 17, 2020

Apollo 13 and lessons for resilience

The Apollo 13 astronauts and recovery crew after splashdown on Friday 17th April 1970. Credit: NASA

Fifty years ago this Friday, the three astronauts on board the Apollo 13 command module safely splashed down into the Pacific Ocean. The drama of the incident has been retold extensively over the past week or so in the press. Most of the articles focused on the drama of the events themselves. But a good article by Kevin Fong in the Guardian emphasised how the training during the Apollo programme and the response to the crisis was vital. Reading this article, it struck me how we can still learn from what happened then as we try to build resilient organisations.

In 1967 three astronauts died in the Apollo 1 fire whilst the spacecraft was on the ground undergoing testing for its mission. The subsequent NASA inquiry found that potential warning signs had been ignored. The high oxygen atmosphere was always a high fire risk, but whilst NASA had developed a procedure for fire in the spacecraft in space, there was no procedure for a fire whilst on the ground. Following this fire NASA had investigated in detail everything related to that accident, and learnt the lessons in the design and building of the later spacecraft. They also learnt the lessons in planning and responding.

In the case of Apollo 13 survival from the accident itself was only possible with great teamwork between the three astronauts 180,000 nautical miles away from Earth and the people on the ground. Initial decisions had to be taken very quickly, with limited understanding of what exactly had happened. It is the ability to handle an incident quickly and correctly that is a key skill for a resilient organisation today.

There was strong leadership during the crisis, and a clear structure by which problems were addressed. In particular, everyone understood their roles and responsibilities which had been rehearsed and applied many times. This was communicated clearly and widely so that everyone knew what was happening and how their work fitted into the overall response. Where an individual had the knowledge and expertise, they were listened to. The team had to work quickly to understand what the situation was, how it could then develop, and then identify possible solutions. And they knew exactly what the timescales by which the solution needed to be in place were. The response of the mission operations teams was crucial, and they were eventually awarded the Presidential Medal of Freedom for their work. (Although this being the 1960s, it is really noticeable how almost all of the people you see working in Mission Control are men!).

Another contribution to the eventual success was the huge amount of work during the Apollo programme that went in to identifying possible fault and multiple-fault situations, and then simulating and testing extensively to identify solutions. These identified the sort of things that would work, and what wouldn’t. Even if the actual incident itself had never been simulated in detail (which is often the case when trying to identify potential emergencies), the processes and knowledge gained was crucial.

The use of the lunar module as a potential “lifeboat” was considered early on the Apollo project, but not necessarily having a disabled command service module at the same time. And during the Apollo 10 simulation and testing programme an emergency lunar module activation plan was developed and refined which proved invaluable as the basis of the plan that could then be refined during the actual incident as it became clearer what exactly was needed.

These qualities were shown when an improvised fix had to be made to allow square carbon dioxide filters to fit into round filter holes in the lunar module. Without this fix, the astronauts would have asphyxiated before they returned to Earth. This was achieved using all of the teamwork and design qualities learned so hard over the preceding years. (And always having a good quantity of duct tape close to hand!).

The definition of resilience is “the ability of an organisation to absorb and adapt in a changing environment”. Throughout the Apollo programme NASA demonstrated some of the qualities that are needed to be a resilient organisation. If you can strengthen your current levels of resilience, it means you can respond quickly and effectively to a sudden disruption. If you achieve this, your business will be more adaptive, competitive, agile and robust, and better able to survive in uncertain times.

< Older Post Newer Post >

Building Business Resilience

by Chris McFee • 18 Jan, 2022

We know about the importance of resilience in our own lives to help us to successfully cope with stress and adversity – being able to “bounce back” when something bad happens. But the importance of resilience is now recognised in both the wider community and in business. If you have a resilient biness then you are more prepared for the surprises and shocks that could affect your business. When unpleasant surprises occur, you are more able to bounce back and keep your business operating successfully. So, you may be thinking about how you can make your business more resilient. If you are, try thinking about what could be coming up in the future, and what would have the most impact and disruption on your business. Think about how long you could cope before severe problems started to occur. However, resilience is not just about having a range of specialist techniques such as business continuity (although this helps) but is about having the ability to respond quickly when things do go wrong. The skills and capabilities you already have in your business will be key in helping. What are these capabilities? You need to be confident in what you are doing, and be flexible enough to be able to change what you are doing quickly. And don’t rely on a single person to be responsible for a particular activity. Everyone should feel confident that they would know what to do in a given situation. Of course, you can never foresee everything that may happen, and will not be able to have a complete answer to everything that could happen, but you may be able to select tools and techniques that will help you out. In particular, to be resilient needs a flexible mindset – which means no “we always do it this way” type of thinking. It needs people who are always willing to learn, and have the time to do this learning. And are allowed to apply this learning when needed. Does everyone have a good understanding of how the organisation works? How happy are your staff to move out of their comfort zone? How well do you communicate in your organisation – are you willing to be challenged and take advice? For instance, when mistakes happen (and they will) how do you respond – are people blamed, or are you willing to learn from these mistakes? But equally important is a regular cycle that you need to put in place to build and improve resilience. Key is good leadership, focusing on the capabilities you need for success. Then you need to develop your plans, build them and put them into practice, and finally deliver them, continuingly checking that they actually work as planned, and updating them regularly. Rather than try and anticipate solutions to everything that could happen (an impossible task), consider building a “toolkit” of responses and mitigations that you can mix and match as appropriate. Having a toolkit of approaches recognises that there is no one size fits all solution, and that you will need to mix and match your response as seems most appropriate at the time. The important thing is that you have something in place, and you are confident it will work (even though you hope you will never have to use it). The toolkit allows you to adapt your current mitigations flexibly when circumstances changes or when something unexpected happens (you can never plan for everything). Once you have started to develop an approach, plan how to do it, work out who needs to be involved and when, and what specialist techniques (such as business continuity or cyber security) you may need to call upon. Make sure everyone knows what they need to do, and work out how they will do it when an incident occurs. You will find that building resilience isn’t something you can do in a day. In reality, you need a range of approaches, they need to be flexible, and everyone needs to know what they are, and be confident at applying them when the time comes. It is always good to have a plan in place to deal with any incident – even something as simple as who to call and when. To achieve your goals, you need to building in resilience every day. For example, by empowering your staff so they know what their role is, and are confident and can work independently if necessary.

Communicating risks

by Chris McFee • 18 Jan, 2022

Communicating risks to colleagues - life is risky post

ASSESSING UNCERTAIN INFORMATION – DECISION MAKING WITH UNCERTAINTY

by Chris McFee • 18 Jun, 2020

One of the aspects of risk management planning is the uncertainty in trying to identify the full range of threats that could cause disruption. In particular, assessing the likelihood of the threat occurring. One way that smaller businesses can deal with this is to focus on the generic outcomes of threats. For example, flooding, power cuts or fire could have the same outcome: that you are unable to access your business premises. However, when we want to do a more detailed analysis then we need to think in more detail about each specific threat: how likely are they to occur and what would the direct impacts be. This is particularly important in dealing with some of the more unusual threats which may have a large impact but whose likelihood is low, but how low is not clear. To illustrate, how likely is it that we have a complete failure of the UK electricity network? This sort of information is known as a “planning assumption” and it is important to be clear what they are. Obviously, if you assume that a threat is far less likely than it turns out to be, you may quickly discover that your business is severely disrupted as your plans aren’t up to the job. But if you assume the threat is more likely (or serious) than it turns out to be, you could end up spending lots of time and money that would be better spent in other key areas of your business. While this information is available in some areas, it is difficult to get hold of. Examples include the frequencies and durations of power cuts in your local area over the past ten years. There is quite a lot of certainty around those figures, but you can’t access them (unless you are willing to pay quite a bit of money). In contrast, the information for other threats is more uncertain. I have already mentioned the likelihood of a failure of the UK electricity infrastructure. Another threat is from space weather, or an unconventional terrorist attack. In those cases, detailed information about frequency of occurrence is not available, and an expert judgement is needed where there is often disagreement about the frequencies to use. Given this, what figures should you take for your planning assumptions? How can you make sense of the information that is publicly available? It is important that you communicate these assumptions clearly and effectively to colleagues so that they can include them in their business decisions. But how can you do this without providing a false level of confidence in your assumption that you may not have? Be aware that there is rarely a consensus at the early stages. An example could be that of a power cut affecting your business. It is highly likely that you will face a power cut at some time. You almost certainly have had one in the past. And the immediate consequences are also quite well understood. You may lose IT provision (unless of course you have already mitigated with some form of backup). If your business is public facing, you may need to temporarily close access to the public, and so on. That said, how long is this power cut likely to last? This is where the uncertainty can lie. In such a situation, where there is consensus about the nature of the threat but a lot less so about the likelihood, include all of the range of scenarios in your planning. For example, you could aim to plan for a most likely case of one hour, a possibility of two days, and an unlikely (but still possible) worst case duration of two weeks. You can then use this to set boundaries. Likewise, the situation is more difficult where you have a lower probability threat, particularly when the impacts of the threat itself will be very high. Together, these this combination can make it difficult to communicate the threat sufficiently effectively to enable appropriate and proportionate planning to take place. One technique is to use language and this should be the first component of your discussions. For example, when considering the likelihood (frequency) that something may happen for which there is little information, framing the information in the context of verbal boundaries is often highlighted as an effective technique. For instance, • very high confidence – 90%. • Highly unlikely – less than 1 in 10 chance (<10%). This is a good technique to use, but be aware that it can be a hostage to fortune and should be used carefully. Whilst everyone may be clear that “highly unlikely” corresponds to a less than one in ten chance of occurrence, our own biases may lead to us mentally interpreting this in different ways, particularly when it is natural to focus on short term issues. Describing an event as “highly unlikely” often leads to that event being filed away in the “it will probably never happen” part of the brain, and we approach that threat in a different way. It may be better to expand the language slightly. For example, it is “highly unlikely but still possible” makes that uncertainty clearer. A misunderstanding over the use of language can quickly lead to you being blamed as the individual responsible for underestimating this threat. Also beware of over-interpreting extreme possibilities. The press can often become fixated on extreme scenarios that become almost apocalyptic but really are extremely unlikely. A natural tendency to be focused on that extreme case can lead to poor allocation of resources. Conversely, by rejecting that “extreme” case, we can go too far in the other direction and assume that the threat is not as likely as it actually is. In UK central government planning the concept of the “reasonable worst case” is used to try and avoid these biases. But of course, this always leads to the question: “what do you mean by reasonable?”. Thus, having bounded the potential likelihoods, you could try planning around a number of different scenarios that provide a sensitivity analysis to see how vulnerable your business is. It may be that most scenarios your business would be disrupted but could still cope and you only need to focus on one or two scenarios. Weather forecasts are produced in a similar way. The inherent non-linearity in weather systems means that there is quite a high sensitivity to the initial starting conditions. Therefore, models are run many times to produce a range of potential weather outcomes which the forecasters use their judgment to review. It follows that, when you run sensitivity scenarios you are looking to identify how extreme any situation needs to be before things start to go seriously wrong. The information you have from your business impact analysis should be very helpful to this. Nevertheless, what we are focusing on here is not the failure of a specific component or service in your business process. With a major event there are likely to be multiple compounding areas of disruption that stress the system until something gives. If possible, include a second incident into your analysis. You may find your business can cope with one incident, but if a similar incident occurs things start to deteriorate very quickly. Running multiple scenarios is obviously very time consuming and resource intensive and may not be possible. But if can do something this can give greater confidence around these uncertainties – it helps bound those areas that you need to really worry about. Alternatively, if you can’t manage extensive scenario planning, how about some role playing? One of the difficulties with assessing low probability events is the difficulty in being able to “imagine” the actual impacts. Numbers and slides may demonstrate how bad the impacts could be, but the inherent low probability of the event means that we have difficulty in dealing with this. It can often feel so abstract that we make assumptions about how we would react which are wrong. Role playing can help with this. Asking simple questions such as, “if this did happen to the business, how would you feel if”? In a similar vein, focusing on similar issues that occurred in the past, and the consequences of the decisions made, can be very helpful. And finally, try to understand the inherent biases and compensate. In particular, overconfidence and groupthink. Discussions around uncertainty will often be led by one or two individuals who are very vocal in their view. Sometimes this is justified (they may be an expert in this field) but often this is down to pure overconfidence. Try to find ways to remove this tendency. Perhaps using questionnaires to get individual views (this also helps minimise groupthink where no one wants to be odd one out). In these situations, a diversity of individuals with diverse opinions and experiences can be extremely helpful.

We must communicate risk more

by Chris McFee • 26 Mar, 2020

For many people, the arrival of a pandemic has come as a total surprise. Most people had no conception that such an event could ever occur (apart from in a movie). This is a bad place to be in. Our understanding of how people think about risk shows that when people are confronted with a brand-new risk, the reaction can be bewilderment and bad choices. But the chance that the world would be massively disrupted by a major pandemic infectious disease wasn’t at all unlikely. It’s now well known that one hundred years ago the world faced a major flu outbreak that is thought to have killed over 50 million people. There were two smaller but significant outbreaks in the 20th Century (in 1957 and 1968). And the government’s own risk assessment (published every two years) makes it clear that a pandemic flu outbreak was the highest risk over a five-year period, with an outbreak of a novel infection (such as COVID-19) not far behind. So why did so many people seem to be completely blindsided by the news? There is worry and concern about the impacts on family and the country. But why does there also seem to be such shock about the event itself? The government and academia review these risks regularly, and most large organisations have a pandemic preparedness plan as part of their Business Continuity. Yet knowledge of the potential risk seems not to have reached the public as a whole. More needs to be done to get the message out about the major risks we face. There are other risks out there, such as a Black Start , or space weather , which could have a huge impact on the country, but which many people will be totally unaware of. Decisions need to be taken now about how much money we want to spend as a country to try and mitigate any risks to a level we find acceptable. This can include very unpleasant decisions, such as how we allocate life-saving resources (such as intensive care) when demand may exceed supply. But it is better to think about these things before they happen, and not play catch-up later. Japan is due a major earthquake at some point. Awareness is high and regular exercises are held. This doesn’t mean when the earthquake hits people won’t be terrified. But it does mean that everyone will know what they must do, and how they need to respond to save lives. Talking about these things much more widely over the next few years won’t stop any of them happening. But the discussions will help guide where we spend our money. And knowing that these risks exists can help build societal resilience so people respond in ways that are helpful and don’t create more problems. Who should do this? For many years’ scientists have been strongly encouraged to get out and explain their work, and the potential implications for society. We who work in resilience and business continuity should do the same. We need to highlight these risks, and the work we do to try and mitigate them. We should not just keep talking to each other about out work, but talk to the public as well. This should be seen as part of every professional’s job description.

Preparing for Space Weather

by Chris McFee • 14 Feb, 2020

Business resilience to Space Weather and wider planning consequences The Mullard Space Science Laboratory and the Institute for Risk and Data Reduction (both from UCL) have published a report on organisational resilience for severe space weather . The report sets out what severe space weather is, and the possible impacts of an event. Impacts can be wide ranging, but the impacts that always generate most interest (and TV programmes) are around the potential for localised power outages. The report also sets out a checklist for continuity management that can be used by organisations to improve their business continuity arrangements. As with many potential hazards, the focus for business continuity planning should be dealing with the consequences, many of which are common to other hazards. The National Risk Register (2017 edition) lists a range of hazards and threats (threats are more “human” related such as terrorism or cyber-attacks) which could impact individuals and organisations. Loss of power can be a common consequence from a range of hazards. Alongside space weather, high winds or flooding, extreme low temperatures, electricity system failures (such as happened in August 2019) or industrial and urban accidents all include the potential for impacts on power. Businesses trying to identify potential impacts can often become confused by the number of hazards and threats to consider. But by planning to deal with the consequences of space weather your business will also mitigate the impacts from other hazards. Think about what the consequences could be: loss of power, loss of IT provision, inability to access your main business premises, etc. Planning for those broad consequences will cover a lot of things that could potentially go wrong and is the approach followed in the National Business Resilience Planning Assumptions (although that document is a getting a bit old). Of course, some specific hazards are serious enough that they may warrant an individual plan. Pandemic flu (or similar infectious disease) is potentially serious enough for a separate plan. And space weather is one of those areas where some of the impacts may not be common to other hazards, particularly as the dependence on technology in many areas of society could lead to cascading effects. So, it is important that you consider space weather and all the possible impacts to make sure they are captured somewhere in your planning. And remember, if you have a plan, make sure you test it regularly!

Cyber education can only go so far

by Chris McFee • 13 Feb, 2020

Cyber education can only go so far in helping prevent phishing type attacks. The level of sophistication of many attacks is such that it is almost impossible for individuals to be able to spot an attack 100% of the time. This is even when individuals have received training so that they are aware of the different types of attack, and have been given advice on how to spot them and what to do. Last year I remember speaking with a security representative from a major UK/US firm who said that the US side of his company had a policy of sacking people who clicked on malicious links – even by accident. This is a bad idea. Firstly, you will probably loose a lot of good staff. Secondly, it will encourage a culture where people try to hide any mistakes they make in case their company punishes them. An IT Security Policy for your company must acknowledge that it is not possible to educate away the dangers from a phishing attack. Sophisticated software to provide an additional technical solution is needed. As is a well-rehearsed plan to deal with an incident when it happens (as it probably will). This is not to say that education is unnecessary. It is very important that staff are educated on good cyber security and feel as confident as they can on understanding what threats are likely and what they need to do. A helpful consequence of such education is that people are also much less likely to fall for similar attacks at home. Good cyber training will not prevent attacks, but it will reduce the overall risk. See: https://blogs.scientificamerican.com/observations/how-to-protect-people-against-phishing-and-other-scams/?amp

The importance of a good Cyber Culture

by Chris McFee • 09 Jan, 2020

The Cyber-attack on Travelex just before the end of last year caused massive issues for the company, with a demand for a ransom payment and the Travelex website taken down for a number of weeks. The company was attacked on New Year's Eve with ransomware. News reports have said that staff have resorted to manual methods (such as paper and pens) to try and keep currency transactions moving. Details of how the attack worked are not known, but there a few lessons for businesses of any size: Don't assume that the technical defences will work 100% of the time and have a plan to deal with consequences . It is almost certain they will not. Having good technical defences in place will do an awful lot to reduce the risk. Ensuring that your staff are well-trained and have a good understanding of the types of cyber-attacks that could take place is also important. An obvious point is to ensure that all of your staff are aware of the dangers from phishing attacks and malware. But cyber criminals will design sophisticated attacks to fool even cyber-savvy even with extensive training, it only takes one click to get through some defences.You should always assume that you could be the victim of a cyber-attack and plan accordingly. Know what systems you have and what information is on them so you can understand the potential impact of an attack. And make sure that you have detailed plans for responding, and test them regularly. Any business, of any size should do this. Don't forget your supply chain. One of the biggest impacts of the cyber-attack was on organisations that use Travelex services. The Guardian reported that companies such as Sainsbury’s Bank, Tesco Bank and First Direct have all had issues with foreign exchange services as they used Travelex services. Your business planning should identify third-party services in your supply chain and consider the impact on your business if they had problems. Larger businesses will review third-parties as part of their own contractual arrangements, but this is probably something you may not be able to do as a small business. However, you can at least look for indications that the company at least understands the importance of good cyber-security. For example, the Cyber Essentials scheme is one way to do this.

Contact me