We are all in the middle of a worldwide incident in which we are all incident responders in some way. The large-scale disruption that Covid-19 has introduced to day-to-day functioning of society has profound implications for the software systems that are a part of daily life. And the experts tasked with keeping those systems running are no exception.
In this article, we will talk through the 5 key themes that emerged from a series of two facilitated video calls with engineers and managers from several different organizations, along with expert suggestions on how to adapt to some of these problems, all backed up by research.
The 5 key themes elicited from the session were:
- Staying in the loop and maintaining context on decisions and changes is more challenging. Maintaining common ground requires additional effort and successful teams adjusted their effort to maintain levels of engagement more consistent with face-to-face interactions
- There’s more effort required, some of that additional effort is invisible to colleagues. New strategies were developed or existing ones amplified to create lightweight, effective methods of sustaining common ground.
- New tools are being adopted to try and keep people in the loop. Tool adoption was used to bridge the gap between face-to-face and virtual operations.
- Successful adaptation was the result of front-line engagement and short-cycle feedback. Drawing on internal expertise, knowledge sharing and previously established (but smaller scale) initiatives many teams met the challenge of change.
- Sustained resilience stems from cooperative cross-functional roles. Now more than ever, strong cross-functional collaboration enables organizations to move quickly to the changing environment. Roles that span different parts of the business help ‘pollinate’ information and identify both threats and opportunities.
The themes are simple and obvious, but the analysis is deep and insightful. We know everyone is busy so we made this as straightforward as possible to get you to what matters quicker – what changes can you make now to help you with what’s going on now? The first three relate to a similar theme around enabling a collective working together to sustain shared understanding. The last two relate to the current and continued ability to adjust to surprise and disruption. After reading this article, it is our hope that you think critically about how your team might experiment with some of the suggestions for adaptations and engage your colleague’s in the process of adapting together.
The Setup
Software Engineers from the Learning from Incidents Slack workspace, who used to roll their chairs back to query a co-worker for advice on an incident, now find themselves frequently struggling through problems on their own. A team lead who sourced valuable intel about other team’s projects while bumping into colleagues at the coffee machine has lost her ‘finger on the pulse’ of organizational happenings. And, managers who were carefully considering how to balance coherence in their remote and collocated team are now faced with suddenly leading a fully distributed team. We are all reacting and recalibrating as we adapt to the new normal.
The Participants
In mid-March we brought together a collection of these engineers, product managers, engineering managers and members of the Learning from Incidents community to share insights into how they, and their organizations, were adapting to the sudden changes. The 15 participants worked across the spectrum at small start ups, specialized software engineering organizations, and large multinationals. All were part of organizations that had 24/7 operations and high expectations for reliability. Many of the engineers we spoke with ran the kinds of systems experiencing double and triple digit increases or were part of teams that went from entirely collocated to entirely non-collocated within the span of several days.
Some were on teams that ran customer facing systems with sudden high loads such as video streaming, online gaming, food delivery, and B2B operations running business critical support functions. In other words, they are on the front lines of keeping their company functioning in the face of highly anomalous conditions. The conversation was intended as a way to solicit some of the key challenges, collect practices that were working and connect these to practices from Resilience Engineering.
The Patterns
As we listened, we saw some general patterns emerge despite the obvious differences amongst the participants or the specific contexts they were being described – incident response or code development for example. We collated the comments then reflected on what was driving the challenges being discussed. We compiled the suggestions given then bolstered these recommendations by looking to the literature to find guidance for expert practitioners like software engineers who have to deal with constraints to being able to communicate effectively while coping with fast changing conditions and uncertainty. These are some of the hallmarks of many high consequence work domains.
This write up frames the shared experiences from our group to help organizations that are facing similar circumstances make adjustments to support their teams with coping better. While there was a wide range of insights generated from the discussion, these five stuck out to us as being the most useful to look at deeper.
As mentioned at the beginning of the article, the 5 key themes elicited from the session were:
- “Staying in the loop” and maintaining context on decisions and changes is more challenging.
- There’s more effort required, some of that additional effort is invisible to colleagues.
- New tools are being adopted to try and keep people in the loop.
- Some teams with certain characteristics adapted more resiliently than others.
- Sustained resilience stems from cooperative cross-functional roles.
Each theme is described briefly using examples we heard in the discussions, then we’ve highlighted some findings from past research relevant to each theme and given a few suggestions for easily implemented ways to address the challenges. The suggestions for improvement are estimates based on practical strategies and informed by research, they might not fit your team and that’s why we’ve given you the background to help you adjust for your own context. Just as with open source code, this is open source adaptive capacity, and our hope is you give feedback to us about what is working or not working for you.
1. Staying in the loop is harder.
The problem:
The most common issue across participants was some form of the difficulty in maintaining coherence around what was happening - both locally within the team and more broadly (with other teams or parts of the business) – without being in the office together. Being up-to-date matters greatly in times like these when changes are continuous as teams make ongoing adjustments to fit the new conditions. It matters most as time pressure increases and the consequences of being out-of-date are greater like during an incident response.
There were obvious critical barriers to staying in the loop when you are suddenly remote. We discussed two, which we describe as ‘the problem of access’ and ‘the problem of alignment’.
The first, the problem of access, is simple: It’s difficult to be in the loop when you can’t just overhear a conversation relevant to you or you lose the ability to bump into someone in the hallway. The loss of being able to ‘happen upon’ critical information is exacerbated if the move to remote work has driven these kinds of conversations into direct messages or other similarly private forums. The consensus in our discussion was that many teams were adapting by trying to increase the amount of communication - many said Slack messages and Zoom meetings had unquestionably increased. But unless you knew where to find those conversations – what online forums were being used - and you are being invited to join them, it was as if they were being held behind closed doors. This issue will become increasingly problematic the longer we are in remote work conditions, as less of these informal interactions are able to keep information flowing across the organization.
The second part of the problem is a bit more nuanced and, with apologies, we’ll invoke another idiom: It has to do with “being on the same page”. Just having access to the conversation isn’t the same thing as understanding the conversation and its implications. One engineer from a large multinational company in our group noted they had instituted a 3-week company-wide code freeze because they recognized the potential threat of misunderstandings arising from someone who might be distracted trying to manage children or spouses at home. This is an extreme example, one which may have the unintended consequences of engineers being ‘out of the loop’ for longer which increases the effort needed to get back up to date.
The research shows:
“Staying in the loop” and “being on the same page” are ways of describing what the research literature calls maintaining common ground.
Common ground is the shared knowledge, beliefs and assumptions needed to coordinate actions across groups of involved parties.
An example of this is in feature development. If you’ve been part of a development team you know what it means to have a breakdown in common ground when you recognize that you all have different understandings of what the intended purpose of a feature is. The research has shown that we listen for and observe for common ground breakdowns, that is, we naturally seek to repair any discrepancies when working together. It’s a continual cycle of establishing, maintaining and repairing because shared understanding across a working group continuously erodes – as knowledge, beliefs and assumptions change in some individuals but not all.
It’s much easier to identify breakdowns in face-to-face interactions, we can see someone hesitating and needing more time to process and we can infer they might not share our understanding and the ability to have micro-interactions (such as those hallways conversations) to repair common ground can continually serve to keep everyone aligned.
Technology-mediated communication, for a variety of reasons, can become unintentionally ambiguous as it is hard to read cues, fully express questions or articulate concerns. While some tools may be better suited for remote work - particularly for tightly coupled, cognitively demanding work - all will introduce some degree of degradation to your interactions. Therefore, as your team shifts to fully remote communication, an emphasis on maintaining common ground is of critical importance.
How to adapt:
It’s not very satisfying to say we need to “get used to it” but to a degree this is helpful advice. We have had to adapt our working style so adapting communication styles as well only makes sense. The advice to emphasize maintaining common ground is a practice that will extend well beyond the end of the current remote work. Modern business environments are always undergoing change, albeit not typically so abruptly and so substantially. However, recognizing that, to a greater or lesser extent, we always have partial or incomplete information is useful so a team is aware of their limitations and actively seeks to identify assumptions that may be faulty.
- To ensure you maintain common ground for yourself and your team: Ask “what might we be missing?” around critical tasks or in projects where regular feedback loops have been broken can generate more information gathering to prevent misalignment.
- Practice reciprocity and consider how you can help others you work with maintain their common ground: Ask “who might need to know this? When do they need to know? How involved do they need to be?”
- Engage in collaborative cross-checking in high consequence actions or decisions to bring different perspectives together to test assumptions and assess accuracy.
- Verify common ground more frequently - when leaving meetings have participants describe in their own words what they think they heard.
2. There’s more effort required, some of that additional effort is invisible to colleagues.
The problem:
The second theme is closely related to the first - it’s obvious to say that software has become critical to bridging the gap of physical distance from co-workers. And, just as obvious, the limitations of collaborative software requires new kinds of effort.
Our group described how activities like being able to gather around a whiteboard to brainstorm ideas; being able to jointly diagnose problems by looking at shared dashboards; or pair programming side-by-side in the workspace the efficacy of which had all become degraded and the tasks had become more effortful in some fashion when having to be done online. In addition, several discussants feared that losing the fall-back of being able to meet in person when task complexity increased was likely to have an impact on their team’s performance in activities like incident response.
The research shows:
This can be attributed to multiple factors, chief among them is that costs of coordination are substantial for technology mediated, cognitively demanding, tightly coupled, and time pressured work. Costs of coordination are the additional mental effort and load required to participate in joint activities. When relatively low tempo day-to-day activities like group brainstorming are done remotely, important cues are lost in most technological mediums that add friction to typically smooth interactions. Cues such as tone of voice, body language or facial expressions are used in face-to-face interactions to aid timing in conversation or to identify when participants are deep in thought and should not be interrupted are lost. In urgent situations such as incident response, this can be critically disruptive to critical cognitive functions such as diagnosing and resolving outages.
Some teams in our group sought to address the additional effort to maintain common ground and carry out collaborative work functions by simply increasing communications. However, that doesn’t necessarily work because the signal to noise ratio drops. So, instead of working harder to retain the same degree of common ground, you work harder to interpret larger volumes of incoming information and you spend additional effort generating your own additional communication to be understood. Coordinating with others when maintaining common ground is essential and cues are absent means the quality of the communication itself has to change.
How to adapt:
- We’ve now noted two ways in which load can increase on engineers in current conditions - as mentioned in point 1, you will need to expend additional efforts to maintain common ground - and the second is due to the technology itself. Knowing that your team has limited attention right now as multiple life demands and distractions are occurring simultaneously, how to best account for the additional effort? In aviation, safety literature points to “take offs and landings” as high risk activities. This gives a reasonable heuristic to engineering teams - what are the highest consequence tasks and priorities happening? What practices can be introduced to these to A) maintain common ground and B) lower the amount of drag introduced by technology-mediated interactions?
- One of the engineers in our discussion mentioned their CEO had sent out a company-wide mandate to use webcams during meetings and yes, we know, there are times when joining a webconference with video is excruciating. However, we are operating with tools that create impoverished communication environments relative to face-to-face. Webcams provide additional contextual cues for your colleagues that can help maintain common ground and lower the cognitive burden of newly distributed work.
- For those new to text-based online collaboration, don’t underestimate the role of reactji’s in your communications. For the uninitiated, reactji’s are emoji’s used to show a response to someone’s comment or post. Using something like the eyes emoji is a lightweight way to let someone know you’ve seen their post and are looking into it. Reactji’s replace the nods, raised eyebrows, “uh huhs” and other forms of engagement that take place in a face-to-face conversation.
- If your colleagues will be off-line for part of the day as they balance caregiving demands, additional effort may be needed to help them maintain common ground. Short summaries of a Zoom call posted in Slack or Teams saves them from re-watching an hour long recording, lowering their costs of coordination with minimal impact to teammates.
3. New tools are being adopted to keep people in the loop.
The problem:
Our group consisted of software engineers (some of which were remote workers) in which text-based collaborative forums like Slack or Teams and web conferencing tools like Zoom were commonplace amongst the engineering teams. However many noted that other functions of the business were relatively new to using online tooling as their primary means of communication. There has been plenty of advice on working remotely but one aspect that is underexplored is how these tools shape understanding (or lack thereof) in collaborative activities.
As we noted in the theme above, many teams were looking to support the traditionally in-person collaborative activities like pair programming and brainstorming as they shifted to virtual. Several engineers mentioned they’d started using tools such as JamBoard, Mural, LiveShare, Google Docs for this kind of work. Others adopted low-tech solutions such as writing on sticky notes and holding them up for their webcam or getting a second monitor with a webcam to enable better visual reference for the increased web conferencing taking place. And yet, everyone wanted something better to support them.
While one of our respondents astutely noted they were thankful to be handling Covid-19 with modern tooling because even 5 years ago many organizations would have struggled tremendously to get anything done with non-collocated, remote collaboration tools still don’t match very well with our needs.
The research shows:
The use of software to aid collaborative activities has been a topic of study since the mid-1980’s when researchers in the field of computer supported cooperative work (CSCW) began to consider how ‘groupware’ could connect distributed teams of people in joint activities. But even 40 years later, virtual coordination and collaboration remains a challenge. Despite this, modern tools so support some features needed for distributed work teams - namely, they provide a shared frame of reference of participants that can be used to support maintaining common ground.
A shared frame of reference is a common understanding about the state of the world and the meaningful activities relative to that state. Individual mental models can be shared frames of reference when they overlap but they are hard to identify when there are gaps between them. Think back to the last time you were discussing a complex project you were working on with someone else. You start describing the problem and they jump in with a solution completely unrelated to the problem you were trying to solve. This is an example of a discrepancy between the common understanding that is only uncovered once you begin talking about it. It’s well understood that there are differences between mental models of software systems but these are typically only surfaced when things go wrong, or when the participants notice a misalignment then invest time and effort in a discussion about them.
However, generating a shared frame of reference in the form of an artifact can be a useful way of reducing the gaps for important tasks or topics. In anthropology and sociology, an artifact is anything created by humans which gives information about the culture of its creator and users. As interpreted for the current software engineering context, cognitive artifacts are virtual objects made to aid or enhance shared understanding. These can help replicate what can take place in person - whiteboarding or looking over someone’s shoulder to watch someone physically type. Having a shared frame of reference can decrease cognitive effort of complex tasks by taking ephemeral things being described and making them tangible. It’s a shortcut to making discrepancies in mental models more visible.
How to adapt:
- Create shared visual frames of reference as much as possible (virtual whiteboards, trellos, google docs, murals) that can be easily shared and jointly worked on.
- Center the meeting or discussion around the artifact and encourage others to annotate, edit, adjust. In some cases, it may be beneficial to leave it unfinished or unpolished so it is clear to your collaborators that their input is necessary and the meeting is not just about signing off on someone else’s ideas.
- Set the expectation that shared frames of reference (however lightweight or rudimentary) are an expected part of meetings and shouldn’t be shortcut. If you don’t have time to build one in advance, build it together.
- Use the screenshare feature in Zoom to ground other participants to the ideas being discussed.
- Pin active or featured topics in Slack so they are able to be quickly referenced.
- Use Slackbots to capture repetitive or time-sensitive content.
4. Some teams adapted faster than others - successful adaptation was the result of front-line engagement and short-cycle feedback.
The problem:
Organizations that are coping well with increased load on the system seem to be doing so in part because of adaptive capacity developed prior to the event (some of which could be considered the ‘infrastructure’) but primarily, because of the inherent adaptability of their people. We consider this symbiosis inherent in resilience through infrastructure/resilience through people - one without the other is insufficient to enable an organization to adequately adjust to conditions.
Companies that had invested in building up the platforms and technologies to enable distributed incident management for their reliability engineers and non-collocated employees have reportedly been able to handle the sudden company-wide scaling of these tools. We point to two such reasons. The first is, particularly in larger organizations, scaling an existing tool is easier than dealing with the inherent financial and effort costs associated with adopting new technologies. Attempting to get new vendors through arduous procurement and security gating processes - at a point in time when thousands of other organizations are doing the same - can slow down the adoption of new tools. In addition, learning how to use the technology itself and adapting existing practices to the new technology adds burden at the individual and team level even while it provides benefits. The second point has to do with capitalizing on internal resources to aid the transition. Several members of our group, long-time remote workers within their organization, described being pulled into an ad hoc task force or working group to generate tips for their office-based colleagues on how to effectively make the transition. One engineer described being proud of their efforts helping people new to remote work “get up to speed” quicker by sharing their experience. This learning from internal experience to help others adapt is an example of resilient performance by redirecting internal expertise to support other parts of the business.
Earlier, we noted the example one engineer gave of a temporary company-wide code freeze as an attempt to contain potentially adverse consequences. Other organizations took more measured approaches – employing additional precautionary tactics, increasing cross-checking and ensuring lightweight retro’s get done that can quickly surface early signals of danger or concern amongst the engineering team. One organization recognized remote interviewing and hiring would be very difficult for their team and paused these efforts to avoid mis-hires. Some of the examples given were temporary, some had become ‘the new normal’.
The research shows:
We subscribe to the definition of resilience as systems that can “adapt in the face of variation, but much more importantly, are able to sustain adaptability as the forms and sources of variation continue to change over longer cycles”. Companies that make decisions about how to cope with the current conditions based on feedback from the front line engineers experiences with the changing dynamics are more likely to have demonstrated sustained resilience as the pandemic continues to redefine business-as-usual. In other words, instituting changes based on regulating the workload on and the concerns being raised by ‘hands-on-the-keyboard’ staff are being guided by the demands of the conditions and the capacities of those same staff. Companies that have instituted changes based on perceived potential threats that are independent of feedback from their front line employees are likely introducing greater brittleness to their systems by disconnecting the policy about how to respond from the practice of response itself.
How to adapt:
This information is predominantly useful for managers, team leads and senior leadership but engineering teams with high autonomy can adopt these practices as well.
- It’s worth underscoring the reason most companies are able to continue operations in the current conditions is because their people have been able to adjust their performance to carry on work in spite of the challenges they face. Think about this carefully, it is unlikely you had policies in place to define these new work arrangements, instead the ingenuity, flexibility and adaptability of your people enabled work to still happen.
- Create short, iterative feedback loops that connect the engineers responsible for system reliability with the resources they need to continue handling disruptions.
- Minimize any additional burden to rationalize or justify adopting new tools or practices requested by the frontline. This does not mean change should be flippant. Instead, utilize the expertise and the capabilities across roles so that front line engineers provide real-time data about what changes would help support reliable performance and management roles provide a cross-check on feasibility, remove blockers and free up resources to help support adapting in real-time.
- Carefully consider the implications of top-down policy changes or newly mandated requirements.These can have the unintended consequence of stifling the flexibility and adaptability in your teams that is contributing to your current productivity.
- Continue to use internal expertise to inform on-going transitions.
- We said this earlier but we will say it again: Practice reciprocity. It’s an exceptional moment in society where people are sharing resources to help vulnerable populations stay safe. It’s also a lesson for organizations as many teams function with thin margins and there is very little capacity to take on additional work. However, making small, ongoing contributions to support others to carry out their work is an investment in having additional resources to call on when your own team is overwhelmed.
5. Sustained resilience stems from cooperative cross-functional roles.
The problem:
The final theme that emerged was how the sudden move to remote operations had a siloing effect. Several engineers in our discussion group described their work as ‘pollinator’ roles with a strong emphasis on connecting people, ideas or departments. These may not be formal systems engineering roles but having these roles creates a structural form for maintaining common ground at a macro-level. These types of roles play an important function in not only improving task or project specific performance but in helping identify threats and new opportunities across the organization. Our respondents found the move to remote was challenging in doing this kind of work.
The research shows:
In complex systems, efforts to create reliability through redundancy and compounding technical debt can amplify the inherent hidden interdependencies. Arguably, the roles that operate across-scales and functional units serve to counteract the “robust yet fragile” nature of complex systems. An analogy often used to describe the critical function of systems-focused roles is how wild animals balance ‘heads down’ time feeding and tending to young with ‘heads up’ time scanning the environment for both predators and new sources of food. The self-described ‘pollinators’ (whether a formal role or in the informal functions provided by those who are adept at keeping track of changes across the organization) help provide critical information about what’s happening on the horizon.
How to adapt:
- Work collaboratively across teams to ensure cross-functional roles can continue in some form in the remote environment. Think about your interactions - both frequent and infrequent - and how to establish low friction methods of keeping those parties proactively informed.
- Manage workload of pollinator roles so they are able to spend time in other team’s Slack workspaces or attending update meetings.
- Adopt a stance of cooperative advocacy - that is, a practice that can “provide broadening and cross-checks that (reduce) the risk of premature narrowing”. It was first identified in studies of NASA mission control and reduces fragmentation of information across multiple groups that can lead to decisions that fail to take into account the complexity inherent in a problem.
- Create forums for knowledge to be shared amongst team members. In large organizations with multiple Slack workspaces, or in vendor-client workspaces consider how to replicate the flow of information across boundaries.
Concluding statements
There’s 3 practical things we hope you’ll consider doing after reading this article. The first is that you will make maintaining common ground central to your remote interactions. We've offered some suggestions on how to do this but we suspect you’ll come up with others. The second is that you will be able to better recognize the sources of resilience - the things or people that enabled your team to adapt in ways that not only kept the business functioning but benefit from newly remote operations. Lastly, beyond recognition, you'll invest (even a little) time into identifying ways to support on-going resilience to cope with the inevitable continued adjustments we will need to make in the coming months.
A common thread connecting our group is an interest in learning from incidents. As we continue to adapt to the challenges of distributed work and incident management we continue to meet to discuss how teams are coping with incident response and the challenges of doing distributed incident analysis. Stay tuned for the next installment.
A huge thanks goes out to the Learning from Adaptations (LFA) meetup participants:
John Allspaw
Brent Chapman
Morgan Collins
Jessica Devita
Fred Hebert
Vanessa Huerta Granda
Joshua Kaderlan
Ryan Kitchens
Alan Kraft
Michael McClimet
Tim Nicholas
Pirmin Schuermann
Tim Tischler
Anonymous participant