In June, our Digital Service team won the Special Innovation Devops award at the IT Service Management Forum (ITSMF) Professional Service Management awards.
Each year, ITSMF present innovation awards to organisations who are exploring new territory, often around the edges of traditional IT service management or those who have found innovative solutions to well known problems.
We’re proud our work has been recognised as being innovative and thought this would be a good time to share our story.
IT service management at the Co-op
When we talk about ‘IT service management’ we mean making sure we can operationally support our products and services.
Over the years Co-op has put in place IT service management policies and processes based on an IT infrastructure library (ITIL) – the industry standard framework for developing and running IT services. It includes processes to help manage incidents, requests and changes.
The principles of the framework aim to manage business change whilst maintaining stable services. And because the Co-op has been going through digital transformation and business change over the past few years, maintaining stable services whilst being able make frequent changes in an agile model has been hugely important.
Adapting traditional processes for an agile environment
ITIL processes were created before working in an agile way was commonplace and the Co-op service management policies and processes were originally written for traditional, on premise, waterfall applications. So recently, the Co-op Digital IT service management team have been adapting them so they’re better suited to our fast-paced, cloud-hosted, agile world.
Here are some of the ways we’ve been working innovatively.
Working collaboratively (especially when things go wrong)
Typically, development teams are separate from the IT service management teams who operate live services. But we’ve been involving them. For example, our monitoring systems continually check the health of our services and when something breaks, we’ve set up alerts so that problems are automatically posted into incident chat rooms. We’ve made these visible to the whole Digital team. This way, the wider team can swarm on fixing the problem.
We also review incidents together for 2 reasons:
- To make sure we’re continually improving by preventing recurring issues.
- Reviews act as training guides for new colleagues to learn from past mistakes.
Creating patterns to make things more efficient
We created patterns for how we build and support infrastructure, how we deploy, and how we manage availability and change. Every service follows the same patterns and is scaled appropriately for its size.
Patterns make getting a service live for the first time simpler and quicker. When a service needs something different, we can fully concentrate on those areas rather than trying to reinvent the more basic, standard things. Before we put patterns in place, teams would often hit a wall just as they were planning to launch because they hadn’t sufficiently considered all the security and operational needs that needed to be satisfied. Now, our digital teams can take learnings for an alpha, and create the application and infrastructure for a production-ready service within months.
So far, so good
We’re now consistently doing 5-10 releases a day without service outages, we display our alerting and monitoring in the open so we’re transparent about our weak points and we share our post incident reviews widely so everyone can learn from our mistakes.
As a result we’ve seen improved uptime, typically never falling below 99.95%, have a change failure rate of less than 1% and we’re catching more issues proactively, all while supporting an increased number of services with the same size team.
A reasonable amount of governance
As product teams take on more responsibility for managing their own services, our role as a service team is shifting from being the gatekeepers of production, to making sure we have great processes and governance in place.
We’re giving teams the tools they need to manage changes and incidents themselves which saves time. Our aim to create processes that are supported by tools as well as automation that makes sure the appropriate governance is being done, rather than relying on people to do repetitive admin tasks. And as we try new tools and techniques, we’re sharing these with the rest of the Co-op IT teams, as well as here on our Digital blog, so that they can build on what we’ve learnt.
Principal Service Manager