It’s been just over 18 months since we first spoke about our Digital Technology Operations team so it’s time we reflected on how we’ve been helping product teams to deliver, what we’ve learnt and what’s next.
Developing our role as a ‘landlord’
In our first post we explained that our team looks after 3 things:
- Service management.
- Platform infrastructure.
- IT security.
Another way to think of our role is as a property landlord.
The various delivery teams are independent from us in the sense that they look after the direction and design of their digital product or service – in other words, how each household is run isn’t our concern. However, all Co-op products and services have some things in common which make up a shared digital platform. This includes firewalls, a logging and alerting platform and core templates used to build out our infrastructure. This is the part that our team is responsible for. We’ve set the ground rules and make sure all new products and services follow them (like good tenants). It helps teams keep their products and services safe and operational.
How we’ve helped
1.Putting checks in place
We’ve set up a ‘service readiness’ check that new products and features go through before we call them ‘live’. A recent example is when we launched Coop.co.uk on new infrastructure. The check includes making sure:
- security testing is complete
- the relevant operational teams know what’s changed
- the relevant people know what to do if something goes wrong
2.Helping teams manage cost
We now provide a cost management tool for our delivery teams to help them manage their cloud infrastructure cost. Each team has a dashboard that shows their current spend and their forecasted spend. Having visibility over this gives them autonomy over how they manage their budget and lessens the likelihood that they’ll overspend.
Having a cost dashboard helped the Membership team to see that infrastructure logging was costing more than expected one month. When they investigated they found that the logging was sending data every second rather than every minute. Real-time cost reporting helped them spot the increase quickly so they could fix the configuration and incur only a few days of an increased cost.
3.Improving reliability
Over the last 18 months we haven’t changed the tools such as logging, monitoring and alerting systems, but we have worked hard to make them more reliable. For example, we’ve worked with our monitoring tool supplier to tweak how we configured it. Now it can handle more data easily.
We’ve also put efficient processes in place when there’s a problem. Teams can see the details on a status page and we alert the relevant people to fix the problem through our ‘major incident’ process.
4.Getting buy in from teams
We ask that delivery teams make sure they secure their infrastructure following our guidelines; carry out regular security and disaster recovery testing, and build products and services in line with our approved tech stack.
It’d be easy to say we just provide the tools for them to do this. But in the last 18 months we’ve done much more than that: we’ve successfully explained the importance of good technical standards to teams. We have their trust and their buy in and as a result there’s commonality and consistency between all Co-op products and services.
What’s next
Over the next 12 months we’ll be working with the rest of Co-op IT to shape some of our existing IT management processes, like disaster recovery, to make them work for the new challenges that cloud infrastructure brings. And as our digital teams start to use new technologies like containers and serverless, we’re looking at how our tools and processes can be adapted to support these as well.
Over the next few blog posts we’ll talk in more depth about how we do monitoring, how we manage our services becoming unavailable and how we onboard a new service.
Michaela Kurkiewicz
Head of Digital Technology Operations