18 months on: our Digital Technology Operations team

It’s been just over 18 months since we first spoke about our Digital Technology Operations team so it’s time we reflected on how we’ve been helping product teams to deliver, what we’ve learnt and what’s next.

Developing our role as a ‘landlord’

In our first post we explained that our team looks after 3 things:

  1. Service management.
  2. Platform infrastructure.
  3. IT security.

Another way to think of our role is as a property landlord.

The various delivery teams are independent from us in the sense that they look after the direction and design of their digital product or service – in other words, how each household is run isn’t our concern. However, all Co-op products and services have some things in common which make up a shared digital platform. This includes firewalls, a logging and alerting platform and core templates used to build out our infrastructure. This is the part that our team is responsible for. We’ve set the ground rules and make sure all new products and services follow them (like good tenants). It helps teams keep their products and services safe and operational.

How we’ve helped

1.Putting checks in place

We’ve set up a ‘service readiness’ check that new products and features go through before we call them ‘live’. A recent example is when we launched Coop.co.uk on new infrastructure. The check includes making sure:

  • security testing is complete
  • the relevant operational teams know what’s changed
  • the relevant people know what to do if something goes wrong

2.Helping teams manage cost

We now provide a cost management tool for our delivery teams to help them manage their cloud infrastructure cost. Each team has a dashboard that shows their current spend and their forecasted spend. Having visibility over this gives them autonomy over how they manage their budget and lessens the likelihood that they’ll overspend.

Image of the cost management dashboard. It shows a 6 month forecast, a past 6 month spend and the actual spend.

Having a cost dashboard helped the Membership team to see that infrastructure logging was costing more than expected one month. When they investigated they found that the logging was sending data every second rather than every minute. Real-time cost reporting helped them spot the increase quickly so they could fix the configuration and incur only a few days of an increased cost.

3.Improving reliability

Over the last 18 months we haven’t changed the tools such as logging, monitoring and alerting systems, but we have worked hard to make them more reliable. For example, we’ve worked with our monitoring tool supplier to tweak how we configured it. Now it can handle more data easily.

We’ve also put efficient processes in place when there’s a problem. Teams can see the details on a status page and we alert the relevant people to fix the problem through our ‘major incident’ process.

4.Getting buy in from teams

We ask that delivery teams make sure they secure their infrastructure following our guidelines; carry out regular security and disaster recovery testing, and build products and services in line with our approved tech stack.

It’d be easy to say we just provide the tools for them to do this. But in the last 18 months we’ve done much more than that: we’ve successfully explained the importance of good technical standards to teams. We have their trust and their buy in and as a result there’s commonality and consistency between all Co-op products and services.

What’s next

Over the next 12 months we’ll be working with the rest of Co-op IT to shape some of our existing IT management processes, like disaster recovery, to make them work for the new challenges that cloud infrastructure brings. And as our digital teams start to use new technologies like containers and serverless, we’re looking at how our tools and processes can be adapted to support these as well.

Over the next few blog posts we’ll talk in more depth about how we do monitoring, how we manage our services becoming unavailable and how we onboard a new service.

Michaela Kurkiewicz
Head of Digital Technology Operations

How we’ve made release management quicker and simpler

Release management is about how we plan and schedule when we’re building software. Every digital team has its own release management process but sometimes it’s worth reassessing it to make sure it’s as slick and quick as it could be. That’s exactly what we did.

How things were

Our process on the Digital Operations team was complex and repetitive. It was very specific to the Information Technology Infrastructure Library (ITIL) framework which is used to align IT service management with business needs. Our process was also dependent on a single gatekeeper. It didn’t work for us.  

The process worked like this:

  1. A developer would email me a release note.
  2. I’d forward it to the environment owner.
  3. They’d give me permission over email to put the release into their environment.
  4. I’d email the developer to say this was approved and to advise when complete.

This process would then go on for each of the 3 environments (system integration testing or ‘SIT’, pre-production and production) so testing could be started. A typical release would result in around 30 to 40 emails. It meant we wasted a lot of time and the release cycle was slow.

The recording process wasn’t much slicker either. I had to update 3 spreadsheets, make a new folder for each release note and save each one to a central location. Then I sent an email every evening to document what releases we’d made that day.

Something had to change

Frustrations were running high because it was such a tedious, long-winded process. Developers were frustrated because of the amount of emails they had to send and environment owners were frustrated because of the amount of emails they were receiving. It wasn’t practical or sustainable.

Making things simpler and quicker

We agreed what the ideal release management process should look like. We wanted something less email-intensive, more intuitive, easier to manage, something that’s always up to date.

Photograph of the Trello board on a big screen in the office.

I thought a kanban-style approach using Trello and Google Forms might work well. We still had a requirement to keep the release note part of the process so I created a Google Form that asked similar (but more simply-worded) questions. We could then convert the answers from the Google Form into a PDF using Google plug-ins, email it to the Trello board so it would be automatically converted into a card and appear on the board. At this point we’d reduced the amount of emails by between 5 and 10.

Adding in audits

We ran this new process past environment owners who thought a series of checklists on the Trello cards would be useful. This way we could include evidence that testing had been done and that we’d released in the correct order through the environments. When the Change Advice Board (CAB) reviewed releases they had the evidence there already and this would save time.

Trello also lets you assign tasks to people and they’ll get a notification when something’s been completed. So developers could release without having to wait for an email because the testing team had given approval which triggered a notification for the developer. This saved another 5 to 10 emails.

Testing things out

At this point we held demo sessions before putting the Google Form live. After a week we evaluated where we were at. The feedback was positive: releases didn’t get stuck at any approval points, there were far fewer emails and there was a live version of the status and position of releases at all times. The whole thing was much easier and it was self-managing.

Going from good to great

We kept improving the process and after 6 months we’d changed the way we labelled releases as well as the automation of release checklists when a new release is added. I was now only spending around an hour a day making sure things were flowing correctly.

We’re now coming up to a year since we started doing things differently and the process is down to minutes per day. It’s now totally self-managed by the developer, testers and product managers which gives us more time to work on what we’re actually here for: solving bigger problems.

Steven Allcock
Digital service manager

Introducing the Digital Operations team

On the Co-op Digital blog we’ve spoken a lot about the products and services we’re working on like Membership, our new coop.co.uk site and location finder. We’ve spoken less about the Digital Operations team and the work it does before those products and services can be made available to the world.

Time for an intro?

We recently did a show and tell over in Federation but for those who couldn’t make it, here’s what we spoke about.

Photo shows a group of colleagues watching the Digital Operations team show and tells.

The Digital Operations team’s responsibilities

The Digital Operations team looks after 3 things:

  1. Service management.
  2. Platform infrastructure.
  3. IT security.

The role we play differs for each area of work. For example, for Membership our role is to run the live service and its infrastructure, whereas for location finder we’re supporting the team while they run things themselves. Sometimes, our role is more about helping teams who are designing new services to think about how they’ll be operated and made secure during their life cycle, right from the early idea through to being live.

How we support teams

Photo shows 4 members of the Digital Operations team at their show and tell.

The Digital Operations team doesn’t take on development, support or responsibility for running new services. These things fall under a product or service team’s remit and we advise them. When teams need platform or operations engineers to build and run something, we help them find the people and resources they need.

We help Digital and Group work together

Co-op Digital is only one part of the Co-op, so it’s important that the work we do is in line with the wider policies. We help digital and non-digital people work together by translating Group policies into something accessible for digital teams to work from, and by helping Group colleagues understand how agile ways of working can support the policies.

Saving teams time by creating patterns

A really important part of our role is to build a set of patterns and ways of working that will help teams build things that are secure, reliable and scalable and perform well. We’re still in the early stages but the plan is that using the patterns will help teams make sure their product or service has security controls, disaster recovery, monitoring, alerting, a way for users to tell us about issues, and a support route to get those bugs to the developers.

The patterns are being built around Co-op policies such as our security and data protection policy, which means that if a team uses one to build they will have ticked most of the security policy checkboxes.

Ready for public consumption?

We’re also the keepers of the ‘readiness checklists’ – a list of things that need to be in place before teams make something new publicly available. Points on the checklist includes whether an alpha is publicly accessible; whether it captures colleague, member or customer data and if it integrates with any internal Co-op systems. The checklists aren’t a hoop to jump through just before a service goes live – teams need to start thinking about being production-ready right from alpha phase.

Working on something new? Tell us all about it!

Our big message to teams at our show and tell was: if you’re working on something new, involve us as early as possible. This way we can share any patterns and technology that might help you work more efficiently. There’s no reason to reinvent the wheel each time we build something new. If we’ve got something that works – your team can just reuse it.

Coming to us early usually means we can pick up any problems and point out anything on our checklist that your product or service might not meet much earlier. That’ll mean we won’t have to delay anything.

Another place we can help is if you’re thinking of subscribing to an online service or purchasing a product. Maybe you are thinking of starting a new blog, creating a wiki, using a productivity tool or anything else that will help you with your job – you should make sure you speak to us to find out if it needs review or if there is a suitable product already available.

Come and say hi

We have a regular ‘surgery’ on the sixth floor in Federation House at 11am on Tuesdays. We also have a Slack channel or drop us an email on digitaloperations@coopdigital.co.uk

Michaela Kurkiewicz
Principal service manager