18 months on: our Digital Technology Operations team

It’s been just over 18 months since we first spoke about our Digital Technology Operations team so it’s time we reflected on how we’ve been helping product teams to deliver, what we’ve learnt and what’s next.

Developing our role as a ‘landlord’

In our first post we explained that our team looks after 3 things:

  1. Service management.
  2. Platform infrastructure.
  3. IT security.

Another way to think of our role is as a property landlord.

The various delivery teams are independent from us in the sense that they look after the direction and design of their digital product or service – in other words, how each household is run isn’t our concern. However, all Co-op products and services have some things in common which make up a shared digital platform. This includes firewalls, a logging and alerting platform and core templates used to build out our infrastructure. This is the part that our team is responsible for. We’ve set the ground rules and make sure all new products and services follow them (like good tenants). It helps teams keep their products and services safe and operational.

How we’ve helped

1.Putting checks in place

We’ve set up a ‘service readiness’ check that new products and features go through before we call them ‘live’. A recent example is when we launched Coop.co.uk on new infrastructure. The check includes making sure:

  • security testing is complete
  • the relevant operational teams know what’s changed
  • the relevant people know what to do if something goes wrong

2.Helping teams manage cost

We now provide a cost management tool for our delivery teams to help them manage their cloud infrastructure cost. Each team has a dashboard that shows their current spend and their forecasted spend. Having visibility over this gives them autonomy over how they manage their budget and lessens the likelihood that they’ll overspend.

Image of the cost management dashboard. It shows a 6 month forecast, a past 6 month spend and the actual spend.

Having a cost dashboard helped the Membership team to see that infrastructure logging was costing more than expected one month. When they investigated they found that the logging was sending data every second rather than every minute. Real-time cost reporting helped them spot the increase quickly so they could fix the configuration and incur only a few days of an increased cost.

3.Improving reliability

Over the last 18 months we haven’t changed the tools such as logging, monitoring and alerting systems, but we have worked hard to make them more reliable. For example, we’ve worked with our monitoring tool supplier to tweak how we configured it. Now it can handle more data easily.

We’ve also put efficient processes in place when there’s a problem. Teams can see the details on a status page and we alert the relevant people to fix the problem through our ‘major incident’ process.

4.Getting buy in from teams

We ask that delivery teams make sure they secure their infrastructure following our guidelines; carry out regular security and disaster recovery testing, and build products and services in line with our approved tech stack.

It’d be easy to say we just provide the tools for them to do this. But in the last 18 months we’ve done much more than that: we’ve successfully explained the importance of good technical standards to teams. We have their trust and their buy in and as a result there’s commonality and consistency between all Co-op products and services.

What’s next

Over the next 12 months we’ll be working with the rest of Co-op IT to shape some of our existing IT management processes, like disaster recovery, to make them work for the new challenges that cloud infrastructure brings. And as our digital teams start to use new technologies like containers and serverless, we’re looking at how our tools and processes can be adapted to support these as well.

Over the next few blog posts we’ll talk in more depth about how we do monitoring, how we manage our services becoming unavailable and how we onboard a new service.

Michaela Kurkiewicz
Head of Digital Technology Operations

Introducing the Digital Operations team

On the Co-op Digital blog we’ve spoken a lot about the products and services we’re working on like Membership, our new coop.co.uk site and location finder. We’ve spoken less about the Digital Operations team and the work it does before those products and services can be made available to the world.

Time for an intro?

We recently did a show and tell over in Federation but for those who couldn’t make it, here’s what we spoke about.

Photo shows a group of colleagues watching the Digital Operations team show and tells.

The Digital Operations team’s responsibilities

The Digital Operations team looks after 3 things:

  1. Service management.
  2. Platform infrastructure.
  3. IT security.

The role we play differs for each area of work. For example, for Membership our role is to run the live service and its infrastructure, whereas for location finder we’re supporting the team while they run things themselves. Sometimes, our role is more about helping teams who are designing new services to think about how they’ll be operated and made secure during their life cycle, right from the early idea through to being live.

How we support teams

Photo shows 4 members of the Digital Operations team at their show and tell.

The Digital Operations team doesn’t take on development, support or responsibility for running new services. These things fall under a product or service team’s remit and we advise them. When teams need platform or operations engineers to build and run something, we help them find the people and resources they need.

We help Digital and Group work together

Co-op Digital is only one part of the Co-op, so it’s important that the work we do is in line with the wider policies. We help digital and non-digital people work together by translating Group policies into something accessible for digital teams to work from, and by helping Group colleagues understand how agile ways of working can support the policies.

Saving teams time by creating patterns

A really important part of our role is to build a set of patterns and ways of working that will help teams build things that are secure, reliable and scalable and perform well. We’re still in the early stages but the plan is that using the patterns will help teams make sure their product or service has security controls, disaster recovery, monitoring, alerting, a way for users to tell us about issues, and a support route to get those bugs to the developers.

The patterns are being built around Co-op policies such as our security and data protection policy, which means that if a team uses one to build they will have ticked most of the security policy checkboxes.

Ready for public consumption?

We’re also the keepers of the ‘readiness checklists’ – a list of things that need to be in place before teams make something new publicly available. Points on the checklist includes whether an alpha is publicly accessible; whether it captures colleague, member or customer data and if it integrates with any internal Co-op systems. The checklists aren’t a hoop to jump through just before a service goes live – teams need to start thinking about being production-ready right from alpha phase.

Working on something new? Tell us all about it!

Our big message to teams at our show and tell was: if you’re working on something new, involve us as early as possible. This way we can share any patterns and technology that might help you work more efficiently. There’s no reason to reinvent the wheel each time we build something new. If we’ve got something that works – your team can just reuse it.

Coming to us early usually means we can pick up any problems and point out anything on our checklist that your product or service might not meet much earlier. That’ll mean we won’t have to delay anything.

Another place we can help is if you’re thinking of subscribing to an online service or purchasing a product. Maybe you are thinking of starting a new blog, creating a wiki, using a productivity tool or anything else that will help you with your job – you should make sure you speak to us to find out if it needs review or if there is a suitable product already available.

Come and say hi

We have a regular ‘surgery’ on the sixth floor in Federation House at 11am on Tuesdays. We also have a Slack channel or drop us an email on digitaloperations@coopdigital.co.uk

Michaela Kurkiewicz
Principal service manager

Moving to continuous delivery

I joined the Digital Engineering team in February as the Principal Service Manager. We’ve had a busy few months designing how we’re going to run the new digital services, the first of which being the Local Causes application website which forms part of our new Co-op Membership.

Picture of Michaela - Principal Service Manager

 

A different way of working

Coming from a background of traditional IT systems with on premise infrastructure, I knew from the moment I walked onto the 13th floor of 1 Angel Square that I’d have to start thinking differently. Every wall, surface and window was covered in sketches, Kanban boards and post its and the floor was buzzing with energy.

Picture of a Kanban board.

As a service team, our job is make sure the systems and services keep running, whether that’s by handling incidents, tackling problems or making sure that changes don’t cause outages or new issues. And that last one has been one of our biggest challenges. In the new digital teams, the pace is much faster than anything we’d been used to. Previously we’d been used to handling one or two big changes a month. The digital teams were aiming to release daily  – we needed a different way of working.

Our Challenge

The systems are quite complex with lots of different moving parts using lots of different technologies. Some front-end components like the website are built using new tools and technologies that build in automated deployment and automated regression testing. Other components down the stack are slower moving – they haven’t been built using these tools and require more manual intervention. We needed to build a change process that didn’t cause a bottle neck to releasing new features but at the same time would give us enough assurance that changes to one component weren’t going to cause issues up and down the stack (and at the same time making sure we’re not drowning in admin that doesn’t add value.)

Whilst we’re still keeping in line with the Co-op’s core change policies, we’ve tweaked the way we work to enable us to handle the higher volumes of change. At the moment whilst there’s a lot going on, we’re having daily Change Approval Board (CAB) meetings to make sure representatives from each of the components are aware of the changes going on across the whole  ecosystem. This way we’re catching any potential conflicts and we’ve seen some great challenges between the teams to make their changes safer by improving their testing or deployment approaches.

We’re trialling different ways to make sure everyone knows what’s going on, from post-its on a whiteboard to venturing into the world of chat ops with a shared calendar integrated into a Slack channel. And as we go, we’re collecting feedback from all of the different teams – What’s not working? How can we make it more efficient? How could we tackle the admin differently?

We hit a great milestone this month – over 13 days we hit an average of one successful deployment a day to the Co-op Local Community Fund. Whilst we’re not quite at full continuous delivery levels yet we’ve learnt a lot in getting this far.

Michaela Kurkiewicz
Principal Service Manager