Rogers responds to CRTC questions over outage, will split network

6
Rogers responds to CRTC questions over outage, will split network

Rogers’ response to questions from the Canadian Radio-television and Telecommunications Commission (CRTC) about the July 8th outage — or ‘Red Friday,’ as Vass Bednar has taken to calling it — arrived late on July 22nd in a document filed to the CRTC’s website.

The lengthy, partially-redacted document (which downloads a .docx file) includes responses to various CRTC questions, with explanations about what happened, what Rogers will do to keep it from happening again, who was affected, and more. Rogers opens the document with a note that it will be “as transparent as possible” when answering the CRTC’s questions but also asked the CRTC to treat certain information in the document as confidential to protect the company’s customers, network, and vendors.

Frustratingly, Rogers redacted many details of its plans to prevent future outages.

Still, some of the broader goals remain available to the public. Rogers confirmed in the document that it plans to “increase resiliency in our networks and systems which will include fully segregating our wireless and wireline core networks,” as was previously reported by MobileSyrup.

Details on what caused the outage

Moreover, Rogers provided additional details about the cause of the outage. Previously, the company had said a maintenance update caused routers in its network to malfunction.

In the Friday disclosures, Rogers detailed that the update was the sixth in a seven-phase process that started on February 8th. The previous five phases “proceeded without incident.” That sixth stage began at 2: 27am on July 8th (the company notes it usually performs upgrades at times when traffic is low). The update contained a coding error that started the issue at 4: 43am, which cascaded through Rogers’ core network “very quickly.”

That coding error deleted a “routing filter” in Rogers’ distribution routers, which allowed all possible routes to the internet to flow through the routers. Rogers explains that this caused the routers to propagate “abnormally high volumes of routes throughout the core network,” leading certain network equipment to exceed capacity and fail.

Rogers goes on to describe that it uses a “common core” network — like “many large Telecommunications Services Providers” (TSPs) — that combines wireless, wireline and other sources. The company explains that its core consists of various vendors’ equipment, that different equipment can have different designs and routing management protocols, and that these differences are “at the heart of the outage.”

Rogers notes that the outage impacted employees, preventing them from connecting to the company’s IT and network infrastructure. While some Rogers employees were able to communicate with each other using Bell or Telus SIM cards they received as part of a 2015 emergency contingency plan established between the carriers, staff still had to travel to centralized locations to access the network and begin sorting out what went wrong and how to fix it. This contributed to delays in restoring service.

Again, much of this mirrors previous MobileSyrup reporting about what caused the outage, although there are some new details that weren’t known before. Primarily, previous external analysis of the outage indicated that the issues stemmed from gateway routers, whereas Rogers says the outage started with distribution routers.

Rogers says it couldn’t transfer customers to competitors’ networks

As the Globe and Mail highlights in its report, Rogers revealed in the disclosure that it couldn’t transfer customers to competitors’ networks during the outage.

Bell and Telus offered Rogers assistance, but the company determined it couldn’t transfer customers to the other networks since some aspects of Rogers’ network — such as the centralized user database — weren’t accessible due to the outage. Moreover, Rogers says that competitors’ networks wouldn’t “have been able to handle the extra and sudden volume of wireless users (over 10.2M) and the related voice/data traffic surge.”

Particularly interesting about this is the government response. Industry Minister François-Philippe Champagne directed Canadian telecom companies to develop a mutual assistance agreement to help each other during outages following the events of Red Friday. Given that Rogers couldn’t transfer customers to other networks and the claim that other networks couldn’t handle the surge in traffic, it remains unclear how telecoms could implement a mutual assistance structure without significant changes to each company’s network. Moreover, if Bell and Telus also use common core networks — as Rogers implies — then those networks are also potentially vulnerable to the same failure as Rogers’ network.

Still, Rogers said it will explore various mutual assistance options with other companies before delivering a formalized agreement to the minister in September.

Changes to the update review process and communication

Rogers also noted in the disclosure that it went through a “comprehensive planning process including scoping, budget approval, project approval, kickoff, design document, method of procedure, risk assessment, and testing, finally culminating in the engineering and implementation phases” for the update.

The company stressed that it makes updates to the core network “very carefully.”

However, Rogers said it would review the process it uses to plan and implement updates to the network. The company also detailed plans to improve communication between its teams and the public when it comes to outages.

Changes include giving communication teams backup devices on alternate networks to use if Rogers’ network fails, updating policies and procedures for sharing updates in the event of a “network blackout,” increasing the frequency of updates, providing information across all channels about impacts to critical services like 9-1-1, and ensuring all statements posted to social media include the use of alt text.

That last one is particularly interesting given Rogers and its flanker brands posted updates to Twitter using pictures, but people with visual impairments may not be able to read the text in a picture. Alt text provides descriptions of image appearance and function, which can be picked up by technology like screen readers to help people with visual impairments understand images.

What’s next

A House of Commons committee on industry and technology plans to study the outage and will have a hearing Monday. The Globe and Mail notes that Rogers replaced its chief technology officer (CTO), Jorge Fernandes, just days before this hearing.

Telecom veteran Ron McKenzie replaced Fernandes — as MobileSyrup reported, the change is unlikely to disrupt Rogers’ plans to address the outage by separating wireless and wireline traffic.

The Globe expects the committee, which includes members from all four major federal parties, will question Rogers executives about the outage and five-day credit delivered to customers to compensate them for the outage. Critics previously questioned whether the credit was enough, given the scope of the damage was far beyond not having service for several days. Moreover, a Quebec resident has filed a class-action lawsuit against the company seeking $400 for each customer impacted by the outage.

Those interested in diving into the details shared by Rogers can read the disclosure in full here (note the link will download a .docx file).

Source: CRTC (.docx file) Via: The Globe and Mail