How a coding error caused Rogers outage that left millions without service

9
How a coding error caused Rogers outage that left millions without service

People use the wifi inside Toronto’s Fairview Mall on July 8.Yader Guzman/The Globe and Mail

Rogers Communications Inc. RCI-B-T engineers began the sixth step of a seven-step process to upgrade the core infrastructure that supports the company’s wireless and broadband networks at 2: 27 a.m. on July 8.

Two hours and 16 minutes later, a coding error was introduced that triggered a cascade of events, resulting in a massive outage that left millions of Canadians without cellphone, internet or home phone service for at least a day.

The shutdown of one of Canada’s dominant telecommunications networks created widespread chaos. Rogers was unable to deliver four emergency alerts to its wireless customers in Saskatchewan, including three tornado warnings and one dangerous person report.

Rogers customers were unable to call 911, and the Interac debit system was also affected, causing issues for both consumers and businesses. In Toronto, the disruption forced Canadian singer-songwriter the Weeknd to postpone a concert that was supposed to have been held at the Rogers Centre that night.

Initially, even Rogers itself was unsure what was causing the service disruption. But weeks later, in a detailed submission in response to questions from the Canadian Radio-television and Telecommunications Commission, the company gave a full account of its version of events.

Opinion: Rogers still has some explaining to do about its outage and the fallout for its Shaw deal

Opinion: Rogers outage a reminder of Canada’s failure to set up a secure wireless network for emergency services

Those documents, which were disclosed publicly by the CRTC in redacted form on Friday, give new details on the outage and provide an early glimpse at the set of facts Rogers executives will draw upon on Monday, when they are expected to testify about the incident in a public hearing before the House of Commons committee on industry and technology.

Like many of its peers, Rogers currently has one core network that supports all of the services it provides. The core is essentially the network’s brain. It receives, processes, transmits and connects all voice, wireless data, internet and television traffic.

The telecom had started the seven-phase process to upgrade the core back in February, after what the company described in its CRTC submission as a comprehensive planning process that included budget and project approvals, risk assessment and testing.

The first five phases had gone smoothly. But, at 4: 43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.

Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.

Rogers uses equipment from different manufacturers in its network core, and the two vendors the company buys routers from have different designs and approaches to managing traffic and protecting the equipment from overloading. Those differences are at the core of the outage Rogers experienced, the company said in the documents.

But, in the early hours, the company’s technicians had not yet pinpointed the cause of the catastrophe. Rogers apparently considered the possibility that its networks had been attacked by cybercriminals. At 6 a.m., Jorge Fernandes, who at the time was the company’s chief technology officer, reached out to his counterparts at Telus Corp. T-T and BCE Inc.’s Bell Canada BCE-T to inform them of the outage and warn them to look out for cyberattacks, the company said in its submission.

Although Bell and Telus offered to help, Rogers quickly determined that it would not be able to transfer its customers to its rivals’ networks because certain elements of the Rogers network, such as its centralized user database, were inaccessible as a result of the outage. In any case, the rival networks would not have been able to handle the sudden surge of traffic from Rogers’s 10.2 million wireless subscribers, the telecom said.

Rogers outage may weigh on decision around $26-billion takeover of Shaw, Champagne says

Mr. Fernandes was in Portugal when the outage began, and he immediately started making arrangements to return to Canada, according to two sources familiar with his whereabouts. The Globe is not identifying the sources because they were not authorized to speak publicly about the matter.

Meanwhile, the Rogers network team gathered at the company’s network operations centre in Brampton, Ont., re-established access to the network and started trying to figure out the cause of the outage.

In order to communicate with each other and coordinate the recovery effort, some employees started swapping out their SIM cards for Bell or Telus SIM cards that they had received back in 2015 as part of an emergency contingency plan established between the wireless carriers.

It wasn’t until 8: 54 a.m. – roughly four hours after the start of the outage – that the company publicly acknowledged the situation. “We know how important it is for our customers to stay connected,” the telecom tweeted through its customer service account. “We are aware of issues currently affecting our networks and our teams are fully engaged to resolve the issue as soon as possible. We will continue to keep you updated as we have more information to share.”

The company’s disclosures to the CRTC suggest the delayed reaction might have had to do with problems logging in to online accounts used to communicate with customers. The telecom said that, in the future, it will ensure its crisis response teams have alternative methods of accessing social-media accounts that are protected by two-factor authentication linked to Rogers devices.

It took all day for the network team to restore the network. They had to disconnect the equipment that was causing the problem, redirect traffic and confirm the stability of the network before slowly bringing services back online. The process had to done methodically to prevent overloading the network and triggering another outage, the company said.

“Our wireless services are starting to recover and our technical teams are working hard to get everyone back online as quickly as possible,” the company tweeted shortly before 10 p.m.

The following morning, Rogers announced that it had restored services for the “vast majority” of its customers. But intermittent issues persisted throughout the weekend.

This Sunday, in an open letter to customers, Rogers CEO Tony Staffieri vowed to invest more in testing, oversight and artificial intelligence to improve the reliability of the company’s networks. He put the price tag of the changes at around $10-billion over three years.

The wireless giant will also physically separate its wireless and wireline core networks to ensure that any future outages don’t affect both services, Mr. Staffieri said.

Last week, the company replaced Mr. Fernandes, a former Vodafone executive, with veteran telecom executive Ron McKenzie. Mr. McKenzie was previously the president of Rogers for Business, the division that offers wireless and internet services to corporate clients.

Mr. McKenzie will kick off his new role with an appearance in front of the House of Commons committee that is studying the outage. The committee, which is made up of members of Parliament from all four major federal parties, is expected to grill him, Mr. Staffieri and Rogers chief regulatory officer Ted Woodhead on the five-day billing credit the company is offering to compensate its customers for the outage. The committee may also ask about the network and operational changes the telecom plans to make in order to prevent future outages.

As all of this is happening, Rogers is awaiting regulatory approval of its contested $26-billion takeover of Shaw Communications Inc., ahead of a July 31 deadline. The Competition Bureau is attempting to block the merger, arguing that it will result in poorer service and higher prices for cellphone customers.

Your time is valuable. Have the Top Business Headlines newsletter conveniently delivered to your inbox in the morning or evening. Sign up today.