T-Mobile Slashes Production Time from 7 Months to Days with Cloud Foundry
A major telco uses Cloud Foundry to speed up the development life cycle and achieve scalability of its applications.
- A 1,000-function monolith was renovated by introducing a microservices-based architecture.
- Production cycles were slashed from seven months to days.
- New platform auto-scales and handles 40 millions calls a day.
- Bug fixing is done within a day with zero downtime.
Monitoring is critical to avoid major disasters, responding to operational issues at short notice. Transfer of ownership to participants reduces cross-organizational conflict, while empowering team members to develop, test, and run their own code. Management buy-in is essential.
Java, Spring Cloud Services, LeoFS, MySQL, RabbitMQ, Apigee Edge, Kubernetes, Swagger
With onboarding started in May 2016, T-Mobile was able to launch its platform into production early in July that year. By August 2016, 100% of production traffic was moved to Pivotal Cloud Foundry.
T-Mobile US, Inc. is a major wireless network operator in the United States. Its headquarters are located in Bellevue, Washington, in the Seattle metropolitan area.
The company traces its roots to VoiceStream Wireless PCS founded in 1994 as a subsidiary of Western Wireless Corporation. In May 2001, it was purchased by Deutsche Telekom for $35 billion and renamed T-Mobile US, Inc. in July 2002.
Becoming the “un-carrier”
T-Mobile US is a wireless network operator with 72.6 million customers as of early 2017. Operating under the T-Mobile and metroPCS brands, it bills itself as the “un-carrier,” offering a range of continuously updated innovative services and flexible rate plans.
In addition to reducing dependence on contracts and uncoupling device costs from servicing expenses, the initiative has provided:
- Access to free unlimited music and video streaming
- Possibility to make calls and send messages via a Wi-Fi connection without additional apps, logins, or costs
- Tools to help business customers go mobile (such as a free .com domain name, a website optimized for mobile devices, and an e-mail address)
- A wide variety of exclusive promotions and bonuses from T-Mobile’s partners (e.g., Walmart, Domino’s Pizza, Lyft, StubHub, etc.)
While giving customers more freedom in how they use wireless and mobile services, the “un-carrier” initiative required a lot of technical innovation internally.
For instance, an environment that mainly comprises large Java-based monoliths was a major impediment to changes. T-Mobile had trouble scaling up (and then back down) during several big holiday promotion periods it faces each year. In addition, the release cycles were long and complex—with almost 7 months and 72 steps taking to launch the code to production. With traditional development and operations teams working separately, differences among test and production servers made configuration difficult.
Investigating Cloud Foundry
Seeking ways to speed up deployment and improve scalability and configuration management, T-Mobile turned to Cloud Foundry—at first, to its open-source version. After some time, the company realized it needed expert assistance in onboarding and troubleshooting, as well as in adopting microservices. So, it decided to switch to Pivotal Cloud Foundry (PCF).
A recent presentation revealed how T-Mobile used PCF to rebuild one of its legacy Java-based systems—running on WebLogic—with more than 1,000 functions. According to Brendan Aye (Principal Architect for Cloud Foundry), and Melissa Chapman (Sr. Product Manager at T-Mobile), the platform was seen as a solution to creating a scalable environment.
Microservices were meant to reverse the dependence on the existing large, monolithic infrastructure. Containers were conceptualized as a way to ensure environment consistency with all the containers pushed in the same way.
The team began with one of the monolith’s functions—GetUsage
, which provided T-Mobile customers with access to information about their data usage. Having quite a low impact on the whole system, it was a perfect choice to try out a new approach. Though representing a relatively small portion of the legacy app, the function handled 2.5 to 12 million calls per day.
Adoption and results
With onboarding started in May 2016, T-Mobile was able to launch its platform into production early in July that year. By August 2016, 100% of production traffic was moved to Pivotal Cloud Foundry.
The team totally re-built the monolith’s GetUsage
function, scaling it up to handle about 40 million calls per day. Whereas, with the old system, the release could take seven months and 72 steps, Cloud Foundry made it possible to reduce the production cycle to just days. In addition, bugs are now got fixed within the same day when they are detected—with zero impact on the system performance.
“If we were to do this with our traditional infrastructure, it would probably take us seven months. With Cloud Foundry, it was a day or so.” —Melissa Chapman, T-Mobile
T-Mobile started gathering metrics on their application instances in August 2016. The team began the shift to PCF and microservices when the system was reaching 1,000 instances.
Autoscaling ensured no human intervention. By May 2017, the team already had 3,000 application instances to manage, reaching the critical mass adoption rate, where “users started helping users with problems, onboarding, etc.”
Apart from the dramatic reduction of the production time, Cloud Foundry enabled T-Mobile to get the fail-fast and fix-fast concepts up and working. According to Brendan, risks on developing new ideas could be taken, “by allowing us to move more quickly and try things out that we’re not able to try otherwise, and we’re able to recover from that risk just as quickly.”
Furthermore, he says, “the principle of the cloud is rapid elasticity, being able to get code out the door more quickly and being able to scale those applications up and down to meet demand, even sometimes daily.”
Whereas previously T-Mobile had separate teams for testing and production, adopting Cloud Foundry has aided the company in introducing an entirely new DevOps culture. Guiding a product through the development process to production, teams are now “wholly responsible for owning, developing, and operating these different services.”
Shifting the responsibility, the company helped its employees to realize the true value and meaning of the DevOps culture. “They are responsible for the code they develop, how they test it, and how they run it,” Brendan says.
The role of APIs
The PCF platform is complemented by Apigee’s API management framework, which enables using existing services (like GetUsage
) for consumption. Via APIs, the customers and partners of T-Mobile can access a variety of data resources, while helping the company to diversify its offers with new services and additional bonuses (such as free pizzas).
“With our growth come exciting opportunities where our business partners, other innovative apps, and entrepreneurs want to integrate with us. APIs are our way to involve them.”
—Himanshu Kumar, T-Mobile
Using APIs, both internal and external developers are enabled to quickly roll out applications without compromising security and stability.
A webinar from Himanshu Kumar, a Principal Developer at T-Mobile, and Paul Williams of Apigee provided a detailed overview of how these APIs work.
According to the webinar, the “capability” APIs are “detached from experience” and “purely focused on the resource or the underlying representation of an entity in the purest form.” The “experience” APIs are designed to “achieve an optimal use from a user experience perspective.”
As Paul explained, APIs are at the heart of creating a digital value chain. They are seen as “technical contracts between developers and the team that’s implementing functions and capabilities.”
“The goal is to rapidly innovate, as developers build apps for customers to do more business. They (the teams) can iterate on top of those APIs without having to affect ongoing development and maintenance of apps and their interactions with back-end systems.” —Paul Williams, Apigee
Himanshu also pointed out, “we are trying to imagine IT systems and solutions as things that we can break into capabilities, with an unambiguous assignment to teams that can own them, have a life cycle, and they feel empowered.”
The customer-centric APIs implemented by T-Mobile Nederland is an example of how the company employs APIs to exchange data with internal and external parties. (For more about using APIs for running microservices on Cloud Foundry, read our brief post on the topic, featuring another Apigee discussion.)
Technical challenges and lessons learned
According to Brendan and Melissa, initial T-Mobile’s infrastructure resources “were established to support teams developing monolithic apps on stateful servers.” So, migrating to Spring Cloud Services on PCF required certain workarounds:
Networking was “the biggest hurdle.” Initially, PCF didn’t support T-Mobile’s networking layout delivering a number of separate networks for each of the company’s availability zones. Some of the PCF tiles (e.g., MySQL and RabbitMQ) did not support such multi-subnet topologies. So, as the Pivotal team was working to update PCF services to support multi-subnetworks, developers at T-Mobile had to perform manual BOSH deployments as a temporary measure to enable Spring Cloud Services.
“We had to actually crack open the tiles, use BOSH releases from them, and deploy them manually with BOSH.” —Brendan Aye, T-Mobile
“RabbitMQ and MySQL were a must,” but the tiles for these two applications are multi-tenant single clusters, which was not going to work well within T-Mobile’s large production environment. Brendan explained that the so-called “bad neighbors” could flood the app with messages and requests, bringing down the entire cluster. Although the RabbitMQ and MySQL tiles were successfully used for Spring Cloud Services (e.g., Hystrix), it was decided not to offer them for actual production workloads.
Working with private clouds, the company needed an on-premises S3 object storage. There was an initial option to use a built-in NFS server, but the system covered only a single instance, so did not meet the requirements. To provide a highly available S3 object storage, the team created BOSH releases for the open-source LeoFS tool and got them up and running across all the three availability zones.
Having multiple data centers, T-Mobile lacked global control over load balancing. So, cross-region load balancing became a customer responsibility.
After shifting to Cloud Foundry and microservices, the rapid increase in app instances caused compliance concerns. To address them, the company’s team set up automated provisioning of permissions to the platform. “We used a tool called CF Management that allows you to use a GitHub repo as a source of permissions for your orgs,” explained Brendan. “By doing this, we can leverage all the existing stuff we have in source control, such as pull requests and permissions.”
“It’s very easy to see who approves something, when they approved it, and what’s changed.”
—Brendan Aye, T-Mobile
Brendan and Melissa emphasized the need for strong, consistent executive sponsorship to break down existing walls between and among departments and making the new system a success. Furthermore, the speakers pointed out persistence as one of the key factors in achieving important goals.
“We also learned to figure out what truly matters, what you need to ensure success and put your foot down when you need to.” —Brendan Aye, T-Mobile
Keeping all of the critical platform components under control by monitoring them allows for detecting problems in a timely manner and predicting possible hazards in the future.
What’s next?
With a new cloud-native development approach, a DevOps culture, and customer-centric APIs, T-Mobile has made a substantial progress in unleashing its “un-carrier” initiative. The results for the second quarter of 2017 include another 1.3 million customers added to their network and service revenues reaching a record level, up 8% year-over-year.
According to Brendan Aye, the team plans to further expand its offerings, to improve working with RabbitMQ and MySQL tiles, and to update the infrastructure foundations to get multi-subnet support. Furthermore, T-Mobile aspires to resolve the load-balancing issue to enable customers to “build and push one time and have it run across all the foundations.”
Earlier this September, the company also announced another move within the “un-carrier” initiative—providing a free Netflix subscription.
Want details? Watch the videos!
Table of contents
|
In the video below, James Webb and Brendan Aye highlight how the bundle of Cloud Foundry and microservices helped T-Mobile to avoid downtime during the recent iPhone X launch.
Further reading
- Orange Labs Tests Massive Migration to the Cloud
- Huawei Deploys 4,500+ Apps with Cloud Foundry
- Running Microservices and APIs with Cloud Foundry
Related slides
About the experts