Carnegie Mellon University

data center

October 15, 2018

Optimizing computing systems

By Marika Yang

Krista Burns

Machine learning has grown incredibly in engineering and computer science in recent years, with the explosion of interest in artificial intelligence. In machine learning, humans—engineers and computer scientists—feed large data sets into a neural network model to train the model to learn from data and eventually identify and analyze patterns and make decisions.

Gauri Joshi is researching the analysis and optimization of computing systems. Joshi, assistant professor of electrical and computer engineering, has been named a recipient of a 2018 IBM Faculty Award for her research in distributed machine learning. Faculty Award recipients are nominated by IBM employees in recognition of a specific project that is of significant interest to the company and receive a cash award in support of the selected project.

Joshi’s research is about distributing deep learning training algorithms. The data sets used to train neural network models are massive in size, so a single machine is not sufficient to handle the amount of data and the computing required to the analyze the data. Therefore, data sets and computations are typically divided across multiple computing nodes (i.e. computers, machines, or servers), with each node responsible for one part of the data set.

In a distributed machine learning system with data sets divided across nodes, researchers use an algorithm called stochastic gradient descent (SGD), which is at the center of Joshi’s research. The algorithm is distributed across the nodes and helps achieve the lowest possible error in the data. It requires exact synchronization, which can lead to delays.

“My work is about trying to strike the best balance between the error and the delay in distributed SGD algorithms,” Joshi said. “In particular, this framework fits well with the IBM Watson machine learning platform; I will be working with the IBM Watson Machine Learning vision; I will be working with the IBM Research AI team.”

In every iteration of the SGD, a central server is required to communicate with all of the nodes. If any of the nodes slow down, then the entire network slows down to wait for that node, which can significantly reduce the overall speed of the computation. Efficiency and speed of computation are the two main things Joshi aims to improve, both without risking the accuracy of the network.

“When you have a distributed system, communication and synchronization delays in the system always affect the proponents of the algorithm. I'm trying to design robust algorithms that work well on unreliable computing nodes,” she said.

Prior to joining Carnegie Mellon’s College of Engineering in Fall 2017, Joshi was a research staff member at IBM’s Thomas J. Watson Research Center. Because of her past experience, she was aware of the specific research projects that are relevant to the company’s interests.

The funding provided by the Faculty Award will be used to support Joshi’s students, who are working on the theoretical analysis for this project. In the future, she hopes to release an open source implementation of the new algorithm they have developed. Joshi plans to work with IBM to make this method available to anybody who wants to train their own machine learning algorithms using distributed SGD.