Carnegie Mellon University

doctor using an ipad

August 20, 2019

Exploring the essence of big data

By Miranda Liu

Krista Burns

Whether you noticed or not, you are receiving and creating countless data in your everyday life, sometimes merely by sending messages and browsing items in a shopping site. Many fields, such as medicine and entertainment, have become data-rich, which drives researchers to find new ways to capture and analyze this rapidly increasing information.

Yuejie Chi, the Robert E. Doherty Career Development Associate Professor of Electrical and Computer Engineering, is one of these researchers. “There’re lots of interesting questions about how you can model such data and how you can extract information from these data,” said Chi. “They allow me to apply the type of tools I know to some practical problems that domain experts might be interested in.”

For her research, Chi has received the Presidential Early Career Award for Scientists and Engineers (PECASE). Established in 1996, the PECASE is the highest honor bestowed by the United States Government to outstanding scientists and engineers who have begun their independent research careers and have shown exceptional promise for advancing their fields.

Chi’s research focuses on representing data efficiently to reduce complexity and improve decision making. We can obtain plenty of information from big data, but the data we observe and collect every day can be highly redundant, messy, and incomplete. Take movie sites such as Netflix as an example; the users may only review a small number of films even though there are thousands of films out there. 

How, then, can people extract useful information from these raw data? Though overwhelming at first glance, the entries in big data matrices can be very correlated. There may be millions of users in a movie site, but they have many similarities such as age, country of origin, and educational background. Likewise, movies can have the same genres, directors, and main actors. If we study entries by their correlations, we can obtain their hidden features, also called latent variables.

By focusing on latent structures, movie sites can predict the missing entries and which movies the users might like. In this way, they can design algorithms to build an effective recommendation system.

“You don’t directly just think about the data itself; you’re trying to get some structures,” said Chi. “Once you get a good model of the latent structure, you can think about solving an inverse problem where you try to recover those latent structures using optimization. So we’re studying how to design algorithms to recover these structures.”

Aside from recommendation systems, Chi also uses latent representations to examine problems associated with imaging modalities. Biologists build devices, such as single-molecule super-resolution microscopy, to look at structures within cells, but the images they collect often lack the desirable resolution due to limitations of the device. By studying latent structures, Chi’s team has developed a new algorithm that significantly enhances the image resolutions; it uses the same available data but fewer computational resources.

Recently, Chi has been developing algorithms for distributed optimization. Nowadays, people often distribute data to different machines, as the data sets are too massive to fit onto a single device. Once they establish a distributed setting, however, communication issues may arise among individual machines. There may be adversarial events, and some entities may not want to share their data with the central location for privacy reasons. Thus, Chi aims to design algorithms that are communication-efficient and resilient to outlier events.

“Once you know how to represent your data, you can leverage the structures in your algorithm design and achieve the goal more efficiently,” said Chi.