Finding Extremists in Online Social Networks

Extremist groups such as ISIS use social media to recruit, spread propaganda, and incite violence.  How can one monitor social networks to shut down these users?  Our work develops simple machine learning methods to predict if new users will belong to existing extremist groups using their network neighborhoods.  We also develop a network search model based on Polya’s Urns that lets one monitor the network to see if suspended users return with new accounts.  We tested our method on real Twitter accounts of ISIS members and found we can find the returning users much faster than other search methods.This work is done with Lieutenant Colonel Christopher Marks.

[Theory Paper and Application Paper

(Originally these were 2 papers, then the reviewers didn’t like either of them because one had not enough  theory and one had not enough application, so now they are combined into one paper.  But I still like having them in two separate PDFs).

Picking Winners: A Framework For Venture Capital Investment

We develop a data-driven approach to build portfolios of startup companies.  Our framework involves building portfolios to maximize the probability of at least one company “winning” which means having an IPO or an acquisition.  We call this the “Picking Winners” portfolio.  We develop a model based on first passage times random walks for the companies’ funding round time series.  We train our model on data for 24,000 companies.  We find with our approach we can achieve higher exit rates than many top venture capital firms.

This work is done with David Scott Hunter.

Media mentions

Picking Winners with Integer Programming in Daily Fantasy Sports

We develop an integer programming formulation for picking entries into daily fantasy sports contests on sites like DraftKings. Our “Picking Winners”
portfolio of entries aims to have one entry achieve a huge number of points and win the contest. This is because nearly all the money
goes to the winner. Our algorithm creates lineups with high variance (known as stacking) and then forces the lineups
to not share too many entries (diversify the portfolio). We were able to win multiple contests in hockey and baseball
using our approach. 

This work is done with David Scott Hunter.
The Julia code to implement our algorithm:

Predicting Performance Using Biometric Data

We show that you can predict who will choke under pressure using biometric data measured in low stress environments.  We conduct an experiment where subjects perform arithmetic questions for money.  They answer the questions in low stress and high stress rounds.  High stress is induced by time constraints and higher monetary reward.  We measure their electrodermal activity (EDA), which is a measure of how much they sweat.  EDA is a known physiological indicator of mental arousal.  Basically, if your mind is aroused, you sweat.  The arousal can be due to stress or excitement.  We find that subjects whose EDA decreases in the low stress round do poorly in the high stress round.  Unlike other studies in this area, ours is truly predictive, in that we forecast a future outcome using historical biometric data.  Our results suggest the potential of biometric data for job training and hiring processes.

This work is done with Carter Mundell and Juan Pablo Vielma.

Media mentions

Choice-Time Models

What can how fast you swipe on a face in Tinder tell us about how much you like (or dislike) that face?  Turns out quite a bit.  We develop a “choice-time-engagement” model that uses choice and time data to infer how much you like something, and how engaged you are with the decision you make.  Think about apps like Tinder.  A lot of times, people aren’t engaged and swipe blindly.  It would be nice to measure this engagement, and maybe count their swipes less when trying to figure out the quality of users on the site.  Our choice-time-engagement model can denoise online survey data using choice and time data to get better estimates of true user preferences.  This is useful for not only dating apps, but nearly any app where people vote on stuff.To demonstrate that our model works, we tested it on some online surveys.  One is a Mathematicians survey, where the subjects vote on their preferred mathematicians in head to head comparisons.  The resulting response times and preferences are shown in the figure on the left.  As can be seen, Newton vs. Pascal is a fast decision, with Newton winning most of the time.  On the other hand, Newton vs Gauss takes 4.3 sec to decide, and its nearly 50/50.

This is joint work with Zhengli Wang.


Building a set of Location Specific Social Media Users

We present an algorithm to build a set of social media users starting from a small seed set.  Our method uses an expand-classify approach where we classify the set of users as in the target location or not, and then expands the set by collecting the neighbors of the people in the target location.  Our classification algorithm uses a factor graph model and we show that classification can be easily done by solving a min-cut problem.  We find that we can collect several thousand users from Twitter in a few hours.  Testing on geo-tagged content, we find that our method is fairly accurate.

This work is done with Lieutenant Colonel Christopher Marks.


Predicting the Popularity of Tweets

We developed a probabilistic model for the spread of an individual tweet in Twitter. By observing the times of the retweets of a tweet, we are able to predict the total number of retweets the tweet will receive.
This is joint work with Emily Fox and Eric Bradlow.

The data for this work can be found on our Data page.

You can try a demo of our predictor called Twouija at


Finding Rumor Sources

Imagine a rumor spreads in a network and all we know is the source. Then can we find the source using only the network structure? It turns out that we can and there is a network centrality we created called rumor centrality which does this. For regular tree networks under a certain rumor spreading model it is an exact maximum likelihood estimator for the source. We have proven that it performs well on essentially any sparse network, showing the rumor centrality is a universally good source estimator. Rumor centrality is also a rather interesting mathematical object, having connections with linear extensions of partial orders, random walks, Markov chains, and Nash equilibria of certain network games.Related publications can be found here.This work was done in collaboration with Devavrat Shah.The figure on the right shows how rumor centrality finds the source. A simulated rumor was spread on this network, and the each node’s normalized rumor centrality was calculated, with red=1 and blue = 0. As can be seen, rumor centrality narrows down the set of likely rumor sources. In this figure, the true source is the red node in the center.

This work is done with Devavrat Shah.

[Paper 1 and Paper 2]

Learning Community Structure

What defines a community in a social network? Is it the community’s leaders? Or is it the less important followers in a community? We have found that these followers actually provide the identifiable signal for a community in a network. These followers have no friends outside of their community, and in this way provide the community with an identity. Using this observation, we created the leader-follower algorithm (LFA) which finds communities in networks by searching for these followers. The LFA has linear run-time and so can be applied to incredibly large networks. It can find overlapping communities and learns the number of communities naturally from the network structure. We have also proven that the LFA has good performance on a wide class of networks. These strengths give the LFA and advantage over other community detection methods such as spectral clustering or inference based methods. The figure on the right shows the communities found by the LFA in my own Facebook network, with appropriate social labels for each community found.

This is joint work with Devavrat Shah.

[Paper ]