Computer Science Student Research Day
This schedule is preliminary; the exact talk ordering may change. Invited talks will not change times.
Attendees stand a chance of winning mystery prizes.
- Friday, September 3, 2021
- Old Mill 325 (John F. Dewey Lounge)
Enter the Old Mill building via the South/side entrance (door sign: "Old Mill Annex")
- 16m presentation + 4m Q/A
- 20m presentation + 5m Q/A
|8:15am–8:30am||Chris Skalka (CS Chair)
Aaron Clauset (https://aaronclauset.github.io)
Nearly-optimal Prediction of Missing Links in Networks
Predicting missing links in networks is a fundamental task in network analysis and modeling. However, current link prediction algorithms exhibit wide variations in their accuracy, and we lack a general understanding of which methods work better in which contexts. In this talk, I'll describe a novel meta-learning solution to this problem, which makes predictions that appear to be nearly optimal by learning to combine three classes of prediction methods: community detection algorithms, structural features like degrees and triangles, and network embeddings. We evaluate 203 component methods individually and in stacked generalization on (i) synthetic data with known structure, for which we analytically calculate the optimal link prediction performance, and (ii) a large corpus of 550 structurally diverse networks from social, biological, technological, information, economic, and transportation domains. Across settings, supervised stacking nearly always performs best and produces nearly-optimal performance on synthetic networks. Moreover, we show that accuracy saturates quickly, and near-optimal predictions typically requires only a handful of component methods. Applied to real data, we quantify the utility of each method on different types of networks, and then show that the difficulty of predicting missing links varies considerably across domains: it is easiest in social networks and hardest in technological networks. I'll close with forward-looking comments on the limits of predictability for missing links in complex networks and on the utility of stacked generalizations for achieving them.
|9:30am–9:45am||Aaron Clauset, 2nd Talk: The Unequal Impact of Parenthood in Academia|
|9:45am–9:50am||Mystery Host, Raffle Draw/Mystery Prize #1|
SOMTimeS: Self-organizing Maps for Time Series Clustering and Its Application to Serious Illness Conversations
(Advisors: Byung S. Lee + Donna M. Rizzo)
There is an increasing demand for scalable algorithms capable of clustering and analyzing large time series datasets. The Kohonen self-organizing map (SOM) is a type of unsupervised artificial neural network used for visualizing and clustering complex data, reducing the dimensionality of data, and selecting influential features. Like all clustering methods, the SOM requires a means of measuring similarity between input observations (in this work time series data). Dynamic time warping (DTW) is one such method, and a top performer in time series analysis due to its ability to distort the temporal dimension to find the best match. Despite its prior use in clustering methods, including the SOM algorithm, DTW is limited in practice because this resilience comes at a high computational cost when clustering large amounts of data associated with real applications. To address this, we present a new method, called SOMTimeS (a Self-Organizing Map for TIME Series) that retains the robustness of DTW, yet scales better and runs faster than competing DTW-based clustering methods. The computational efficiency of SOMTimeS stems from its ability to prune unnecessary DTW computations during the SOM’s competitive learning (i.e., training) phase. We evaluated accuracy and scalability on 112 benchmark time series datasets from the University of California, Riverside classification archive. SOMTimeS clustered these data with state-of-the-art accuracy and the DTW computation pruning enabled it to scale linearly with respect to the number and length of time series. In order to demonstrate its utility in high dimensionality and temporally sequenced phenomena of increasing relevance to computer science, we applied SOMTimeS to natural language conversation data collected as part of a large healthcare cohort study of patient-family-clinician serious illness discussions.
The Penumbra of Open Source: Projects Outside of Centralized Platforms Are Longer Maintained, More Academic, and More Collaborative
(Advisors: James P. Bagrow + Laurent Hébert-Dufresne)
GitHub has become the central online platform for open source, hosting most open source repositories. With this popularity, the public digital traces of GitHub are now a valuable means to study teamwork and collaboration. In many ways, however, GitHub is a convenience sample. We need to assess its representativeness, particularly how GitHub’s design may alter the working patterns of its users. Here we develop a novel, extensive sample of public open source project repositories outside of centralized platforms like GitHub. We characterized these projects along a number of dimensions, and compare to a time-matched sample of corresponding GitHub projects. Compared to GitHub, these projects tend to have more collaborators, are maintained for longer periods, and tend to be more focused on academic and scientific problems.
Ollin D. Langle-Chimal
Social and Economic Disparities During the COVID-19 Pandemic
(Advisor: Nick Cheney)
The current COVID-19 pandemic has strengthened the already marked differences among socio-economic groups. The use of non-pharmacological interventions; such as mobility restrictions and work-from-home jobs have, on one hand, helped mitigate the spread of the virus and created a job deficit and a consumption reduction, on the other. This has specially impacted middle- and low-income countries where economic stimulus programs were small at best, and thus further affecting the most vulnerable populations. Using GPS signals from cellphone usage allows us to track users’ behavior in order to evaluate the compliance in mobility restrictions and discern the differences between economic strata to create better public policies based on evidence. We use GPS data from a provider and government censuses to optimize a mobility model and measure the impact in six middle income countries. We also have a different data partnership which we use along with economic surveys to deepen the understanding of the economic impact of the pandemic.
[5m, 11:05am–11:10am] Mystery Host, Raffle Draw/Mystery Prize #2
A Strategy for Provably Secure Multi-party Computation
(Advisors: Joe Near + Chris Skalka)
Multi-party computation is a generalization of homomorphic encryption in which many parties may have secret inputs to the computation. This talk will first introduce some basic strategy (circuit computation) and a simple MPC protocol (BGW, aka Shamir Secret Sharing). Then most of the talk will focus on the security properties MPC protocols need to provide, and how proofs of those properties are typically constructed. Finally, we'll look at parallels between security proofs for MPC systems and handlers of effect signatures in algebraic-effects systems.
Standardizing Corpora for Sociotechnical System Measurements Using Twitter Data
(Advisors: Chris Danforth + Peter Dodds)
The widespread availability of social media data offers researchers the potential for insights into previously difficult to measure phenomena. Many claims have been made in the literature about the potential of social media data feeds, from supplementing traditional polling to nowcasting economic indicators. However, the methods used to study such systems are not standardized, and often leave researchers with wide discretion in how they construct datasets. We aim to review methods of corpus construction, and to highlight strengths and weaknesses of existing methods for common research applications. For focused case studies, we argue for algorithmically expanding sets of anchor words using co-occurring tokens as a key metric, while relying on expert discretion to exclude weakly related words. Using case studies focusing on public perception of face masks, vaccination, and the coronavirus, we show how differences at the data collection stage can substantially influence the results of a case study.
LanePainter: Lane Marks Enhancement via Generative Adversarial Network
(Advisor: Safwan Wshah)
Lane detection is one of the popular research fields in computer vision and autonomous vehicles. However, research rarely focuses on how the quality of lane marks affects the performance of lane marks detection algorithms. In this work, we study this problem and propose LanePainter, a GAN-based model for simultaneously classifying and enhancing the lane marks. Our experiments show that low-quality lane marks have an impact on the existing lane detection algorithms. In the experiments, we demonstrate our model can successfully detect the low-quality lane marks and enhance them. Finally, we show that enhanced lane marks can improve the performance of existing lane detection algorithms on low-quality lane marks.
Active Magnetic Sensing for Subterranean Urban Target Discrimination
(Advisor: Dryver Huston)
Given the rapid expansion of urban sectors in recent years, the location, identification, management, and hazard detection of subterranean infrastructure has been more pivotal now than ever before. Here, a system is being developed using electronically-geared, rotating neodymium magnets to project oscillating magnetic fields. When magnetic fields interact with both ferromagnetic and non-ferromagnetic conductive materials, Eddy Currents—which induce a counter-propagating magnetic field—are generated and then measured by a magnetometer measuring at a sample rate of 490Hz+. The data are then read into a Raspberry Pi 4, interpreted through a series of signal processing algorithms, and displayed in graph form. Future plans entail feeding processed data into a neural network and training the program to discriminate between different types of sensed materials like copper, steel, iron, aluminum, and even lead; and integrating with edge-based augmented reality interfaces. In order to develop a dedicated system which can handle the strain of heavy data processing, read the data from the magnetometer in a timely manner, support neural network processing, integrate real sense and augmented reality software, and perform applications in real time, a “cluster” system of Raspberry Pis using socket programming is in the process of being developed.
Continual Audit of Individual and Group Fairness in Deployed Classifiers via Prediction Sensitivity
(Advisor: Joe Near)
As AI-based systems increasingly impact many areas of our lives, fairness properties of these systems is an increasingly high-stakes problem. Group fairness metrics are used to audit these systems, but these metrics may not always describe fairness perfectly, and cannot easily be applied to continuously audit AI-based systems after deployment. In this paper, we address the first challenge by arguing in favor of model-based evaluation of classifier fairness, based on counterfactual augmentation of data. To address the second challenge, we propose a new measure of individual fairness—called prediction sensitivity—for the continual audit of deployed classifiers. We show that prediction sensitivity is effective at distinguishing fair from unfair models (at the group level) and fair from unfair predictions (at the individual level) in the context of real (biased) data.
[5m, 2:00pm–2:05pm] Mystery Host, Raffle Draw/Mystery Prize #3
Avian Vs. Redditor: Establishing a Baseline Rate of Information Transfer Between Twitter and Reddit
(Advisors: Chris Danforth + Peter Dodds)
The evolution and spread of narratives online often transcends the boundaries of a single social media platform, calling for investigation of information transfer. In our Storywrangler API, we make available a privacy preserving high-level view of what people have been talking about over the last decade through phrase counts for public messages on both Twitter and Reddit. This presentation describes a study of the dynamics of information transfer between the two platforms through Granger causality on word frequencies. Our results show that a subset of 3-30% of common words exhibit statistically significant information transfer; for Twitter's influence on Reddit, information from the preceding 2 days is most relevant, while Reddit's peak influence on Twitter occurs at the 9 day mark. Word embedding models reveal the thematic qualities of connected terms—showing broadly that Twitter drives conversation on politics, and Reddit drives more discussion around business. This work provides an overview of conversation dynamics on two major social media platforms, and establishes a baseline for more tailored analyses of cross-platform dynamics within a specific topic area.
Deep Learning for Model Parameters Calibration in Power Systems
(Advisor: Safwan Wshah)
In power systems, having accurate device models is crucial for grid reliability, availability, and resiliency. Existing model calibration methods based on mathematical approaches often lead to multiple solutions due to the ill-posed nature of the problem, requiring further interventions from the field engineers to select the optimal solution. Current practice such as staged-test is costly and time-consuming. Recent approaches that used phasor measurement units (PMUs) data showed promising results by providing a unique opportunity to calibrate dynamic models without the time-consuming and costly offline tests. In this work, we propose a new practical method for model calibration using deep learning and the event playback approach.
Laurent Hébert-Dufresne (https://laurenthebertdufresne.github.io)
The Importance of Human Behavior for Epidemic Models
Modern epidemic models have a long history, going back almost a hundred years to the seminal work of William O. Kermack and Anderson G. McKendrick on the Susceptible-Infectious-Recovered (or SIR) model in the 1920s. Their work was meant to provide a theoretical understanding of epidemic dynamics, and not necessarily to inform epidemic forecasting and policies. As such, some of their simplifying assumptions about human behavior are harder to justify now that mathematical models and advanced computing power regularly contribute to public health. In this lecture, to go beyond the assumptions used by Kermack and McKendrick, we will rely on theoretical tools from network science as well as data on human mobility. The implications of these new computational and data-driven tools will be discussed in the context of the ongoing COVID-19 pandemic. As we will see, failure to account for the diversity and adaptiveness of human behavior undoubtedly affects epidemic forecasting as well as our basic understanding of epidemic dynamics.
|4:00pm–4:20pm||Wrap-up & Best Presentation Awards (Show Up to Vote):
1st place ($300), 2nd place ($200), 3rd place ($150)