Institute for Digital Research and Education
Speaker: Bryor Snefjella, Ph.D.
IDRE Scholar,
Psychology Department,
University of California Los Angeles
Video Recording: https://youtu.be/RTpW1-FtBGs
Abstract: Inquiry in the language sciences makes extensive use of open-source data sets. For example, data sets of hand-annotations of words for properties such as their connotation and familiarity. Other common types of open-source resources include behavioural or neuroimageing recordings of responses to linguistic stimuli in controlled experiments, or measurements taken from massive respositories of digitized natural language use. A challenge in the language sciences is extensive missing data in extant open-source data sets. Most data sets contain information on orders of magnitude fewer words than an average speaker knows, and the words they do contain are non-randomly sampled and non-overlapping. A commonly proposed remedy to this missing data is to replace hand-annotation with machine learning. This is the approach taken by the English Lexicon Imputation Project, the first comprehensive resource of word-level annotations created in cognitive science. In this talk I present the resource, the Bayesian deep neural network used to create it, and how missing data methodology was key to overcoming the limitations of prior literature on computational linguistic resource generation. The talk should be of interest to computational social scientists, language scientists, and those interested in deep-learning and missing data methods.
About speaker: Bryor Snefjella is a postdoctoral researcher in the Psychology Department, Cognitive Area, mentored by Idan Blank, Keith Holyoak, and Hongjing Lu. Before moving to UCLA, Bryor received a PhD in Cognitive Science of Language in McMaster University in Canada. His research on language use patterns in social media has received international media attention. Check him out on his personal website, Twitter, Linkedin, and Research Gate.