Naomi Saphra
Give us a brief overview of your background:
I started doing research as an undergrad at Carnegie Mellon University, focusing on analyzing informal social media text, and continued with my PhD at Johns Hopkins. Eventually my advisor, Adam Lopez, moved to the University of Edinburgh and I followed, gradually transitioning my focus from social media to domain adaptation to curriculum learning to training dynamics.
During the move, I lost the ability to type, and took a lot of medical leave so I could learn to dictate code. I knew that I couldn’t win “races” against possible scooping because dictating code was so much slower, so I focused on questions I found interesting even though other people weren’t yet asking them. After graduating, I moved to NYU for a postdoc working with Kyunghyun Cho. I’m now a Kempner Research Fellow at Harvard, continuing my work on understanding the training of language models.
Where are you located?
Cambridge, MA
What languages do you speak?
I’m an embarrassingly monolingual anglophone. Machine learning people sometimes describe me as a language person or even a linguist, so it’s really embarrassing to only actually speak one language.
One thing you are looking forward to?
I’m most excited right now about a new project on trying to understand fish.
My work with Eleuther has focused on understanding memorization in language models. We are using memorization as a testbed for scientific approaches that treat model behaviors as diverse phenomena, which are best understood through taxonomies rather than through homogeneous treatments.
What issue has your attention in the industry?
I think one of the most urgent problems in language modeling right now is sourcing and attribution. These models are incredibly complex so this is a very challenging problem. When a model makes a claim, people often trust it without question. Even worse, as these models are trained increasingly on junk LLM output scraped from the web, what little trustworthiness they do have may disintegrate.
Why do you contribute to EleutherAI?
I contribute to EAI because I want to understand the training process, and they are the only organization that releases training information with this level of detail. From training corpus to intermediate checkpoints to optimizer state, they publish everything because they consider scientific understanding a priority, beyond simply building capable models.
What is important to you about Open Access to ML or LLM’s?
We have two choices as a society. The first is to accept that one company will eventually emerge dominant with a monopoly on the models that shape our world. The other is to commit to public open source models.
What contribution are you most proud of?
My greatest contribution has probably been in my advocacy for open source models to release intermediate training checkpoints for scientific study. I’ve been cited in justifications for that decision, and I’ve released a manifesto where I compare developmental analysis to “interpretability creationism”, or end state analysis.