Alignment

Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.

Current Projects

Featured

Eliciting Latent Knowledge

Alignment MineTest

Mesaoptimization

Releases

Featured

Library

trlX

Library

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Library

tuned-lens

Library

A library implementing the Tuned Lens, along with other tools for extracting, manipulating, and studying the learned representations of transformers across layers.

Library

Dataset

Simulacra Aesthetic Captions

Dataset

A dataset of prompts, synthetic AI generated images, and aesthetic ratings of those images.

Dataset

Papers

Featured

Feb 12, 2024

arXiv

Suppressing Pink Elephants with Direct Principle Feedback

Feb 12, 2024

arXiv

Feb 12, 2024

arXiv

Dec 16, 2023

ICLR

Quality-Diversity through AI Feedback

Dec 16, 2023

ICLR

Dec 16, 2023

ICLR

Dec 14, 2023

NeurIPS

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

Dec 14, 2023

NeurIPS

Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, and Edward Grefenstette. "Large language models are not zero-shot communicators." arXiv preprint arXiv:2210.14986, 2022.

Dec 14, 2023

NeurIPS

Dec 9, 2023

ICML Workshop

Do LLMs selectively encode the goal of an agent's reach?

Dec 9, 2023

ICML Workshop

Dec 9, 2023

ICML Workshop

Dec 8, 2023

EMNLP

trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback

Dec 8, 2023

EMNLP

Reinforcement learning from human feedback (RLHF) utilizes human feedback to better align large language models with human preferences via online optimization against a learned reward model. Current RLHF paradigms rely on Proximal Policy Optimization (PPO), which quickly becomes a challenge to implement and scale up to large architectures. To address this difficulty we present the trlX library as a feature-complete open-source framework for RLHF fine-tuning of models up to and exceeding 70 billion parameters. We implement support for multiple types of distributed training including distributed data parallel, model sharded, as well as tensor, sequential, and pipeline parallelism.

To increase the accessibility of RLHF to researchers, we implement compute- and memory-saving features that give trlX the flexibility to support users with a wide range of compute resources. This includes offline RL methods like Implicit Language Q Learning (ILQL), low-rank adapters, and the Hydra architecture. We find offline fine-tuning offers competitive performance relative to online algorithms while being easier to implement, train, and scale. To evaluate our framework we train RLHF models on two separate well-known tasks using publicly available human preference data. Models trained with trlX achieve preference win-rates over baselines at rates comparable to the original works.

Dec 8, 2023

EMNLP

Oct 2, 2023

Representation Engineering: A Top-Down Approach to AI Transparency

Oct 2, 2023

May 25, 2023

arXiv

Role-Play with Large Language Models

May 25, 2023

arXiv

May 25, 2023

arXiv

Feb 9, 2023

Alignment Forum

Anomalous tokens reveal the original identities of Instruct models

Feb 9, 2023

Alignment Forum

I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.

I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.

Feb 9, 2023

Alignment Forum

Oct 15, 2022

arXiv

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Oct 15, 2022

arXiv

Oct 15, 2022

arXiv

Oct 24, 2021

Alignment Forum

Towards Deconfusing Gradient Hacking