Evals - Iz's Morning Notes

Evals are so hot right now ![[so-hot-right-now-trending.gif]] I think one of the reasons we talk so much about it is that it's a very overloaded term. To each person it means something else, and we don't yet have language or comfort around the full ontology of everything that falls under "Evals". To further complicate, evals on an API are done both on the user side (provided that the user is sophisticated enough) and on the vendor side. ### Evals are monitoring For any user running anything non-deterministic API (be that a foundational model, or any composite system that relies on a foundational model) in production, "online evals" are a way to see that the live performance does not break in production. These can come in two flavors: * Binary (alert when the eval fails below certain threshold) * Scoring (let's see how our system is doing over time) ### Evals are benchmarks For the user of an API, evals are a way to compare between different vendors at the point of choice. These can come in two flavors: * Publicly available benchmarks (e.g. `SWE-bench`) * ELO-based arenas like LMArena * Custom built by the user of an API ### Evals are acceptance testing For a user of an API in live production, evals are a way to see that v4.3 does not regress over v4.2, or a new vendor is performing as well as their old vendor, and it's not going to regress their system. ### Evals are system testing For the vendor of an API, evals are a way to approximate the customer's acceptance testing, or public benchmarks; something to run before releasing the next version. ### Evals are an ongoing scoreboard For the vendor of an API, evals are a way to look how the model is doing week-to-week, something for a modeling team to "hill climb" on to adjust architecture, data pipelines, and hyperparameters. ### Evals are surrogate for customer experience Philosophically, evals are a surrogate for how the consumer will experience the product. In traditional software engineering, this gets done via UX studies, periodically and we see how non-deterministic users experience a deterministic product. This is the ultimate function of evals -- being a surrogate for the vibes of the user. ### Evals are guidepost for a loss function The flip side of the above is that in some way, a loss function is a surrogate for the hill-climbing eval. So the platonic ideal is that the customer preference flows down reliably across different "flavors" of evals all the way to the loss function we train on: So it's surrogates upon surrogates, each arrow meaning "surrogate for" or "approximates": ![[Eval chain.png]]