Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.
https://twitter.com/lmarena_ai/status/1909397817434816562
https://twitter.com/lmarena_ai/status/1909397817434816562
X (formerly Twitter)
lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X
We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link…
Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in.
Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
https://twitter.com/Ahmad_Al_Dahle/status/1909302532306092107
Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
https://twitter.com/Ahmad_Al_Dahle/status/1909302532306092107
X (formerly Twitter)
Ahmad Al-Dahle (@Ahmad_Al_Dahle) on X
We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models.
That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as…
That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as…