StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation
The very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline.
- weights: https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1
The very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline.
- weights: https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.
Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).
GitHub: https://github.com/prometheus-eval/prometheus-eval
Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.
Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).
GitHub: https://github.com/prometheus-eval/prometheus-eval
ICLR 2024
6—11 May, Schedule
Workshops (some):
- Representational Alignment , papers
- Privacy Regulation and Protection in Machine Learning , papers
- LLM Agents
- How Far Are We From AGI
- Secure and Trustworthy Large Language Models
- Bridging the Gap Between Practice and Theory in Deep Learning
Papers
ICLR Proceedings at OpenReview
6—11 May, Schedule
Workshops (some):
- Representational Alignment , papers
- Privacy Regulation and Protection in Machine Learning , papers
- LLM Agents
- How Far Are We From AGI
- Secure and Trustworthy Large Language Models
- Bridging the Gap Between Practice and Theory in Deep Learning
Papers
ICLR Proceedings at OpenReview
👏2
Better & Faster Large Language Models via Multi-token Prediction
LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.
LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.
👍2
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.
However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.
Is DPO truly superior to PPO?
Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.
Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.
However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.
Is DPO truly superior to PPO?
Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.
Open sourcing IBM’s Granite code models
IBM is releasing a family of Granite code models to the open-source community.
- paper
- github: https://github.com/ibm-granite
- models: https://huggingface.co/ibm-granite
IBM is releasing a family of Granite code models to the open-source community.
- paper
- github: https://github.com/ibm-granite
- models: https://huggingface.co/ibm-granite
Large Language Models Cannot Self-Correct Reasoning Yet
The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.
The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.
AgentBench: Evaluating LLMs as Agents
AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.
github: https://github.com/THUDM/AgentBench
AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.
github: https://github.com/THUDM/AgentBench
AutoDev: Automated AI-Driven Development
One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.
RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?
The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.
One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.
RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?
The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.
From Human-to-Human to Human-to-Bot Conversations in Software Engineering
Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot
Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot
Codestral
- 22B parameters
- 32K context window
- non-production license
HuggingFace: https://huggingface.co/mistralai/Codestral-22B-v0.1
- 22B parameters
- 32K context window
- non-production license
HuggingFace: https://huggingface.co/mistralai/Codestral-22B-v0.1
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
The paper bridges the conceptual gap between SSMs and attention variants. It yields insights on how recent SSMs (e.g. Mamba) perform as well as Transformers on language modeling. Also, it provides new ideas to improve SSMs (and potentially Transformers) by connecting the algorithmic and systems advances on both sides.
The paper bridges the conceptual gap between SSMs and attention variants. It yields insights on how recent SSMs (e.g. Mamba) perform as well as Transformers on language modeling. Also, it provides new ideas to improve SSMs (and potentially Transformers) by connecting the algorithmic and systems advances on both sides.
❤1
Multi-turn Reinforcement Learning from Preference Human Feedback
RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.
Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.
Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF
RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.
Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.
Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF