ml4se – Telegram
ml4se
502 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

The very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline.
- weights: https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.

Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).

GitHub: https://github.com/prometheus-eval/prometheus-eval
Better & Faster Large Language Models via Multi-token Prediction

LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.
👍2
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.

However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.

Is DPO truly superior to PPO?

Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.
Open sourcing IBM’s Granite code models

IBM is releasing a family of Granite code models to the open-source community.

- paper
- github: https://github.com/ibm-granite
- models: https://huggingface.co/ibm-granite
Large Language Models Cannot Self-Correct Reasoning Yet

The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.
AgentBench: Evaluating LLMs as Agents

AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.

github: https://github.com/THUDM/AgentBench
AutoDev: Automated AI-Driven Development

One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.

RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?

The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.
From Human-to-Human to Human-to-Bot Conversations in Software Engineering

Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot
Codestral

- 22B parameters
- 32K context window
- non-production license

HuggingFace: https://huggingface.co/mistralai/Codestral-22B-v0.1
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

The paper bridges the conceptual gap between SSMs and attention variants. It yields insights on how recent SSMs (e.g. Mamba) perform as well as Transformers on language modeling. Also, it provides new ideas to improve SSMs (and potentially Transformers) by connecting the algorithmic and systems advances on both sides.
1
Multi-turn Reinforcement Learning from Preference Human Feedback

RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.

Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.

Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF