ml4se – Telegram
ml4se
502 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.

Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).

GitHub: https://github.com/prometheus-eval/prometheus-eval
Better & Faster Large Language Models via Multi-token Prediction

LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.
👍2
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.

However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.

Is DPO truly superior to PPO?

Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.
Open sourcing IBM’s Granite code models

IBM is releasing a family of Granite code models to the open-source community.

- paper
- github: https://github.com/ibm-granite
- models: https://huggingface.co/ibm-granite
Large Language Models Cannot Self-Correct Reasoning Yet

The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.
AgentBench: Evaluating LLMs as Agents

AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.

github: https://github.com/THUDM/AgentBench
AutoDev: Automated AI-Driven Development

One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.

RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?

The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.
From Human-to-Human to Human-to-Bot Conversations in Software Engineering

Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot
Codestral

- 22B parameters
- 32K context window
- non-production license

HuggingFace: https://huggingface.co/mistralai/Codestral-22B-v0.1
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

The paper bridges the conceptual gap between SSMs and attention variants. It yields insights on how recent SSMs (e.g. Mamba) perform as well as Transformers on language modeling. Also, it provides new ideas to improve SSMs (and potentially Transformers) by connecting the algorithmic and systems advances on both sides.
1
Multi-turn Reinforcement Learning from Preference Human Feedback

RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.

Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.

Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF
HuggingFace: Agents 2.0 and langchain_huggingface

* Release of Transformers Agents 2.0
- new agents
- new agent framework
* A new package langchain_huggingface jointly maintained by Hugging Face and LangChain
🔥1
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct

Training LLMs requires extensive high-quality data. But we can distill the knowledge of a powerful teacher model to guide a smaller model. This leads to a problem:

While the small model can achieve significant performance improvements, the final accuracy of the small model is unlikely to surpass that of the teacher model.

Moreover, although using closed-source models reduces costs compared to manual annotation, the cost of using closed-source models remains high.

RQs:
1. Can we correct the incorrect knowledge generated by the teacher model to provide more accurate code for the student model?
2. Instead of relying on expensive closed-source teacher models, can we enable our student model to learn autonomously?

GitHub: https://github.com/bin123apple/AutoCoder
👍1