ml4se – Telegram
ml4se
502 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
Large Language Models Cannot Self-Correct Reasoning Yet

The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.
AgentBench: Evaluating LLMs as Agents

AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.

github: https://github.com/THUDM/AgentBench
AutoDev: Automated AI-Driven Development

One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.

RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?

The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.
From Human-to-Human to Human-to-Bot Conversations in Software Engineering

Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot
Codestral

- 22B parameters
- 32K context window
- non-production license

HuggingFace: https://huggingface.co/mistralai/Codestral-22B-v0.1
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

The paper bridges the conceptual gap between SSMs and attention variants. It yields insights on how recent SSMs (e.g. Mamba) perform as well as Transformers on language modeling. Also, it provides new ideas to improve SSMs (and potentially Transformers) by connecting the algorithmic and systems advances on both sides.
1
Multi-turn Reinforcement Learning from Preference Human Feedback

RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.

Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.

Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF
HuggingFace: Agents 2.0 and langchain_huggingface

* Release of Transformers Agents 2.0
- new agents
- new agent framework
* A new package langchain_huggingface jointly maintained by Hugging Face and LangChain
🔥1
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct

Training LLMs requires extensive high-quality data. But we can distill the knowledge of a powerful teacher model to guide a smaller model. This leads to a problem:

While the small model can achieve significant performance improvements, the final accuracy of the small model is unlikely to surpass that of the teacher model.

Moreover, although using closed-source models reduces costs compared to manual annotation, the cost of using closed-source models remains high.

RQs:
1. Can we correct the incorrect knowledge generated by the teacher model to provide more accurate code for the student model?
2. Instead of relying on expensive closed-source teacher models, can we enable our student model to learn autonomously?

GitHub: https://github.com/bin123apple/AutoCoder
👍1
CodeR: Issue Resolving with Multi-Agent and Task Graphs

CodeR adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository.

Agents:
• Manager: The manager is an agent who interacts with the user directly and is in charge of the whole issue-resolving task.
• Reproducer: The reproducer is an agent that is responsible for generating a test to reproduce the issue.
• Fault Localizer: The fault localizer is an agent that identifies the code regions that could cause the issue.s
• Editor: The editor is the one who performs the actual code changes.
• Verifier: The verifier is an agent that will run the reproduced or integration tests to check whether the modifications
have resolved the issue or not.
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

The work proposes a zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. The method relies on the intuition that the differences between the LLM-rewritten and original codes tend to be smaller when the original code is synthetic.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder-V2 is an open-source MoE code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6T tokens. The model expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.

github: https://github.com/deepseek-ai/DeepSeek-Coder-V2
paper: https://arxiv.org/abs/2406.11931
16B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
238B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
🎉3
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Can we systematically address jailbreak attacks?

It is difficult to prepare against all possible jailbreak queries (which current approaches like SFT attempt to do)—these queries usually elicit related harmful responses that rely on the same underlying knowledge (e.g., detailed steps to make a bomb).

Consequently, by directly unlearning harmful knowledge in the LLM, it prevents the model from generating harmful responses, even when confronted with unseen jailbreak prompts.

The authors realize unlearning method named Safe Unlearning, which implements three complementary objectives:
- minimizing the probability of generating harmful responses,
- maximizing the probability of rejecting harmful queries, and
- maintaining general performance on harmless queries

github: https://github.com/thu-coai/SafeUnlearning