ml4se – Telegram
ml4se
502 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
Multi-turn Reinforcement Learning from Preference Human Feedback

RLHF has become the standard approach for aligning LLMs with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal.

Authors propose novel methods for reinforcement learning from preference feedback between two full multi-turn conversations.

Algorithms for the multi-turn setting:
- Preference-based Q-function
- Multi-turn Preference Optimization algorithm
- MTPO with mixture policy
- Multi-turn RLHF
HuggingFace: Agents 2.0 and langchain_huggingface

* Release of Transformers Agents 2.0
- new agents
- new agent framework
* A new package langchain_huggingface jointly maintained by Hugging Face and LangChain
🔥1
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct

Training LLMs requires extensive high-quality data. But we can distill the knowledge of a powerful teacher model to guide a smaller model. This leads to a problem:

While the small model can achieve significant performance improvements, the final accuracy of the small model is unlikely to surpass that of the teacher model.

Moreover, although using closed-source models reduces costs compared to manual annotation, the cost of using closed-source models remains high.

RQs:
1. Can we correct the incorrect knowledge generated by the teacher model to provide more accurate code for the student model?
2. Instead of relying on expensive closed-source teacher models, can we enable our student model to learn autonomously?

GitHub: https://github.com/bin123apple/AutoCoder
👍1
CodeR: Issue Resolving with Multi-Agent and Task Graphs

CodeR adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository.

Agents:
• Manager: The manager is an agent who interacts with the user directly and is in charge of the whole issue-resolving task.
• Reproducer: The reproducer is an agent that is responsible for generating a test to reproduce the issue.
• Fault Localizer: The fault localizer is an agent that identifies the code regions that could cause the issue.s
• Editor: The editor is the one who performs the actual code changes.
• Verifier: The verifier is an agent that will run the reproduced or integration tests to check whether the modifications
have resolved the issue or not.
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

The work proposes a zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. The method relies on the intuition that the differences between the LLM-rewritten and original codes tend to be smaller when the original code is synthetic.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder-V2 is an open-source MoE code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6T tokens. The model expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.

github: https://github.com/deepseek-ai/DeepSeek-Coder-V2
paper: https://arxiv.org/abs/2406.11931
16B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
238B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
🎉3
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Can we systematically address jailbreak attacks?

It is difficult to prepare against all possible jailbreak queries (which current approaches like SFT attempt to do)—these queries usually elicit related harmful responses that rely on the same underlying knowledge (e.g., detailed steps to make a bomb).

Consequently, by directly unlearning harmful knowledge in the LLM, it prevents the model from generating harmful responses, even when confronted with unseen jailbreak prompts.

The authors realize unlearning method named Safe Unlearning, which implements three complementary objectives:
- minimizing the probability of generating harmful responses,
- maximizing the probability of rejecting harmful queries, and
- maintaining general performance on harmless queries

github: https://github.com/thu-coai/SafeUnlearning
Is Functional Correctness Enough to Evaluate Code Language Models?Exploring Diversity of Generated Codes

In complex code generation tasks, utilizing diversity encoded in LMs help generate correct outputs.

RQs:
- Can recent code LMs generate sufficiently diverse solutions to specific problems?
- Is there a correlation between the diversity and correctness of the generated codes?
- Do advanced code generation strategies enhance both code diversity and correctness?

The authors observe that existing code LMs tend to generate functionally correct codes with limited diversity.
Understanding Defects in Generated Codes by Language Models

LLMs sometimes generate code with defects.

RQs:
- What are the types of defects in the generated code, and how can they be classified based on their characteristics?
- Can existing prompt engineering techniques help in fixing the problematic code?
1
Diffusion is spectral autoregression

Autoregression and diffusion are currently the two dominant generative modelling paradigms. And they aren’t all that different: diffusion models of images perform approximate autoregression in the frequency domain.

Colab: https://colab.research.google.com/drive/1siywvhvl1OxI1UmqRrJHiFUK0M5SHlcx
👍1
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

The authors explore a novel approach to building a time series forecasting foundation model using natural images (based on the intrinsic similarities between images and time series). The proposed VisionTS, without any training on time series data, outperforms the largest foundation model MOIRAI_Large in the zero-shot setting.
🔥2
LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models

Long context generalization depends on token distances set by position indices, which are then combined with token representations. LongRecipe is primarily focused on optimizing the learning process by efficiently handling both position indices and token representations.
The approach extends effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory

code: https://github.com/zhiyuanhubj/LongRecipe
My Python code is a neural network

Many programs we write can be embedded in an RNN, and a trained RNN can perform better than if we wrote the algorithm by hand. The author demonstrates this idea with a program that determines whether a message sent during code review clearly refers to the program code.
Learning to Ask: When LLMs Meet Unclear Instruction

The study delves into the issue of unclear user instructions and their impact on the effective use of tools by modern LLMs. Recognizing the limitations of LLMs in dealing with ambiguous instructions, the authors conducted an investigation into the common error patterns present in real-world user instructions. Based on the analysis, they introduced the Noisy ToolBench dataset, a novel tool-using benchmark aimed to evaluate the LLM’s tool-using performance under unclear user instructions. Furthermore, they developed the Ask-when-Needed method (AwN), an approach that empowers LLMs to actively seek user input whenever they face uncertainty in instructions.
🔥1
Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

The results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.