DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-Coder-V2 is an open-source MoE code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6T tokens. The model expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.
github: https://github.com/deepseek-ai/DeepSeek-Coder-V2
paper: https://arxiv.org/abs/2406.11931
16B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
238B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
DeepSeek-Coder-V2 is an open-source MoE code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6T tokens. The model expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.
github: https://github.com/deepseek-ai/DeepSeek-Coder-V2
paper: https://arxiv.org/abs/2406.11931
16B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
238B Instruct: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
🎉3
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Can we systematically address jailbreak attacks?
It is difficult to prepare against all possible jailbreak queries (which current approaches like SFT attempt to do)—these queries usually elicit related harmful responses that rely on the same underlying knowledge (e.g., detailed steps to make a bomb).
Consequently, by directly unlearning harmful knowledge in the LLM, it prevents the model from generating harmful responses, even when confronted with unseen jailbreak prompts.
The authors realize unlearning method named Safe Unlearning, which implements three complementary objectives:
- minimizing the probability of generating harmful responses,
- maximizing the probability of rejecting harmful queries, and
- maintaining general performance on harmless queries
github: https://github.com/thu-coai/SafeUnlearning
Can we systematically address jailbreak attacks?
It is difficult to prepare against all possible jailbreak queries (which current approaches like SFT attempt to do)—these queries usually elicit related harmful responses that rely on the same underlying knowledge (e.g., detailed steps to make a bomb).
Consequently, by directly unlearning harmful knowledge in the LLM, it prevents the model from generating harmful responses, even when confronted with unseen jailbreak prompts.
The authors realize unlearning method named Safe Unlearning, which implements three complementary objectives:
- minimizing the probability of generating harmful responses,
- maximizing the probability of rejecting harmful queries, and
- maintaining general performance on harmless queries
github: https://github.com/thu-coai/SafeUnlearning
Is Functional Correctness Enough to Evaluate Code Language Models?Exploring Diversity of Generated Codes
In complex code generation tasks, utilizing diversity encoded in LMs help generate correct outputs.
RQs:
- Can recent code LMs generate sufficiently diverse solutions to specific problems?
- Is there a correlation between the diversity and correctness of the generated codes?
- Do advanced code generation strategies enhance both code diversity and correctness?
The authors observe that existing code LMs tend to generate functionally correct codes with limited diversity.
In complex code generation tasks, utilizing diversity encoded in LMs help generate correct outputs.
RQs:
- Can recent code LMs generate sufficiently diverse solutions to specific problems?
- Is there a correlation between the diversity and correctness of the generated codes?
- Do advanced code generation strategies enhance both code diversity and correctness?
The authors observe that existing code LMs tend to generate functionally correct codes with limited diversity.
Understanding Defects in Generated Codes by Language Models
LLMs sometimes generate code with defects.
RQs:
- What are the types of defects in the generated code, and how can they be classified based on their characteristics?
- Can existing prompt engineering techniques help in fixing the problematic code?
LLMs sometimes generate code with defects.
RQs:
- What are the types of defects in the generated code, and how can they be classified based on their characteristics?
- Can existing prompt engineering techniques help in fixing the problematic code?
❤1
Diffusion is spectral autoregression
Autoregression and diffusion are currently the two dominant generative modelling paradigms. And they aren’t all that different: diffusion models of images perform approximate autoregression in the frequency domain.
Colab: https://colab.research.google.com/drive/1siywvhvl1OxI1UmqRrJHiFUK0M5SHlcx
Autoregression and diffusion are currently the two dominant generative modelling paradigms. And they aren’t all that different: diffusion models of images perform approximate autoregression in the frequency domain.
Colab: https://colab.research.google.com/drive/1siywvhvl1OxI1UmqRrJHiFUK0M5SHlcx
👍1
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
The authors explore a novel approach to building a time series forecasting foundation model using natural images (based on the intrinsic similarities between images and time series). The proposed VisionTS, without any training on time series data, outperforms the largest foundation model MOIRAI_Large in the zero-shot setting.
The authors explore a novel approach to building a time series forecasting foundation model using natural images (based on the intrinsic similarities between images and time series). The proposed VisionTS, without any training on time series data, outperforms the largest foundation model MOIRAI_Large in the zero-shot setting.
🔥2
LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models
Long context generalization depends on token distances set by position indices, which are then combined with token representations. LongRecipe is primarily focused on optimizing the learning process by efficiently handling both position indices and token representations.
The approach extends effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory
code: https://github.com/zhiyuanhubj/LongRecipe
Long context generalization depends on token distances set by position indices, which are then combined with token representations. LongRecipe is primarily focused on optimizing the learning process by efficiently handling both position indices and token representations.
The approach extends effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory
code: https://github.com/zhiyuanhubj/LongRecipe
My Python code is a neural network
Many programs we write can be embedded in an RNN, and a trained RNN can perform better than if we wrote the algorithm by hand. The author demonstrates this idea with a program that determines whether a message sent during code review clearly refers to the program code.
Many programs we write can be embedded in an RNN, and a trained RNN can perform better than if we wrote the algorithm by hand. The author demonstrates this idea with a program that determines whether a message sent during code review clearly refers to the program code.
Learning to Ask: When LLMs Meet Unclear Instruction
The study delves into the issue of unclear user instructions and their impact on the effective use of tools by modern LLMs. Recognizing the limitations of LLMs in dealing with ambiguous instructions, the authors conducted an investigation into the common error patterns present in real-world user instructions. Based on the analysis, they introduced the Noisy ToolBench dataset, a novel tool-using benchmark aimed to evaluate the LLM’s tool-using performance under unclear user instructions. Furthermore, they developed the Ask-when-Needed method (AwN), an approach that empowers LLMs to actively seek user input whenever they face uncertainty in instructions.
The study delves into the issue of unclear user instructions and their impact on the effective use of tools by modern LLMs. Recognizing the limitations of LLMs in dealing with ambiguous instructions, the authors conducted an investigation into the common error patterns present in real-world user instructions. Based on the analysis, they introduced the Noisy ToolBench dataset, a novel tool-using benchmark aimed to evaluate the LLM’s tool-using performance under unclear user instructions. Furthermore, they developed the Ask-when-Needed method (AwN), an approach that empowers LLMs to actively seek user input whenever they face uncertainty in instructions.
🔥1
Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku
The results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.
The results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.
Fixing Code Generation Errors for Large Language Models
The authors conducted ten rounds of tests on 14 LLMs using the HumanEval dataset. Through manual analysis of the test results, they found that these LLMs achieved an average of 84.07% of their reported performance.
They also investigated the relationship between Pass@1 results, model inference time, and model parameter size. The analysis revealed a positive correlation between Pass@1 results and model parameter size, while no significant correlation was observed between inference time and parameter size.
Subsequently, the authors performed an in-depth analysis of errors in the test results, extracting and categorizing 12,837 errors into 14 types. Through the analysis, they identified 19 specific causes leading to these errors.
The proposed a fixing method can fix three types of errors, improving the performance of 14 LLMs on HumanEval and MBPP datasets with average increases of 9.5% and 5.4%, respectively.
The authors conducted ten rounds of tests on 14 LLMs using the HumanEval dataset. Through manual analysis of the test results, they found that these LLMs achieved an average of 84.07% of their reported performance.
They also investigated the relationship between Pass@1 results, model inference time, and model parameter size. The analysis revealed a positive correlation between Pass@1 results and model parameter size, while no significant correlation was observed between inference time and parameter size.
Subsequently, the authors performed an in-depth analysis of errors in the test results, extracting and categorizing 12,837 errors into 14 types. Through the analysis, they identified 19 specific causes leading to these errors.
The proposed a fixing method can fix three types of errors, improving the performance of 14 LLMs on HumanEval and MBPP datasets with average increases of 9.5% and 5.4%, respectively.