DeepSeek Coder: Let the Code Write Itself
DeepSeek Coder is composed of a series of code language models.
- Pretrained on 2 Trillion tokens over more than 80 programming languages.
- Various model sizes (1.3B, 5.7B, 6.7B and 33B) to support different requirements.
- A window size of 16K window size, supporting project-level code completion and infilling.
- State-of-the-Art performance among open code models.
- Open source and free for research and commercial use.
Huggingface: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai/deepseek-coder/
DeepSeek Coder is composed of a series of code language models.
- Pretrained on 2 Trillion tokens over more than 80 programming languages.
- Various model sizes (1.3B, 5.7B, 6.7B and 33B) to support different requirements.
- A window size of 16K window size, supporting project-level code completion and infilling.
- State-of-the-Art performance among open code models.
- Open source and free for research and commercial use.
Huggingface: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai/deepseek-coder/
🔥6
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
The paper evaluates the effectiveness of LLM for the task of detecting vulnerabilities in source code.
The authors compare different prompting approaches with each other. Also they compare LLMs with static analysis tools (CodeQL) and with classical deep learning methods (LineVul). In addition, the authors draw some conclusions about the features of synthetic and real-world datasets.
From the conclusions I would like to note:
- Combining GPT-4 and CodeQL gives 96-97%;
- Efforts aimed at the ability to work with a larger context look promising.
The paper evaluates the effectiveness of LLM for the task of detecting vulnerabilities in source code.
The authors compare different prompting approaches with each other. Also they compare LLMs with static analysis tools (CodeQL) and with classical deep learning methods (LineVul). In addition, the authors draw some conclusions about the features of synthetic and real-world datasets.
From the conclusions I would like to note:
- Combining GPT-4 and CodeQL gives 96-97%;
- Efforts aimed at the ability to work with a larger context look promising.
🔥2
Lost in the Middle: How Language Models Use Long Contexts
Language model performance is highest when relevant information occurs at the very beginning (primacy bias) or end of its input context (recency bias), and performance significantly degrades when models must access and use information in the middle of their input context.
Effective reranking of retrieved documents (pushing relevant information closer to the start of the input context) or ranked list truncation may be promising directions for improving how language-model-based readers use retrieved context.
Language model performance is highest when relevant information occurs at the very beginning (primacy bias) or end of its input context (recency bias), and performance significantly degrades when models must access and use information in the middle of their input context.
Effective reranking of retrieved documents (pushing relevant information closer to the start of the input context) or ranked list truncation may be promising directions for improving how language-model-based readers use retrieved context.
👍3
ml4se
New LLMs appear regularly. They are available for download or via API. We can't always trust these LLMs (toxicity, unsafe content, etc).
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
garak, LLM vulnerability scanner
garak checks if an LLM can be made to fail in an way we don't want. garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses
Docs: https://docs.garak.ai/garak/
garak checks if an LLM can be made to fail in an way we don't want. garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses
Docs: https://docs.garak.ai/garak/
🤔3
AlphaCode 2 Technical Report
AlphaCode 2 is a new and enhanced system, powered by Gemini. AlphaCode 2 relies on the combination of language models and a bespoke search and reranking mechanism.
AlphaCode 2 is a new and enhanced system, powered by Gemini. AlphaCode 2 relies on the combination of language models and a bespoke search and reranking mechanism.
👍1👏1
Optimum-NVIDIA on Hugging Face enables blazingly fast LLM inference in just 1 line of code
By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform.
By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform.
Naturalness of Attention: Revisiting Attention in Code Language Models
In this work, the authors have revisited the mathematical definition of multihead attention from prior works in natural language. They performed a layer-wise analysis to study the trends of attention weights and the transformation of input for LM (i.e., CodeBERT) and showed how the attention mechanism is not merely composed of the attention weights.
RQs:
- How do the general trends across layers between attention weights 𝛼 and the scaled transformations norms ∥𝛼 𝑓 (𝑥)∥ compare?
- How does ∥𝛼 𝑓 (𝑥)∥ align with the syntactic structure of source code compared to attention weights?
In this work, the authors have revisited the mathematical definition of multihead attention from prior works in natural language. They performed a layer-wise analysis to study the trends of attention weights and the transformation of input for LM (i.e., CodeBERT) and showed how the attention mechanism is not merely composed of the attention weights.
RQs:
- How do the general trends across layers between attention weights 𝛼 and the scaled transformations norms ∥𝛼 𝑓 (𝑥)∥ compare?
- How does ∥𝛼 𝑓 (𝑥)∥ align with the syntactic structure of source code compared to attention weights?
👍3❤1
NeurIPS 2023
Location: New Orleans Ernest N. Morial Convention Center
Dates: Sun, Dec 10, 2023 – Sat, Dec 16, 2023
Schedule: https://nips.cc/virtual/2023/calendar
Microsoft: https://www.microsoft.com/en-us/research/blog/neurips-2023-highlights-breadth-of-microsofts-machine-learning-innovation/
Stanford: https://ai.stanford.edu/blog/neruips-2023/
Google: https://blog.research.google/2023/12/google-at-neurips-2023.html
Meta: https://ai.meta.com/events/neurips-2023/
Salesforce: https://blog.salesforceairesearch.com/salesforce-research-at-neurips-2023/
Location: New Orleans Ernest N. Morial Convention Center
Dates: Sun, Dec 10, 2023 – Sat, Dec 16, 2023
Schedule: https://nips.cc/virtual/2023/calendar
Microsoft: https://www.microsoft.com/en-us/research/blog/neurips-2023-highlights-breadth-of-microsofts-machine-learning-innovation/
Stanford: https://ai.stanford.edu/blog/neruips-2023/
Google: https://blog.research.google/2023/12/google-at-neurips-2023.html
Meta: https://ai.meta.com/events/neurips-2023/
Salesforce: https://blog.salesforceairesearch.com/salesforce-research-at-neurips-2023/
promptbase [Microsoft]
promptbase is an evolving collection of resources, best practices, and example noscripts for eliciting the best performance from foundation models like GPT-4
Medprompt blog:
- The Power of Prompting
- Steering at the Frontier: Extending the Power of Prompting
promptbase is an evolving collection of resources, best practices, and example noscripts for eliciting the best performance from foundation models like GPT-4
Medprompt blog:
- The Power of Prompting
- Steering at the Frontier: Extending the Power of Prompting
Do you use "please" when communicating with LLMs? Example: https://github.com/microsoft/promptbase/blob/main/src/promptbase/math/math.py#L12
Anonymous Poll
65%
Yes
35%
No
LLM360: Towards Fully Transparent Open-Source LLMs
Most open-source LLMs have only released partial artifacts, such as the final model weights or inference code. The authors present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. As a first step of LLM360, they release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses: https://www.llm360.ai
Most open-source LLMs have only released partial artifacts, such as the final model weights or inference code. The authors present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. As a first step of LLM360, they release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses: https://www.llm360.ai
👍1
RealCode_eval
It is a benchmark to perform execution-based evaluation of LLM code generation for real github repositories.
RealCode is a dataset of 219 Python functions from 22 github repositories published between June and August on 2023. All these functions are covered with tests in their respective repositories.
- Around 60% of the used repositories are related to the field of AI/LLMs/ML.
- The code in RealCode repositories was not be seen during pretraining of starcoder or codellama as these models were trained before the summer 2023. Deepseek-coder may have seen this code in pretraining.
- Repositories are rolled back to a specific commit during data preparation
- Not all the tests are passed in the repositories.
It is a benchmark to perform execution-based evaluation of LLM code generation for real github repositories.
RealCode is a dataset of 219 Python functions from 22 github repositories published between June and August on 2023. All these functions are covered with tests in their respective repositories.
- Around 60% of the used repositories are related to the field of AI/LLMs/ML.
- The code in RealCode repositories was not be seen during pretraining of starcoder or codellama as these models were trained before the summer 2023. Deepseek-coder may have seen this code in pretraining.
- Repositories are rolled back to a specific commit during data preparation
- Not all the tests are passed in the repositories.
ENASE 2024 (Deadline Extension)
Regular Paper Submission Extension: January 3, 2024
Position Paper Submission: January 25, 2024
Doctoral Consortium Paper Submission: February 29, 2024
The mission of ENASE (Evaluation of Novel Approaches to Software Engineering) is to be a prime international forum to discuss and publish research findings and IT industry experiences with relation to novel approaches to software engineering.
Regular Paper Submission Extension: January 3, 2024
Position Paper Submission: January 25, 2024
Doctoral Consortium Paper Submission: February 29, 2024
The mission of ENASE (Evaluation of Novel Approaches to Software Engineering) is to be a prime international forum to discuss and publish research findings and IT industry experiences with relation to novel approaches to software engineering.
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
LMs may still produce a valid solution if they not only write code, but also selectively “emulate” the interpreter by generating the expected output and other lines of code that cannot be executed. In this work, the authors propose Chain of Code, a simple extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM. Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought.
Project: https://chain-of-code.github.io/
LMs may still produce a valid solution if they not only write code, but also selectively “emulate” the interpreter by generating the expected output and other lines of code that cannot be executed. In this work, the authors propose Chain of Code, a simple extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM. Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought.
Project: https://chain-of-code.github.io/
Entity-Augmented Code Generation
LLMs are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. However, these mod-
els are not designed to utilize external information sources. In this paper, the authors use retrieval-augmented LLMs for a new task — code generation using external entities. Existing retrieval-augmented LLMs fail to assign relevance scores between similar entity names, thus the authors propose a novel end-to-end trainable architecture with an scalable entity retriever injected directly into the LLM decoder.
LLMs are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. However, these mod-
els are not designed to utilize external information sources. In this paper, the authors use retrieval-augmented LLMs for a new task — code generation using external entities. Existing retrieval-augmented LLMs fail to assign relevance scores between similar entity names, thus the authors propose a novel end-to-end trainable architecture with an scalable entity retriever injected directly into the LLM decoder.
👍1
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Adapters is a new open-source library based on the initial version of AdapterHub. It is aimed at unifying parameter-efficient and modular transfer learning. The library integrates 10 diverse adapter methods into a unified interface for easy usage and provides a simple way of leveraging the modularity of adapters by designing composition blocks.
GitHub: https://github.com/adapter-hub/adapters
Adapters is a new open-source library based on the initial version of AdapterHub. It is aimed at unifying parameter-efficient and modular transfer learning. The library integrates 10 diverse adapter methods into a unified interface for easy usage and provides a simple way of leveraging the modularity of adapters by designing composition blocks.
GitHub: https://github.com/adapter-hub/adapters