ml4se – Telegram
ml4se
505 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
Self-consistency for open-ended generations

Although individual generations sampled from the large-scale pre-trained language models often yield high-quality results, multiple samplings can produce certain generations of substantially higher quality than the average output of the model.

Recently for the special case of problems that have fixed answer, a simple approach, called self-consistency was suggested for selecting the best answer from multiple generations (Wang et al. 2022). In that paper, the authors sample multiple generations from the LLM, extract the predicted answer from each generation and select the answer with the most number of votes. However, it is important to note that the self-consistency approach is not applicable to prompts that are open-ended and do not have fixed answers.

In this paper, the authors introduce a generalized framework for self-consistency that extends its applicability beyond problems that have fixed-answer answers.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

In the paper, the authors
- survey open problems and fundamental limitations of RLHF and related methods;
- overview techniques to understand, improve, and complement RLHF in practice; and
- propose auditing and disclosure standards to improve societal oversight of RLHF systems.
Patterns for Building LLM-based Systems & Products

The post is about practical patterns for integrating LLMs into systems and products:
- Evals: To measure performance
- RAG: To add recent, external knowledge
- Fine-tuning: To get better at specific tasks
- Caching: To reduce latency & cost
- Guardrails: To ensure output quality
- Defensive UX: To anticipate & manage errors gracefully
- Collect user feedback: To build our data flywheel
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

In this work, the authors conduct a study of subtokenization options for large LM pretraining on source code. They show that for large LMs pretrained on source code:
- Grouping punctuation chars in single tokens reduces the average length by 17% without downstream performance drop, and permitting more complex composite tokens reduces lengths by 40%, sometimes with quality drop;
- UnigramLM is generally preferable over BPE;
- Smaller vocabularies may improve quality with 3—19% length increase;
- Subtokenizers are well transferable between programming languages.
Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey

The study presented a comprehensive empirical evaluation of LLMs for automated code clone detection across diverse clone types, languages, and prompt formulations. The key findings demonstrate that advanced LLMs like GPT-3.5-Turbo and GPT-4 can achieve remarkably high recall and accuracy in detecting even complex semantic clones, outperforming existing techniques. Introducing intermediate reasoning steps through chain-of-thought prompting leads to noticeable gains by equipping models with a structured thought process.

- Can LLMs detect code clones with a simple prompt?
- How do LLMs perform by using one-step chain-of-thought prompts?
- Can LLMs perform better by using multi-step chain-of-thought prompts?
- How do LLMs perform using code embedding?
- How does the performance of LLMs in code clone detection vary across different programming languages?
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

In this paper, the authors introduce a novel framework, namely RRTF (Rank Responses to align Test&Teacher Feedback), and present a new Code LLM, namely PanGu-Coder2. Firstly, they adopt the Evol-Instruct technique to obtain a substantial amount of high-quality natural language instruction and code solution data pairs. Then, they train the base model by ranking candidate code solutions using feedback from test cases and heurstic preferences.

Through comprehensive evaluations on HumanEval, CodeEval, and LeetCode benchmarks, PanGu-Coder2 achieves new state-of-the-art performance among billion-parameter-level Code LLMs, surpassing all of the existing ones by a large margin.
notebook_whisperer

A coding assistant to help with the construction of Jupyter notebooks. With the Notebook Whisperer, you can enter a short sentence saying what you would like to do. It then populates the next cell in your Jupyter notebook with the code for performing that task. This is accomplished by sending the contents of your notebook to chatGPT and having it provide the code that it thinks will fulfill your request.
👍1
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

The quality of code produced by a code LLM varies significantly by programming languages. The paper presents an effective approach for boosting the performance of code LLMs on low-resource languages using semi-synthetic data.

Key ingredients:
1. The large volume of training data for high-resource programming languages includes a lot of well-documented code
2. Code LLMs are effective unit test generators, and we can check that generated tests pass
3. We can mechanically translate many unit tests to a low-resource language with a simple compiler
4. Code LLMs can translate code from one language to another, and we can test these translations with the aforementioned tests, and engineer a prompt to increase the likelihood of a successful translation

The MultiPL-T datasets, and links to the fine-tuned models are available at huggingface.co/datasets/nuprl/MultiPL-T
👍1
A Survey of Time Series Anomaly Detection Methods in the AIOps Domain

Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or significant events. Numerous advanced anomaly detection methods have emerged to address availability and performance issues.

The review offers a comprehensive overview of time series anomaly detection in Artificial Intelligence for IT operations (AIOps), which uses AI capabilities to automate and optimize operational workflows. Additionally, it explores future directions for real-world and next-generation time-series anomaly detection based on recent advancements.
👍1
OWASP Top 10 for LLM

The OWASP Top 10 for Large Language Model Applications project aims to educate developers, designers, architects, managers, and organizations about the potential security risks when deploying and managing Large Language Models (LLMs). The project provides a list of the top 10 most critical vulnerabilities often seen in LLM applications, highlighting their potential impact, ease of exploitation, and prevalence in real-world applications. Examples of vulnerabilities include prompt injections, data leakage, inadequate sandboxing, and unauthorized code execution, among others. The goal is to raise awareness of these vulnerabilities, suggest remediation strategies, and ultimately improve the security posture of LLM applications.

1 Prompt Injection
2 Insecure Output Handling
3 Training Data Poisoning
4 Model Denial of Service
5 Supply Chain Vulnerabilities
6 Sensitive Information Disclosure
7 Insecure Plugin Design
8 Excessive Agency
9 Overreliance
10 Model Theft

PDF
Forwarded from Empty Set of Ideas (Arsenii)
https://arxiv.org/abs/2308.10825v1

Algebraic Topology for Data Scientists

This book gives a thorough introduction to topological data analysis (TDA), the application of algebraic topology to data science. Algebraic topology is traditionally a very specialized field of math, and most mathematicians have never been exposed to it, let alone data scientists, computer scientists, and analysts. I have three goals in writing this book. The first is to bring people up to speed who are missing a lot of the necessary background. I will describe the topics in point-set topology, abstract algebra, and homology theory needed for a good understanding of TDA. The second is to explain TDA and some current applications and techniques. Finally, I would like to answer some questions about more advanced topics such as cohomology, homotopy, obstruction theory, and Steenrod squares, and what they can tell us about data. It is hoped that readers will acquire the tools to start to think about these topics and where they might fit in.
2👍1
Introducing Code Llama, a state-of-the-art large language model for coding

- Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts.
- Code Llama is free for research and commercial use.
- Code Llama is built on top of Llama 2 and is available in three models:
- Code Llama, the foundational code model;
- Code Llama - Python specialized for Python;
- Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.

Github
🔥7
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models

The results reveal the superiority and potential of PEFT over ICL (In-Context Learning) on a wide range of LLMs in reducing the computational burden and improving performance.

Main results:
- LLMs fine-tuned with PEFT techniques, i.e., a few millions of parameters, systematically outperform small language models fully fine-tuned with hundreds of millions of parameters
- Prompt tuning often outperforms LoRA even though it requires learning substantially fewer parameters
- LLMs fine-tuned using LoRA and Prompt tuning significantly outperform LLMs with ICL, even when increasing the number of prompt examples under the ICL setting
- PEFT techniques allow LLMs to better adapt to the task-specific dataset with low computational cost
🔥1
Forwarded from Consciousnesses
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness

The authors one of which is Yoshua Bengio derive a list of indicator properties from a survey of theories of consciousness. Each of these indicator properties is said to be necessary for consciousness by one or more theories, and some subsets are said to be jointly sufficient. The claim is that AI systems which possess more of the indicator properties are more likely to be conscious.

It is discussed how AI systems could be constructed, or have been constructed, with each of the indicator properties. Also the authors consider whether some specific existing AI systems possess the indicator properties. These include Transformer-based LLMs, the Perceiver architecture, DeepMind’s Adaptive Agent and PaLM-E. This work does not suggest that any existing AI system is a strong candidate for consciousness.
Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B

CodeLlama-34B and CodeLlama-34B-Python were fine-tuned on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March.

The Phind models were trained over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. The authors employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

huggingface:
- Phind/Phind-CodeLlama-34B-v1
- Phind/Phind-CodeLlama-34B-Python-v1
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning

In this study, the authors present LLaMA-Reviewer, a framework that leverages LLaMA for automating code review. Two PEFT methods — zero-init attention prefix-tuning and LoRA tuning are used to address the computational challenge of LLM fine-tuning.

RQs:
1. How effective is a LLM in automating code review tasks, compared to SoTA methods?
2. How does the representation of input data impact the performance of large language models?
3. How does instruction tuning influence the performance of subsequent sub-tasks?
4. What implications arise from different PEFT methods?

Code, models, results: https://zenodo.org/record/7991113
Understanding Llama 2 and Code Llama

In this edition of the newsletter: the release of the Llama 2 base and chat models, as well as CodeLlama, the latest advances in the open source AI large language model landscape.
👍3👎1
HumanEval_ru Dataset

This is the Russian version of the code generation HumanEval dataset.

Load dataset:

from datasets import load_dataset
load_dataset('NLPCoreTeam/humaneval_ru')

DatasetDict({
train: Dataset({
features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction', 'instruction_noexamples'],
num_rows: 164
})
})
1