Forwarded from Empty Set of Ideas (Arsenii)
https://arxiv.org/abs/2308.10825v1
Algebraic Topology for Data Scientists
This book gives a thorough introduction to topological data analysis (TDA), the application of algebraic topology to data science. Algebraic topology is traditionally a very specialized field of math, and most mathematicians have never been exposed to it, let alone data scientists, computer scientists, and analysts. I have three goals in writing this book. The first is to bring people up to speed who are missing a lot of the necessary background. I will describe the topics in point-set topology, abstract algebra, and homology theory needed for a good understanding of TDA. The second is to explain TDA and some current applications and techniques. Finally, I would like to answer some questions about more advanced topics such as cohomology, homotopy, obstruction theory, and Steenrod squares, and what they can tell us about data. It is hoped that readers will acquire the tools to start to think about these topics and where they might fit in.
Algebraic Topology for Data Scientists
This book gives a thorough introduction to topological data analysis (TDA), the application of algebraic topology to data science. Algebraic topology is traditionally a very specialized field of math, and most mathematicians have never been exposed to it, let alone data scientists, computer scientists, and analysts. I have three goals in writing this book. The first is to bring people up to speed who are missing a lot of the necessary background. I will describe the topics in point-set topology, abstract algebra, and homology theory needed for a good understanding of TDA. The second is to explain TDA and some current applications and techniques. Finally, I would like to answer some questions about more advanced topics such as cohomology, homotopy, obstruction theory, and Steenrod squares, and what they can tell us about data. It is hoped that readers will acquire the tools to start to think about these topics and where they might fit in.
❤2👍1
Introducing Code Llama, a state-of-the-art large language model for coding
- Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts.
- Code Llama is free for research and commercial use.
- Code Llama is built on top of Llama 2 and is available in three models:
- Code Llama, the foundational code model;
- Code Llama - Python specialized for Python;
- Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.
Github
- Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts.
- Code Llama is free for research and commercial use.
- Code Llama is built on top of Llama 2 and is available in three models:
- Code Llama, the foundational code model;
- Code Llama - Python specialized for Python;
- Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.
Github
🔥7
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models
The results reveal the superiority and potential of PEFT over ICL (In-Context Learning) on a wide range of LLMs in reducing the computational burden and improving performance.
Main results:
- LLMs fine-tuned with PEFT techniques, i.e., a few millions of parameters, systematically outperform small language models fully fine-tuned with hundreds of millions of parameters
- Prompt tuning often outperforms LoRA even though it requires learning substantially fewer parameters
- LLMs fine-tuned using LoRA and Prompt tuning significantly outperform LLMs with ICL, even when increasing the number of prompt examples under the ICL setting
- PEFT techniques allow LLMs to better adapt to the task-specific dataset with low computational cost
The results reveal the superiority and potential of PEFT over ICL (In-Context Learning) on a wide range of LLMs in reducing the computational burden and improving performance.
Main results:
- LLMs fine-tuned with PEFT techniques, i.e., a few millions of parameters, systematically outperform small language models fully fine-tuned with hundreds of millions of parameters
- Prompt tuning often outperforms LoRA even though it requires learning substantially fewer parameters
- LLMs fine-tuned using LoRA and Prompt tuning significantly outperform LLMs with ICL, even when increasing the number of prompt examples under the ICL setting
- PEFT techniques allow LLMs to better adapt to the task-specific dataset with low computational cost
🔥1
Forwarded from Consciousnesses
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
The authors one of which is Yoshua Bengio derive a list of indicator properties from a survey of theories of consciousness. Each of these indicator properties is said to be necessary for consciousness by one or more theories, and some subsets are said to be jointly sufficient. The claim is that AI systems which possess more of the indicator properties are more likely to be conscious.
It is discussed how AI systems could be constructed, or have been constructed, with each of the indicator properties. Also the authors consider whether some specific existing AI systems possess the indicator properties. These include Transformer-based LLMs, the Perceiver architecture, DeepMind’s Adaptive Agent and PaLM-E. This work does not suggest that any existing AI system is a strong candidate for consciousness.
The authors one of which is Yoshua Bengio derive a list of indicator properties from a survey of theories of consciousness. Each of these indicator properties is said to be necessary for consciousness by one or more theories, and some subsets are said to be jointly sufficient. The claim is that AI systems which possess more of the indicator properties are more likely to be conscious.
It is discussed how AI systems could be constructed, or have been constructed, with each of the indicator properties. Also the authors consider whether some specific existing AI systems possess the indicator properties. These include Transformer-based LLMs, the Perceiver architecture, DeepMind’s Adaptive Agent and PaLM-E. This work does not suggest that any existing AI system is a strong candidate for consciousness.
Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B
CodeLlama-34B and CodeLlama-34B-Python were fine-tuned on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March.
The Phind models were trained over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. The authors employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
huggingface:
- Phind/Phind-CodeLlama-34B-v1
- Phind/Phind-CodeLlama-34B-Python-v1
CodeLlama-34B and CodeLlama-34B-Python were fine-tuned on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March.
The Phind models were trained over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. The authors employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
huggingface:
- Phind/Phind-CodeLlama-34B-v1
- Phind/Phind-CodeLlama-34B-Python-v1
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning
In this study, the authors present LLaMA-Reviewer, a framework that leverages LLaMA for automating code review. Two PEFT methods — zero-init attention prefix-tuning and LoRA tuning are used to address the computational challenge of LLM fine-tuning.
RQs:
1. How effective is a LLM in automating code review tasks, compared to SoTA methods?
2. How does the representation of input data impact the performance of large language models?
3. How does instruction tuning influence the performance of subsequent sub-tasks?
4. What implications arise from different PEFT methods?
Code, models, results: https://zenodo.org/record/7991113
In this study, the authors present LLaMA-Reviewer, a framework that leverages LLaMA for automating code review. Two PEFT methods — zero-init attention prefix-tuning and LoRA tuning are used to address the computational challenge of LLM fine-tuning.
RQs:
1. How effective is a LLM in automating code review tasks, compared to SoTA methods?
2. How does the representation of input data impact the performance of large language models?
3. How does instruction tuning influence the performance of subsequent sub-tasks?
4. What implications arise from different PEFT methods?
Code, models, results: https://zenodo.org/record/7991113
Understanding Llama 2 and Code Llama
In this edition of the newsletter: the release of the Llama 2 base and chat models, as well as CodeLlama, the latest advances in the open source AI large language model landscape.
In this edition of the newsletter: the release of the Llama 2 base and chat models, as well as CodeLlama, the latest advances in the open source AI large language model landscape.
👍3👎1
HumanEval_ru Dataset
This is the Russian version of the code generation HumanEval dataset.
Load dataset:
This is the Russian version of the code generation HumanEval dataset.
Load dataset:
from datasets import load_dataset
load_dataset('NLPCoreTeam/humaneval_ru')
DatasetDict({
train: Dataset({
features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction', 'instruction_noexamples'],
num_rows: 164
})
})❤1
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
BioCoder is a benchmark for code generation incorporating 2269 bioinformatics-specific coding problems. It incorporates a fuzz-testing framework for evaluation. The authors have applied BioCoder to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT.
Dataset, benchmark, Docker images, and noscripts: https://github.com/gersteinlab/biocoder
BioCoder is a benchmark for code generation incorporating 2269 bioinformatics-specific coding problems. It incorporates a fuzz-testing framework for evaluation. The authors have applied BioCoder to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT.
Dataset, benchmark, Docker images, and noscripts: https://github.com/gersteinlab/biocoder
Releasing Persimmon-8B
Permisimmon-8B is open-source, fully permissive model. It is trained from scratch using a context size of 16K. The model has 70k unused embeddings for multimodal extensions, and has sparse activations. The inference code combines the speed of C++ implementations (e.g. FasterTransformer) with the flexibility of naive Python inference.
Hidden Size 4096
Heads 64
Layers 36
Batch Size 120
Sequence Length 16384
Training Iterations 375K
Tokens Seen 737B
Code and weights: https://github.com/persimmon-ai-labs/adept-inference
Permisimmon-8B is open-source, fully permissive model. It is trained from scratch using a context size of 16K. The model has 70k unused embeddings for multimodal extensions, and has sparse activations. The inference code combines the speed of C++ implementations (e.g. FasterTransformer) with the flexibility of naive Python inference.
Hidden Size 4096
Heads 64
Layers 36
Batch Size 120
Sequence Length 16384
Training Iterations 375K
Tokens Seen 737B
Code and weights: https://github.com/persimmon-ai-labs/adept-inference
DevGPT: Studying Developer-ChatGPT Conversations
Yet, we know very little about how ChatGPT is actually used by software developers. What questions do developers present to ChatGPT? What are the dynamics of these interactions? What is the backdrop against which these conversations are held, and how do the conversations feedback into the artifacts of their work? To close this gap, the authors introduce DevGPT, a curated dataset which encompasses 17,913 prompts and ChatGPT’s responses including 11,751 code snippets, coupled with the corresponding software development artifacts—ranging from source code, commits, issues, pull requests, to discussions and Hacker News threads—to enable the analysis of the context and implications of these developer interactions with ChatGPT.
Dataset
Yet, we know very little about how ChatGPT is actually used by software developers. What questions do developers present to ChatGPT? What are the dynamics of these interactions? What is the backdrop against which these conversations are held, and how do the conversations feedback into the artifacts of their work? To close this gap, the authors introduce DevGPT, a curated dataset which encompasses 17,913 prompts and ChatGPT’s responses including 11,751 code snippets, coupled with the corresponding software development artifacts—ranging from source code, commits, issues, pull requests, to discussions and Hacker News threads—to enable the analysis of the context and implications of these developer interactions with ChatGPT.
Dataset
🔥3
Communicative Agents for Software Development
ChatDev is a chat-based end-to-end software development framework that leverages LLMs to facilitate effective communication and collaboration among multiple roles involved in the software development process. By decomposing the development process into sequential atomic subtasks through the use of the chat chain, ChatDev enables granular focus and promotes desired outputs for each subtask. The experimental results demonstrate the efficiency and cost-effectiveness of the automated software development process driven by ChatDev.
The Human-Agent-Interaction mode is now available. You can get involved with the ChatDev team by playing the role of reviewer and making suggestions to the programmer.
GitHub
ChatDev is a chat-based end-to-end software development framework that leverages LLMs to facilitate effective communication and collaboration among multiple roles involved in the software development process. By decomposing the development process into sequential atomic subtasks through the use of the chat chain, ChatDev enables granular focus and promotes desired outputs for each subtask. The experimental results demonstrate the efficiency and cost-effectiveness of the automated software development process driven by ChatDev.
The Human-Agent-Interaction mode is now available. You can get involved with the ChatDev team by playing the role of reviewer and making suggestions to the programmer.
GitHub
🔥2
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. It adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8xA100 machine.
Results:
1. The proposed shifted short attention is easy to implement, compatible with Flash-Attention, and not required during inference.
2. Released models: from 7B to 70B, context length from 8k to 100k, including LLaMA2-LongLoRA-7B-100k, LLaMA2-LongLoRA-13B-64k, and LLaMA2-LongLoRA-70B-32k.
3. A long-context QA dataset, LongQA, for supervised fine-tuning (SFT).
Repository: https://github.com/dvlab-research/LongLoRA
LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. It adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8xA100 machine.
Results:
1. The proposed shifted short attention is easy to implement, compatible with Flash-Attention, and not required during inference.
2. Released models: from 7B to 70B, context length from 8k to 100k, including LLaMA2-LongLoRA-7B-100k, LLaMA2-LongLoRA-13B-64k, and LLaMA2-LongLoRA-70B-32k.
3. A long-context QA dataset, LongQA, for supervised fine-tuning (SFT).
Repository: https://github.com/dvlab-research/LongLoRA
👍1
78% MNIST accuracy using GZIP in under 10 lines of code
c = lambda z: len(gzip.compress(z.tobytes()))
def ncd(x, y):
return (c(x + y) - min(c(x), c(y))) / max(c(x), c(y))
cls = [(x, c(x), l) for x, l in training_set]
correct_predictions = sum([np.array_equal(Counter(
[l for _, _, l in sorted([(ncd(x1, x), x, l) for x, _, l in cls],
key=lambda t: t[0])[:5]]).most_common(1)[0][0], label)
for x1, label in test_set])🤯3