Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
PB is a general-purpose self-referential self-improvement mechanism for LLMs. Given a seed set of mutation-prompts, thinking-styles, and a domain-specific problem denoscription, PB generates variations of the task-prompts and mutation-prompts, exploiting the fact that LLMs can be prompted to act as mutation operators. Based on the fitness of the evolved task-prompts, a subset of evolutionary units consisting of task-prompts and their associated mutation-prompt is selected for future generations. Over multiple generations of PB, prompts have adapted to the domain at hand, e.g., in a mathematical domain, PB evolved the task-prompt "
PB is a general-purpose self-referential self-improvement mechanism for LLMs. Given a seed set of mutation-prompts, thinking-styles, and a domain-specific problem denoscription, PB generates variations of the task-prompts and mutation-prompts, exploiting the fact that LLMs can be prompted to act as mutation operators. Based on the fitness of the evolved task-prompts, a subset of evolutionary units consisting of task-prompts and their associated mutation-prompt is selected for future generations. Over multiple generations of PB, prompts have adapted to the domain at hand, e.g., in a mathematical domain, PB evolved the task-prompt "
Show all your working. II. You should use the correct mathematical notation and vocabulary, where appropriate. III. You should write your answer in full sentences and in words. IV. You should use examples to illustrate your points and prove your answers. V. Your workings out should be neat and legible" on GSM8K.🤯3👍1
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
L2CEval is a comprehensive evaluation of LLMs for natural language to code generation, along a variety of axes such as model scale, training data, sensitivity to few shot exemplars as well as the impact of instruction tuning, etc.
L2CEval includes a wide range of sota models, specifically 54 models from 13 different organizations, all evaluated on 3 core domains of language-to-code generation tasks. L2CEval includes extensive evaluations of models as small as 1B parameters, to significantly larger ones such as davinci and GPT-4 models from OpenAI, with estimated size of 170B+ parameters.
The study can be useful for the community in applying LLMs for downstream code applications.
https://l2c-eval.github.io/
L2CEval is a comprehensive evaluation of LLMs for natural language to code generation, along a variety of axes such as model scale, training data, sensitivity to few shot exemplars as well as the impact of instruction tuning, etc.
L2CEval includes a wide range of sota models, specifically 54 models from 13 different organizations, all evaluated on 3 core domains of language-to-code generation tasks. L2CEval includes extensive evaluations of models as small as 1B parameters, to significantly larger ones such as davinci and GPT-4 models from OpenAI, with estimated size of 170B+ parameters.
The study can be useful for the community in applying LLMs for downstream code applications.
https://l2c-eval.github.io/
👍1
Think before you speak: Training Language Models With Pause Tokens
Language models generate responses by producing a series of tokens in immediate succession: the (K + 1)-th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token.
What happens if we delay a model’s answer generation, and how can we execute these delays? What if we were to let the model manipulate say, K + 10 hidden vectors, before it outputs the (K + 1)-th token?
The authors operationalize this idea by performing training and inference on language models with a pause token, a sequence of which is appended to the input prefix. It allows the model to process extra computation before committing to an answer.
The main finding is that such delays show gains on downstream tasks covering reasoning, question-answering, general understanding, when the model is both pre-trained and finetuned with delays.
Language models generate responses by producing a series of tokens in immediate succession: the (K + 1)-th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token.
What happens if we delay a model’s answer generation, and how can we execute these delays? What if we were to let the model manipulate say, K + 10 hidden vectors, before it outputs the (K + 1)-th token?
The authors operationalize this idea by performing training and inference on language models with a pause token, a sequence of which is appended to the input prefix. It allows the model to process extra computation before committing to an answer.
The main finding is that such delays show gains on downstream tasks covering reasoning, question-answering, general understanding, when the model is both pre-trained and finetuned with delays.
👍2
ICAART 2024: International Conference on Agents and Artificial Intelligence
Conference Areas
1 . Agents
2 . Artificial Intelligence
Deadlines:
1. Regular Paper Submission Extension: October 26, 2023
2. Position Paper Submission: November 15, 2023
3. Doctoral Consortium Paper Submission: December 21, 2023
Conference Areas
1 . Agents
2 . Artificial Intelligence
Deadlines:
1. Regular Paper Submission Extension: October 26, 2023
2. Position Paper Submission: November 15, 2023
3. Doctoral Consortium Paper Submission: December 21, 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a denoscription of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.
The evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieve.
Leaderboard: http://www.swebench.com/
GitHub: https://github.com/princeton-nlp/SWE-bench
SWE-bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a denoscription of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.
The evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieve.
Leaderboard: http://www.swebench.com/
GitHub: https://github.com/princeton-nlp/SWE-bench
👍2
Mistral 7B Paper
A paper about the Mistral 7 model appeared on the arxiv.org website
Mistral 7B v0.1 is a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
A paper about the Mistral 7 model appeared on the arxiv.org website
Mistral 7B v0.1 is a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
👍5
AutoAgents: A Framework for Automatic Agent Generation
In the paper, the authors propose a framework for agents orchestration. This multi-agent approach makes it possible to solve problems that are difficult for a single model to cope with. The difference between this approach and the previous ones is that at the same time:
- an unlimited number of dynamically generated agents ,
- multi-agent conversation,
- self-refinement agents,
- collaborative refinement actions.
github: https://github.com/Link-AGI/AutoAgents
huggingface: https://huggingface.co/spaces/LinkSoul/AutoAgents
In the paper, the authors propose a framework for agents orchestration. This multi-agent approach makes it possible to solve problems that are difficult for a single model to cope with. The difference between this approach and the previous ones is that at the same time:
- an unlimited number of dynamically generated agents ,
- multi-agent conversation,
- self-refinement agents,
- collaborative refinement actions.
github: https://github.com/Link-AGI/AutoAgents
huggingface: https://huggingface.co/spaces/LinkSoul/AutoAgents
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
CodeFuse-13B is an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages.
CodeFuse-13B is an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages.
👍3
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
CrossCodeEval is a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#.
github: https://github.com/amazon-science/cceval
CrossCodeEval is a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#.
github: https://github.com/amazon-science/cceval
👍1
Is code text, or is text code? (1)
The same approaches are often used to work with code and text. Namely, the code is processed as a sequence of tokens. This is not the only approach of working with code; for example, you can represent the code in the form of AST, DFG, etc. There are approaches that combine different representations, e.g., CodeBERT or GraphCodeBERT. Despite this, the basic approach now is to treat code as if it were text. This allows you to use unified data processing methods and build models that can simultaneously work with natural language and programming languages.
Okay, let's assume the code is text. Can we say that text is code? Are there fundamental differences between natural language and programming language? At first glance, these languages are designed for different tasks: one for expression, the other for execution.
The same approaches are often used to work with code and text. Namely, the code is processed as a sequence of tokens. This is not the only approach of working with code; for example, you can represent the code in the form of AST, DFG, etc. There are approaches that combine different representations, e.g., CodeBERT or GraphCodeBERT. Despite this, the basic approach now is to treat code as if it were text. This allows you to use unified data processing methods and build models that can simultaneously work with natural language and programming languages.
Okay, let's assume the code is text. Can we say that text is code? Are there fundamental differences between natural language and programming language? At first glance, these languages are designed for different tasks: one for expression, the other for execution.
Is code text, or is text code? (2)
One of the most important properties of code is the concept of functionality. The code runs on the computer and there is some result—a changed state of the computer, for example, output to stdout or file on disk. Speaking about functionality, it is necessary to mention the syntactic and semantic properties of the code. Syntactic properties are what the code looks like, and semantic properties are what it does. In particular, two functions can look different, but implement the same function (here the problem of clone detection arises—you need to understand whether two code fragments are clones of each other). At first glance, the concept of functionality is what distinguishes code from text, but if we consider that text is a program that runs inside a person and changes his state, then the difference between code and text becomes smaller.
So, code is a program or set of instructions that is executed on a computer. A natural language text is a program that is executed by a person: a person reads the text and changes his state.
One of the most important properties of code is the concept of functionality. The code runs on the computer and there is some result—a changed state of the computer, for example, output to stdout or file on disk. Speaking about functionality, it is necessary to mention the syntactic and semantic properties of the code. Syntactic properties are what the code looks like, and semantic properties are what it does. In particular, two functions can look different, but implement the same function (here the problem of clone detection arises—you need to understand whether two code fragments are clones of each other). At first glance, the concept of functionality is what distinguishes code from text, but if we consider that text is a program that runs inside a person and changes his state, then the difference between code and text becomes smaller.
So, code is a program or set of instructions that is executed on a computer. A natural language text is a program that is executed by a person: a person reads the text and changes his state.
LPU: LLM as a processing unit
If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).
In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).
At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).
All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.
Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.
Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.
If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).
In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).
At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).
All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.
Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.
Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.
What's next?
If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.
Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.
The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.
Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).
If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.
Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.
The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.
Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).
🔥1
A term of length 4,523,659,424,929
Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.
If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.
At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.
Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.
If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.
At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.
😁2
GAIA: a benchmark for General AI Assistants
GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.
The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.
Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use
Code: https://huggingface.co/gaia-benchmark
GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.
The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.
Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use
Code: https://huggingface.co/gaia-benchmark
👍2