NEW BOT Телеграм, страница

ml4se

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

PB is a general-purpose self-referential self-improvement mechanism for LLMs. Given a seed set of mutation-prompts, thinking-styles, and a domain-specific problem denoscription, PB generates variations of the task-prompts and mutation-prompts, exploiting the fact that LLMs can be prompted to act as mutation operators. Based on the fitness of the evolved task-prompts, a subset of evolutionary units consisting of task-prompts and their associated mutation-prompt is selected for future generations. Over multiple generations of PB, prompts have adapted to the domain at hand, e.g., in a mathematical domain, PB evolved the task-prompt "

Show all your working. II. You should use the correct mathematical notation and vocabulary, where appropriate. III. You should write your answer in full sentences and in words. IV. You should use examples to illustrate your points and prove your answers. V. Your workings out should be neat and legible

" on GSM8K.

🤯3👍1

387 views06:48

ml4se

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

L2CEval is a comprehensive evaluation of LLMs for natural language to code generation, along a variety of axes such as model scale, training data, sensitivity to few shot exemplars as well as the impact of instruction tuning, etc.

L2CEval includes a wide range of sota models, specifically 54 models from 13 different organizations, all evaluated on 3 core domains of language-to-code generation tasks. L2CEval includes extensive evaluations of models as small as 1B parameters, to significantly larger ones such as davinci and GPT-4 models from OpenAI, with estimated size of 170B+ parameters.

The study can be useful for the community in applying LLMs for downstream code applications.

https://l2c-eval.github.io/

👍1

440 views16:20

ml4se

Think before you speak: Training Language Models With Pause Tokens

Language models generate responses by producing a series of tokens in immediate succession: the (K + 1)-th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token.

What happens if we delay a model’s answer generation, and how can we execute these delays? What if we were to let the model manipulate say, K + 10 hidden vectors, before it outputs the (K + 1)-th token?

The authors operationalize this idea by performing training and inference on language models with a pause token, a sequence of which is appended to the input prefix. It allows the model to process extra computation before committing to an answer.

The main finding is that such delays show gains on downstream tasks covering reasoning, question-answering, general understanding, when the model is both pre-trained and finetuned with delays.

👍2

414 views03:36

ml4se

ICAART 2024: International Conference on Agents and Artificial Intelligence

Conference Areas
1 . Agents
2 . Artificial Intelligence

Deadlines:
1. Regular Paper Submission Extension: October 26, 2023
2. Position Paper Submission: November 15, 2023
3. Doctoral Consortium Paper Submission: December 21, 2023

343 views04:04

ml4se

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a denoscription of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.

The evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieve.

Leaderboard: http://www.swebench.com/
GitHub: https://github.com/princeton-nlp/SWE-bench

👍2

441 views06:32

ml4se

Mistral 7B Paper

A paper about the Mistral 7 model appeared on the arxiv.org website

Mistral 7B v0.1 is a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.

👍5

493 views14:35

ml4se

AutoAgents: A Framework for Automatic Agent Generation

In the paper, the authors propose a framework for agents orchestration. This multi-agent approach makes it possible to solve problems that are difficult for a single model to cope with. The difference between this approach and the previous ones is that at the same time:
- an unlimited number of dynamically generated agents ,
- multi-agent conversation,
- self-refinement agents,
- collaborative refinement actions.

github: https://github.com/Link-AGI/AutoAgents
huggingface: https://huggingface.co/spaces/LinkSoul/AutoAgents

508 viewsedited 08:39

ml4se

CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

CodeFuse-13B is an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages.

👍3

415 views14:21

ml4se

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

CrossCodeEval is a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#.

github: https://github.com/amazon-science/cceval

👍1

529 views14:32

ml4se

Is code text, or is text code? (1)

The same approaches are often used to work with code and text. Namely, the code is processed as a sequence of tokens. This is not the only approach of working with code; for example, you can represent the code in the form of AST, DFG, etc. There are approaches that combine different representations, e.g., CodeBERT or GraphCodeBERT. Despite this, the basic approach now is to treat code as if it were text. This allows you to use unified data processing methods and build models that can simultaneously work with natural language and programming languages.

Okay, let's assume the code is text. Can we say that text is code? Are there fundamental differences between natural language and programming language? At first glance, these languages are designed for different tasks: one for expression, the other for execution.

295 views11:42

ml4se

Is code text, or is text code? (2)

One of the most important properties of code is the concept of functionality. The code runs on the computer and there is some result—a changed state of the computer, for example, output to stdout or file on disk. Speaking about functionality, it is necessary to mention the syntactic and semantic properties of the code. Syntactic properties are what the code looks like, and semantic properties are what it does. In particular, two functions can look different, but implement the same function (here the problem of clone detection arises—you need to understand whether two code fragments are clones of each other). At first glance, the concept of functionality is what distinguishes code from text, but if we consider that text is a program that runs inside a person and changes his state, then the difference between code and text becomes smaller.

So, code is a program or set of instructions that is executed on a computer. A natural language text is a program that is executed by a person: a person reads the text and changes his state.

321 views11:43

ml4se

LPU: LLM as a processing unit

If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).

In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).

At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).

All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.

Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.

Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.

395 views11:47

ml4se

What's next?

If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.

Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.

The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.

Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).

🔥1

466 views11:52

ml4se

A term of length 4,523,659,424,929

Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.

If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.

At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.

😁2

466 views04:39

ml4se

GAIA: a benchmark for General AI Assistants

GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.

The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.

Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use

Code: https://huggingface.co/gaia-benchmark

👍2

455 views07:35

About

Blog

Apps

Platform