ml4se – Telegram
ml4se
505 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
Mistral 7B Paper

A paper about the Mistral 7 model appeared on the arxiv.org website

Mistral 7B v0.1 is a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
👍5
AutoAgents: A Framework for Automatic Agent Generation

In the paper, the authors propose a framework for agents orchestration. This multi-agent approach makes it possible to solve problems that are difficult for a single model to cope with. The difference between this approach and the previous ones is that at the same time:
- an unlimited number of dynamically generated agents ,
- multi-agent conversation,
- self-refinement agents,
- collaborative refinement actions.

github: https://github.com/Link-AGI/AutoAgents
huggingface: https://huggingface.co/spaces/LinkSoul/AutoAgents
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

CodeFuse-13B is an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages.
👍3
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

CrossCodeEval is a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#.

github: https://github.com/amazon-science/cceval
👍1
Is code text, or is text code? (1)

The same approaches are often used to work with code and text. Namely, the code is processed as a sequence of tokens. This is not the only approach of working with code; for example, you can represent the code in the form of AST, DFG, etc. There are approaches that combine different representations, e.g., CodeBERT or GraphCodeBERT. Despite this, the basic approach now is to treat code as if it were text. This allows you to use unified data processing methods and build models that can simultaneously work with natural language and programming languages.

Okay, let's assume the code is text. Can we say that text is code? Are there fundamental differences between natural language and programming language? At first glance, these languages are designed for different tasks: one for expression, the other for execution.
Is code text, or is text code? (2)

One of the most important properties of code is the concept of functionality. The code runs on the computer and there is some result—a changed state of the computer, for example, output to stdout or file on disk. Speaking about functionality, it is necessary to mention the syntactic and semantic properties of the code. Syntactic properties are what the code looks like, and semantic properties are what it does. In particular, two functions can look different, but implement the same function (here the problem of clone detection arises—you need to understand whether two code fragments are clones of each other). At first glance, the concept of functionality is what distinguishes code from text, but if we consider that text is a program that runs inside a person and changes his state, then the difference between code and text becomes smaller.

So, code is a program or set of instructions that is executed on a computer. A natural language text is a program that is executed by a person: a person reads the text and changes his state.
LPU: LLM as a processing unit

If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).

In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).

At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).

All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.

Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.

Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.
What's next?

If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.

Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.

The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.

Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).
🔥1
A term of length 4,523,659,424,929

Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.

If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.

At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.
😁2
GAIA: a benchmark for General AI Assistants

GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.

The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.

Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use

Code: https://huggingface.co/gaia-benchmark
👍2
New LLMs appear regularly. They are available for download or via API. We can't always trust these LLMs (toxicity, unsafe content, etc).

Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
Anonymous Poll
70%
Yes
30%
No
DeepSeek Coder: Let the Code Write Itself

DeepSeek Coder is composed of a series of code language models.
- Pretrained on 2 Trillion tokens over more than 80 programming languages.
- Various model sizes (1.3B, 5.7B, 6.7B and 33B) to support different requirements.
- A window size of 16K window size, supporting project-level code completion and infilling.
- State-of-the-Art performance among open code models.
- Open source and free for research and commercial use.

Huggingface: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai/deepseek-coder/
🔥6
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

The paper evaluates the effectiveness of LLM for the task of detecting vulnerabilities in source code.
The authors compare different prompting approaches with each other. Also they compare LLMs with static analysis tools (CodeQL) and with classical deep learning methods (LineVul). In addition, the authors draw some conclusions about the features of synthetic and real-world datasets.

From the conclusions I would like to note:
- Combining GPT-4 and CodeQL gives 96-97%;
- Efforts aimed at the ability to work with a larger context look promising.
🔥2
LLM Visualization

3D visualization and the walkthrough of the GPT large language model.
🔥1