LPU: LLM as a processing unit
If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).
In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).
At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).
All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.
Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.
Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.
If text is a program for humans, then in the case of LLMs, text is a program for LLMs. Then the LLM, in turn, plays the role of some kind of computer—LPU (Language Processing Unit).
In the case of LLM, the input for the model is often divided into an instruction (prompt) and a direct denoscription of the problem being solved. For example, the instruction is "Let's think step by step" and the task denoscription is "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" (from paper).
At the moment, there are many works that confirm that the quality of solving a problem depends on the instructions. There are different approaches to exactly what the instructions might look like (e.g., Scratchpads, CoT or ToT).
All this speaks in favor of the fact that it is convenient to call a program for LPU not the entire text, but only the instructions. The task denoscription is a parameter passed to the program.
Thus we have:
- LLM: LPU, computer,
- prompt, instruction: program,
- task denoscription: parameters to be passed.
Programming for LPU is writing a suitable prompt (instruction) so that when we receive a specific instance of a task as a parameter, the answer suits us as much as possible.
What's next?
If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.
Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.
The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.
Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).
If we accept what is written above, then now ML development is divided into work on:
- improvement of LPU and
- writing programs for LPU.
Improving LPU includes both work on hardware for LPU and training models. Both require large resources, expensive technologies and can become the prerogative of global companies. In addition, we can expect these two technologies (work on hardware and training) to come together.
The second direction, writing programs for LPU, is more applied and less labor-intensive. From this point of view, it can be expected that it will remain available to local companies.
Training the models, i.e. changing their weights may become less popular. Main potential causes:
- models can be adapted without changing the weights by lengthening the context (see, e.g., the paper Why Can GPT Learn In-Context?, which describes the technique of implicit additional training). A similar mechanism can provide the necessary reinforcement due to accumulated history,
- changing weights is labor-intensive: data and training power are needed,
- we can expect that there will be LPUs for which it is impossible to change the weights (hardware implementations for fast inference).
🔥1
A term of length 4,523,659,424,929
Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.
If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.
At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.
Bourbaki suggest that their definition of the number 1 runs to some tens of thousands of symbols. Adrian Mathias shows that that is a considerable under-estimate, the true number of symbols 4,523,659,424,929, not counting 1,179,618,517,981 disambiguatory links.
If the ordered pair (x, y) is introduced by definition rather than taken as a primitive, the term defining 1 will have 2409875496393137472149767527877436912979508338752092897 symbols, with 871880233733949069946182804910912227472430953034182177 links.
At 80 symbols per line, 50 lines per page, 1,000 pages per book, the shorter version would occupy more than a million books, and the longer, 6 * 10^47 books. If each book weighed a kilogram, these books would be about 200,000 times heavier than the Milky Way.
😁2
GAIA: a benchmark for General AI Assistants
GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.
The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.
Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use
Code: https://huggingface.co/gaia-benchmark
GAIA is a benchmark for AI systems proposing general assistant questions. GAIA attempts to circumvent different pitfalls of LLMs evaluation. It is composed of 466 questions designed and annotated by humans. These questions are text-based, and sometimes come with a file (such as an image or a spreadsheet). The questions are designed to admit a short, single correct answer, therefore easy to verify.
The authors show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.
Design choices:
- target questions that are conceptually simple although potentially tedious for humans
- interpretability
- robustness against memorization
- easiness of use
Code: https://huggingface.co/gaia-benchmark
👍2
New LLMs appear regularly. They are available for download or via API. We can't always trust these LLMs (toxicity, unsafe content, etc).
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
Anonymous Poll
70%
Yes
30%
No
DeepSeek Coder: Let the Code Write Itself
DeepSeek Coder is composed of a series of code language models.
- Pretrained on 2 Trillion tokens over more than 80 programming languages.
- Various model sizes (1.3B, 5.7B, 6.7B and 33B) to support different requirements.
- A window size of 16K window size, supporting project-level code completion and infilling.
- State-of-the-Art performance among open code models.
- Open source and free for research and commercial use.
Huggingface: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai/deepseek-coder/
DeepSeek Coder is composed of a series of code language models.
- Pretrained on 2 Trillion tokens over more than 80 programming languages.
- Various model sizes (1.3B, 5.7B, 6.7B and 33B) to support different requirements.
- A window size of 16K window size, supporting project-level code completion and infilling.
- State-of-the-Art performance among open code models.
- Open source and free for research and commercial use.
Huggingface: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai/deepseek-coder/
🔥6
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
The paper evaluates the effectiveness of LLM for the task of detecting vulnerabilities in source code.
The authors compare different prompting approaches with each other. Also they compare LLMs with static analysis tools (CodeQL) and with classical deep learning methods (LineVul). In addition, the authors draw some conclusions about the features of synthetic and real-world datasets.
From the conclusions I would like to note:
- Combining GPT-4 and CodeQL gives 96-97%;
- Efforts aimed at the ability to work with a larger context look promising.
The paper evaluates the effectiveness of LLM for the task of detecting vulnerabilities in source code.
The authors compare different prompting approaches with each other. Also they compare LLMs with static analysis tools (CodeQL) and with classical deep learning methods (LineVul). In addition, the authors draw some conclusions about the features of synthetic and real-world datasets.
From the conclusions I would like to note:
- Combining GPT-4 and CodeQL gives 96-97%;
- Efforts aimed at the ability to work with a larger context look promising.
🔥2
Lost in the Middle: How Language Models Use Long Contexts
Language model performance is highest when relevant information occurs at the very beginning (primacy bias) or end of its input context (recency bias), and performance significantly degrades when models must access and use information in the middle of their input context.
Effective reranking of retrieved documents (pushing relevant information closer to the start of the input context) or ranked list truncation may be promising directions for improving how language-model-based readers use retrieved context.
Language model performance is highest when relevant information occurs at the very beginning (primacy bias) or end of its input context (recency bias), and performance significantly degrades when models must access and use information in the middle of their input context.
Effective reranking of retrieved documents (pushing relevant information closer to the start of the input context) or ranked list truncation may be promising directions for improving how language-model-based readers use retrieved context.
👍3
ml4se
New LLMs appear regularly. They are available for download or via API. We can't always trust these LLMs (toxicity, unsafe content, etc).
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
Is there a future for services that check LLMs for these risks? Certification, online testing via API, etc.
garak, LLM vulnerability scanner
garak checks if an LLM can be made to fail in an way we don't want. garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses
Docs: https://docs.garak.ai/garak/
garak checks if an LLM can be made to fail in an way we don't want. garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses
Docs: https://docs.garak.ai/garak/
🤔3
AlphaCode 2 Technical Report
AlphaCode 2 is a new and enhanced system, powered by Gemini. AlphaCode 2 relies on the combination of language models and a bespoke search and reranking mechanism.
AlphaCode 2 is a new and enhanced system, powered by Gemini. AlphaCode 2 relies on the combination of language models and a bespoke search and reranking mechanism.
👍1👏1
Optimum-NVIDIA on Hugging Face enables blazingly fast LLM inference in just 1 line of code
By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform.
By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform.
Naturalness of Attention: Revisiting Attention in Code Language Models
In this work, the authors have revisited the mathematical definition of multihead attention from prior works in natural language. They performed a layer-wise analysis to study the trends of attention weights and the transformation of input for LM (i.e., CodeBERT) and showed how the attention mechanism is not merely composed of the attention weights.
RQs:
- How do the general trends across layers between attention weights 𝛼 and the scaled transformations norms ∥𝛼 𝑓 (𝑥)∥ compare?
- How does ∥𝛼 𝑓 (𝑥)∥ align with the syntactic structure of source code compared to attention weights?
In this work, the authors have revisited the mathematical definition of multihead attention from prior works in natural language. They performed a layer-wise analysis to study the trends of attention weights and the transformation of input for LM (i.e., CodeBERT) and showed how the attention mechanism is not merely composed of the attention weights.
RQs:
- How do the general trends across layers between attention weights 𝛼 and the scaled transformations norms ∥𝛼 𝑓 (𝑥)∥ compare?
- How does ∥𝛼 𝑓 (𝑥)∥ align with the syntactic structure of source code compared to attention weights?
👍3❤1
NeurIPS 2023
Location: New Orleans Ernest N. Morial Convention Center
Dates: Sun, Dec 10, 2023 – Sat, Dec 16, 2023
Schedule: https://nips.cc/virtual/2023/calendar
Microsoft: https://www.microsoft.com/en-us/research/blog/neurips-2023-highlights-breadth-of-microsofts-machine-learning-innovation/
Stanford: https://ai.stanford.edu/blog/neruips-2023/
Google: https://blog.research.google/2023/12/google-at-neurips-2023.html
Meta: https://ai.meta.com/events/neurips-2023/
Salesforce: https://blog.salesforceairesearch.com/salesforce-research-at-neurips-2023/
Location: New Orleans Ernest N. Morial Convention Center
Dates: Sun, Dec 10, 2023 – Sat, Dec 16, 2023
Schedule: https://nips.cc/virtual/2023/calendar
Microsoft: https://www.microsoft.com/en-us/research/blog/neurips-2023-highlights-breadth-of-microsofts-machine-learning-innovation/
Stanford: https://ai.stanford.edu/blog/neruips-2023/
Google: https://blog.research.google/2023/12/google-at-neurips-2023.html
Meta: https://ai.meta.com/events/neurips-2023/
Salesforce: https://blog.salesforceairesearch.com/salesforce-research-at-neurips-2023/
promptbase [Microsoft]
promptbase is an evolving collection of resources, best practices, and example noscripts for eliciting the best performance from foundation models like GPT-4
Medprompt blog:
- The Power of Prompting
- Steering at the Frontier: Extending the Power of Prompting
promptbase is an evolving collection of resources, best practices, and example noscripts for eliciting the best performance from foundation models like GPT-4
Medprompt blog:
- The Power of Prompting
- Steering at the Frontier: Extending the Power of Prompting
Do you use "please" when communicating with LLMs? Example: https://github.com/microsoft/promptbase/blob/main/src/promptbase/math/math.py#L12
Anonymous Poll
65%
Yes
35%
No