Mistral выкатил MoE (Mixture of Experts) модель Mixtral 8x7B, которая типа бьёт GPT-3.5 из коробки. Также есть instruction finetuned Mixtral 8x7B Instruct. Это интересно.
https://mistral.ai/news/mixtral-of-experts/
https://mistral.ai/news/mixtral-of-experts/
mistral.ai
Mixtral of experts | Mistral AI
A high quality Sparse Mixture-of-Experts.
🔥12
А ещё из интересного, в свежей huggingface transformers растёт и крепнет поддержка GPU AMD.
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
* Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
* Flash Attention 2 support for RoCm by @fxmarty in #27611
* Reflect RoCm support in the documentation by @fxmarty in #27636
* restructure AMD scheduled CI by @ydshieh in #27743
https://github.com/huggingface/transformers/releases/tag/v4.36.0
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
* Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
* Flash Attention 2 support for RoCm by @fxmarty in #27611
* Reflect RoCm support in the documentation by @fxmarty in #27636
* restructure AMD scheduled CI by @ydshieh in #27743
https://github.com/huggingface/transformers/releases/tag/v4.36.0
GitHub
Release v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support · huggingface/transformers
New model additions
Mixtral
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT ...
Mixtral
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT ...
🔥15❤3👍2
И раз сегодня много LLM новостей, то вот ещё одна для тех, кто пропустил.
Nexusflow выложили NexusRaven-V2 с 13B параметров. Модель бьёт GPT-4 (но вроде не Turbo) на Zero-shot Function Calling. Теперь можете построить больше разных ко-пилотов :)
Блог: https://nexusflow.ai/blogs/ravenv2
HF: https://huggingface.co/Nexusflow/NexusRaven-V2-13B
Nexusflow выложили NexusRaven-V2 с 13B параметров. Модель бьёт GPT-4 (но вроде не Turbo) на Zero-shot Function Calling. Теперь можете построить больше разных ко-пилотов :)
Блог: https://nexusflow.ai/blogs/ravenv2
HF: https://huggingface.co/Nexusflow/NexusRaven-V2-13B
huggingface.co
Nexusflow/NexusRaven-V2-13B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🔥15👍5
Это просто пир духа какой-то.
https://www.cerebras.net/blog/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code/
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks – the entire repo is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.
Но кажется только на железе Cerebras'а. Но всё равно прикольно, больше железных и облачных альтернатив!
https://www.cerebras.net/blog/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code/
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks – the entire repo is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.
Но кажется только на железе Cerebras'а. Но всё равно прикольно, больше железных и облачных альтернатив!
www.cerebras.ai
Introducing gigaGPT: GPT-3 sized models in 565 lines of code - Cerebras
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models.
🔥23👍2👎1
А кому надоели LLM, есть свежий лонгрид от Стивена нашего Вольфрама
https://writings.stephenwolfram.com/2023/12/observer-theory/
https://writings.stephenwolfram.com/2023/12/observer-theory/
Stephenwolfram
Observer Theory
Stephen Wolfram discusses building a general observer theory using discoveries from the Physics Project and NKS, including the ruliad. Read how the nature of observers is critical to determining the most fundamental laws we attribute to the universe.
🔥21🤡2❤1👎1
Продолжаем линию маленьких моделей, Microsoft анонсировал phi-2.
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
We are now releasing Phi-2, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.
Вопрос правда с лицензией. Предыдущие phi были чисто некоммерческими.
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
We are now releasing Phi-2, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.
Вопрос правда с лицензией. Предыдущие phi были чисто некоммерческими.
Microsoft Research
Phi-2: The surprising power of small language models
Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on…
👍14🔥10
И до кучи, вдруг кто пропустил Zephyr 3B (не 7B!)
https://stability.ai/news/stablelm-zephyr-3b-stability-llm
Правда она тоже некоммерческая :(
https://stability.ai/news/stablelm-zephyr-3b-stability-llm
Правда она тоже некоммерческая :(
Stability AI
Introducing Stable LM Zephyr 3B: A New Addition to Stable LM, Bringing Powerful LLM Assistants to Edge Devices — Stability AI
StableLM Zephyr 3B is a new chat model representing the latest iteration in our series of lightweight LLMs, preference tuned for instruction following and Q&A-type tasks.
🔥4
Gemini Pro начинает быть доступным (https://ai.google.dev/pricing).
Пока скорее на пощупать, pay-as-you-go будет позже. Ценник наконец-то в символах, а не токенах :)
Не самая интересная модель. GPT-4 (которая ещё не Turbo) бьёт недоступная пока Ultra, но как и ожидалось (https://news.1rj.ru/str/gonzo_ML/2118).
Пока скорее на пощупать, pay-as-you-go будет позже. Ценник наконец-то в символах, а не токенах :)
Не самая интересная модель. GPT-4 (которая ещё не Turbo) бьёт недоступная пока Ultra, но как и ожидалось (https://news.1rj.ru/str/gonzo_ML/2118).
Google AI for Developers
Gemini Developer API pricing | Gemini API | Google AI for Developers
👍13
Интересный пост Томаша Миколова
"Yesterday we received a Test of Time Award at NeurIPS for the word2vec paper from ten years ago. I'm really happy about it! I think it's the first "best paper" type of award I ever received. In fact, the original word2vec paper was rejected at the first ICLR conference in 2013 (despite the acceptance rate of around 70%), so it made me think how difficult it is for reviewers to predict future impact of research papers.
I've heard a lot of comments - both positive and negative - about word2vec during those years, and did not really comment online about it. Somehow I felt the research community is constantly flooded by propaganda-style PR from certain researchers who are hacking this way the citation counts and attention of others, and I did not want to be part of this. But after ten years, I think it could be entertaining to share some stories associated with this paper.
One frequent comment I've heard was that the code was difficult to understand to the point that some people thought I made it unreadable intentionally. But no, I'm not so evil :D The code ended up being over-optimized because I was waiting for many months for approval to publish it, and meanwhile I was trying to make it both faster and shorter. In fact, looking back, if there were not Greg and Jeff in the Brain team, I doubt I would ever get that approval - I think word2vec was likely the first widely known AI project that Google open-sourced.
There was also significant controversy around the GloVe project from Stanford NLP group that was published more than a year after word2vec. While it copied many tricks from our project, GloVe always felt like a step back to me: it was slower, required more memory, and the resulting vectors had lower quality than the original word2vec. However, it was published with word vectors pre-trained on much more data and thus gained a lot of popularity - although the comparison was really apples-to-oranges. We anyways did fix this later in the fastText project, where we did show that word2vec is much better than GloVe when trained on the same data.
I also received a lot of comments on the word analogies - from "I knew that too but forgot to publish it!" (Geoff Hinton, I believe you :) happens to everyone, and anyways I think everybody knows what the origin of Distributed Representations is) to "it's a total hack and I'm sure it doesn't work!" (random guys who didn't bother to read the papers and try it out themselves - including Ian Goodfellow raging about it on Twitter).
Despite word2vec being my most cited paper, I did never think of it as my most impactful project. In fact, word2vec code originally started as a subset of my previous project - RNNLM - which I think ended up forgotten too quickly. In my eyes, it was at least as revolutionary as AlexNet. Just to name ideas that were for the first time ever demonstrated within RNNLM already in 2010 (when it was still dark ages for deep learning): scalable training of recurrent neural networks (as I invented gradient clipping), first ever text generation from neural language model (I was showing examples of this since 2007), dynamic evaluation, character and sub-word level neural language modeling, neural language model adaptation (nowadays called fine-tuning), first publicly available LM benchmark (the modified Penn Treebank dataset - there really was nothing like this on the web when I started my PhD). I published the first ever study showing that neural nets beat n-gram language models increasingly more with more training data when everything is done correctly (today this sounds obvious, but back in the days this was widely considered impossible - even most Google guys did think that the more data you have, the more futile is to work on anything besides n-grams and smoothing techniques).
"Yesterday we received a Test of Time Award at NeurIPS for the word2vec paper from ten years ago. I'm really happy about it! I think it's the first "best paper" type of award I ever received. In fact, the original word2vec paper was rejected at the first ICLR conference in 2013 (despite the acceptance rate of around 70%), so it made me think how difficult it is for reviewers to predict future impact of research papers.
I've heard a lot of comments - both positive and negative - about word2vec during those years, and did not really comment online about it. Somehow I felt the research community is constantly flooded by propaganda-style PR from certain researchers who are hacking this way the citation counts and attention of others, and I did not want to be part of this. But after ten years, I think it could be entertaining to share some stories associated with this paper.
One frequent comment I've heard was that the code was difficult to understand to the point that some people thought I made it unreadable intentionally. But no, I'm not so evil :D The code ended up being over-optimized because I was waiting for many months for approval to publish it, and meanwhile I was trying to make it both faster and shorter. In fact, looking back, if there were not Greg and Jeff in the Brain team, I doubt I would ever get that approval - I think word2vec was likely the first widely known AI project that Google open-sourced.
There was also significant controversy around the GloVe project from Stanford NLP group that was published more than a year after word2vec. While it copied many tricks from our project, GloVe always felt like a step back to me: it was slower, required more memory, and the resulting vectors had lower quality than the original word2vec. However, it was published with word vectors pre-trained on much more data and thus gained a lot of popularity - although the comparison was really apples-to-oranges. We anyways did fix this later in the fastText project, where we did show that word2vec is much better than GloVe when trained on the same data.
I also received a lot of comments on the word analogies - from "I knew that too but forgot to publish it!" (Geoff Hinton, I believe you :) happens to everyone, and anyways I think everybody knows what the origin of Distributed Representations is) to "it's a total hack and I'm sure it doesn't work!" (random guys who didn't bother to read the papers and try it out themselves - including Ian Goodfellow raging about it on Twitter).
Despite word2vec being my most cited paper, I did never think of it as my most impactful project. In fact, word2vec code originally started as a subset of my previous project - RNNLM - which I think ended up forgotten too quickly. In my eyes, it was at least as revolutionary as AlexNet. Just to name ideas that were for the first time ever demonstrated within RNNLM already in 2010 (when it was still dark ages for deep learning): scalable training of recurrent neural networks (as I invented gradient clipping), first ever text generation from neural language model (I was showing examples of this since 2007), dynamic evaluation, character and sub-word level neural language modeling, neural language model adaptation (nowadays called fine-tuning), first publicly available LM benchmark (the modified Penn Treebank dataset - there really was nothing like this on the web when I started my PhD). I published the first ever study showing that neural nets beat n-gram language models increasingly more with more training data when everything is done correctly (today this sounds obvious, but back in the days this was widely considered impossible - even most Google guys did think that the more data you have, the more futile is to work on anything besides n-grams and smoothing techniques).
❤39👍5