Mistral выкатил MoE (Mixture of Experts) модель Mixtral 8x7B, которая типа бьёт GPT-3.5 из коробки. Также есть instruction finetuned Mixtral 8x7B Instruct. Это интересно.
https://mistral.ai/news/mixtral-of-experts/
https://mistral.ai/news/mixtral-of-experts/
mistral.ai
Mixtral of experts | Mistral AI
A high quality Sparse Mixture-of-Experts.
🔥12
А ещё из интересного, в свежей huggingface transformers растёт и крепнет поддержка GPU AMD.
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
* Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
* Flash Attention 2 support for RoCm by @fxmarty in #27611
* Reflect RoCm support in the documentation by @fxmarty in #27636
* restructure AMD scheduled CI by @ydshieh in #27743
https://github.com/huggingface/transformers/releases/tag/v4.36.0
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
* Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
* Flash Attention 2 support for RoCm by @fxmarty in #27611
* Reflect RoCm support in the documentation by @fxmarty in #27636
* restructure AMD scheduled CI by @ydshieh in #27743
https://github.com/huggingface/transformers/releases/tag/v4.36.0
GitHub
Release v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support · huggingface/transformers
New model additions
Mixtral
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT ...
Mixtral
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT ...
🔥15❤3👍2
И раз сегодня много LLM новостей, то вот ещё одна для тех, кто пропустил.
Nexusflow выложили NexusRaven-V2 с 13B параметров. Модель бьёт GPT-4 (но вроде не Turbo) на Zero-shot Function Calling. Теперь можете построить больше разных ко-пилотов :)
Блог: https://nexusflow.ai/blogs/ravenv2
HF: https://huggingface.co/Nexusflow/NexusRaven-V2-13B
Nexusflow выложили NexusRaven-V2 с 13B параметров. Модель бьёт GPT-4 (но вроде не Turbo) на Zero-shot Function Calling. Теперь можете построить больше разных ко-пилотов :)
Блог: https://nexusflow.ai/blogs/ravenv2
HF: https://huggingface.co/Nexusflow/NexusRaven-V2-13B
huggingface.co
Nexusflow/NexusRaven-V2-13B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🔥15👍5
Это просто пир духа какой-то.
https://www.cerebras.net/blog/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code/
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks – the entire repo is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.
Но кажется только на железе Cerebras'а. Но всё равно прикольно, больше железных и облачных альтернатив!
https://www.cerebras.net/blog/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code/
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks – the entire repo is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.
Но кажется только на железе Cerebras'а. Но всё равно прикольно, больше железных и облачных альтернатив!
www.cerebras.ai
Introducing gigaGPT: GPT-3 sized models in 565 lines of code - Cerebras
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models.
🔥23👍2👎1
А кому надоели LLM, есть свежий лонгрид от Стивена нашего Вольфрама
https://writings.stephenwolfram.com/2023/12/observer-theory/
https://writings.stephenwolfram.com/2023/12/observer-theory/
Stephenwolfram
Observer Theory
Stephen Wolfram discusses building a general observer theory using discoveries from the Physics Project and NKS, including the ruliad. Read how the nature of observers is critical to determining the most fundamental laws we attribute to the universe.
🔥21🤡2❤1👎1
Продолжаем линию маленьких моделей, Microsoft анонсировал phi-2.
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
We are now releasing Phi-2, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.
Вопрос правда с лицензией. Предыдущие phi были чисто некоммерческими.
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
We are now releasing Phi-2, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.
Вопрос правда с лицензией. Предыдущие phi были чисто некоммерческими.
Microsoft Research
Phi-2: The surprising power of small language models
Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on…
👍14🔥10
И до кучи, вдруг кто пропустил Zephyr 3B (не 7B!)
https://stability.ai/news/stablelm-zephyr-3b-stability-llm
Правда она тоже некоммерческая :(
https://stability.ai/news/stablelm-zephyr-3b-stability-llm
Правда она тоже некоммерческая :(
Stability AI
Introducing Stable LM Zephyr 3B: A New Addition to Stable LM, Bringing Powerful LLM Assistants to Edge Devices — Stability AI
StableLM Zephyr 3B is a new chat model representing the latest iteration in our series of lightweight LLMs, preference tuned for instruction following and Q&A-type tasks.
🔥4
Gemini Pro начинает быть доступным (https://ai.google.dev/pricing).
Пока скорее на пощупать, pay-as-you-go будет позже. Ценник наконец-то в символах, а не токенах :)
Не самая интересная модель. GPT-4 (которая ещё не Turbo) бьёт недоступная пока Ultra, но как и ожидалось (https://news.1rj.ru/str/gonzo_ML/2118).
Пока скорее на пощупать, pay-as-you-go будет позже. Ценник наконец-то в символах, а не токенах :)
Не самая интересная модель. GPT-4 (которая ещё не Turbo) бьёт недоступная пока Ultra, но как и ожидалось (https://news.1rj.ru/str/gonzo_ML/2118).
Google AI for Developers
Gemini Developer API pricing | Gemini API | Google AI for Developers
👍13
Интересный пост Томаша Миколова
"Yesterday we received a Test of Time Award at NeurIPS for the word2vec paper from ten years ago. I'm really happy about it! I think it's the first "best paper" type of award I ever received. In fact, the original word2vec paper was rejected at the first ICLR conference in 2013 (despite the acceptance rate of around 70%), so it made me think how difficult it is for reviewers to predict future impact of research papers.
I've heard a lot of comments - both positive and negative - about word2vec during those years, and did not really comment online about it. Somehow I felt the research community is constantly flooded by propaganda-style PR from certain researchers who are hacking this way the citation counts and attention of others, and I did not want to be part of this. But after ten years, I think it could be entertaining to share some stories associated with this paper.
One frequent comment I've heard was that the code was difficult to understand to the point that some people thought I made it unreadable intentionally. But no, I'm not so evil :D The code ended up being over-optimized because I was waiting for many months for approval to publish it, and meanwhile I was trying to make it both faster and shorter. In fact, looking back, if there were not Greg and Jeff in the Brain team, I doubt I would ever get that approval - I think word2vec was likely the first widely known AI project that Google open-sourced.
There was also significant controversy around the GloVe project from Stanford NLP group that was published more than a year after word2vec. While it copied many tricks from our project, GloVe always felt like a step back to me: it was slower, required more memory, and the resulting vectors had lower quality than the original word2vec. However, it was published with word vectors pre-trained on much more data and thus gained a lot of popularity - although the comparison was really apples-to-oranges. We anyways did fix this later in the fastText project, where we did show that word2vec is much better than GloVe when trained on the same data.
I also received a lot of comments on the word analogies - from "I knew that too but forgot to publish it!" (Geoff Hinton, I believe you :) happens to everyone, and anyways I think everybody knows what the origin of Distributed Representations is) to "it's a total hack and I'm sure it doesn't work!" (random guys who didn't bother to read the papers and try it out themselves - including Ian Goodfellow raging about it on Twitter).
Despite word2vec being my most cited paper, I did never think of it as my most impactful project. In fact, word2vec code originally started as a subset of my previous project - RNNLM - which I think ended up forgotten too quickly. In my eyes, it was at least as revolutionary as AlexNet. Just to name ideas that were for the first time ever demonstrated within RNNLM already in 2010 (when it was still dark ages for deep learning): scalable training of recurrent neural networks (as I invented gradient clipping), first ever text generation from neural language model (I was showing examples of this since 2007), dynamic evaluation, character and sub-word level neural language modeling, neural language model adaptation (nowadays called fine-tuning), first publicly available LM benchmark (the modified Penn Treebank dataset - there really was nothing like this on the web when I started my PhD). I published the first ever study showing that neural nets beat n-gram language models increasingly more with more training data when everything is done correctly (today this sounds obvious, but back in the days this was widely considered impossible - even most Google guys did think that the more data you have, the more futile is to work on anything besides n-grams and smoothing techniques).
"Yesterday we received a Test of Time Award at NeurIPS for the word2vec paper from ten years ago. I'm really happy about it! I think it's the first "best paper" type of award I ever received. In fact, the original word2vec paper was rejected at the first ICLR conference in 2013 (despite the acceptance rate of around 70%), so it made me think how difficult it is for reviewers to predict future impact of research papers.
I've heard a lot of comments - both positive and negative - about word2vec during those years, and did not really comment online about it. Somehow I felt the research community is constantly flooded by propaganda-style PR from certain researchers who are hacking this way the citation counts and attention of others, and I did not want to be part of this. But after ten years, I think it could be entertaining to share some stories associated with this paper.
One frequent comment I've heard was that the code was difficult to understand to the point that some people thought I made it unreadable intentionally. But no, I'm not so evil :D The code ended up being over-optimized because I was waiting for many months for approval to publish it, and meanwhile I was trying to make it both faster and shorter. In fact, looking back, if there were not Greg and Jeff in the Brain team, I doubt I would ever get that approval - I think word2vec was likely the first widely known AI project that Google open-sourced.
There was also significant controversy around the GloVe project from Stanford NLP group that was published more than a year after word2vec. While it copied many tricks from our project, GloVe always felt like a step back to me: it was slower, required more memory, and the resulting vectors had lower quality than the original word2vec. However, it was published with word vectors pre-trained on much more data and thus gained a lot of popularity - although the comparison was really apples-to-oranges. We anyways did fix this later in the fastText project, where we did show that word2vec is much better than GloVe when trained on the same data.
I also received a lot of comments on the word analogies - from "I knew that too but forgot to publish it!" (Geoff Hinton, I believe you :) happens to everyone, and anyways I think everybody knows what the origin of Distributed Representations is) to "it's a total hack and I'm sure it doesn't work!" (random guys who didn't bother to read the papers and try it out themselves - including Ian Goodfellow raging about it on Twitter).
Despite word2vec being my most cited paper, I did never think of it as my most impactful project. In fact, word2vec code originally started as a subset of my previous project - RNNLM - which I think ended up forgotten too quickly. In my eyes, it was at least as revolutionary as AlexNet. Just to name ideas that were for the first time ever demonstrated within RNNLM already in 2010 (when it was still dark ages for deep learning): scalable training of recurrent neural networks (as I invented gradient clipping), first ever text generation from neural language model (I was showing examples of this since 2007), dynamic evaluation, character and sub-word level neural language modeling, neural language model adaptation (nowadays called fine-tuning), first publicly available LM benchmark (the modified Penn Treebank dataset - there really was nothing like this on the web when I started my PhD). I published the first ever study showing that neural nets beat n-gram language models increasingly more with more training data when everything is done correctly (today this sounds obvious, but back in the days this was widely considered impossible - even most Google guys did think that the more data you have, the more futile is to work on anything besides n-grams and smoothing techniques).
❤39👍5
It was really lucky for me to join Google Brain in 2012 where there were believers in large scale neural networks who allowed me to work on word2vec to demonstrate the potential. But I don't want to give the impression everything was always perfect - as a follow up project after word2vec, I wanted to popularize neural language models by improving Google Translate. I did start collaboration with Franz Och and his team, during which time I proposed a couple of models that could either complement the phrase-based machine translation, or even replace it. I came up (actually even before joining Google) with a really simple idea to do end-to-end translation by training a neural language model on pairs of sentences (say French - English), and then use the generation mode to produce translation after seeing the first sentence. It worked great on short sentences, but not so much on the longer ones. I discussed this project many times with others in Google Brain - mainly Quoc and Ilya - who took over this project after I moved to Facebook AI. I was quite negatively surprised when they ended up publishing my idea under now famous name "sequence to sequence" where not only I was not mentioned as a co-author, but in fact my former friends forgot to mention me also in the long Acknowledgement section, where they thanked personally pretty much every single person in Google Brain except me. This was the time when money started flowing massively into AI and every idea was worth gold. It was sad to see the deep learning community quickly turn into some sort of Game of Thrones. Money and power certainly corrupts people...
Anyhow, the interest in language models was growing maybe slowly over the years, but with the explosion of interest since ChatGPT was released it is really cool to see so many people finally making connection between AI and language. We're not there yet, and I personally believe we need to make new discoveries to push through generalization limits of neural models. We're certainly living in exciting times. But let's not put too much faith into individuals who want to monopolize technology that is based on the hard work of dozens, or even hundreds of scientists while making claims it's all for the good of humanity."
https://www.facebook.com/1533402400/posts/pfbid0ao3fqoznHoprc8FawH6p84bctobvpTPrrbwxtGUXmBz92CzWoG63U6VSjcWJCJJTl/
Anyhow, the interest in language models was growing maybe slowly over the years, but with the explosion of interest since ChatGPT was released it is really cool to see so many people finally making connection between AI and language. We're not there yet, and I personally believe we need to make new discoveries to push through generalization limits of neural models. We're certainly living in exciting times. But let's not put too much faith into individuals who want to monopolize technology that is based on the hard work of dozens, or even hundreds of scientists while making claims it's all for the good of humanity."
https://www.facebook.com/1533402400/posts/pfbid0ao3fqoznHoprc8FawH6p84bctobvpTPrrbwxtGUXmBz92CzWoG63U6VSjcWJCJJTl/
Facebook
Log in to Facebook
Log in to Facebook to start sharing and connecting with your friends, family and people you know.
❤62👍10🤔2