NEW BOT Телеграм, страница

gonzo-обзоры ML статей

The leak of the day.

https://threadreaderapp.com/thread/1678545170508267522.html

GPT-4's details are leaked.

It is over.

Parameters count:

GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers.
Mixture Of Experts - Confirmed.

OpenAI was able to keep costs reasonable by utilizing a mixture of experts (MoE) model.
They utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed to per forward pass.

MoE Routing:

While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.

There roughly ~55B shared parameters for attention.

Inference:

Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.

Dataset:

GPT-4 is trained on ~13T tokens.

These are not unique tokens, they count the epochs as more tokens as well.

Epoch number: 2 epochs for text-based data and 4 for code-based data.

There is millions of rows of instruction fine-tuning data from ScaleAI & internally.

GPT-4 32K

There was an 8k context length (seqlen) for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training.

Batch Size:

The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.
For the real batch size:
Divide this number by the seq len to get the real batch size. just stop with this misleading numbers already.

Parallelism Strategies

To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism as that is the limit for NVLink.

Beyond that, they are using 15-way pipeline parallelism.

(likely used ZeRo Stage 1. It is possible they used block-level FSDP)

Training Cost

OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU.

Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.

If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.

(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)

Mixture of Expert Tradeoffs

There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.
This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.

Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that’s purely research.
There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with.
With such a large training run, OpenAI instead chose to be more conservative on the number of experts.

GPT-4 Inference Cost

GPT-4 costs 3x that of the 175B parameter Davinchi.
This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved.
AN estimate of it's costs is $0.0049 cents per 1k tokens for 128 A100s to inference GPT-4 8k seqlen and $0.0021 cents per 1k tokens for 128 H100’s to inference GPT-4 8k seqlen. It should be noted, we assume decent high utilization, and keeping batch sizes high.

Threadreaderapp

Read and Share Twitter Threads easily!

Thread Reader helps you read and share the best of Twitter Threads

🔥31👍10❤2🤨1

5.28K viewsedited 15:14

gonzo-обзоры ML статей

Multi-Query Attention

OpenAI are using MQA just like everybody else.
Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seqlen GPT-4 definitely cannot run on 40GB A100s, and the 8k is capped on max bsz.

Continuous batching

OpenAI implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.

Vision Multi-Modal

It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text only pre-training.
On the vision model, OpenAI wanted to train it from scratch, but it wasn’t mature enough, so they wanted to derisk it by starting with text.
One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what’s in images and video.
Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: sampling frames, and run Whisper around it to get trannoscript.

[Dont want to say "I told you so" but..]

Speculative Decoding

OpenAI might be using speculative decoding on GPT-4's inference. (not sure 100%)

The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch.
If the small model was right about its predictions – the larger model agrees and we can decode several tokens in a single batch.
But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we continue with the larger model.
The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.

Inference Architecture

The inference runs on a cluster of 128 GPUs.

There are multiple of these clusters in multiple datacenters in different locations.

It is done in 8-way tensor parallelism and 16-way pipeline parallelism.

Each node of 8 GPUs has only ~130B parameters, or… twitter.com/i/web/status/1…
The model has 120, so it fits in 15 different nodes.
[Possibly the there are less layers on the first node since it needs to also compute the embeddings]
According to these numbers: OpenAI should have trained on 2x the tokens if they were trying to go by chinchilla's optimal.

[let alone surpass it like we do]

This goes to show that they are struggling to get high quality data.
Why no FSDP?

A possible reason for this could be that some of the hardware infra they secured is of an older generation.

This is pretty common at local compute clusters as the organisation usually upgrade the infra in several "waves" to avoid a complete pause of operation.… twitter.com/i/web/status/1…

❤13👍5💯1

5.27K viewsedited 15:14

gonzo-обзоры ML статей

Dataset Mixture

They trained on 13T tokens.
CommonCrawl & RefinedWeb are both 5T.

Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of "unaccounted for" tokens: The "secret" data.
Which by this point we already get rumors that parts of it came from twitter, reddit & youtube.

[Rumors that start to become lawsuits]

Some speculations are:
- LibGen (4M+ books)
- Sci-Hub (80M+ papers)
- All of GitHub
My own opinion:

The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.

This is very easy to convert to txt file and than with self-instruct into instruction form.
This creates the "illusion" that GPT-4 "is smart" no matter who use it.

Computer scientist? sure! it can help you with your questions about P!=NP
Philosophy major? It can totally talk to you about epistemology.

Don't you see?
It was trained on the textbooks. It is so obvious.
There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on.

There are some books it knows so well that it had seen them for sure.

Moreover, If i remember correctly: It even know the unique ids of project Euler exes.

❤15🔥8👍3

6.31K viewsedited 15:14

gonzo-обзоры ML статей

Тем временем Anthropic анонсировал доступность Claude 2 через API с окном контекста в 100к токенов

https://www.anthropic.com/index/claude-2

Anthropic

Claude 2

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

🔥22❤5👍4🤯3

7.26K viewsedited 16:01

gonzo-обзоры ML статей

А ещё тем временем Google Bard раскатили на 40 языков и добавили поддержку картинок в промптах. Чего мы всё никак не дождёмся от GPT-4...

https://blog.google/products/bard/google-bard-new-features-update-july-2023/

Google

Bard’s latest update: more features, languages and countries

Today we're expanding Bard to new languages and countries, and launching new features to help you get creative and boost your productivity.

❤14🔥7👍3

6.68K viewsedited 20:14

gonzo-обзоры ML статей

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman
Статья: https://arxiv.org/abs/2307.06949
Сайт: https://hyperdreambooth.github.io/

Прикольная работа с гиперсетями, которые я нежно люблю.

Если кто не знаком с концепцией Hypernetworks, то в двух словах это когда одна нейросеть (она и есть гиперсеть) генерирует веса для другой нейросети. Начало этому направлению было положено одноимённой работой Дэвида Ха и ко (https://arxiv.org/abs/1609.09106). Работы с гиперсетями в целом идут довольно регулярным потоком, но, на мой взгляд, всё равно тема малоизвестная и по ощущению недостаточно глубоко проработанная, в том смысле, что, я убеждён, там скрывается ещё много всего интересного.

На этот раз гиперсети применили для персонализации text-to-image моделей, а точнее для ускорения DreamBooth (https://arxiv.org/abs/2208.12242). DreamBooth был способен файнтюнить предобученные модели (в той работе был Imagen) небольшим числом картинок (3-5) конкретного персонажа, так что модель выучивала уникальный идентификатор (например, “[V]”) субъекта, и его можно далее было использовать для синтеза картинок с этим субъектом в различных новых контекстах (“A [V] dog in the beach”). Примеры работы DreamBooth в статье или на сайте-компаньоне (https://dreambooth.github.io/).

Для файнтюнинга DreamBooth процесс с 1000 итераций занимал 5 минут для Imagen на TPUv4 или Stable Diffusion на A100. Процесс затрагивал все веса UNet’а и текстового энкодера, что в случае Stable Diffusion требовало порядка 1Gb на каждую персонализацию.

HyperDreamBooth сокращает размер кастомизированной модели (в этой работе берут Stable Diffusion) и делает процесс быстрее. Это не единственный способ ускорения персонализации, есть и другие подходы, но мы их тут не рассматриваем.

А ещё он работает на одной фотографии.

Рецепт HyperDreamBooth состоит из трёх ингредиентов.

1) Lightweight DreamBooth (LiDB) является по сути LoRA++.

Сначала делается уже общеизвестная LoRA (Low-Rank Adaptation, https://arxiv.org/abs/2106.09685), где предобученные веса W_0 (размерности n×m) заменяются сначала на сумму замороженных W_0 и обучаемых ∆W, а затем для ∆W делается низкоранговая аппроксимация ∆W=AB (размерность A n×r, размерность B r×m и r<<min(n, m), и работает всё даже для r = 1). Вот эта вот факторизованная добавка обычно и распространяется как LoRA-модель, которая не является полноценной моделью и требует оригинальных предобученных весов, чтобы посчитать с ними W_0 + AB.

Это уже уменьшает размер файнтюненной модели на три порядка, оставляя 386K параметров, требующих 1.6Mb места.

Затем идут дальше и делают ещё одну декомпозицию матриц A и B на A = A_aux*A_train (размерность A_aux n×a, A_train a×r) и B = B_train*B_aux (размерности аналогично r×b и b×m), где aux инициализированы случайными ортогональными векторами, которые заморожены, а обучаются только матрицы train. Гиперпараметры a и b подбираются экспериментально, в работе выбрали a = 100, b = 50. Это уменьшает модель ещё в 10 раз относительно LoRA. Финальная модель (вернее добавка относительно базовой модели) содержит 28К весов и занимает 120Кб места. Что конечно big deal, если нужно хостить множество файнтюненных моделей.

Эти матрицы A_train, B_train получают для каждого слоя внимания (кросс- или self-).

2) HyperNetwork for Fast Personalization of Text-to-Image Models -- собственно гиперсеть, которая по входному изображению предсказывает матрицы A_train и B_train от LiDB.

Гиперсеть обучается на наборе доменно-специфичных картинок (взяли CelebAHQ) с миксом обычного diffusion denoising loss и weight-space loss. Для всех сэмплов используется один и тот же промпт “a [V] face”.

Архитектура гиперсети -- это картиночный энкодер на базе ViT (ViT-H) и декодер трансформера (2 слоя). Декодер итеративно уточняет предсказания весов, стартуя с нулевых значений. Число итераций s -- гиперпараметр (и я не понял, чему он равен). На выходе также есть обучаемые линейные слои.

👍24🔥10❤1

4.74K viewsedited 00:42

gonzo-обзоры ML статей

3) Rank-Relaxed Fast Finetuning для улучшения схватывания моделью мелких деталей. Предсказанные гиперсетью веса файнтюнятся с diffusion denoising loss. Ключевым здесь является rank-relaxed finetuning, где ранг LoRA модели меняется с r = 1 на r > 1 перед файнтюнингом. Веса предсказанные гиперсетью добавляются к основным весам модели и проводится LoRA файнтюнинг с новым более высоким рангом. Используется тот же самый промпт “a [V] face”.

То есть получается, что гиперсеть выдаёт инициализацию весов, с которой файнтюнинг проходит достаточно быстро, за 40 итераций, что в 25 раз быстрее DreamBooth и LoRA DreamBooth, где было 1000 итераций.

Результаты превосходят Textual Inversion и оригинальный DreamBooth (которому вроде как и число итераций немного подняли, до 1200). Картинки прикольные.

Такие дела. Стартапы, вперёд!

А у вас какой любимый кейс с hypernetworks?

👍12🔥8

3.67K viewsedited 00:42

gonzo-обзоры ML статей

Классический DreamBooth

3.29K views00:43

gonzo-обзоры ML статей

3.32K views00:45

gonzo-обзоры ML статей

3.41K views00:46

gonzo-обзоры ML статей

3.4K views00:46

gonzo-обзоры ML статей

3.43K views00:47

gonzo-обзоры ML статей

3.49K views00:48

gonzo-обзоры ML статей

3.73K views00:50

gonzo-обзоры ML статей

4.07K views00:50

gonzo-обзоры ML статей

👍1

4.12K views00:51

gonzo-обзоры ML статей

4.17K views00:52

gonzo-обзоры ML статей

This media is not supported in your browser

VIEW IN TELEGRAM

4.17K views00:53

🔥16👍1

gonzo-обзоры ML статей

Meta just announced Llama 2

https://ai.meta.com/llama/

The good news: Llama 2 is available for free for research and commercial use (if you're not Twitter 😁) under their own licence, Llama 2 Community License Agreement.

some quotes from the licence:

a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.

2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

:)

Llama 2 is pretrained using publicly available online data. Llama 2 models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama-2-chat models have additionally been trained on over 1 million new human annotations.

"We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing (due to a lack of time to sufficiently red team)"

An initial version of Llama-2-chat is then created through the use of supervised fine-tuning. Next, Llama-2-chat is iteratively refined using Reinforcement Learning from Human Feedback (RLHF), which includes rejection sampling and proximal policy optimization (PPO).

You can download the paper here: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Interestingly, Llama 2 is also available on Azure and in Windows.

Now Azure customers can fine-tune and deploy the 7B, 13B, and 70B-parameter Llama 2 models easily and more safely on Azure, the platform for the most widely adopted frontier and open models. In addition, Llama will be optimized to run locally on Windows. Windows developers will be able to use Llama by targeting the DirectML execution provider through the ONNX Runtime, allowing a seamless workflow as they bring generative AI experiences to their applications.

https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/

Industry Leading, Open-Source AI | Llama

Discover Llama 4's class-leading AI models, Scout and Maverick. Experience top performance, multimodality, low costs, and unparalleled efficiency.

🔥21👍3❤2

3.97K viewsedited 17:23

gonzo-обзоры ML статей

3.33K views17:23

gonzo-обзоры ML статей

3.57K views17:23

About

Blog

Apps

Platform