ml4se – Telegram
ml4se
505 subscribers
446 photos
1 file
524 links
Machine Learning for Software Engineering
Download Telegram
HumanEval_ru Dataset

This is the Russian version of the code generation HumanEval dataset.

Load dataset:

from datasets import load_dataset
load_dataset('NLPCoreTeam/humaneval_ru')

DatasetDict({
train: Dataset({
features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction', 'instruction_noexamples'],
num_rows: 164
})
})
1
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

BioCoder is a benchmark for code generation incorporating 2269 bioinformatics-specific coding problems. It incorporates a fuzz-testing framework for evaluation. The authors have applied BioCoder to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT.

Dataset, benchmark, Docker images, and noscripts: https://github.com/gersteinlab/biocoder
Releasing Persimmon-8B

Permisimmon-8B is open-source, fully permissive model. It is trained from scratch using a context size of 16K. The model has 70k unused embeddings for multimodal extensions, and has sparse activations. The inference code combines the speed of C++ implementations (e.g. FasterTransformer) with the flexibility of naive Python inference.

Hidden Size 4096
Heads 64
Layers 36
Batch Size 120
Sequence Length 16384
Training Iterations 375K
Tokens Seen 737B

Code and weights: https://github.com/persimmon-ai-labs/adept-inference
DevGPT: Studying Developer-ChatGPT Conversations

Yet, we know very little about how ChatGPT is actually used by software developers. What questions do developers present to ChatGPT? What are the dynamics of these interactions? What is the backdrop against which these conversations are held, and how do the conversations feedback into the artifacts of their work? To close this gap, the authors introduce DevGPT, a curated dataset which encompasses 17,913 prompts and ChatGPT’s responses including 11,751 code snippets, coupled with the corresponding software development artifacts—ranging from source code, commits, issues, pull requests, to discussions and Hacker News threads—to enable the analysis of the context and implications of these developer interactions with ChatGPT.

Dataset
🔥3
Communicative Agents for Software Development

ChatDev is a chat-based end-to-end software development framework that leverages LLMs to facilitate effective communication and collaboration among multiple roles involved in the software development process. By decomposing the development process into sequential atomic subtasks through the use of the chat chain, ChatDev enables granular focus and promotes desired outputs for each subtask. The experimental results demonstrate the efficiency and cost-effectiveness of the automated software development process driven by ChatDev.

The Human-Agent-Interaction mode is now available. You can get involved with the ChatDev team by playing the role of reviewer and making suggestions to the programmer.

GitHub
🔥2
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. It adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8xA100 machine.

Results:
1. The proposed shifted short attention is easy to implement, compatible with Flash-Attention, and not required during inference.
2. Released models: from 7B to 70B, context length from 8k to 100k, including LLaMA2-LongLoRA-7B-100k, LLaMA2-LongLoRA-13B-64k, and LLaMA2-LongLoRA-70B-32k.
3. A long-context QA dataset, LongQA, for supervised fine-tuning (SFT).

Repository: https://github.com/dvlab-research/LongLoRA
👍1
78% MNIST accuracy using GZIP in under 10 lines of code

c = lambda z: len(gzip.compress(z.tobytes()))

def ncd(x, y):
return (c(x + y) - min(c(x), c(y))) / max(c(x), c(y))

cls = [(x, c(x), l) for x, l in training_set]

correct_predictions = sum([np.array_equal(Counter(
[l for _, _, l in sorted([(ncd(x1, x), x, l) for x, _, l in cls],
key=lambda t: t[0])[:5]]).most_common(1)[0][0], label)
for x1, label in test_set])
🤯3
AutoGen: Enabling next-generation large language model applications

AutoGen is a framework for simplifying the orchestration, optimization, and automation of LLM workflows. It offers customizable and conversable agents that leverage the strongest capabilities of the most advanced LLMs, like GPT-4, while addressing their limitations by integrating with humans and tools and having conversations between multiple agents via automated chat.

Paper: https://arxiv.org/abs/2308.08155
🔥1
Patterns for Building LLM-based Systems & Products:

- Evals: To measure performance
- RAG: To add recent, external knowledge
- Fine-tuning: To get better at specific tasks
- Caching: To reduce latency & cost
- Guardrails: To ensure output quality
- Defensive UX: To anticipate & manage errors gracefully
- Collect user feedback: To build our data flywheel
👍4
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

PB is a general-purpose self-referential self-improvement mechanism for LLMs. Given a seed set of mutation-prompts, thinking-styles, and a domain-specific problem denoscription, PB generates variations of the task-prompts and mutation-prompts, exploiting the fact that LLMs can be prompted to act as mutation operators. Based on the fitness of the evolved task-prompts, a subset of evolutionary units consisting of task-prompts and their associated mutation-prompt is selected for future generations. Over multiple generations of PB, prompts have adapted to the domain at hand, e.g., in a mathematical domain, PB evolved the task-prompt "Show all your working. II. You should use the correct mathematical notation and vocabulary, where appropriate. III. You should write your answer in full sentences and in words. IV. You should use examples to illustrate your points and prove your answers. V. Your workings out should be neat and legible" on GSM8K.
🤯3👍1
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

L2CEval is a comprehensive evaluation of LLMs for natural language to code generation, along a variety of axes such as model scale, training data, sensitivity to few shot exemplars as well as the impact of instruction tuning, etc.

L2CEval includes a wide range of sota models, specifically 54 models from 13 different organizations, all evaluated on 3 core domains of language-to-code generation tasks. L2CEval includes extensive evaluations of models as small as 1B parameters, to significantly larger ones such as davinci and GPT-4 models from OpenAI, with estimated size of 170B+ parameters.

The study can be useful for the community in applying LLMs for downstream code applications.

https://l2c-eval.github.io/
👍1
Think before you speak: Training Language Models With Pause Tokens

Language models generate responses by producing a series of tokens in immediate succession: the (K + 1)-th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token.

What happens if we delay a model’s answer generation, and how can we execute these delays? What if we were to let the model manipulate say, K + 10 hidden vectors, before it outputs the (K + 1)-th token?

The authors operationalize this idea by performing training and inference on language models with a pause token, a sequence of which is appended to the input prefix. It allows the model to process extra computation before committing to an answer.

The main finding is that such delays show gains on downstream tasks covering reasoning, question-answering, general understanding, when the model is both pre-trained and finetuned with delays.
👍2
ICAART 2024: International Conference on Agents and Artificial Intelligence

Conference Areas
1 . Agents
2 . Artificial Intelligence

Deadlines:
1. Regular Paper Submission Extension: October 26, 2023
2. Position Paper Submission: November 15, 2023
3. Doctoral Consortium Paper Submission: December 21, 2023