NEW BOT Телеграм, страница

ml4se

SecretBench: A Dataset of Software Secrets

SecretBench is a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.

Dataset: https://github.com/setu1421/SecretBench

216 views12:16

ml4se

GitHub Code Dataset

* 115M code files from GitHub
* 32 programming languages
* 1TB of data

The dataset was created from the public GitHub dataset on Google BiqQuery.

from datasets import load_dataset
ds = load_dataset("codeparrot/github-code", streaming=True, split="train")

huggingface.co

codeparrot/github-code · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

👍3😱2

173 views07:55

ml4se

Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery

This is a simple Dagster project to analyze the number of fake GitHub stars on any GitHub repository:
https://github.com/dagster-io/fake-star-detector

dagster.io

Detecting Fake GitHub Stars with Dagster

Use Dagster, dbt, and BigQuery to analyze suspicious GitHub star activity and protect open-source credibility.

🔥2

13K views10:17

ml4se

ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design

This paper presents prompt design techniques for software engineering, in the form of patterns, to solve common problems when using LLMs, such as ChatGPT to automate common software engineering activities, such as ensuring code is decoupled from third-party libraries and creating an API specification from a requirements list.

👍2

171 viewsedited 11:07

ml4se

DACOS: A Manually Annotated Dataset of Code Smells

DACOS (DAtaset of COde Smells) is a manually annotated dataset containing 10,267 annotations for 5,192 code snippets. The dataset targets three kinds of code smells at different granularity
* multifaceted abstraction
* complex method, and
* long parameter list

Dataset: https://zenodo.org/record/7570428#.ZBrxX6JBycw
Tagman (a web platform to create a manually annotated dataset of smells): https://github.com/SMART-Dal/Tagman

154 views12:16

ml4se

Mirror: A Natural Language Interface for Data Querying, Summarization, and Visualization

Mirror is an open-source platform for data exploration and analysis powered by large language models. Mirror offers an intuitive natural language interface for querying databases, and automatically generates executable SQL commands to retrieve relevant data and summarize it in natural language. In addition, users can preview and manually edit the generated SQL commands to ensure the accuracy of their queries. Mirror also generates visualizations to facilitate understanding of the data. Designed with flexibility and human input in mind, Mirror is suitable for both experienced data analysts and non-technical professionals looking to gain insights from their data.

MIrror: https://github.com/mirror-data/mirror

185 viewsedited 12:19

ml4se

Google Releases Bard

NY Times

Google Releases Bard, Its Competitor in the Race to Create A.I. Chatbots

The internet giant will grant users access to a chatbot after years of cautious development, chasing splashy debuts from rivals OpenAI and Microsoft.

175 views12:56

ml4se

Copilot X

170 viewsedited 14:51

ml4se

Measuring The Impact Of Programming Language Distribution (Google)

Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, authors present the BabelCode framework for execution-based evaluation of any benchmark in any language

BabelCode: https://github.com/google-research/babelcode

196 views06:12

ml4se

ENASE'23 Technical Program

Conference Areas
1 . Theory and Practice of Systems and Applications Development
2 . Challenges and Novel Approaches to Systems and Software Engineering (SSE)
3 . Systems and Software Quality
4 . Systems and Software Engineering (SSE) for Emerging Domains

152 views05:55

ml4se

Improving Code Generation by Training with Natural Language Feedback

Imitation learning from language feedback (ILF) is an algorithm for learning from natural language feedback at training time. ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. It can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task.

👍3

183 views12:55

ml4se

An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction

Authors propose a novel way of representing changes in source code, the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. The appoach was evaluated in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset VIC.

RQ. 1 Can a vulnerability introducing database generated from a vulnerability fixing commit database be used for vulnerability prediction?
RQ. 2 How effective are Code Change Trees in representing source code changes?
RQ. 3 Are source code metrics sufficient to represent code changes?

dataset paper
VIC dataset

9.4K views13:33

ml4se

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X

CodeGeeX is a multilingual model with 13 billion parameters for code generation. It is pre-trained on 850 billion tokens of 23 programming languages.

- Multilingual Code Generation: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc.
- Crosslingual Code Translation: CodeGeeX supports the translation of code snippets between different languages.
- Customizable Programming Assistant: CodeGeeX is available in the VS Code extension marketplace for free. It supports code completion, explanation, summarization and more, which empower users with a better coding experience.
- Open-Source and Cross-Platform: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100.

GitHub

❤1👍1

9.49K views16:13

ml4se

Forwarded from Consciousnesses

Nature Language Reasoning, A Survey

This survey paper provides a definition for natural language reasoning in NLP, based on both philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of reasoning.

🔥2

179 views16:32

ml4se

BloombergGPT: A Large Language Model for Finance

The work presents BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. Authors construct a 363 billion token dataset based on Bloomberg's extensive data sources. Mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks.

208 views08:15

ml4se

CONAN: Diagnosing Batch Failures for Cloud Systems (Microsoft)

Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, authors focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. CONAN is an efficient and flexible framework that can automatically extract contrast patterns (failed vs. succeeded, slow vs. normal etc.) from contextual data.

162 views11:41

ml4se

ICCQ'23: The Third International Conference on Code Quality

- What IS Code Quality: from “ilities” to QWAN
- Mutant Selection Strategies in Mutation Testing
- Understanding Software Performance Challenges - An Empirical Study on Stack Overflow
- Applying Machine Learning Analysis for Software Quality Test
- Test-based and metric-based evaluation of code generation models for practical question answering

Accepted papers
Live

ICCQ.ru

ICCQ-2023: 3rd International Conference on Code Quality

In cooperation with IEEE Computer Society the event is focused on static analysis, program verification, bug detection, and software maintenance.

210 views12:52

ml4se

Federated Learning with Flexible Control (IBM)

Federated learning (FL) enables distributed model training from local data collected by users. Existing works have separately considered different configurations to make FL more efficient, such as infrequent transmission of model updates, client subsampling, and compression of update vectors. However, an important open problem is how to jointly apply and tune these control knobs in a single FL algorithm.

Is it possible to jointly apply a wide range of control options in a single FL algorithm, to support heterogeneous and time-varying costs of multiple types of resources?

FlexFL is an FL algorithm, which allows flexible configurations in the amount of computation at each client and the amount of communication between clients and the server. This algorithm provides a high degree of freedom in adapting the FL procedure to heterogeneous and dynamically changing resource costs.

240 views15:30

ml4se

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

The paper presents a new dataset, DiverseVul, for detecting software vulnerabilities using deep learning. The dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 nonvulnerable functions extracted from 7,861 commits, which is more diverse and twice the size of the previous largest and most diverse dataset, CVEFixes. The authors plan to publish the DiverseVul dataset.

302 views05:49

About

Blog

Apps

Platform