SecretBench: A Dataset of Software Secrets
SecretBench is a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.
Dataset: https://github.com/setu1421/SecretBench
SecretBench is a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.
Dataset: https://github.com/setu1421/SecretBench
GitHub Code Dataset
* 115M code files from GitHub
* 32 programming languages
* 1TB of data
The dataset was created from the public GitHub dataset on Google BiqQuery.
* 115M code files from GitHub
* 32 programming languages
* 1TB of data
The dataset was created from the public GitHub dataset on Google BiqQuery.
from datasets import load_dataset
ds = load_dataset("codeparrot/github-code", streaming=True, split="train")huggingface.co
codeparrot/github-code · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
👍3😱2
Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery
This is a simple Dagster project to analyze the number of fake GitHub stars on any GitHub repository:
https://github.com/dagster-io/fake-star-detector
This is a simple Dagster project to analyze the number of fake GitHub stars on any GitHub repository:
https://github.com/dagster-io/fake-star-detector
dagster.io
Detecting Fake GitHub Stars with Dagster
Use Dagster, dbt, and BigQuery to analyze suspicious GitHub star activity and protect open-source credibility.
🔥2
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design
This paper presents prompt design techniques for software engineering, in the form of patterns, to solve common problems when using LLMs, such as ChatGPT to automate common software engineering activities, such as ensuring code is decoupled from third-party libraries and creating an API specification from a requirements list.
This paper presents prompt design techniques for software engineering, in the form of patterns, to solve common problems when using LLMs, such as ChatGPT to automate common software engineering activities, such as ensuring code is decoupled from third-party libraries and creating an API specification from a requirements list.
👍2
DACOS: A Manually Annotated Dataset of Code Smells
DACOS (DAtaset of COde Smells) is a manually annotated dataset containing 10,267 annotations for 5,192 code snippets. The dataset targets three kinds of code smells at different granularity
* multifaceted abstraction
* complex method, and
* long parameter list
Dataset: https://zenodo.org/record/7570428#.ZBrxX6JBycw
Tagman (a web platform to create a manually annotated dataset of smells): https://github.com/SMART-Dal/Tagman
DACOS (DAtaset of COde Smells) is a manually annotated dataset containing 10,267 annotations for 5,192 code snippets. The dataset targets three kinds of code smells at different granularity
* multifaceted abstraction
* complex method, and
* long parameter list
Dataset: https://zenodo.org/record/7570428#.ZBrxX6JBycw
Tagman (a web platform to create a manually annotated dataset of smells): https://github.com/SMART-Dal/Tagman
Mirror: A Natural Language Interface for Data Querying, Summarization, and Visualization
Mirror is an open-source platform for data exploration and analysis powered by large language models. Mirror offers an intuitive natural language interface for querying databases, and automatically generates executable SQL commands to retrieve relevant data and summarize it in natural language. In addition, users can preview and manually edit the generated SQL commands to ensure the accuracy of their queries. Mirror also generates visualizations to facilitate understanding of the data. Designed with flexibility and human input in mind, Mirror is suitable for both experienced data analysts and non-technical professionals looking to gain insights from their data.
MIrror: https://github.com/mirror-data/mirror
Mirror is an open-source platform for data exploration and analysis powered by large language models. Mirror offers an intuitive natural language interface for querying databases, and automatically generates executable SQL commands to retrieve relevant data and summarize it in natural language. In addition, users can preview and manually edit the generated SQL commands to ensure the accuracy of their queries. Mirror also generates visualizations to facilitate understanding of the data. Designed with flexibility and human input in mind, Mirror is suitable for both experienced data analysts and non-technical professionals looking to gain insights from their data.
MIrror: https://github.com/mirror-data/mirror
Measuring The Impact Of Programming Language Distribution (Google)
Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, authors present the BabelCode framework for execution-based evaluation of any benchmark in any language
BabelCode: https://github.com/google-research/babelcode
Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, authors present the BabelCode framework for execution-based evaluation of any benchmark in any language
BabelCode: https://github.com/google-research/babelcode
ENASE'23 Technical Program
Conference Areas
1 . Theory and Practice of Systems and Applications Development
2 . Challenges and Novel Approaches to Systems and Software Engineering (SSE)
3 . Systems and Software Quality
4 . Systems and Software Engineering (SSE) for Emerging Domains
Conference Areas
1 . Theory and Practice of Systems and Applications Development
2 . Challenges and Novel Approaches to Systems and Software Engineering (SSE)
3 . Systems and Software Quality
4 . Systems and Software Engineering (SSE) for Emerging Domains
Improving Code Generation by Training with Natural Language Feedback
Imitation learning from language feedback (ILF) is an algorithm for learning from natural language feedback at training time. ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. It can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task.
Imitation learning from language feedback (ILF) is an algorithm for learning from natural language feedback at training time. ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. It can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task.
👍3
An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction
Authors propose a novel way of representing changes in source code, the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. The appoach was evaluated in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset VIC.
RQ. 1 Can a vulnerability introducing database generated from a vulnerability fixing commit database be used for vulnerability prediction?
RQ. 2 How effective are Code Change Trees in representing source code changes?
RQ. 3 Are source code metrics sufficient to represent code changes?
dataset paper
VIC dataset
Authors propose a novel way of representing changes in source code, the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. The appoach was evaluated in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset VIC.
RQ. 1 Can a vulnerability introducing database generated from a vulnerability fixing commit database be used for vulnerability prediction?
RQ. 2 How effective are Code Change Trees in representing source code changes?
RQ. 3 Are source code metrics sufficient to represent code changes?
dataset paper
VIC dataset
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
CodeGeeX is a multilingual model with 13 billion parameters for code generation. It is pre-trained on 850 billion tokens of 23 programming languages.
- Multilingual Code Generation: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc.
- Crosslingual Code Translation: CodeGeeX supports the translation of code snippets between different languages.
- Customizable Programming Assistant: CodeGeeX is available in the VS Code extension marketplace for free. It supports code completion, explanation, summarization and more, which empower users with a better coding experience.
- Open-Source and Cross-Platform: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100.
GitHub
CodeGeeX is a multilingual model with 13 billion parameters for code generation. It is pre-trained on 850 billion tokens of 23 programming languages.
- Multilingual Code Generation: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc.
- Crosslingual Code Translation: CodeGeeX supports the translation of code snippets between different languages.
- Customizable Programming Assistant: CodeGeeX is available in the VS Code extension marketplace for free. It supports code completion, explanation, summarization and more, which empower users with a better coding experience.
- Open-Source and Cross-Platform: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100.
GitHub
❤1👍1
Forwarded from Consciousnesses
Nature Language Reasoning, A Survey
This survey paper provides a definition for natural language reasoning in NLP, based on both philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of reasoning.
This survey paper provides a definition for natural language reasoning in NLP, based on both philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of reasoning.
🔥2
BloombergGPT: A Large Language Model for Finance
The work presents BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. Authors construct a 363 billion token dataset based on Bloomberg's extensive data sources. Mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks.
The work presents BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. Authors construct a 363 billion token dataset based on Bloomberg's extensive data sources. Mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks.
CONAN: Diagnosing Batch Failures for Cloud Systems (Microsoft)
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, authors focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. CONAN is an efficient and flexible framework that can automatically extract contrast patterns (failed vs. succeeded, slow vs. normal etc.) from contextual data.
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, authors focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. CONAN is an efficient and flexible framework that can automatically extract contrast patterns (failed vs. succeeded, slow vs. normal etc.) from contextual data.
ICCQ'23: The Third International Conference on Code Quality
- What IS Code Quality: from “ilities” to QWAN
- Mutant Selection Strategies in Mutation Testing
- Understanding Software Performance Challenges - An Empirical Study on Stack Overflow
- Applying Machine Learning Analysis for Software Quality Test
- Test-based and metric-based evaluation of code generation models for practical question answering
Accepted papers
Live
- What IS Code Quality: from “ilities” to QWAN
- Mutant Selection Strategies in Mutation Testing
- Understanding Software Performance Challenges - An Empirical Study on Stack Overflow
- Applying Machine Learning Analysis for Software Quality Test
- Test-based and metric-based evaluation of code generation models for practical question answering
Accepted papers
Live
ICCQ.ru
ICCQ-2023: 3rd International Conference on Code Quality
In cooperation with IEEE Computer Society the event is focused on static analysis, program verification, bug detection, and software maintenance.
Federated Learning with Flexible Control (IBM)
Federated learning (FL) enables distributed model training from local data collected by users. Existing works have separately considered different configurations to make FL more efficient, such as infrequent transmission of model updates, client subsampling, and compression of update vectors. However, an important open problem is how to jointly apply and tune these control knobs in a single FL algorithm.
Is it possible to jointly apply a wide range of control options in a single FL algorithm, to support heterogeneous and time-varying costs of multiple types of resources?
FlexFL is an FL algorithm, which allows flexible configurations in the amount of computation at each client and the amount of communication between clients and the server. This algorithm provides a high degree of freedom in adapting the FL procedure to heterogeneous and dynamically changing resource costs.
Federated learning (FL) enables distributed model training from local data collected by users. Existing works have separately considered different configurations to make FL more efficient, such as infrequent transmission of model updates, client subsampling, and compression of update vectors. However, an important open problem is how to jointly apply and tune these control knobs in a single FL algorithm.
Is it possible to jointly apply a wide range of control options in a single FL algorithm, to support heterogeneous and time-varying costs of multiple types of resources?
FlexFL is an FL algorithm, which allows flexible configurations in the amount of computation at each client and the amount of communication between clients and the server. This algorithm provides a high degree of freedom in adapting the FL procedure to heterogeneous and dynamically changing resource costs.
DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection
The paper presents a new dataset, DiverseVul, for detecting software vulnerabilities using deep learning. The dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 nonvulnerable functions extracted from 7,861 commits, which is more diverse and twice the size of the previous largest and most diverse dataset, CVEFixes. The authors plan to publish the DiverseVul dataset.
The paper presents a new dataset, DiverseVul, for detecting software vulnerabilities using deep learning. The dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 nonvulnerable functions extracted from 7,861 commits, which is more diverse and twice the size of the previous largest and most diverse dataset, CVEFixes. The authors plan to publish the DiverseVul dataset.