COSCO: On Contrastive Learning of Semantic Similarity for Code to Code Search
The paper introduces a novel code-to-code search technique that enhances the performance of LLMs by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. The authors present the code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time. The proposed approach outperforms the state-of-the-art cross-language search tool by up to 44.7%.
COSCO (github)
RQ1. How does COSCO’s performance compare to the performance of other cross-language code search techniques?
RQ2. Does COSCO’s methodology and performance generalize across different models?
RQ3. Does including semantic similarity scores during training improve code search?
RQ4. How does changing the number of positive and negative comparison samples available for training effect COSCO’s performance?
The paper introduces a novel code-to-code search technique that enhances the performance of LLMs by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. The authors present the code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time. The proposed approach outperforms the state-of-the-art cross-language search tool by up to 44.7%.
COSCO (github)
RQ1. How does COSCO’s performance compare to the performance of other cross-language code search techniques?
RQ2. Does COSCO’s methodology and performance generalize across different models?
RQ3. Does including semantic similarity scores during training improve code search?
RQ4. How does changing the number of positive and negative comparison samples available for training effect COSCO’s performance?
GitHub code search is generally available
New code search and code view are generally available to all users on GitHub.com.
New code search and code view are generally available to all users on GitHub.com.
The GitHub Blog
GitHub code search is generally available
The world’s code is now at your fingertips.
Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering
This book contains the Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023). This conference is sponsored by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC), held in cooperation with the ACM Special Interest Group on Management Information Systems (ACM SIGMIS) and technically co-sponsored by the IEEE SMC - IEEE Technical Committee on Enterprise Information Systems. This year’s ENASE is held in Prague, Czech Republic, from April 24−25.
This book contains the Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023). This conference is sponsored by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC), held in cooperation with the ACM Special Interest Group on Management Information Systems (ACM SIGMIS) and technically co-sponsored by the IEEE SMC - IEEE Technical Committee on Enterprise Information Systems. This year’s ENASE is held in Prague, Czech Republic, from April 24−25.
StarCoder: may the source be with you!
The BigCode community, an open-scientific collaboration working on the responsible development of Code LLMs, introduces StarCoder and StarCoderBase:
- 15.5B parameter models
- 8K context length
- StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process
- StarCoderBase is fine-tuned on 35B Python tokens, resulting in the creation of StarCoder
StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.
The BigCode community, an open-scientific collaboration working on the responsible development of Code LLMs, introduces StarCoder and StarCoderBase:
- 15.5B parameter models
- 8K context length
- StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process
- StarCoderBase is fine-tuned on 35B Python tokens, resulting in the creation of StarCoder
StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.
🔥4
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
The Vault is an open-source large-scale code-text dataset designed to enhance the training of code-focused LLMs. Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality, and format. The Vault overcomes these limitations by providing 40M code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels.
The Vault is an open-source large-scale code-text dataset designed to enhance the training of code-focused LLMs. Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality, and format. The Vault overcomes these limitations by providing 40M code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels.
👍3
Introducing 100K Token Context Windows
- approximately 75K words
- hundreds of pages
- a book, for example "The Great Gatsby" (about 72K tokens)
- a text that will take approximately 5 hours to read
- approximately 75K words
- hundreds of pages
- a book, for example "The Great Gatsby" (about 72K tokens)
- a text that will take approximately 5 hours to read
🔥1
Visualization in the Era of Artificial Intelligence: Experiments for Creating Structural Visualizations by Prompting LLMs
Experiments with 2D/3D visualization using LLMs.
Experiments with 2D/3D visualization using LLMs.
Measuring the Runtime Performance of Code Produced with GitHub Copilot
GitHub Copilot is an artificially intelligent programming assistant used by many developers. The authors evaluate the runtime performance of code produced when developers use GitHub Copilot versus when they do not. To this end, they conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the run-time performance of the participants’ solutions. The results suggest that using Copilot may produce code with a significantly slower runtime performance.
RQ0: Does using Copilot influence program correctness?
RQ1: Is there a runtime performance difference in code when using GitHub Copilot?
RQ2: Do Copilot’s suggestions sway developers towards or away from code with faster runtime performance?
RQ3: Do characteristics of Copilot users influence the run-time performance when it is used?
GitHub Copilot is an artificially intelligent programming assistant used by many developers. The authors evaluate the runtime performance of code produced when developers use GitHub Copilot versus when they do not. To this end, they conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the run-time performance of the participants’ solutions. The results suggest that using Copilot may produce code with a significantly slower runtime performance.
RQ0: Does using Copilot influence program correctness?
RQ1: Is there a runtime performance difference in code when using GitHub Copilot?
RQ2: Do Copilot’s suggestions sway developers towards or away from code with faster runtime performance?
RQ3: Do characteristics of Copilot users influence the run-time performance when it is used?
RLocator: Reinforcement Learning for Bug Localization
The authors propose RLocator, a RL-based technique to rank the source code files where the bug may reside, given the bug report. The contribution of the study is the formulation of the bug localization problem using the Markov Decision Process, which helps to optimize the evaluation measures directly. RLocator is evaluated on 8,316 bug reports. The authors found that RLocator is better than the other state-of-the-art techniques when using MAP as an evaluation measure and is good most of the time when using MRR. Thus the authors conclude that RL for bug detection is a promising avenue for future exploration.
The authors propose RLocator, a RL-based technique to rank the source code files where the bug may reside, given the bug report. The contribution of the study is the formulation of the bug localization problem using the Markov Decision Process, which helps to optimize the evaluation measures directly. RLocator is evaluated on 8,316 bug reports. The authors found that RLocator is better than the other state-of-the-art techniques when using MAP as an evaluation measure and is good most of the time when using MRR. Thus the authors conclude that RL for bug detection is a promising avenue for future exploration.
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
In this work, the authors do the first large-scale study to evaluate the effectiveness of LLMs for helping engineers root cause and mitigate production incidents. Human evaluation with actual incident owners shows the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
In this work, the authors do the first large-scale study to evaluate the effectiveness of LLMs for helping engineers root cause and mitigate production incidents. Human evaluation with actual incident owners shows the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
Code Execution with Pre-trained Language Models
Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pretrained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, the authors aim to teach pretrained models the real-world code execution process. They propose CodeExecutor, a Transformer-based model that learns to execute arbitrary programs and predict their execution traces.
Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pretrained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, the authors aim to teach pretrained models the real-world code execution process. They propose CodeExecutor, a Transformer-based model that learns to execute arbitrary programs and predict their execution traces.
Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
The authors argue that using a code snippet (and possibly an associated traceback) as a query and looking for answers with bugfixing instructions and code samples is a natural use case that is not covered by existing approaches. The paper presents a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; it turns out that in this setting, existing architectures fall short of the simplest BM25 baseline even after fine-tuning.
The authors argue that using a code snippet (and possibly an associated traceback) as a query and looking for answers with bugfixing instructions and code samples is a natural use case that is not covered by existing approaches. The paper presents a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; it turns out that in this setting, existing architectures fall short of the simplest BM25 baseline even after fine-tuning.
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Understanding semantic similarity is an important aspect of language processing. The authors present a new method CCT-LM that improves this ability via a novel CCT pretraining approach and demonstrate its viability on the clone detection and code search tasks. The proposed CCT-LM model outperforms strong baselines in all presented tasks, proving that CCT pretraining provides better semantic similarity understanding for a language model.
Understanding semantic similarity is an important aspect of language processing. The authors present a new method CCT-LM that improves this ability via a novel CCT pretraining approach and demonstrate its viability on the clone detection and code search tasks. The proposed CCT-LM model outperforms strong baselines in all presented tasks, proving that CCT pretraining provides better semantic similarity understanding for a language model.