NEW BOT Телеграм, страница

ML Research Hub

Article Title:
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

Article Date: 24 May 2025

Article Denoscription:
Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.18445v1.pdf

GitHub:
• https://github.com/showlab/omniconsistency

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤6

1.46K views13:58

ML Research Hub

Article Title:
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Article Date: 27 Apr 2025

Article Denoscription:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf

GitHub:
• https://github.com/cvs-health/uqlm

Datasets:
• GSM8K
• SVAMP
• PopQA
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤6

1.82K views15:57

ML Research Hub

Please open Telegram to view this post

VIEW IN TELEGRAM

❤5

08:53

ML Research Hub

SWE-bench Goes Live

🖥

Github: https://github.com/microsoft/swe-bench-live

📕

Paper: https://arxiv.org/abs/2505.23419v1

🔗 Tasks: https://paperswithcode.com/dataset/humaneval

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

Please open Telegram to view this post

VIEW IN TELEGRAM

❤6

1.56K viewsedited 09:29

ML Research Hub

important resource

https://www.linkedin.com/posts/hussein-sheikho-4a8187246_a-comprehensive-cheat-sheet-for-working-with-activity-7337103606531186688-Nn0q

Polars cheat sheet for data engineers and scientists | Hussein Sheikho posted on the topic | LinkedIn

📖 A comprehensive cheat sheet for working with Polars.

✏️ This cheat sheet explains everything about Polars in a concise and simple way. Not just theory! But also a bunch of real examples, practical experience, and projects that will really help you in the…

❤1

1.37K views05:16

ML Research Hub

Forwarded from Machine Learning with Python

This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

✅

https://news.1rj.ru/str/addlist/8_rRW2scgfRhOTc0

✅

https://news.1rj.ru/str/Codeprogrammer

Please open Telegram to view this post

VIEW IN TELEGRAM

❤3

735 views05:18

ML Research Hub

Article Title:
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Article Date: 21 Apr 2025

Article Denoscription:
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2504.14899v1.pdf

GitHub:
• https://github.com/ewrfcas/uni3c

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤7

1.74K views06:53

ML Research Hub

30k 🌟

Please open Telegram to view this post

VIEW IN TELEGRAM

👏3❤1

1.52K views11:50

ML Research Hub

If anyone wants you to donate stars, please donate by gifts, it will be better for us.

There are gifts at a price of 10 stars

❤3

1.37K views13:22

ML Research Hub

Article Title:
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Article Date: 22 Feb 2025

Article Denoscription:
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2502.16069v1.pdf

GitHub:
• https://github.com/just-curieous/curie

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤4👍1🎉1

1.41K views20:37

ML Research Hub

Article Title:
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Article Date: 23 May 2025

Article Denoscription:
In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.18227v1.pdf

GitHub:
• https://github.com/zlkong/awesome-token-compression-reduction

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤5👍1

1.31K views07:48

ML Research Hub

Article Title:
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Article Date: 29 May 2025

Article Denoscription:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf

GitHub:
• https://github.com/jennyzzt/dgm

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤4

1.17K views10:41

ML Research Hub

Article Title:
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Article Date: 30 May 2025

Article Denoscription:
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.24298v2.pdf

GitHub:
• https://github.com/inclusionai/areal

Datasets:
• MATH
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤4🙏1

1.25K views10:43

ML Research Hub

Article Title:
AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents

Article Date: 9 Feb 2025

Article Denoscription:
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, AutoAgent also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, AutoAgent's Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2502.05957v2.pdf

GitHub:
• https://github.com/hkuds/autoagent
• https://github.com/hkuds/auto-deep-research

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤3👍1🔥1

1.39K views14:58

ML Research Hub

Article Title:
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

Article Date: 26 May 2025

Article Denoscription:
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.19641v1.pdf

GitHub:
• https://github.com/minimax-ai/synlogic

Datasets:
• MATH
• BBH
• GPQA
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤5

1.41K views04:52

ML Research Hub

Article Title:
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs

Article Date: 26 Feb 2025

Article Denoscription:
In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2502.19411v1.pdf

GitHub:
• https://github.com/dayuyang1999/awesome-code-reasoning

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤3

1.28K views12:52

ML Research Hub

Article Title:
Advanced long-term earth system forecasting by learning the small-scale nature

Article Date: 26 May 2025

Article Denoscription:
Reliable long-term forecast of Earth system dynamics is heavily hampered by instabilities in current AI models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. We present Triton, an AI framework designed to address this fundamental challenge. Inspired by increasing grids to explicitly resolve small scales in numerical models, Triton employs a hierarchical architecture processing information across multiple resolutions to mitigate spectral bias and explicitly model cross-scale dynamics. We demonstrate Triton's superior performance on challenging forecast tasks, achieving stable year-long global temperature forecasts, skillful Kuroshio eddy predictions till 120 days, and high-fidelity turbulence simulations preserving fine-scale structures all without external forcing, with significantly surpassing baseline AI models in long-term stability and accuracy. By effectively suppressing high-frequency error accumulation, Triton offers a promising pathway towards trustworthy AI-driven simulation for climate and earth system science.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.19432v1.pdf

GitHub:
• https://github.com/easylearningscores/triton_ai4earth

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

GitHub

GitHub - easylearningscores/Triton_AI4Earth: Advanced long-term earth system forecasting by learning the small-scale nature

Advanced long-term earth system forecasting by learning the small-scale nature - easylearningscores/Triton_AI4Earth

❤5

1.4K views14:13

ML Research Hub

❤6

1.22K views07:23

ML Research Hub

🔹 Title:
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

🔹 Publication Date: Published on Sep 12, 2023

🔹 Abstract:
Rectified Flow is used to develop an ultra-fast one-step text-to-image generator named InstaFlow, achieving high image quality with significantly reduced inference time compared to existing methods. AI-generated summary Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow , which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its reflow procedure , which straightens the trajectories of probability flows , refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID ( Frechet Inception Distance ) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation , by a significant margin (37.2 rightarrow 23.3 in FID ). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models InstaFlow . On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in leq 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days . Project page:~https://github.com/gnobitab/ InstaFlow .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2309.06380
• PDF: https://arxiv.org/pdf/2309.06380

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/diffusers/community-pipelines-mirror

🔹 Spaces citing this paper:
• https://huggingface.co/spaces/FlowChef/FlowChef-InstaFlow-Edit
• https://huggingface.co/spaces/FlowChef/FlowChef-InstaFlow-InverseProblem-Inpainting
• https://huggingface.co/spaces/XCLiu/InstaFlow
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤3

955 viewsedited 11:37

ML Research Hub

🔹 Title:
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

🔹 Publication Date: Published on May 23

🔹 Abstract:
Mutarjim is a compact Arabic-English translation model that outperforms larger models on established benchmarks and achieves state-of-the-art performance on a new comprehensive Tarjama-25 benchmark. AI-generated summary We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation . While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus .. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25 , a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness , short sentence lengths, and English-source bias . Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25 , surpassing even significantly larger and proprietary models like GPT-4 o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.17894
• PDF: https://arxiv.org/pdf/2505.17894

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Misraj/Arabic-Image-Captioning_100M
• https://huggingface.co/datasets/Misraj/Tarjama-25

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

arXiv.org

Mutarjim: Advancing Bidirectional Arabic-English Translation with...

We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing...

❤2

933 views11:55

ML Research Hub

🔹 Title:
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

🔹 Publication Date: Published on May 27

🔹 Abstract:
EverGreenQA, a multilingual QA dataset with evergreen labels, is introduced to benchmark LLMs on temporality encoding and assess their performance through verbalized judgments and uncertainty signals. AI-generated summary Large Language Models (LLMs) often hallucinate in question answering ( QA ) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreen QA , the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreen QA , we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5 , a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation , filtering QA datasets , and explaining GPT-4o retrieval behavior .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.21115
• PDF: https://arxiv.org/pdf/2505.21115
• Github: https://github.com/s-nlp/Evergreen-classification

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/s-nlp/EverGreen-Multilingual

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT

❤4

910 views12:00

About

Blog

Apps

Platform