Aspiring Data Science – Telegram
Aspiring Data Science
386 subscribers
465 photos
12 videos
12 files
2.15K links
Заметки экономиста о программировании, прогнозировании и принятии решений, научном методе познания.
Контакт: @fingoldo

I call myself a data scientist because I know just enough math, economics & programming to be dangerous.
Download Telegram
#news #calibration #teaser

В процессе написания небольшой статейки, мини-исследования о пользе посткалибрации бинарных классификаторов. По мере работы возникают всё новые вопросы, на которые хочется дать ответ 😅 Вначале был 1 датасет, 2 модели, 3 библиотеки, через день уже 5 моделей, еще через день добавился доп. датасет с "другим видом" мискалибрации (что, кстати, заметно изменило рейтинги), через день библиотек заюзанных в сравнении стало 7. Теперь вот хочется добавить к зоопарку и какую-то свою реализацию калибратора, а просто "шоб було".
#hpo #hpt #metalearning #books

Пересмотрев доклад, стал искать упомянутую книжку "Automated Machine Learning - Methods, Systems, Challenges". Она оказалась в свободном доступе, по виду хорошая, изучаю. Наугад полистал этот сайт на предмет интересной литературы по ML, и не прогадал.
Нашёл ещё книгу "Metalearning - Applications to Automated Machine Learning and Data Mining".

Оказывается, для концепции крутого оптимизатора HPT, над которым я работаю, существует уже давно название - метаобучение.

The general aim of metalearning systems is to learn from the usage of prior models (how they were constructed and how well they performed) in order to to model a target dataset better. If the base-level task is classification, this implies that the system can predict the value of the target variable, i.e. the class value in this case. Ideally, it does this better (or more efficiently) by leveraging information besides the training data itself.

Algorithm selection (AS): Given a set of algorithms and a dataset (target dataset), determine which algorithm is most appropriate to model the target dataset.
Hyperparameter optimization (HPO): Given an algorithm with specific hyperparameters and a target dataset, determine the best hyperparameter settings of the given algorithm to model the target dataset.
Combined algorithm selection and hyperparameter optimization (CASH): Given a set of algorithms, each with its own set of hyperparameters, and a target dataset, determine which algorithm to use and how to set its hyperparameters to model the target dataset. Some CASH systems address also the more complex pipeline synthesis task discussed next.
Workflow (pipeline) synthesis: Given a set of algorithms, each with its own set of hyperparameters, and a target dataset, design a workflow (pipeline) consisting of a one or more algorithms to model the target dataset. The inclusion of a particular algorithm and its hyperparameter settings in the workflow can be seen as a CASH problem.
Architecture search and/or synthesis: This problem type can be seen as a generalization of the problem type above. In this setting the individual constituents do not need to be organized in a sequence, as it is done in workflows (pipelines). The architecture can include, for instance, partially ordered or tree-like structures. The neural network architecture design can be seen as a problem that falls into this category.
Few-shot learning: Given a target dataset with few examples and various datasets that are very similar, but include many examples, retrieve a model that has been pretrained on prior datasets and fine-tune it to perform well on the target dataset.
#llms #gpt #ai #fun

Мой вариант микро-теста на AGI - вопрос, на который не могут пока правильно ответить Клод, ЧатГПТ, Гемини:

"Графическое изображение какой концепции в машинном обучении напоминает связку сосисок?"
😁2
#hpo #hpt #multifidelity

"Increasing dataset sizes and increasingly complex models are a major hurdle in HPO since they make blackbox performance evaluation more expensive. Training a single hyperparameter configuration on large datasets can nowadays easily exceed several hours and take up to several days. A common technique to speed up manual tuning is therefore to probe an algorithm/hyperparameter configuration on a small subset of the data, by training it only for a few iterations, by running it on a subset of features, by only using one or a few of the cross-validation folds, or by using down-sampled images in computer vision. Multi-fidelity methods cast such manual heuristics into formal algorithms, using so-called low fidelity approximations of the actual loss function to minimize. These approximations introduce a tradeoff between optimization performance and runtime, but in practice, the obtained speedups often outweigh the approximation error.

First, we review methods which model an algorithm’s learning curve during training and can stop the training procedure if adding further resources is predicted to not help. Second, we discuss simple selection methods which only choose
one of a finite set of given algorithms/hyperparameter configurations. Third, we discuss multi-fidelity methods which can actively decide which fidelity will provide most information about finding the optimal hyperparameters. We also refer to Chap. 2 (which discusses how multi-fidelity methods can be used across datasets) and Chap. 3 (which describes low-fidelity approximations for neural architecture search)."

В общем, все (или почти все) те идеи, которые я напридумывал и считал уникальными и крутыми, уже кем-то придуманы. Но почему тогда я ничего не слышу про успешные реализации, почему повсюду только эта сраная Оптуна? Может, я просто мало работал с коммерческими automl платформами, и там всё это давно есть? Но почему тогда они не публикуют сравнений с "простыми" байесовскими оптимизаторами, почему не рекламируют свои преимущества?
#hpo #hpt

"In 2009, Escalante et al. [30] extended the HPO problem to the Full Model Selection problem, which includes selecting a preprocessing algorithm, a feature selection algorithm, a classifier and all their hyperparameters. By being able to construct a machine learning pipeline from multiple off-the-shelf machine learning algorithms using HPO, the authors empirically found that they can apply their method to any data set as no domain knowledge is required, and demonstrated the applicability of their approach to a variety of domains [32, 49]. Their proposed method, particle swarm model selection (PSMS), uses a modified particle swarm optimizer to handle the conditional configuration space. To avoid overfitting, PSMS was extended with a custom ensembling strategy which combined the best solutions from multiple generations [31]. Since particle swarm optimization was originally designed to work on continuous configuration spaces, PSMS was later also extended to use a genetic algorithm to optimize the pipeline structure and only use particle swarm optimization to optimize the hyperparameters of each pipeline [145]."
#hpo #hpt #robustness

"An open problem in HPO is overfitting. As noted in the problem statement (see Sect. 1.2), we usually only have a finite number of data points available for calculating the validation loss to be optimized and thereby do not necessarily optimize for generalization to unseen test datapoints. Similarly to overfitting a machine learning algorithm to training data, this problem is about overfitting the hyperparameters to the finite validation set; this was also demonstrated to happen experimentally [20, 81]

A different approach to combat overfitting might be to find stable optima instead of sharp optima of the objective function [112]. The idea is that for stable optima, the function value around an optimum does not change for slight perturbations of the hyperparameters, whereas it does change for sharp optima. Stable optima lead to better generalization when applying the found hyperparameters to a new, unseen set of datapoints (i.e., the test set). An acquisition function built around this was shown to only slightly overfit for support vector machine HPO, while regular Bayesian optimization exhibited strong overfitting [112]."


https://arxiv.org/pdf/1902.07846
#hpo #hpt #metalearning

"Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible. Not only does this dramatically speed up and improve the design of machine learning pipelines or neural architectures, it also allows us to replace hand-engineered algorithms with novel approaches learned in a data-driven way. In this chapter, we provide an overview of the state of the art in this fascinating and continuously evolving field.


When we learn new skills, we rarely – if ever – start from scratch. We start from skills learned earlier in related tasks, reuse approaches that worked well before, and focus on what is likely worth trying based on experience [82]. With every skill learned, learning new skills becomes easier, requiring fewer examples and less trialand-error. In short, we learn how to learn across tasks. Likewise, when building machine learning models for a specific task, we often build on experience with related tasks, or use our (often implicit) understanding of the behavior of machine learning techniques to help make the right choices.

The challenge in meta-learning is to learn from prior experience in a systematic, data-driven way. First, we need to collect meta-data that describe prior learning tasks and previously learned models. They comprise the exact algorithm configurations used to train the models, including hyperparameter settings, pipeline compositions and/or network architectures, the resulting model evaluations, such as accuracy and training time, the learned model parameters, such as the trained weights of a neural net, as well as measurable properties of the task itself, also known as meta-features."
#hpo #hpt #papers

"Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to chose adequate hyperparameter spaces for tuning."

Воу, у них есть крутая веб панелька с результатами расчётов. Вот это я понимаю, авторы реально стараются донести результаты до читателя. Это прямо задание нового стандарта качества: https://philipppro.shinyapps.io/tunability/

Единственное, единицы прироста они неудачно выбрали. 2 раза прочитал, так и не понял точно.

https://arxiv.org/pdf/1802.09596
#hpt #hpo #landmarking #multifidelity

Оказывается, всё уже изобретено до нас.

Landmarking in machine learning is a meta-learning technique that evaluates the performance of simple, computationally inexpensive models (called landmarkers) on a given dataset to extract useful meta-features. These meta-features can then be used to guide model selection, hyperparameter tuning, or other meta-learning tasks.

Key Ideas Behind Landmarking:

Using Simple Models: Instead of training complex models on a dataset, landmarking quickly evaluates the performance of basic classifiers or regressors (e.g., decision stumps, k-nearest neighbors with k=1, Naive Bayes).
Extracting Meta-Features: The accuracy, F1-score, or other performance metrics of these landmarkers act as meta-features that describe the dataset’s properties.
Guiding Model Selection: If a simple model performs well, it may indicate that more complex models of the same type will work well too. For example, if a 1-NN classifier performs well, then SVMs or Random Forests might also be suitable.


Example:

Suppose you have a new dataset and want to determine whether a neural network or a tree-based model would work best.
You could train a decision stump (a tree with one split) and a 1-nearest neighbor classifier.
If the decision stump performs well, this suggests that the dataset has simple, rule-based decision boundaries, making tree-based models a good choice.

Pros & Cons:

Fast & Low-Cost: Landmarking is computationally cheap compared to training complex models.
Useful for Cold Start Problems: Helps when no prior knowledge about dataset properties is available.
Limited Predictive Power: Landmarkers are simple models and may not always capture dataset complexity well.
Scalability Issues: The choice of landmarkers needs to be carefully designed for different types of datasets.

Landmarking and multi-fidelity methods are related but not synonyms. They both aim to reduce computation in machine learning, but they serve different purposes:

Landmarking (Meta-Learning Technique)

Landmarking is a meta-learning approach that uses the performance of simple models (landmarkers) to infer the difficulty of a dataset.
It is mainly used for model selection and meta-feature extraction.
Example: Training a 1-Nearest Neighbor model or a Decision Stump and using their accuracy as a meta-feature to predict which complex model will work best.

Multi-Fidelity (Optimization & Surrogate Modeling)

Multi-fidelity methods aim to speed up optimization by using low-fidelity approximations before running expensive high-fidelity computations.
It is used in hyperparameter optimization, Bayesian optimization, and neural architecture search (NAS).
Example: Instead of training a deep neural network for 100 epochs, a multi-fidelity approach might first evaluate it on 10 epochs to estimate its final performance.

In some cases, landmarking can be seen as a type of multi-fidelity method if we view simple models as low-fidelity approximations of complex models. However, multi-fidelity methods usually involve explicit modeling of the tradeoff between low- and high-fidelity evaluations, while landmarking is more heuristic.


In the selection process, we can use either the landmarkers’ absolute accuracy measures or the relationship of the landmarkers’ accuracy relative to each other (Fürnkranz and Petrak 2001). Often, a landmarker’s accuracy cannot represent the original algorithm’s accuracy well, causing a less effective algorithm to be selected. Sampling-based landmark (Petrak 2000; Fürnkranz and Petrak 2001; Soares et al. 2001) is another method for automatically selecting machine learning algorithms. To quickly obtain a rough estimate of each algorithm’s accuracy on a data set, the method applies the algorithm on a sample of the data set. The accuracy estimates are used to select the algorithm to be used on the whole data set.
#llms #ai #gpt #math #ensembling #diversity

Недавно на ютубе смотрел видео о применении компьютера (в частности, языка Julia для исследования проблемы 196), спросил в комментах математика (явно советской школы), который написал статью по модификации проблемы, не пробовал ли он ИИ для генерации кода, да и вообще обкатки идей, на что получил кажущийся высокомерным ответ, мол, этот ваш ИИ слишком слаб для моих задач. Повеяло Сурдиным, в общем ) Я такой думаю, да что ж вы за гиганты мысли такие, что он для вас слаб и неточен, я вот прекрасно с ним работаю и оочень доволен. А вот и пример.


Q.: Suppose i'm running HPT/HPO for some model on a particular dataset, and I plan to later use more than one HP combinations to form an ensemble. Now, am I looking for HPs producing un-correlated predictions (zero MI), or anti-correlated predictions (corrcoeff=-1)?

A.: You're looking for hyperparameter (HP) combinations that produce uncorrelated predictions (zero mutual information, MI ≈ 0) rather than strictly anti-correlated ones (corrcoeff=−1).

Why uncorrelated predictions?
Ensemble Diversity: The key to a strong ensemble is combining diverse but individually strong models. If two models are perfectly anti-correlated (ρ=−1), their predictions will cancel each other out when averaged, leading to little to no improvement.
Reduction in Variance: The goal of an ensemble is to reduce variance and improve generalization. If the models are uncorrelated, they make different kinds of errors, so their combination will smooth out those errors.
Avoiding Overcompensation: A correlation of −1 means that one model's overestimation is exactly another model's underestimation, which could lead to instability in the ensemble.

How to achieve this?
Optimize for performance first: Choose hyperparameters that result in good individual models.
Select diverse models: Instead of maximizing anti-correlation, look for models that make errors in different places (measured by MI or correlation).
Use decorrelation techniques: Cluster models based on prediction similarity and pick those with the lowest pairwise correlation.
In short, low correlation (not necessarily negative) is desirable, but enforcing perfect anti-correlation is counterproductive.


Вот эта интересная идея, о которой я и не подумал, Cluster models based on prediction similarity and pick those with the lowest pairwise correlation. А ведь интересная визуализация может получиться. Хоть какое-то реальное применение для unsupervised ML 😅

Q.: What are other practical means to look for models that make errors in different places, except MI?

A.: Aside from mutual information (MI), there are several practical ways to identify models that make errors in different places when forming an ensemble:

Disagreement Rate (Prediction Diversity Index)
For classification, measure how often two models make different predictions on the same data points.
Higher values indicate greater diversity.

Jaccard Similarity for Misclassified Samples
Compute the Jaccard index for misclassified examples between two models:
Lower values mean models make mistakes on different samples.

Double-Fold Cross-Validation Differences
Train models on different folds and analyze how much their predictions differ on holdout sets.
If a model performs well on Fold 1 but poorly on Fold 2, and another model shows the opposite behavior, they might complement each other.

Shapley Value or Feature Importance Differences
Compare the features each model relies on using SHAP, permutation importance, or Gini importance.
If two models prioritize different features, they likely capture different aspects of the data.

Orthogonality of Prediction Vectors (in High-Dimensional Spaces)
Compute cosine similarity between prediction vectors:
Lower cosine similarity implies more independent predictions.
2
#transferlearning

"Many people hold the view that learning should not be viewed as an isolated task that starts from scratch with every new problem. Instead, a learning algorithm should exhibit the ability to exploit the results of previous learning processes to new tasks. This area is often referred to as transfer of knowledge across tasks, or simply transfer learning. The term learning to learn is also sometimes used in this context.

The term representational transfer is used to denote cases when the target and source models are trained at different times and the transfer takes place after one or more source models have already been trained. In this case there is an explicit form of knowledge transferred directly to the target model or to a meta-model that captures the relevant part of the knowledge obtained in past learning episodes.

The term functional transfer is used to denote cases where two or more models are trained simultaneously. This situation is sometimes referred to as multi-task learning. In this case the models share (possibly a part of) their internal structure during learning."
#calibration #chainedlearning

Задумался, почему отдельное калибровочное множество помогает классификатору, даже если уже используется валидационное множество и ранняя остановка по "истиной скоринговой" метрике (типа логлосса).

Что удалось выжать из ИИ:

"Many ML models tend to produce poorly calibrated probabilities, often being overconfident.
Even if the larger validation set includes the same calibration data, training directly on it does not explicitly adjust probability estimates for better calibration.
A separate calibration step explicitly corrects systematic miscalibration, which improves probability-based metrics (e.g., log loss, Brier score, AUC in some cases).

Splitting into validation + calibration introduces an implicit regularization effect because:
The model's probability outputs on the validation set remain unchanged.
The calibration step learns an additional transformation to correct systematic biases, which is less likely to overfit than if the model had seen all the data at once.
If you train a calibration model (like isotonic regression) on all validation data at once, it could overfit the calibration mapping and be less effective OOS.

Standard model training optimizes classification accuracy (e.g., cross-entropy loss) but does not directly optimize for probability calibration.
Calibration methods, on the other hand, explicitly adjust probability distributions to minimize metrics like Brier loss or expected calibration error (ECE).
This difference means that models that look equally good on raw validation metrics may behave differently after calibration.

By calibrating on a held-out subset, you're reducing variance in probability estimates.
This smoothing effect makes your probability distributions more stable and generalizable OOS."

- Так погодите, значит ли это, что можно накидать ещё пару цепочек полезных трансформаций, и это сработает лучше, чем одна общая?

-Да, в этом есть смысл! Если калибровка работает как дополнительный корректирующий слой, то можно применить аналогичный подход для других полезных трансформаций. Это похоже на послойное обучение (layer-wise learning) или разделение функций оптимизации.

"Почему несколько отдельных трансформаций могут быть лучше, чем одна общая?

Разделение задач → Меньший риск переобучения

Основная модель учится находить основные паттерны в данных.
Дополнительные корректирующие трансформации (например, калибровка, устранение систематических ошибок) могут улучшать качество без риска переобучения всей модели.
В случае одной общей модели ей приходится одновременно решать обе задачи → это может привести к компромиссам.

Адаптация под разные аспекты ошибки

Например, можно отдельно:
Устранить систематический сдвиг (bias correction).
Отрегулировать масштаб предсказаний (variance correction).
Откалибровать доверительные интервалы (quantile adjustment).
Каждая такая трансформация исправляет конкретный тип ошибки, а не пытается все исправить разом.

Влияние на обобщающую способность

Когда модель учится сразу на всём датасете, она может подстраиваться под шум и переобучаться.
Если разбить процесс на несколько последовательных этапов, это добавляет эффект регуляризации.

Как это применить на практике?

Основная модель → предсказывает сырые вероятности.
Калибровка (Platt scaling, isotonic regression) → исправляет систематические ошибки.
Дополнительная модель может корректировать редкие случаи (например, сильно несбалансированные классы)."
#calibration #regression

Ну хорошо, двигаемся дальше, нельзя ли эту логику обобщить на регрессию? Электронный болван согласен.

Yes! The same logic that makes post-calibration beneficial for binary classification can also be applied to regression tasks—though it is rarely discussed. Here’s why post-calibration can be valuable for regression and how it can be implemented effectively.


Why Would Regression Benefit from Post-Calibration?
Regression models output continuous values, but just like classification models, they can suffer from systematic miscalibration in their predictions. The key idea is that a model’s predicted values might be biased or have the wrong spread compared to the actual target distribution.

There are two main ways a regression model can be miscalibrated:

Mean Shift (Bias)

The model’s predictions systematically overestimate or underestimate the true values.
Example: A house price prediction model always underestimates expensive houses.

Spread Distortion (Misestimated Variance)

The model’s predictions have the wrong level of dispersion:
Too narrow → Underestimated uncertainty (overconfident)
Too wide → Overestimated uncertainty (underconfident)
Example: A temperature prediction model might always predict values within ±2°C, even when actual deviations are much larger.

How to Perform Post-Calibration in Regression?

Instead of calibrating probabilities like in classification, you adjust the magnitude and distribution of the predicted values in regression. This can be done in a few ways:

1. Post-hoc Mean Correction (Bias Calibration)

Take a held-out calibration set (separate from validation).
Compute the difference between the predicted and actual values.
Fit a simple correction model (e.g., linear regression) to adjust the predictions.

This works like Platt scaling in classification but for correcting systematic bias.

2. Quantile Regression Calibration (Fixing Spread)

Instead of just predicting the mean, we fit a secondary model to predict quantiles (e.g., 10th, 50th, 90th percentile).
This helps in cases where the model’s uncertainty is wrong.
#books #trading #chan

У Чана вышла новая книжка в соавторстве аж с 4мя людьми, "Hands-On AI Trading with Python, QuantConnect, and AWS".
Посмотрел несколько пошаренных страничек, удивился низкому качеству ML-кода. С точки зрения ML рекомендовать не могу, а с точки зрения "понять, какие задачи можно решать с помощью ML в трейдинге" - наверное, можно. Я удивляюсь, как могут люди много лет зарабатывать трейдингом и моделями деньги, и иметь ML знания на уровне ученика средней школы? Скейлеры применяют вне конвейера, используют GridSearch. Я понимаю, что это всё примеры, но зачем использовать в своих примерах плохие практики?
Aspiring Data Science
#hpo #hpt #metalearning #books Пересмотрев доклад, стал искать упомянутую книжку "Automated Machine Learning - Methods, Systems, Challenges". Она оказалась в свободном доступе, по виду хорошая, изучаю. Наугад полистал этот сайт на предмет интересной литературы…
#ensembling #cascading #delegating #arbitrating

Читаю "Metalearning - Applications to Automated Machine Learning and Data Mining". В главе про ансамбли с удивлением обнаружил весьма интересные алгоритмы.


1. Cascade Generalization

Cascade generalization is a type of stacked ensembling where models are arranged in a sequential, layered fashion. Predictions from earlier layers become additional features for models in later layers. Unlike traditional stacking, cascade generalization emphasizes using higher-level models to refine or augment lower-level predictions.

Key idea: Each layer’s models pass their predictions as additional inputs to models in the next layer.
Advantage: Can iteratively refine weak models, allowing later models to correct errors made earlier.
Example: A first-layer model predicts probabilities, and a second-layer model uses those probabilities along with the original features to make the final decision.

2. Cascading

Cascading refers to a progressive model selection strategy where simpler (cheaper) models are used first, and only ambiguous cases are passed to more complex (expensive) models.

Key idea: Reduce computational cost by filtering easy cases early.
Example: A decision tree quickly filters obvious negative cases, and only uncertain cases are sent to a more sophisticated deep learning model.

3. Delegating (or Selective Routing)

Delegating is a framework where multiple models exist, and an intelligent mechanism decides which model should handle each instance. This is also known as an expert selection approach.

Key idea: Different models specialize in different regions of the feature space, and a routing function determines which model should process a given input.
Example: In fraud detection, a rule-based system handles typical transactions, while an anomaly detection model analyzes suspicious ones.

4. Arbitrating

Arbitrating is a meta-learning approach where an additional model (arbitrator) decides which base model’s prediction to trust more. Unlike delegation, where models specialize in different regions, arbitration combines predictions but gives more weight to the most confident model.

Key idea: Instead of picking a single expert, the arbitrator dynamically adjusts confidence in different models.
Example: A reinforcement learning agent (arbitrator) learns which base model performs best in specific scenarios.

5. Meta-Decision Trees (MDTs)

Meta-decision trees (MDTs) are decision trees that learn when to trust each base model in an ensemble, instead of directly predicting the target.

Key idea: The decision tree’s leaves represent which model should be used for a given input, rather than the final prediction itself.
Advantage: Unlike traditional stacking, it explicitly learns a strategy for model selection.
Example: An MDT may learn that Model A performs best for low-income customers, while Model B works better for high-income customers.

Ещё Мастерс писал о чём-то, подобном Arbitrating, когда модели могут специализироваться в разных регионах области определения. Открытых реализаций в Питоне найти не смог навскидку.
🤔1
#russia #law

Похоже, рецепт высокого благосостояния для россиян наконец-то найден.

"К настоящему моменту сумма требований российских телеканалов к Google по поводу блокировки аккаунтов на YouTube достигла 1,8 дуодециллиона рублей. Подобная сумма превышает весь мировой ВВП."

https://3dnews.ru/1117573/arbitragniy-sud-moskvi-zapretil-kompanii-google-prodolgat-sudebnie-razbiratelstva-v-kalifornii
👍1😁1