NEW BOT Телеграм, страница

Aspiring Data Science

#automl #tabpfn

Могу подтвердить, что, похоже, их либа (и подход) реально работает. На нескольких датасетах типа breast cancer, и синтетики типа make_classification(n_samples=40_000, n_classes=2,n_features=40,n_informative=10) TabPFN в моих запусках всегда опережала катбуст по ROC AUC и точности. пусть ненамного, но всегда. гиперпараметры катбуста я 4 часа не тюнил, просто проверил кол-во деревьев 1_000, 10_000, с и без ранней остановки.

На моей старенькой RTX 2070:

train, test sets shape= (30000, 40)

ROC AUC: 0.9945391499218594
Accuracy 0.9890666666666666

train+inference time: 4min 7s

для сравнения результаты катбуста CatBoostClassifier(verbose=0,task_type='CPU',eval_fraction=0.1,early_stopping_rounds=200, num_trees=10000)

ROC AUC: 0.9939392120134284
Accuracy 0.9802

Для train, test sets shape= (20000, 40) время 1min 56s.

С хорошим GPU она возьмёт и датасет 100k-200k строк, наверное.
С другой стороны, мне кажется, если тюнить, можно TabPFN и догнать.

Для PHE-версии надо явно передавать device: AutoTabPFNClassifier(max_time=60*9,ignore_pretraining_limits=True,device="cuda")

Что интересно, он грузит веса типа Attempting HuggingFace download: tabpfn-v2-classifier-od3j1g5m.ckpt, Attempting HuggingFace download: tabpfn-v2-classifier-llderlii.ckpt.

В общем, теперь метод .fit выполняется долго (согласно max_time), да ещё и .predict_proba занял 5 минут на том же датасете. ROC AUC немного улучшилась по сравнению с не-PHE версией, Accuracy чуть просела.

ROC AUC: 0.9948400465955762
Accuracy 0.9888666666666667

❤3

120 viewsAnatoly Alekseev, edited 07:34

Aspiring Data Science

#python #speed

Хорошее перечисление основных подходов к оптимизации питоновского кода.

https://medium.com/@yashwanthnandam/think-python-is-slow-try-these-hacks-for-3x-faster-noscripts-today-fbe258ec93bd

Medium

Think Python Is Slow? Try These Hacks for 3x Faster Scripts Today

Why investing time in optimization pays off big

105 viewsAnatoly Alekseev, 11:26

Aspiring Data Science

Forwarded from Artem Ryblov’s Data Science Weekly

The Kaggle Book by Konrad Banachewicz and Luca Massaron

Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.

The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you'll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won't easily find elsewhere, and the knowledge they've accumulated along the way. As well as Kaggle-specific tips, you'll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You'll design better validation schemes and work more comfortably with different evaluation metrics.

Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering #kaggle #metrics #validation #hyperparameters #tabular #cv #nlp

@data_science_weekly

101 viewsAnatoly Alekseev, 14:36

Aspiring Data Science

#youtube

https://3dnews.ru/1118303/segodnya-youtube-ispolnilos-20-let

3DNews - Daily Digital Digest

Сегодня YouTube исполнилось 20 лет

Ровно 20 лет назад трое бывших сотрудников PayPal запустили ресурс YouTube.

🫡1

102 viewsAnatoly Alekseev, 01:46

Aspiring Data Science

#automl #hpo #hpt #openml #diogenes

Провёл feasibility study оптимизатора гиперпараметров на базе мета-обучения, по данным openml. Результаты очень обнадёживают.

Для конкретного алгоритма ML (я взял xgboost classifier) , зная признаки датасета и, собственно, гиперпараметры, можно, не проводя каждый раз обучение, предсказывать достижимый ROC AUC со средней ошибкой MAPE в 3-5%.

Это fewshot, обучив модель на 1% комбинаций данного конкретного датасета (не обучив ни на одной комбинации датасета, т.е., только pretrain, MAPE=14%).
С мета-фичами от ~~застройщика~~ авторов openml.

А у меня же еще тонна идей по улучшениям. Я хотел эту идею реализовать еще 2 года тому, ну да ладно, хорошо хоть сейчас начал.

Работы, конечно, море, но предварительные результаты позитивные. В топку "байесовкую оптимизацию". There must be a better way!

105 viewsAnatoly Alekseev, 08:11

Aspiring Data Science

#books #kagglebook #ctf

Читаю The Kaggle book, оказывается , Энтони Голдблюм по образованию экономист, как и я )
А Джереми Ховард перед основанием fast.ai трудился в Kaggle.

Хинтон тоже не избежал участия в соревах каггл, и даже выиграл MerckActiviy. Уже интересно!

"Professor Donoho does not refer to Kaggle specifically, but to all data science competition platforms. Quoting computational linguist Mark Liberman, he refers to data science competitions and platforms as being part of a Common Task Framework (CTF) paradigm that has been silently and steadily progressing data science in many fields during the last decades. He states that a CTF can work incredibly well at improving the solution of a problem in data science from an empirical point of view, quoting the Netflix competition and many DARPA competitions as successful examples. The CTF paradigm has contributed to reshaping the best-in-class solutions for problems in many fields.

The system works the best if the task is well defined and the data is of good quality. In the long run, the performance of solutions improves by small gains until it reaches an asymptote. The process can be sped up by allowing a certain amount of sharing among participants (as happens on Kaggle by means of discussions, and sharing Kaggle Notebooks and extra data provided by the datasets found in the Datasets section). According to the CTF paradigm, competitive pressure in a competition suffices to produce always-improving solutions. When the competitive pressure is paired with some degree of sharing among participants, the improvement happens at an even faster rate – hence why Kaggle introduced many incentives for sharing."

106 viewsAnatoly Alekseev, edited 09:46

Aspiring Data Science

#cv

"Actually, if you watch carefully the data, it seems like data distributions are segregated into specific portions of space, something reminiscent ot me of the Madelon dataset created by Isabelle Guyon.

I therefore tried to stratifiy my folds based on a k-means clustering of the non-noisy data and my local cv has become more reliable (very correlated with the public leaderboard) and my models are performing much better with cv prediction."

https://www.kaggle.com/code/lucamassaron/are-you-doing-cross-validation-the-best-way

Kaggle

Are you doing cross-validation the best way?

Explore and run machine learning code with Kaggle Notebooks | Using data from 30 Days of ML

🔥1

102 viewsAnatoly Alekseev, edited 22:57

Aspiring Data Science

#umap #tsne #dimreducers #manifold

Понравилась интерактивная визуализация кластеров датасета одежды. Ну и мамонт, конечно.

"UMAP is an incredibly powerful tool in the data scientist's arsenal, and offers a number of advantages over t-SNE.

While both UMAP and t-SNE produce somewhat similar output, the increased speed, better preservation of global structure, and more understandable parameters make UMAP a more effective tool for visualizing high dimensional data.

Finally, it's important to remember that no dimensionality reduction technique is perfect - by necessity, we're distorting the data to fit it into lower dimensions - and UMAP is no exception.

However, by building up an intuitive understanding of how the algorithm works and understanding how to tune its parameters, we can more effectively use this powerful tool to visualize and understand large, high-dimensional datasets."

https://pair-code.github.io/understanding-umap/

103 viewsAnatoly Alekseev, edited 00:43

Aspiring Data Science

Думаю, это была свёрточная нейросеть, а не "ИИ".

"В USP был проведён эксперимент, в котором ИИ самостоятельно анализировал фотографии лошадей, сделанные до и после хирургического вмешательства, а также до и после приёма обезболивающих препаратов. ИИ изучал глаза, уши и рот лошадей, определяя наличие болевого синдрома. Согласно результатам исследования, ИИ сумел выявить признаки, указывающие на боль, с точностью 88 %, что подтверждает эффективность такого подхода и открывает перспективы для дальнейших исследований."

https://3dnews.ru/1118376/ii-nauchilsya-raspoznavat-emotsii-givotnih-po-virageniyu-mordi

3DNews - Daily Digital Digest

ИИ научился распознавать эмоции животных по выражению морды

Учёные разработали ИИ-системы, способные выявлять боль, стресс и заболевания у животных посредством анализа фотографий их морды.

👍1

96 viewsAnatoly Alekseev, edited 03:13

Aspiring Data Science

#hustles

Digital Product мне особенно нравится, но никак не складывается пока что (

https://medium.com/the-data-entrepreneurs/data-side-hustles-you-can-start-today-844863769827

Medium

Data Side Hustles You Can Start Today

How you can monetise your data science skills for extra income

84 viewsAnatoly Alekseev, edited 04:54

Aspiring Data Science

#ai #gpt #llms

Интересная идея, что с ИИ продуктивнее говорить, чем переписываться.

https://medium.com/the-efficient-entrepreneur/why-your-deepseek-prompts-are-falling-short-and-how-to-close-the-gap-5b517caad388

Medium

Why Your DeepSeek Prompts Are Falling Short (And How to Close the Gap)

The invisible friction between human thought and AI understanding

83 viewsAnatoly Alekseev, 04:56

Aspiring Data Science

#hustles #ai #gpt #llms #rag

https://medium.com/towards-data-science/5-ai-projects-you-can-build-this-weekend-with-python-c57724e9c461

Medium

5 AI Projects You Can Build This Weekend (with Python)

From beginner-friendly to advanced

80 viewsAnatoly Alekseev, edited 05:07

Aspiring Data Science

#optuna #optunahub #hpo #hpt #smac3

Что интересно, оптимизатор smac3, который я недавно независимо для себя открыл, добавлен в оптуну через хаб.

https://medium.com/optuna/announcing-optuna-4-2-98148689e626

Medium

Announcing Optuna 4.2

We are pleased to announce the release of Optuna 4.2! Optuna 4.2 now supports several new optimization algorithms, a gRPC storage proxy for…

🔥1

81 viewsAnatoly Alekseev, edited 05:11

Aspiring Data Science

#interviews

https://medium.com/towards-data-science/mathematics-i-look-for-in-data-scientist-interviews-7c7cb1aaebe5

Medium

Mathematics I Look for in Data Scientist Interviews

Let’s rebuild our data science foundation.

78 viewsAnatoly Alekseev, 05:19

Aspiring Data Science

#python #codestyle #codegems

Есть о чём задуматься, типа конструкций for elem in mylist or []:

https://levelup.gitconnected.com/12-production-grade-python-code-styles-ive-picked-up-from-work-ad32d8ae630d

Medium

12 Production-Grade Python Code Styles I’ve Picked Up From Work

Read Free…

91 viewsAnatoly Alekseev, edited 05:24

Aspiring Data Science

#dash #panel #streamlit #dashboards

https://pub.towardsai.net/which-python-dashboard-is-better-dash-panel-and-streamlit-showdown-8d4f8bf744f9

Medium

Which Python Dashboard Is Better? Dash, Panel And Streamlit Showdown

Prompting GPT-4 for multi-visual interactive dashboard creation

102 viewsAnatoly Alekseev, 05:27

Aspiring Data Science

Forwarded from Генерал СВР

Дорогие подписчики и гости канала! И американская и российская сторона форсируют события вокруг организации встречи президента США Дональда Трампа и человека назначенного президентом России и похожего на Владимира Путина. При этом становится, временами, непонятно кому эта встреча больше нужна. Трамп пытается решать все проблемы нахрапом, особо не задумываясь над всякой ерундой в виде каких-то "планов". Никакого плана Трампа по прекращению войны не существует в природе, а есть желание сблизить переговорные позиции России и Украины настолько, чтобы стороны согласились на перемирие. В сущности, группы переговорщиков на всех площадках только тем и занимаются, что "сближают позиции". Трамп уже в курсе желаний российского руководства, которые были сильно скорректированы на сегодняшний день и считает, что может во время личной встречи с "Путиным", что называется, дожать до необходимого минимума уступок. Скорость, с которой разворачиваются события действительно впечатляет. Ещё на прошлой неделе Илону Маску был согласован визит в Москву со встречей с "президентом", как уже на выходных практически договорились о встрече "Путина" и Трампа в Саудовской Аравии, а Маску предложили перенести визит на другие даты. Встреча Маска с "Путиным" была интересна политбюро, как способ повлиять на Трампа в преддверии личной встречи с президентом США, а теперь понимания нужен ли контакт с Маском нет. Политбюро ждёт этой встречи в основном с одним расчётом- договорённости на снятие с России большинства санкций уже в ближайшее время, и готовы за это на уступки. Это главное, на что настраивают "Путина".

😁2🤮2🤡1

96 viewsAnatoly Alekseev, 06:12

Aspiring Data Science

#politics

Интересно, соответствует ли информация об участии Маска действительности. Этот канал информации уже не раз подтверждал свою осведомлённость, и по косвенным признакам действительно можно судить о политической позиции Маска, благоприятствующей нынешнему руководству России.

😁1🤮1🤡1

100 viewsAnatoly Alekseev, 06:14

Aspiring Data Science

#books #kagglebook

Закончил чтение The Kaggle Book (English Edition). Общие замечания к книге:

Много места потеряно ради сомнительной "академической широты". Зачем было тратить десятки страниц на определение метрик? Лучше бы вместо этого рассказали про трюк с ансамблированием моделек, затюненных на разные метрики. Сами же в начале сказали, что книга подразумевает наличие определённой базы, и "основы линейной регрессии" рассказывать не будут.

А тюнеры? Зачем было приводить код использования GridSearchCV? К чему эта академическая широта картины, не лучше ли было дать совет, каким тюнером пользоваться и в чем его практические преимущества? Зачем рекламировать skopt, который на момент написания книги не имел коммитов уже 2 года (а на текущий момент 5 лет)?

Ну ладно, раз вы потратили десятки страниц на описание этих тюнеров (80% из которых в реале никто не будет использовать) и примеры кода, почему не удосужились их все запустить на каком-то датасете и сравнить хотя бы для примера?

Теперь, их объяснения какие параметры тюнить у бустингов, ну честно, это на уровне школьников, не гроссмейстеров каггл.

В то же время, некоторые главы действительно изобилуют ценным личным опытом и советами, особенно глава про ансамбли, это как раз то, чего я ждал.

Понравились главы по компьютерному зрению (CV) и обработке текстов (NLP), в первой много внимания уделено аугментации изображений, в последней приведены хорошие примеры конвейеров (pipelines).

Преимуществ в целом больше, чем недостатков, и книгу я рекомендую к прочтению для начального и среднего уровня в DS.

Далее размещу несколько постов с идеями, которые мне понравились, показались полезными или неожиданными. Иногда будут мои комментарии. Основной контент на английском.

Custom losses in boosting
Metrics, Dimensionality reduction, Pseudo-labeling
Denoising with autoencoders, Neural networks for tabular competitions
Ensembling часть 1
Ensembling часть 2
Stacking variations

Также понравилась серия постов/мини-интервью с гроссами каггл, приведу интересное:

Часть 1
Часть 2
Часть 3
Часть 4

199 viewsAnatoly Alekseev, edited 01:41

Aspiring Data Science

#books #kagglebook #interviews

Paweł Jankiewicz

I tend to always build a framework for each competition that allows me to create as many experiments as possible.
You should create a framework that allows you to change the most sensitive parts of the pipeline quickly.

Я тоже пытаюсь сделать свой фреймворк, чтобы не начинать каждый раз с нуля. для области DS этим неизбежно становится automl.

What Kaggle competitions taught me is the importance of validation, data leakage prevention, etc. For example, if data leaks happen in so many competitions, when people who prepare them are the best in the field, you can ask yourself what percentage of production models have data leaks in training; personally, I think 80%+ of production models are probably not validated correctly, but don’t quote me on that.

Software engineering skills are probably underestimated a lot. Every competition and problem is slightly different and needs some framework to streamline the solution (look at https://github.com/bestfitting/ instance_level_recognition and how well their code is organized). Good code organization helps you to iterate faster and eventually try more things.

Andrew Maranhão

While libraries are great, I also suggest that at some point in your career you take the time to implement it yourself. I first heard this advice from Andrew Ng and then from many others of equal calibre. Doing this creates very in-depth knowledge that sheds new light on what your model does and how it responds to tuning, data, noise, and more.

Over the years, the things I wished I realized sooner the most were:
1. Absorbing all the knowledge at the end of a competition
2. Replication of winning solutions in finished competitions

Это сильнейшая идея. Развивая её дальше, можно сказать, что учиться надо и по закончившимся соревам, в которых ты НЕ участвовал, и даже по синтетическим, которые ты сам создал!

In the pressure of a competition drawing to a close, you can see the leaderboard shaking more than ever
before. This makes it less likely that you will take risks and take the time to see things in all their detail.
When a competition is over, you don’t have that rush and can take as long as you need; you can also
replicate the rationale of the winners who made their solutions known.

If you have the discipline, this will do wonders for your data science skills, so the bottom line is: stop when
you are done, not when the competition ends. I have also heard this advice from an Andrew Ng keynote,
where he recommended replicating papers as one of his best ways to develop yourself as an AI practitioner.

Martin Henze

In many cases, after those first few days we’re more than 80% on the way to the ultimate winner’s solution, in terms of scoring metric. Of course, the fun and the challenge of Kaggle are to find creative ways to get those last few percent of, say, accuracy. But in an industry job, your time is often more efficiently spent in tackling a new project instead.

I don’t know how often a hiring manager would actually look at those resources, but I frequently got the impression that my Grandmaster noscript might have opened more doors than my PhD did. Or maybe it was a combination of the two. In any case, I can much recommend having a portfolio of public Notebooks.

Even if you’re a die-hard Python aficionado, it pays off to have a look beyond pandas and friends every once in a while. Different tools often lead to different viewpoints and more creativity.

Andrada Olteanu

I believe the most overlooked aspect of Kaggle is the community. Kaggle has the biggest pool of people, all gathered in one convenient place, from which one could connect, interact, and learn from. The best way to leverage this is to take, for example, the first 100 people from each Kaggle section (Competitions, Datasets, Notebooks – and if you want, Discussions), and follow on Twitter/LinkedIn everybody that has this information shared on their profile. This way, you can start interacting on a regular basis with these amazing people, who are so rich in insights and knowledge.

81 viewsAnatoly Alekseev, edited 01:57

About

Blog

Apps

Platform