ML Research Hub – Telegram
ML Research Hub
32.7K subscribers
4.08K photos
236 videos
23 files
4.39K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
---
#76. Our user engagement metric is flat. What could this mean and how would you investigate?
A: Flat engagement can be both good and bad.
Good: It could mean we have a mature, stable product with a loyal user base.
Bad: It could mean we are failing to attract new users or that existing users are losing interest (stagnation).
Investigation:
1. Segment: Break down the overall metric. Is engagement flat for all user segments (new vs. old, by country, by platform)? You might find that new user engagement is rising while old user engagement is falling, resulting in a flat overall number.
2. Feature Usage: Analyze engagement with specific features. Are users shifting their behavior from one feature to another?
3. Competitive Analysis: Look at what competitors are doing. Is the entire market flat?

#77. What is a "good" data visualization? What are some common mistakes?
A:
Good Visualization: It is accurate, easy to understand, tells a clear story, and is not misleading. It has clear labels, a noscript, and an appropriate chart type for the data (e.g., line chart for time series, bar chart for comparison).
Common Mistakes:
- Using the wrong chart type.
- Misleading axes (e.g., not starting a bar chart at zero).
- "Chart junk": too many colors, 3D effects, or unnecessary elements that distract from the data.
- Lack of context or clear labels.

#78. What's more important: having perfect data or getting quick insights from imperfect data?
A: It depends on the context.
• For financial reporting or critical system decisions, data accuracy is paramount.
• For initial exploratory analysis, identifying trends, or making quick business decisions, getting timely insights from reasonably good data is often more valuable. The goal is often directionally correct insights to inform the next step, not perfect precision. It's a trade-off between speed and accuracy.

#79. How do you handle outliers in a dataset?
A:
Identify: Use visualization (box plots, scatter plots) or statistical methods (Z-score, IQR method) to detect them.
Investigate: Understand why they exist. Are they data entry errors, or are they legitimate but extreme values?
Handle:
- Remove: If they are errors, they can be removed.
- Transform: Apply a transformation (like log transformation) to reduce their impact.
- Cap: Cap the values at a certain threshold (e.g., replace all values above the 99th percentile with the 99th percentile value).
- Keep: If they are legitimate and important (e.g., fraudulent transactions), they should be kept and studied separately.

#80. Our CEO wants to know the "average session duration" for our app. What are the potential pitfalls of this metric?
A:
It's an average: It can be heavily skewed by a small number of users with extremely long sessions (outliers). The median session duration might be a more robust metric.
Doesn't measure quality: A long session isn't necessarily a good session. A user could be struggling to find something, or they could have left the app open in the background.
Definition is key: How do we define the end of a session? 30 minutes of inactivity? This definition can significantly change the metric.
• I would present the median alongside the mean and also provide a distribution of session lengths to give a more complete picture.

---
Part 5: Technical & Behavioral Questions (Q81-100)
#81. What tools are you proficient in for data analysis?
A: I am highly proficient in SQL for data extraction and querying from relational databases like PostgreSQL and MySQL. For data cleaning, manipulation, analysis, and visualization, I primarily use Python with libraries such as pandas, NumPy, Matplotlib, and Seaborn. I also have experience using BI tools like Tableau/Power BI for creating interactive dashboards.

#82. Describe a data analysis project you are proud of.
A: Use the STAR method:
Situation: "At my previous company, we were experiencing a higher-than-expected customer churn rate."
Task: "I was tasked with identifying the key drivers of churn to help the product team develop a retention strategy."
Action: "I extracted user activity data, subnoscription information, and customer support tickets using SQL. In Python, I cleaned the data and engineered features like 'days since last login' and 'number of support tickets'. I then performed an exploratory data analysis, built a logistic regression model to identify the most significant predictors of churn, and visualized the findings in a Tableau dashboard."
Result: "My analysis revealed that customers who didn't use a key feature within their first week were 50% more likely to churn. This insight led the product team to redesign the onboarding flow to highlight this feature, which contributed to a 15% reduction in first-month churn over the next quarter."

#83. How do you ensure the quality of your data and analysis?
A:
Data Validation: I always start by checking for missing values, duplicates, and outliers. I write validation noscripts and perform exploratory data analysis to understand the data's distribution and sanity-check its values.
Code Reviews: I ask peers to review my SQL queries and Python noscripts for logic errors and efficiency.
Documentation: I thoroughly document my methodology, assumptions, and steps so my work is reproducible and transparent.
Triangulation: I try to verify my findings using different data sources or analytical approaches if possible.

#84. How do you communicate your findings to a non-technical audience?
A: I focus on the "so what" rather than the "how". I start with the key insight or recommendation upfront. I use clear, simple language and avoid jargon. I rely heavily on effective data visualizations to tell the story. For example, instead of saying "the p-value was 0.01," I would say, "the data shows with high confidence that our new feature is increasing user engagement."

#85. Describe a time you made a mistake in your analysis. What did you do?
A: "In one project, I was analyzing marketing campaign performance and initially concluded that a specific campaign had a very high ROI. However, a colleague reviewing my query pointed out that I had forgotten to exclude test user accounts from my analysis. I immediately acknowledged the mistake, re-ran the query with the correct filters, and presented the updated, more modest results to my manager. I learned the importance of having a peer review process for critical queries and now I always build a data validation checklist for every project."
---
#86. What is bias-variance tradeoff?
A: It's a central concept in machine learning.
Bias is the error from erroneous assumptions in the learning algorithm. High bias can cause a model to miss relevant relations between features and target outputs (underfitting).
Variance is the error from sensitivity to small fluctuations in the training set. High variance can cause a model to model the random noise in the training data (overfitting).
The tradeoff is that models with low bias tend to have high variance, and vice-versa. The goal is to find a balance that minimizes total error.
#87. What is feature engineering?
A: Feature engineering is the process of using domain knowledge to create new features (predictor variables) from raw data. The goal is to improve the performance of machine learning models. Examples include combining features, creating dummy variables from categorical data, or extracting components from a date.

#88. How would you handle a very large dataset that doesn't fit into your computer's memory?
A:
Chunking: Read and process the data in smaller chunks using options in pd.read_csv (chunksize).
Data Types: Optimize data types to use less memory (e.g., using int32 instead of int64).
Cloud Computing: Use cloud-based platforms like AWS, GCP, or Azure that provide scalable computing resources (e.g., using Spark with Databricks).
Sampling: Work with a representative random sample of the data for initial exploration.

#89. Where do you go to stay up-to-date with the latest trends in data analysis?
A: I actively read blogs like Towards Data Science on Medium, follow key data scientists and analysts on LinkedIn and Twitter, and listen to podcasts. I also enjoy browsing Kaggle competitions to see how others approach complex problems and occasionally review documentation for new features in libraries like pandas and Scikit-learn.

#90. What is a key performance indicator (KPI)?
A: A KPI is a measurable value that demonstrates how effectively a company is achieving key business objectives. Organizations use KPIs to evaluate their success at reaching targets. For a data analyst, it's crucial to understand what the business's KPIs are in order to align analysis with business goals.
---
#91. What is the difference between structured and unstructured data?
A:
Structured Data: Highly organized and formatted in a way that is easily searchable in relational databases (e.g., spreadsheets, SQL databases).
Unstructured Data: Data that has no predefined format or organization, making it more difficult to collect, process, and analyze (e.g., text in emails, images, videos, social media posts).

#92. Why is data cleaning important?
A: Data cleaning (or data cleansing) is crucial because "garbage in, garbage out." Raw data is often messy, containing errors, inconsistencies, and missing values. If this data is not cleaned, it will lead to inaccurate analysis, flawed models, and unreliable conclusions, which can result in poor business decisions.

#93. Tell me about a time you had to work with ambiguous instructions or unclear data.
A: "I was once asked to analyze 'user engagement'. This term was very broad. I scheduled a meeting with the stakeholders (product manager, marketing lead) to clarify. I asked questions like: 'What business question are we trying to answer with this analysis?', 'Which users are we most interested in?', and 'What actions on the platform do we consider valuable engagement?'. This helped us collaboratively define engagement with specific metrics (e.g., likes, comments, session duration), which ensured my analysis was relevant and actionable."

#94. What is the difference between a dashboard and a report?
A:
Report: A static presentation of data for a specific time period (e.g., a quarterly sales report). It's meant to inform.
Dashboard: A dynamic, interactive BI tool that provides a real-time, at-a-glance view of key performance indicators. It's meant for monitoring and exploration.

#95. What is statistical power?
A: Statistical power is the probability that a hypothesis test will correctly reject the null hypothesis when the null hypothesis is false (i.e., the probability of avoiding a Type II error). In A/B testing, higher power means you are more likely to detect a real effect if one exists.
#96. How do you know if your sample is representative of the population?
A: The best way is through proper sampling techniques. Random sampling is the gold standard, where every member of the population has an equal chance of being selected. You can also use stratified sampling, where you divide the population into subgroups (strata) and then take a random sample from each subgroup to ensure all groups are represented proportionally.

#97. What is your favorite data visualization and why?
A: "I find the box plot to be incredibly powerful and efficient. In a single, compact chart, it visualizes the distribution of data, showing the median, quartiles (25th and 75th percentiles), and potential outliers. It's excellent for comparing distributions across multiple categories and is much more informative than a simple bar chart of means."

#98. What is survivorship bias?
A: Survivorship bias is a logical error where you concentrate on the people or things that "survived" some process and inadvertently overlook those that did not because of their lack of visibility. A classic example is analyzing the habits of successful startup founders without considering the thousands who failed, which can lead to flawed conclusions about what it takes to succeed.

#99. You are given two datasets. How would you figure out if they can be joined?
A: I would first inspect the columns in both datasets to look for a common key or field. This field should ideally be a unique identifier (like user_id, product_id). I would check that the data types of these key columns are the same. Then, I would check the overlap of values between the key columns to understand how many records would match in a join.

#100. Why do you want to be a data analyst?
A: "I am passionate about being a data analyst because I enjoy the process of transforming raw data into actionable insights that can drive real business decisions. I love the blend of technical skills like SQL and Python with the problem-solving and storytelling aspects of the role. I find it incredibly rewarding to uncover hidden patterns and help a company grow by making data-informed choices."

━━━━━━━━━━━━━━━
By: @DataScienceT
🔹 Title: Continuous Autoregressive Language Models

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27688
• PDF: https://arxiv.org/pdf/2510.27688
• Project Page: https://shaochenze.github.io/blog/2025/CALM/
• Github: https://shaochenze.github.io/blog/2025/CALM

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

🔹 Publication Date: Published on Oct 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27492
• PDF: https://arxiv.org/pdf/2510.27492
• Project Page: https://thinkmorph.github.io/
• Github: https://github.com/ThinkMorph/ThinkMorph

🔹 Datasets citing this paper:
https://huggingface.co/datasets/ThinkMorph/Jigsaw_Assembly
https://huggingface.co/datasets/ThinkMorph/Visual_Search
https://huggingface.co/datasets/ThinkMorph/Chart_Refocus
https://huggingface.co/datasets/ThinkMorph/Spatial_Navigation

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27684
• PDF: https://arxiv.org/pdf/2510.27684

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27623
• PDF: https://arxiv.org/pdf/2510.27623
• Project Page: https://zqs1943.github.io/BEAT/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27607
• PDF: https://arxiv.org/pdf/2510.27607

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27266
• PDF: https://arxiv.org/pdf/2510.27266

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: The Denario project: Deep knowledge AI agents for scientific discovery

🔹 Publication Date: Published on Oct 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26887
• PDF: https://arxiv.org/pdf/2510.26887
• Github: https://github.com/AstroPilot-AI/Denario

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

🔹 Publication Date: Published on Oct 29

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25602
• PDF: https://arxiv.org/pdf/2510.25602
• Github: https://github.com/ChenMnZ/INT_vs_FP

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

🔹 Publication Date: Published on Oct 28

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24411
• PDF: https://arxiv.org/pdf/2510.24411
• Github: https://github.com/OS-Copilot/OS-Sentinel

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Defeating the Training-Inference Mismatch via FP16

🔹 Publication Date: Published on Oct 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26788
• PDF: https://arxiv.org/pdf/2510.26788
• Github: https://github.com/sail-sg/Precision-RL

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Higher-order Linear Attention

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27258
• PDF: https://arxiv.org/pdf/2510.27258
• Project Page: https://yifanzhang-pro.github.io/HLA
• Github: https://github.com/yifanzhang-pro/HLA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27606
• PDF: https://arxiv.org/pdf/2510.27606
• Github: https://github.com/InternLM/Spatial-SSRL

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

🔹 Publication Date: Published on Oct 29

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25889
• PDF: https://arxiv.org/pdf/2510.25889
• Project Page: https://rlinf.readthedocs.io/en/latest/rst_source/examples/pi0.html
• Github: https://github.com/RLinf/RLinf

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

🔹 Publication Date: Published on Oct 31

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.27224
• PDF: https://arxiv.org/pdf/2510.27224

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
Forwarded from Kaggle Data Hub
Unlock premium learning without spending a dime! ⭐️ @DataScienceC is the first Telegram channel dishing out free Udemy coupons daily—grab courses on data science, coding, AI, and beyond. Join the revolution and boost your skills for free today! 📕

What topic are you itching to learn next? 😊
https://news.1rj.ru/str/DataScienceC 🌟
Please open Telegram to view this post
VIEW IN TELEGRAM
🔹 Title: Agent Lightning: Train ANY AI Agents with Reinforcement Learning

📝 Summary:
Agent Lightning is a flexible RL framework for training LLMs in any AI agent. It uniquely decouples agent execution from training, allowing seamless integration with diverse existing agents with minimal code changes. This enables robust training for complex interactions and shows stable performan...

🔹 Publication Date: Published on Aug 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03680
• PDF: https://arxiv.org/pdf/2508.03680
• Project Page: https://www.microsoft.com/en-us/research/project/agent-lightning/
• Github: https://github.com/microsoft/agent-lightning

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT
🔹 Title: Kimi Linear: An Expressive, Efficient Attention Architecture

📝 Summary:
Kimi Linear is a new hybrid linear attention architecture that, for the first time, outperforms full attention across various contexts. It achieves superior performance and efficiency, reducing KV cache and increasing decoding throughput, making it a powerful drop-in replacement.

🔹 Publication Date: Published on Oct 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26692
• PDF: https://arxiv.org/pdf/2510.26692
• Github: https://github.com/MoonshotAI/Kimi-Linear

🔹 Models citing this paper:
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Base
https://huggingface.co/aiqtech/Kimi-Linear-48B-A3B-Instruct

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT