Data science/ML/AI – Telegram
Data science/ML/AI
13K subscribers
516 photos
2 videos
98 files
314 links
Data science and machine learning hub

Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.

For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels

DMCA: @disclosure_bds
Contact: @mldatascientist
Download Telegram
📚 Data Science Riddle - Model Selection

Two models have similar accuracy, but one is far simpler. Which should you choose ?
Anonymous Quiz
18%
The complex one
71%
The simpler one
4%
Neither
7%
Both
The Real Reason PCA Works: Variance as Signal

Students memorize PCA as “dimensionality reduction.”
But the deeper insight is: PCA assumes variance = information.

If a direction in the data has high variance, PCA considers it meaningful.
If variance is small, PCA considers it noise.

This is not always true in real systems.

PCA fails when:
important signals have low variance
noise has high variance
relationships are nonlinear

That’s why modern methods (autoencoders, UMAP, t-SNE) outperform PCA on many datasets.
4
📚 Data Science Riddle - Probability

A classifier outputs 0.9 probability for class A, but the real frequency is only 0.7. What is the model lacking?
Anonymous Quiz
28%
Regularization
26%
Early stopping
28%
Normalization
18%
Calibration
Why Feature Drift Is Harder Than Data Drift

Data drift = inputs change
Feature drift = the logic that generates the feature changes

Example:
Your “active user” feature used to be “clicked in last 7 days.”
Marketing redefines it to “clicked in last 3 days.”
Your model silently dies because the underlying concept changed.

Feature drift is more dangerous:
it happens inside your system, not in external data.

Production ML must version:
▪️feature definitions
▪️transformation logic
▪️data contracts

Otherwise the same model receives different features week to week.
3👏1
📚 Data Science Riddle - Feature Engineering

A model's performance drops because some features have extreme outliers. What helps most?
Anonymous Quiz
16%
Label smoothing
43%
Robust scaling
19%
Bagging
22%
Increasing k-fold splits
4
🧵 Thread Series on:

Mastering Pandas for Data Manipulation!


Pandas is the go-to library for handling tabular data in Python. Whether you're analyzing sales, surveys, or logs, start every project the same way:

import pandas as pd

# Load CSV
df = pd.read_csv('sales_data.csv')

# Quick look
df.head()     # First 5 rows
df.info()     # Structure & data types
df.describe() # Basic stats


Next up 👉 Selecting Columns & Rows
4🔥2
Selecting Columns & Rows

Need specific columns or rows? Pandas makes selection intuitive and fast:

# Single column (Series)
df['name']

# Multiple columns (DataFrame)
df[['name', 'age', 'sales']]

# Row selection with .loc (label-based)
df.loc[0:5]                    # Rows 0 to 5
df.loc[df['sales'] > 1000]     # Conditional

# .iloc (position-based)
df.iloc[0:5, 1:4]              # Rows 0-4, columns 1-3


Next up 👉 Filtering and Querying
5
Filtering and Querying

Want to zoom in on specific data?

Filtering in Pandas is incredibly powerful. Check the code below:

# Multiple conditions
high_sales = df[(df['sales'] > 1000) & (df['region'] == 'West')]

# Using .query() – cleaner syntax!
high_performers = df.query("sales > 1000 and region == 'West'")

# Find missing values
df[df['email'].isna()]

# Contains substring
df[df['product'].str.contains('Pro', case=False)]


Next up 👉 Adding and Removing Columns
3
Adding and Removing Columns

DataFrames are flexible! Easily create new columns or remove unnecessary ones:

# Add new column
df['revenue'] = df['sales'] * df['price']

# From existing columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']

# Drop columns
df.drop(columns=['temp_col'], inplace=True)

# Or create a new DF without modifying original
clean_df = df.drop(columns=['old_col1', 'old_col2'])


Next up 👉 Dealing with Missing Values
5
Dealing with Missing Values

Real-world data is messy, missing values are common.

Here's how to handle them cleanly:

# Check for nulls
df.isnull().sum()

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Forward or backward fill (great for time series)
df['value'].ffill()


Next up 👉 Using GroupBy
5
Using GroupBy

GroupBy is where Pandas shines brightest. It summarizes data by categories in one line.

# Total sales by region
df.groupby('region')['sales'].sum()

# Multiple aggregations
df.groupby('region').agg({
    'sales': 'sum',
    'customer_id': 'nunique',
    'order_date': 'max'
})

# Group by multiple columns
df.groupby(['region', 'product'])['sales'].mean()


Next up 👉 Sorting and Ranking
2
📚 Data Science Riddle - Evaluation

You're measuring performance on a dataset with heavy class imbalance. What metric is most reliable?
Anonymous Quiz
19%
Accuracy
48%
F1 Score
14%
Precision
19%
AUC
Sorting and Ranking

Order matters! Sort your data to find top performers or trends:

# Sort by one column
df.sort_values('sales', ascending=False)

# Sort by multiple columns
df.sort_values(['region', 'sales'], ascending=[True, False])

# Reset index after sorting
df = df.sort_values('sales', ascending=False).reset_index(drop=True)

# Add rank
df['sales_rank'] = df['sales'].rank(ascending=False)


Next up 👉 Merging and Joining Data
2
Media is too big
VIEW IN TELEGRAM
OnSpace Mobile App builder: Build AI Apps in minutes

With OnSpace, you can build website or AI Mobile Apps by chatting with AI, and publish to PlayStore or AppStore.

🔥 What will you get:
🤖 Create app or website by chatting with AI;
🧠 Integrate with Any top AI power just by giving order (like Sora2, Nanobanan Pro & Gemini 3 Pro);
📦 Download APK,AAB file, publish to AppStore.
💳 Add payments and monetize like in-app-purchase and Stripe.
🔐 Functional login & signup.
🗄 Database + dashboard in minutes.
🎥 Full tutorial on YouTube and within 1 day customer service

🌐 Visit website:
👉 https://www.onspace.ai/?via=tg_bigdata

📲 Or Download app:
👉 https://onspace.onelink.me/za8S/h1jb6sb9?c=bigdata
3
Merging and Joining Data

Working with multiple datasets? Combine them just like SQL:

# Inner join (default)
merged = pd.merge(df_sales, df_customers, on='customer_id')

# Left join
pd.merge(df_sales, df_customers, on='customer_id', how='left')

# Concatenate vertically
all_data = pd.concat([df_2023, df_2024], ignore_index=True)

# Join on index
df1.join(df2, on='date')


This wraps up our Data Manipulation Using Pandas Series.

Hit ❤️ if you liked this series. It will help us tailor more content based on what you like.

👉Join @datascience_bds for more
Part of the @bigdataspecialist family
5