Data science/ML/AI
Hey everyone 👋 Tomorrow we are kicking off a new short & free series called: 📊 Data Importing Series 📊 We’ll go through all the real ways to pull data into Python: → CSV, Excel, JSON and more → Databases & SQL databases → APIs, Google Sheets, even PDFs…
Hey Everyone 👋
Should we continue another series on "Data Manipulation with Pandas" just like the previous series?
Should we continue another series on "Data Manipulation with Pandas" just like the previous series?
Anonymous Poll
99%
Yes
1%
No
❤4
Data Cleaning in Python
Data cleaning is the process of detecting and correcting inaccurate, incomplete, or inconsistent data to improve data quality for analysis and modeling. It is a crucial step in any data science workflow.
Handling Missing Values
Removing Duplicate Data
Correcting Data Types
Renaming Columns
Handling Inconsistent Data
Clean data leads to more accurate analysis and reliable models. Python’s pandas library simplifies cleaning tasks such as handling missing values, duplicates, incorrect types, and inconsistencies.
Data cleaning is the process of detecting and correcting inaccurate, incomplete, or inconsistent data to improve data quality for analysis and modeling. It is a crucial step in any data science workflow.
Handling Missing Values
df.isnull().sum() # Check missing values
df.dropna() # Remove rows with missing values
df.fillna(0) # Replace missing values
Removing Duplicate Data
df.duplicated() # Identify duplicates
df.drop_duplicates() # Remove duplicates
Correcting Data Types
df.dtypes #identify data types
df["age"] = df["age"].astype(int) #convert age column to integer data type
df["date"] = pd.to_datetime(df["date"]) #convert date column to date data type
Renaming Columns
df.columns = df.columns.str.lower().str.replace(" ", "_")Handling Inconsistent Data
df["gender"] = df["gender"].str.lower() #convert to lower case
df["name"] = df["name"].str.strip()
Clean data leads to more accurate analysis and reliable models. Python’s pandas library simplifies cleaning tasks such as handling missing values, duplicates, incorrect types, and inconsistencies.
❤10
📚 Data Science Riddle - Model Selection
Two models have similar accuracy, but one is far simpler. Which should you choose ?
Two models have similar accuracy, but one is far simpler. Which should you choose ?
Anonymous Quiz
18%
The complex one
71%
The simpler one
4%
Neither
7%
Both
The Real Reason PCA Works: Variance as Signal
Students memorize PCA as “dimensionality reduction.”
But the deeper insight is: PCA assumes variance = information.
If a direction in the data has high variance, PCA considers it meaningful.
If variance is small, PCA considers it noise.
This is not always true in real systems.
PCA fails when:
➖important signals have low variance
➖noise has high variance
➖relationships are nonlinear
That’s why modern methods (autoencoders, UMAP, t-SNE) outperform PCA on many datasets.
Students memorize PCA as “dimensionality reduction.”
But the deeper insight is: PCA assumes variance = information.
If a direction in the data has high variance, PCA considers it meaningful.
If variance is small, PCA considers it noise.
This is not always true in real systems.
PCA fails when:
➖important signals have low variance
➖noise has high variance
➖relationships are nonlinear
That’s why modern methods (autoencoders, UMAP, t-SNE) outperform PCA on many datasets.
❤4
📚 Data Science Riddle - Probability
A classifier outputs 0.9 probability for class A, but the real frequency is only 0.7. What is the model lacking?
A classifier outputs 0.9 probability for class A, but the real frequency is only 0.7. What is the model lacking?
Anonymous Quiz
28%
Regularization
26%
Early stopping
28%
Normalization
18%
Calibration
Why Feature Drift Is Harder Than Data Drift
Data drift = inputs change
Feature drift = the logic that generates the feature changes
Example:
Your “active user” feature used to be “clicked in last 7 days.”
Marketing redefines it to “clicked in last 3 days.”
Your model silently dies because the underlying concept changed.
Feature drift is more dangerous:
it happens inside your system, not in external data.
Production ML must version:
▪️feature definitions
▪️transformation logic
▪️data contracts
Otherwise the same model receives different features week to week.
Data drift = inputs change
Feature drift = the logic that generates the feature changes
Example:
Your “active user” feature used to be “clicked in last 7 days.”
Marketing redefines it to “clicked in last 3 days.”
Your model silently dies because the underlying concept changed.
Feature drift is more dangerous:
it happens inside your system, not in external data.
Production ML must version:
▪️feature definitions
▪️transformation logic
▪️data contracts
Otherwise the same model receives different features week to week.
❤3👏1
📚 Data Science Riddle - Feature Engineering
A model's performance drops because some features have extreme outliers. What helps most?
A model's performance drops because some features have extreme outliers. What helps most?
Anonymous Quiz
16%
Label smoothing
43%
Robust scaling
19%
Bagging
22%
Increasing k-fold splits
❤4
🧵 Thread Series on:
Mastering Pandas for Data Manipulation!
Pandas is the go-to library for handling tabular data in Python. Whether you're analyzing sales, surveys, or logs, start every project the same way:
Next up 👉 Selecting Columns & Rows
Mastering Pandas for Data Manipulation!
Pandas is the go-to library for handling tabular data in Python. Whether you're analyzing sales, surveys, or logs, start every project the same way:
import pandas as pd
# Load CSV
df = pd.read_csv('sales_data.csv')
# Quick look
df.head() # First 5 rows
df.info() # Structure & data types
df.describe() # Basic stats
Next up 👉 Selecting Columns & Rows
❤4🔥2
Selecting Columns & Rows
Need specific columns or rows? Pandas makes selection intuitive and fast:
Next up 👉 Filtering and Querying
Need specific columns or rows? Pandas makes selection intuitive and fast:
# Single column (Series)
df['name']
# Multiple columns (DataFrame)
df[['name', 'age', 'sales']]
# Row selection with .loc (label-based)
df.loc[0:5] # Rows 0 to 5
df.loc[df['sales'] > 1000] # Conditional
# .iloc (position-based)
df.iloc[0:5, 1:4] # Rows 0-4, columns 1-3
Next up 👉 Filtering and Querying
❤5
Filtering and Querying
Want to zoom in on specific data?
Filtering in Pandas is incredibly powerful. Check the code below:
Next up 👉 Adding and Removing Columns
Want to zoom in on specific data?
Filtering in Pandas is incredibly powerful. Check the code below:
# Multiple conditions
high_sales = df[(df['sales'] > 1000) & (df['region'] == 'West')]
# Using .query() – cleaner syntax!
high_performers = df.query("sales > 1000 and region == 'West'")
# Find missing values
df[df['email'].isna()]
# Contains substring
df[df['product'].str.contains('Pro', case=False)]
Next up 👉 Adding and Removing Columns
❤3
Adding and Removing Columns
DataFrames are flexible! Easily create new columns or remove unnecessary ones:
Next up 👉 Dealing with Missing Values
DataFrames are flexible! Easily create new columns or remove unnecessary ones:
# Add new column
df['revenue'] = df['sales'] * df['price']
# From existing columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']
# Drop columns
df.drop(columns=['temp_col'], inplace=True)
# Or create a new DF without modifying original
clean_df = df.drop(columns=['old_col1', 'old_col2'])
Next up 👉 Dealing with Missing Values
❤5
Dealing with Missing Values
Real-world data is messy, missing values are common.
Here's how to handle them cleanly:
Next up 👉 Using GroupBy
Real-world data is messy, missing values are common.
Here's how to handle them cleanly:
# Check for nulls
df.isnull().sum()
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)
# Forward or backward fill (great for time series)
df['value'].ffill()
Next up 👉 Using GroupBy
❤5
Using GroupBy
GroupBy is where Pandas shines brightest. It summarizes data by categories in one line.
Next up 👉 Sorting and Ranking
GroupBy is where Pandas shines brightest. It summarizes data by categories in one line.
# Total sales by region
df.groupby('region')['sales'].sum()
# Multiple aggregations
df.groupby('region').agg({
'sales': 'sum',
'customer_id': 'nunique',
'order_date': 'max'
})
# Group by multiple columns
df.groupby(['region', 'product'])['sales'].mean()
Next up 👉 Sorting and Ranking
❤2
📚 Data Science Riddle - Evaluation
You're measuring performance on a dataset with heavy class imbalance. What metric is most reliable?
You're measuring performance on a dataset with heavy class imbalance. What metric is most reliable?
Anonymous Quiz
19%
Accuracy
48%
F1 Score
14%
Precision
19%
AUC
Sorting and Ranking
Order matters! Sort your data to find top performers or trends:
Next up 👉 Merging and Joining Data
Order matters! Sort your data to find top performers or trends:
# Sort by one column
df.sort_values('sales', ascending=False)
# Sort by multiple columns
df.sort_values(['region', 'sales'], ascending=[True, False])
# Reset index after sorting
df = df.sort_values('sales', ascending=False).reset_index(drop=True)
# Add rank
df['sales_rank'] = df['sales'].rank(ascending=False)
Next up 👉 Merging and Joining Data
❤2
Media is too big
VIEW IN TELEGRAM
OnSpace Mobile App builder: Build AI Apps in minutes
With OnSpace, you can build website or AI Mobile Apps by chatting with AI, and publish to PlayStore or AppStore.
🔥 What will you get:
• 🤖 Create app or website by chatting with AI;
• 🧠 Integrate with Any top AI power just by giving order (like Sora2, Nanobanan Pro & Gemini 3 Pro);
• 📦 Download APK,AAB file, publish to AppStore.
• 💳 Add payments and monetize like in-app-purchase and Stripe.
• 🔐 Functional login & signup.
• 🗄 Database + dashboard in minutes.
• 🎥 Full tutorial on YouTube and within 1 day customer service
🌐 Visit website:
👉 https://www.onspace.ai/?via=tg_bigdata
📲 Or Download app:
👉 https://onspace.onelink.me/za8S/h1jb6sb9?c=bigdata
With OnSpace, you can build website or AI Mobile Apps by chatting with AI, and publish to PlayStore or AppStore.
🔥 What will you get:
• 🤖 Create app or website by chatting with AI;
• 🧠 Integrate with Any top AI power just by giving order (like Sora2, Nanobanan Pro & Gemini 3 Pro);
• 📦 Download APK,AAB file, publish to AppStore.
• 💳 Add payments and monetize like in-app-purchase and Stripe.
• 🔐 Functional login & signup.
• 🗄 Database + dashboard in minutes.
• 🎥 Full tutorial on YouTube and within 1 day customer service
🌐 Visit website:
👉 https://www.onspace.ai/?via=tg_bigdata
📲 Or Download app:
👉 https://onspace.onelink.me/za8S/h1jb6sb9?c=bigdata
❤3
Merging and Joining Data
Working with multiple datasets? Combine them just like SQL:
This wraps up our Data Manipulation Using Pandas Series.
Hit ❤️ if you liked this series. It will help us tailor more content based on what you like.
👉Join @datascience_bds for more
Part of the @bigdataspecialist family
Working with multiple datasets? Combine them just like SQL:
# Inner join (default)
merged = pd.merge(df_sales, df_customers, on='customer_id')
# Left join
pd.merge(df_sales, df_customers, on='customer_id', how='left')
# Concatenate vertically
all_data = pd.concat([df_2023, df_2024], ignore_index=True)
# Join on index
df1.join(df2, on='date')
This wraps up our Data Manipulation Using Pandas Series.
Hit ❤️ if you liked this series. It will help us tailor more content based on what you like.
👉Join @datascience_bds for more
Part of the @bigdataspecialist family
❤5