❤5👍4🤣1
What more free resources do you want?
Anonymous Poll
18%
Python
22%
Artificial Intelligence
16%
Machine Learning
20%
Data Science
6%
Data Engineering
4%
Programming languages
15%
Projects
👍7❤3
Important Topics to become a data scientist
[Advanced Level]
👇👇
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Denoscription
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learning important data science and machine learning concepts
ENJOY LEARNING 👍👍
[Advanced Level]
👇👇
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Denoscription
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learning important data science and machine learning concepts
ENJOY LEARNING 👍👍
👍7❤3🤣1
Machine Learning with Decision Trees and Random Forest 📝.pdf
1.8 MB
Machine Learning with Decision Trees and Random Forest 📝.pdf
👍6😁1
Machine Learning Algorithm
👍9😁1
Data Scientist Interview Questions
1. How would you test whether a given dataset follows a normal distribution?
2. Explain the difference between Type I and Type II errors. How do they impact hypothesis testing?
3. You roll two dice. What is the probability that the sum is at least 8?
4. Given a biased coin that lands on heads with probability p, how can you generate a fair coin flip using it?
5. How would you detect and handle outliers in a dataset?
6. How do you deal with an imbalanced dataset in classification problems?
7. Explain how the Gradient Boosting Algorithm works. How is it different from Random Forest?
8. You are given a trained model with poor performance on new data. How would you debug the issue?
9. What is the curse of dimensionality? How do you mitigate its effects?
10. How do you choose the best number of clusters in K-means clustering?
11. Given a table of transactions, write an SQL query to find the top 3 customers with the highest total purchase amount.
12. How would you optimize a slow SQL query that joins multiple large tables?
13. Write an SQL query to calculate the rolling average of sales over the past 7 days.
14. How would you handle NULL values in an SQL dataset when performing aggregations?
15. How would you design a real-time recommendation system for an e-commerce website?
Answering these questions requires an in-depth knowledge of Data Scientist concepts.
1. How would you test whether a given dataset follows a normal distribution?
2. Explain the difference between Type I and Type II errors. How do they impact hypothesis testing?
3. You roll two dice. What is the probability that the sum is at least 8?
4. Given a biased coin that lands on heads with probability p, how can you generate a fair coin flip using it?
5. How would you detect and handle outliers in a dataset?
6. How do you deal with an imbalanced dataset in classification problems?
7. Explain how the Gradient Boosting Algorithm works. How is it different from Random Forest?
8. You are given a trained model with poor performance on new data. How would you debug the issue?
9. What is the curse of dimensionality? How do you mitigate its effects?
10. How do you choose the best number of clusters in K-means clustering?
11. Given a table of transactions, write an SQL query to find the top 3 customers with the highest total purchase amount.
12. How would you optimize a slow SQL query that joins multiple large tables?
13. Write an SQL query to calculate the rolling average of sales over the past 7 days.
14. How would you handle NULL values in an SQL dataset when performing aggregations?
15. How would you design a real-time recommendation system for an e-commerce website?
Answering these questions requires an in-depth knowledge of Data Scientist concepts.
👍7😁1
5 data science questions you should be able to answer for a data scientist role.
𝐌𝐞𝐝𝐢𝐮𝐦 𝐥𝐞𝐯𝐞𝐥
1. Name ML algorithms that do not use Gradient Descent for optimization.
2. Explain how you construct an ROC-AUC curve.
3. Give examples of business cases where precision is more important than recall, and vice versa.
4. What’s the difference between bagging and boosting, and when would you use one over the other?
5. How do MLE and MAP differ?
𝐌𝐞𝐝𝐢𝐮𝐦 𝐥𝐞𝐯𝐞𝐥
1. Name ML algorithms that do not use Gradient Descent for optimization.
2. Explain how you construct an ROC-AUC curve.
3. Give examples of business cases where precision is more important than recall, and vice versa.
4. What’s the difference between bagging and boosting, and when would you use one over the other?
5. How do MLE and MAP differ?
👍1🤣1
Many data scientists don't know how to push ML models to production. Here's the recipe 👇
𝗞𝗲𝘆 𝗜𝗻𝗴𝗿𝗲𝗱𝗶𝗲𝗻𝘁𝘀
🔹 𝗧𝗿𝗮𝗶𝗻 / 𝗧𝗲𝘀𝘁 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 - Ensure Test is representative of Online data
🔹 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 - Generate features in real-time
🔹 𝗠𝗼𝗱𝗲𝗹 𝗢𝗯𝗷𝗲𝗰𝘁 - Trained SkLearn or Tensorflow Model
🔹 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗖𝗼𝗱𝗲 𝗥𝗲𝗽𝗼 - Save model project code to Github
🔹 𝗔𝗣𝗜 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 - Use FastAPI or Flask to build a model API
🔹 𝗗𝗼𝗰𝗸𝗲𝗿 - Containerize the ML model API
🔹 𝗥𝗲𝗺𝗼𝘁𝗲 𝗦𝗲𝗿𝘃𝗲𝗿 - Choose a cloud service; e.g. AWS sagemaker
🔹 𝗨𝗻𝗶𝘁 𝗧𝗲𝘀𝘁𝘀 - Test inputs & outputs of functions and APIs
🔹 𝗠𝗼𝗱𝗲𝗹 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 - Evidently AI, a simple, open-source for ML monitoring
𝗣𝗿𝗼𝗰𝗲𝗱𝘂𝗿𝗲
𝗦𝘁𝗲𝗽 𝟭 - 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗼𝗻 & 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
Don't push a model with 90% accuracy on train set. Do it based on the test set - if and only if, the test set is representative of the online data. Use SkLearn pipeline to chain a series of model preprocessing functions like null handling.
𝗦𝘁𝗲𝗽 𝟮 - 𝗠𝗼𝗱𝗲𝗹 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁
Train your model with frameworks like Sklearn or Tensorflow. Push the model code including preprocessing, training and validation noscripts to Github for reproducibility.
𝗦𝘁𝗲𝗽 𝟯 - 𝗔𝗣𝗜 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 & 𝗖𝗼𝗻𝘁𝗮𝗶𝗻𝗲𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻
Your model needs a "/predict" endpoint, which receives a JSON object in the request input and generates a JSON object with the model score in the response output. You can use frameworks like FastAPI or Flask. Containzerize this API so that it's agnostic to server environment
𝗦𝘁𝗲𝗽 𝟰 - 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 & 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁
Write tests to validate inputs & outputs of API functions to prevent errors. Push the code to remote services like AWS Sagemaker.
𝗦𝘁𝗲𝗽 𝟱 - 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴
Set up monitoring tools like Evidently AI, or use a built-in one within AWS Sagemaker. I use such tools to track performance metrics and data drifts on online data.
𝗞𝗲𝘆 𝗜𝗻𝗴𝗿𝗲𝗱𝗶𝗲𝗻𝘁𝘀
🔹 𝗧𝗿𝗮𝗶𝗻 / 𝗧𝗲𝘀𝘁 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 - Ensure Test is representative of Online data
🔹 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 - Generate features in real-time
🔹 𝗠𝗼𝗱𝗲𝗹 𝗢𝗯𝗷𝗲𝗰𝘁 - Trained SkLearn or Tensorflow Model
🔹 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗖𝗼𝗱𝗲 𝗥𝗲𝗽𝗼 - Save model project code to Github
🔹 𝗔𝗣𝗜 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 - Use FastAPI or Flask to build a model API
🔹 𝗗𝗼𝗰𝗸𝗲𝗿 - Containerize the ML model API
🔹 𝗥𝗲𝗺𝗼𝘁𝗲 𝗦𝗲𝗿𝘃𝗲𝗿 - Choose a cloud service; e.g. AWS sagemaker
🔹 𝗨𝗻𝗶𝘁 𝗧𝗲𝘀𝘁𝘀 - Test inputs & outputs of functions and APIs
🔹 𝗠𝗼𝗱𝗲𝗹 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 - Evidently AI, a simple, open-source for ML monitoring
𝗣𝗿𝗼𝗰𝗲𝗱𝘂𝗿𝗲
𝗦𝘁𝗲𝗽 𝟭 - 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗼𝗻 & 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
Don't push a model with 90% accuracy on train set. Do it based on the test set - if and only if, the test set is representative of the online data. Use SkLearn pipeline to chain a series of model preprocessing functions like null handling.
𝗦𝘁𝗲𝗽 𝟮 - 𝗠𝗼𝗱𝗲𝗹 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁
Train your model with frameworks like Sklearn or Tensorflow. Push the model code including preprocessing, training and validation noscripts to Github for reproducibility.
𝗦𝘁𝗲𝗽 𝟯 - 𝗔𝗣𝗜 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 & 𝗖𝗼𝗻𝘁𝗮𝗶𝗻𝗲𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻
Your model needs a "/predict" endpoint, which receives a JSON object in the request input and generates a JSON object with the model score in the response output. You can use frameworks like FastAPI or Flask. Containzerize this API so that it's agnostic to server environment
𝗦𝘁𝗲𝗽 𝟰 - 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 & 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁
Write tests to validate inputs & outputs of API functions to prevent errors. Push the code to remote services like AWS Sagemaker.
𝗦𝘁𝗲𝗽 𝟱 - 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴
Set up monitoring tools like Evidently AI, or use a built-in one within AWS Sagemaker. I use such tools to track performance metrics and data drifts on online data.
👍7😁1
AI Agents Course
by Hugging Face 🤗
This free course will take you on a journey, from beginner to expert, in understanding, using and building AI agents.
https://huggingface.co/learn/agents-course/unit0/introduction
by Hugging Face 🤗
This free course will take you on a journey, from beginner to expert, in understanding, using and building AI agents.
https://huggingface.co/learn/agents-course/unit0/introduction
👍4😁1
How do you handle null, 0, and blank values in your data during the cleaning process?
Sometimes interview questions are also based on this topic. Many data aspirants or even some professionals sometimes make the mistake of simply deleting missing values or trying to fill them without proper analysis.This can damage the integrity of the analysis. It’s essential to ask or find out the reason behind missing values in the data
whether from the project head, client, or through own investigation.
𝘼𝙣𝙨𝙬𝙚𝙧:
Handling null, 0, and blank values is crucial for ensuring the accuracy and reliability of data analysis. Here’s how to approach it:
1. 𝙄𝙙𝙚𝙣𝙩𝙞𝙛𝙮𝙞𝙣𝙜 𝙖𝙣𝙙 𝙐𝙣𝙙𝙚𝙧𝙨𝙩𝙖𝙣𝙙𝙞𝙣𝙜 𝙩𝙝𝙚 𝘾𝙤𝙣𝙩𝙚𝙭𝙩:
- 𝙉𝙪𝙡𝙡 𝙑𝙖𝙡𝙪𝙚𝙨: These represent missing or undefined data. Identify them using functions like 'ISNULL' or filters in Power Query.
- 0 𝙑𝙖𝙡𝙪𝙚𝙨: These can be legitimate data points but may also indicate missing data in some contexts. Understanding the context is important.
- 𝘽𝙡𝙖𝙣𝙠 𝙑𝙖𝙡𝙪𝙚𝙨: These can be spaces or empty strings. Identify them using 'LEN', 'TRIM', or filters.
2. 𝙃𝙖𝙣𝙙𝙡𝙞𝙣𝙜 𝙏𝙝𝙚𝙨𝙚 𝙑𝙖𝙡𝙪𝙚𝙨 𝙐𝙨𝙞𝙣𝙜 𝙋𝙧𝙤𝙥𝙚𝙧 𝙏𝙚𝙘𝙝𝙣𝙞𝙦𝙪𝙚𝙨:
- 𝙉𝙪𝙡𝙡 𝙑𝙖𝙡𝙪𝙚𝙨: Typically decide whether to impute, remove, or leave them based on the dataset’s context and the analysis requirements. Common imputation methods include using mean, median, or a placeholder.
- 0 𝙑𝙖𝙡𝙪𝙚𝙨: If 0s are valid data, leave them as is. If they indicate missing data, treat them similarly to null values.
- 𝘽𝙡𝙖𝙣𝙠 𝙑𝙖𝙡𝙪𝙚𝙨: Convert blanks to nulls or handle them as needed. This involves using 'IF' statements or Power Query transformations.
3. 𝙐𝙨𝙞𝙣𝙜 𝙀𝙭𝙘𝙚𝙡 𝙖𝙣𝙙 𝙋𝙤𝙬𝙚𝙧 𝙌𝙪𝙚𝙧𝙮:
- 𝙀𝙭𝙘𝙚𝙡: Use formulas like 'IFERROR', 'IF', and 'VLOOKUP' to handle these values.
- 𝙋𝙤𝙬𝙚𝙧 𝙌𝙪𝙚𝙧𝙮: Use transformations to filter, replace, or fill null and blank values. Steps like 'Fill Down', 'Replace Values', and custom columns help automate the process.
By carefully considering the context and using appropriate methods, the data cleaning process maintains the integrity and quality of the data.
Hope it helps :)
Sometimes interview questions are also based on this topic. Many data aspirants or even some professionals sometimes make the mistake of simply deleting missing values or trying to fill them without proper analysis.This can damage the integrity of the analysis. It’s essential to ask or find out the reason behind missing values in the data
whether from the project head, client, or through own investigation.
𝘼𝙣𝙨𝙬𝙚𝙧:
Handling null, 0, and blank values is crucial for ensuring the accuracy and reliability of data analysis. Here’s how to approach it:
1. 𝙄𝙙𝙚𝙣𝙩𝙞𝙛𝙮𝙞𝙣𝙜 𝙖𝙣𝙙 𝙐𝙣𝙙𝙚𝙧𝙨𝙩𝙖𝙣𝙙𝙞𝙣𝙜 𝙩𝙝𝙚 𝘾𝙤𝙣𝙩𝙚𝙭𝙩:
- 𝙉𝙪𝙡𝙡 𝙑𝙖𝙡𝙪𝙚𝙨: These represent missing or undefined data. Identify them using functions like 'ISNULL' or filters in Power Query.
- 0 𝙑𝙖𝙡𝙪𝙚𝙨: These can be legitimate data points but may also indicate missing data in some contexts. Understanding the context is important.
- 𝘽𝙡𝙖𝙣𝙠 𝙑𝙖𝙡𝙪𝙚𝙨: These can be spaces or empty strings. Identify them using 'LEN', 'TRIM', or filters.
2. 𝙃𝙖𝙣𝙙𝙡𝙞𝙣𝙜 𝙏𝙝𝙚𝙨𝙚 𝙑𝙖𝙡𝙪𝙚𝙨 𝙐𝙨𝙞𝙣𝙜 𝙋𝙧𝙤𝙥𝙚𝙧 𝙏𝙚𝙘𝙝𝙣𝙞𝙦𝙪𝙚𝙨:
- 𝙉𝙪𝙡𝙡 𝙑𝙖𝙡𝙪𝙚𝙨: Typically decide whether to impute, remove, or leave them based on the dataset’s context and the analysis requirements. Common imputation methods include using mean, median, or a placeholder.
- 0 𝙑𝙖𝙡𝙪𝙚𝙨: If 0s are valid data, leave them as is. If they indicate missing data, treat them similarly to null values.
- 𝘽𝙡𝙖𝙣𝙠 𝙑𝙖𝙡𝙪𝙚𝙨: Convert blanks to nulls or handle them as needed. This involves using 'IF' statements or Power Query transformations.
3. 𝙐𝙨𝙞𝙣𝙜 𝙀𝙭𝙘𝙚𝙡 𝙖𝙣𝙙 𝙋𝙤𝙬𝙚𝙧 𝙌𝙪𝙚𝙧𝙮:
- 𝙀𝙭𝙘𝙚𝙡: Use formulas like 'IFERROR', 'IF', and 'VLOOKUP' to handle these values.
- 𝙋𝙤𝙬𝙚𝙧 𝙌𝙪𝙚𝙧𝙮: Use transformations to filter, replace, or fill null and blank values. Steps like 'Fill Down', 'Replace Values', and custom columns help automate the process.
By carefully considering the context and using appropriate methods, the data cleaning process maintains the integrity and quality of the data.
Hope it helps :)
👍5❤2🤣1