Data Science Interview Questions
[PART - 7]
𝐐1. 𝐩-𝐯𝐚𝐥𝐮𝐞?
𝐀ns. p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.
𝐐2. 𝐈𝐧𝐭𝐞𝐫𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐄𝐱𝐭𝐫𝐚𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧?
𝐀ns. Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
𝐐3. 𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐞𝐝 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 & 𝐧𝐨𝐫𝐦𝐚𝐥 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧?
𝐀ns. The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.
𝐐4. 𝐑𝐞𝐜𝐨𝐦𝐦𝐞𝐧𝐝𝐞𝐫 𝐒𝐲𝐬𝐭𝐞𝐦𝐬?
𝐀ns. The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.
𝐐5. 𝐉𝐎𝐈𝐍 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝐢𝐧 𝐒𝐐𝐋
𝐀ns. The SQL Joins clause is used to combine records from two or more tables in a database.
𝐐6. 𝐒𝐪𝐮𝐚𝐫𝐞𝐝 𝐞𝐫𝐫𝐨𝐫 𝐚𝐧𝐝 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐞𝐫𝐫𝐨𝐫?
𝐀ns. mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem's accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.
ENJOY LEARNING 👍👍
[PART - 7]
𝐐1. 𝐩-𝐯𝐚𝐥𝐮𝐞?
𝐀ns. p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.
𝐐2. 𝐈𝐧𝐭𝐞𝐫𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐄𝐱𝐭𝐫𝐚𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧?
𝐀ns. Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
𝐐3. 𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐞𝐝 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 & 𝐧𝐨𝐫𝐦𝐚𝐥 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧?
𝐀ns. The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.
𝐐4. 𝐑𝐞𝐜𝐨𝐦𝐦𝐞𝐧𝐝𝐞𝐫 𝐒𝐲𝐬𝐭𝐞𝐦𝐬?
𝐀ns. The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.
𝐐5. 𝐉𝐎𝐈𝐍 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝐢𝐧 𝐒𝐐𝐋
𝐀ns. The SQL Joins clause is used to combine records from two or more tables in a database.
𝐐6. 𝐒𝐪𝐮𝐚𝐫𝐞𝐝 𝐞𝐫𝐫𝐨𝐫 𝐚𝐧𝐝 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐞𝐫𝐫𝐨𝐫?
𝐀ns. mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem's accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.
ENJOY LEARNING 👍👍
Today's Question on Probability
Two candidates Aman and Mohan appear for a Data Science Job interview. The probability of Aman cracking the interview is 1/8 and that of Mohan is 5/12. What is the probability that at least one of them will crack the interview?
The probability of Aman getting selected for the interview is P(A) = 1/8 The probability of Mohan getting selected for the interview is P(B)=5/12
Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means
P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)
Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job. To calculate the final answer, we first have to find out the value of P(A ∩ B) So, P(A ∩ B) = P(A) * P(B)
1/8 * 5/12
5/96
Now, put the value of P(A ∩ B) into equation 1
P(A U B) =P(A)+ P(B) – (P(A ∩ B))
1/8 + 5/12 -5/96
So, the answer will be 47/96.
ENJOY LEARNING 👍👍
Two candidates Aman and Mohan appear for a Data Science Job interview. The probability of Aman cracking the interview is 1/8 and that of Mohan is 5/12. What is the probability that at least one of them will crack the interview?
The probability of Aman getting selected for the interview is P(A) = 1/8 The probability of Mohan getting selected for the interview is P(B)=5/12
Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means
P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)
Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job. To calculate the final answer, we first have to find out the value of P(A ∩ B) So, P(A ∩ B) = P(A) * P(B)
1/8 * 5/12
5/96
Now, put the value of P(A ∩ B) into equation 1
P(A U B) =P(A)+ P(B) – (P(A ∩ B))
1/8 + 5/12 -5/96
So, the answer will be 47/96.
ENJOY LEARNING 👍👍
Data Science Interview Questions
[Part -8]
Q. How would you build a model to predict credit card fraud?
A. Use Kaggle's Credit card fraud dataset, start with EDA (Exploratory Data Analysis). Applying train, test split over the data and then finally choosing any model like logistic regression, XGBoost or Random Forest. After Hyperparameter tuning and fitting the model, the final step would be evaluating its performance.
Q. How would you derive new features from features that already exist?
A. Feature engineering is applied first to generate additional features, and then feature selection is done to eliminate irrelevant, redundant, or highly correlated features. This includes techniques like Binning, Data manipulation etc.
Q. If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?
A. Overfitting because we might learn too much into some particular patterns within this small sample set so we lose generalization abilities on other datasets.
Q. Suppose you were given two years of transaction history. What features would you use to predict credit risk?
A. Following are the features that can be used in such case.
Transaction amount,
Transaction count,
Transaction frequency,
transaction category: bar, grocery, jwery etc.
transaction channels: credit card, debit card, international wire transfer etc.
distance between transaction address and mailing address,
fraud/ risk score
Q. Explain overfitting and what steps you can take to prevent it.
A. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Some steps that we can take to avoid it:
1. Data augmentation
2. L1/L2 Regularization
3. Remove layers / number of units per layer
4. Cross-validation
Q. Why does SVM need to maximize the margin between support vectors?
A. Our goal is to maximize the margin because the hyperplane for which the margin is maximum is the optimal hyperplane. Thus SVM tries to make a decision boundary in such a way that the separation between the two classes is as wide as possible in the plane.
ENJOY LEARNING 👍👍
[Part -8]
Q. How would you build a model to predict credit card fraud?
A. Use Kaggle's Credit card fraud dataset, start with EDA (Exploratory Data Analysis). Applying train, test split over the data and then finally choosing any model like logistic regression, XGBoost or Random Forest. After Hyperparameter tuning and fitting the model, the final step would be evaluating its performance.
Q. How would you derive new features from features that already exist?
A. Feature engineering is applied first to generate additional features, and then feature selection is done to eliminate irrelevant, redundant, or highly correlated features. This includes techniques like Binning, Data manipulation etc.
Q. If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?
A. Overfitting because we might learn too much into some particular patterns within this small sample set so we lose generalization abilities on other datasets.
Q. Suppose you were given two years of transaction history. What features would you use to predict credit risk?
A. Following are the features that can be used in such case.
Transaction amount,
Transaction count,
Transaction frequency,
transaction category: bar, grocery, jwery etc.
transaction channels: credit card, debit card, international wire transfer etc.
distance between transaction address and mailing address,
fraud/ risk score
Q. Explain overfitting and what steps you can take to prevent it.
A. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Some steps that we can take to avoid it:
1. Data augmentation
2. L1/L2 Regularization
3. Remove layers / number of units per layer
4. Cross-validation
Q. Why does SVM need to maximize the margin between support vectors?
A. Our goal is to maximize the margin because the hyperplane for which the margin is maximum is the optimal hyperplane. Thus SVM tries to make a decision boundary in such a way that the separation between the two classes is as wide as possible in the plane.
ENJOY LEARNING 👍👍
❤1
Data Science Interview Questions
[Part - 9]
Q. Difference between array and list
A. The main difference between these two data types is the operation you can perform on them. Lists are containers for elements having differing data types but arrays are used as containers for elements of the same data type.
Q. Which is faster dictionary or list for look up
A. Dictionary is faster because you used a better algorithm. The reason is because a dictionary is a lookup, while a list is an iteration. Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
Q. How much time SVM takes to complete if 1 iteration takes 10sec for 1st class.
And there are 4 classes.
A. It would take 4*10 = 40 seconds to train one-vs-all method one to one.
Q. Kernals in svm with difference
A. Kernel Function in SVM is a method used to take data as input and transform into the required form of processing data.
Gaussian Kernel Radial Basis Function (RBF) : It is used to perform transformation, when there is no prior knowledge about data and it uses radial basis method to improve the transformation.
Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of neural network, which is used as activation function for artificial neurons.
Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature space over polynomials of the original variables used in kernel.
Linear Kernel: used when data is linearly separable.
ENJOY LEARNING 👍👍
[Part - 9]
Q. Difference between array and list
A. The main difference between these two data types is the operation you can perform on them. Lists are containers for elements having differing data types but arrays are used as containers for elements of the same data type.
Q. Which is faster dictionary or list for look up
A. Dictionary is faster because you used a better algorithm. The reason is because a dictionary is a lookup, while a list is an iteration. Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
Q. How much time SVM takes to complete if 1 iteration takes 10sec for 1st class.
And there are 4 classes.
A. It would take 4*10 = 40 seconds to train one-vs-all method one to one.
Q. Kernals in svm with difference
A. Kernel Function in SVM is a method used to take data as input and transform into the required form of processing data.
Gaussian Kernel Radial Basis Function (RBF) : It is used to perform transformation, when there is no prior knowledge about data and it uses radial basis method to improve the transformation.
Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of neural network, which is used as activation function for artificial neurons.
Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature space over polynomials of the original variables used in kernel.
Linear Kernel: used when data is linearly separable.
ENJOY LEARNING 👍👍
👍2
Today's Probability Question
Three zebras are sitting on each corner of an equilateral triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either opposite edge of the triangle. What is the probability that none of the zebras collide?
• Let's imagine all of the zebras on an equilateral triangle. They each have two options of directions to go in if they are running along the outline to either edge. Given the case is random, let's compute the possibilities in which they fail to collide.
• There are only really two possibilities. The zebras will either all choose to run in a clockwise direction or a counter-clockwise direction.
• Let's calculate the probabilities of each. The probability that every zebra will choose to go clockwise will be the product of each zebra choosing the clockwise direction. Given there are two choices (counterclockwise or clockwise), that would be 1/2 * 1/2 * 1/2 = 1/8
• The probability of every zebra going counter-clockwise is the same at 1/8. Therefore, if we sum up the probabilities, we get the correct probability of 1/4 or 25%.
Three zebras are sitting on each corner of an equilateral triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either opposite edge of the triangle. What is the probability that none of the zebras collide?
• Let's imagine all of the zebras on an equilateral triangle. They each have two options of directions to go in if they are running along the outline to either edge. Given the case is random, let's compute the possibilities in which they fail to collide.
• There are only really two possibilities. The zebras will either all choose to run in a clockwise direction or a counter-clockwise direction.
• Let's calculate the probabilities of each. The probability that every zebra will choose to go clockwise will be the product of each zebra choosing the clockwise direction. Given there are two choices (counterclockwise or clockwise), that would be 1/2 * 1/2 * 1/2 = 1/8
• The probability of every zebra going counter-clockwise is the same at 1/8. Therefore, if we sum up the probabilities, we get the correct probability of 1/4 or 25%.
Data Science Interview Questions
[PART- 10]
Q. Difference between WHERE and HAVING in SQL
A. The main difference between them is that the WHERE clause is used to specify a condition for filtering records before any groupings are made, while the HAVING clause is used to specify a condition for filtering values from a group.
Q. Explain confusion matrix ?
A. A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.
Q. Explain PCA
A. The principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigen decomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis.
Q. How do you cut a cake into 8 equal parts using only 3 straight cuts ?
A. Cut the cake from middle first, then pile up the one piece on another, and then again cut it straight from the middle which will leave you with 4 pieces. Finally, put all the 4 pieces on one another, and cut it for the third time. This is how with 3 straight cuts, you can cut cake into 8 equal pieces.
Q. Explain kmeans clustering
A. K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart. Similarity of two points is determined by the distance between them.
Q. How is KNN different from k-means clustering?
A. K-means clustering represents an unsupervised algorithm, mainly used for clustering, while KNN is a supervised learning algorithm used for classification.
Q. Stock market prediction: You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy). Would you treat this as a classification or a regression problem?
A. It is a classification problem.
ENJOY LEARNING 👍👍
[PART- 10]
Q. Difference between WHERE and HAVING in SQL
A. The main difference between them is that the WHERE clause is used to specify a condition for filtering records before any groupings are made, while the HAVING clause is used to specify a condition for filtering values from a group.
Q. Explain confusion matrix ?
A. A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.
Q. Explain PCA
A. The principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigen decomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis.
Q. How do you cut a cake into 8 equal parts using only 3 straight cuts ?
A. Cut the cake from middle first, then pile up the one piece on another, and then again cut it straight from the middle which will leave you with 4 pieces. Finally, put all the 4 pieces on one another, and cut it for the third time. This is how with 3 straight cuts, you can cut cake into 8 equal pieces.
Q. Explain kmeans clustering
A. K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart. Similarity of two points is determined by the distance between them.
Q. How is KNN different from k-means clustering?
A. K-means clustering represents an unsupervised algorithm, mainly used for clustering, while KNN is a supervised learning algorithm used for classification.
Q. Stock market prediction: You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy). Would you treat this as a classification or a regression problem?
A. It is a classification problem.
ENJOY LEARNING 👍👍
Which job noscript is known as the sexiest job of the world?
Anonymous Quiz
13%
Software Engineer
12%
Blockchain Developer
8%
Data Engineer
68%
Data Scientist
Using loop inside loop is known as?
Anonymous Quiz
5%
Sub Loop
90%
Nested Loop
4%
Double Loop
2%
Series loop
Python can be used for?
Anonymous Quiz
4%
Data Analytics
1%
Web Development
4%
Machine Learning
90%
All of the above
👍1
SQL can be used for?
Anonymous Quiz
31%
Analytics
5%
Web development
3%
Machine Learning
61%
All of the above
👍3
Matplotlib can be used for?
Anonymous Quiz
3%
Web Development
87%
Data Visualization
6%
Data Extraction
4%
None of the above
👍1
Which of the following is not used for Machine Learning/Deep Learning?
Anonymous Quiz
5%
Scikit-learn
6%
Tensorflow
8%
Keras
81%
JavaScript
Data Science Interview Questions
[Part - 11]
Q1. Difference between R square and Adjusted R Square.
Ans. One main difference between R2 and the adjusted R2: R2 assumes that every single variable explains the variation in the dependent variable. The adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
Q2. Difference between Precision and Recall.
Ans. When it comes to precision we're talking about the true positives over the true positives plus the false positives. As opposed to recall which is the number of true positives over the true positives and the false negatives.
Q3. Assumptions of Linear Regression.
Ans. There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. The fourth one is normality.
Q4. Difference between Random Forest and Decision Tree.
Ans. A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.
Q5. How does K-means work?
Ans. K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.
Q6. How do you generally choose among different classification models to decide which one is performing the best?
Ans. Here are some important considerations while choosing an algorithm:
Size of the training data, Accuracy and/or Interpretability of the output, Speed or Training time, Linearity and number of features.
Q7. How do you perform feature selection?
Ans. Unsupervised: Do not use the target variable (e.g. remove redundant variables). Correlation.
Supervised: Use the target variable (e.g. remove irrelevant variables). Wrapper: Search for well-performing subsets of features. RFE.
Q8. What is an intercept in a Linear Regression? What is its significance?
Ans. The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = b*X + error. The intercept (often labeled the constant) is the expected mean value of Y when all X="0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value. If X never equals 0, then the intercept has no intrinsic meaning.
ENJOY LEARNING 👍👍
[Part - 11]
Q1. Difference between R square and Adjusted R Square.
Ans. One main difference between R2 and the adjusted R2: R2 assumes that every single variable explains the variation in the dependent variable. The adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
Q2. Difference between Precision and Recall.
Ans. When it comes to precision we're talking about the true positives over the true positives plus the false positives. As opposed to recall which is the number of true positives over the true positives and the false negatives.
Q3. Assumptions of Linear Regression.
Ans. There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. The fourth one is normality.
Q4. Difference between Random Forest and Decision Tree.
Ans. A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.
Q5. How does K-means work?
Ans. K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.
Q6. How do you generally choose among different classification models to decide which one is performing the best?
Ans. Here are some important considerations while choosing an algorithm:
Size of the training data, Accuracy and/or Interpretability of the output, Speed or Training time, Linearity and number of features.
Q7. How do you perform feature selection?
Ans. Unsupervised: Do not use the target variable (e.g. remove redundant variables). Correlation.
Supervised: Use the target variable (e.g. remove irrelevant variables). Wrapper: Search for well-performing subsets of features. RFE.
Q8. What is an intercept in a Linear Regression? What is its significance?
Ans. The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = b*X + error. The intercept (often labeled the constant) is the expected mean value of Y when all X="0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value. If X never equals 0, then the intercept has no intrinsic meaning.
ENJOY LEARNING 👍👍
Today's Question - What are some ways I can make my model more robust to outliers?
There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.
Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (Mean +/- 2*SD), it can be used for normality. Or interquartile ranges Q1 - Q3, Q1 - is the "middle" value in the first half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.
Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.
For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.
There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.
Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (Mean +/- 2*SD), it can be used for normality. Or interquartile ranges Q1 - Q3, Q1 - is the "middle" value in the first half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.
Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.
For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.