Data Science & Machine Learning – Telegram
Data Science & Machine Learning
72.5K subscribers
773 photos
2 videos
68 files
681 links
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free

For collaborations: @love_data
Download Telegram
Do you want daily quiz to enhance your knowledge?
Anonymous Poll
97%
Yes
3%
No
Data Science Interview Questions
[Part - 11]

Q1.  Difference between R square and Adjusted R Square.

Ans. One main difference between R2 and the adjusted R2: R2 assumes that every single variable explains the variation in the dependent variable. The adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.


Q2. Difference between Precision and Recall.

Ans.  When it comes to precision we're talking about the true positives over the true positives plus the false positives. As opposed to recall which is the number of true positives over the true positives and the false negatives.


Q3.  Assumptions of Linear Regression.
Ans.  There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. The fourth one is normality.


Q4. Difference between Random Forest and Decision Tree.

Ans. A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.


Q5. How does K-means work?

Ans.  K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.


Q6.  How do you generally choose among different classification models to decide which one is performing the best?

Ans. Here are some important considerations while choosing an algorithm:

Size of the training data, Accuracy and/or Interpretability of the output, Speed or Training time, Linearity and number of features.


Q7. How do you perform feature selection?

Ans. Unsupervised: Do not use the target variable (e.g. remove redundant variables). Correlation. 

Supervised: Use the target variable (e.g. remove irrelevant variables). Wrapper: Search for well-performing subsets of features. RFE.


Q8. What is an intercept in a Linear Regression? What is its significance?

Ans. The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = b*X + error.  The intercept (often labeled the constant) is the expected mean value of Y when all X="0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value. If X never equals 0, then the intercept has no intrinsic meaning.

ENJOY LEARNING 👍👍
In which algorithm target variable isn't required
Anonymous Quiz
22%
Supervised
78%
Unsupervised
Today's Question -  What are some ways I can make my model more robust to outliers?


There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.

Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (Mean +/- 2*SD), it can be used for normality. Or interquartile ranges Q1 - Q3, Q1 - is the "middle" value in the first half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.

Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.

For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.
Guys, I'm recommending you ! Grab all above Courses 🚀🔥

Limited time deal 💯
DATA SCIENCE INTERVIEW QUESTIONS
[PART -12]

Q. What are Entropy and Information gain in Decision tree algorithm?

A. Entropy is a measure of impurity or uncertainty in a set of data used in information theory. It determines how data is split by a decision tree. The quantity of information improved in the nodes before splitting them for making subsequent judgments can be characterized as the information obtained in the decision tree.



Q. What Will Happen If the Learning Rate Is Set inaccurately (Too Low or Too High)?

A. A high learning rate in gradient descent will cause the learning to jump over global minima, whereas a low learning rate will cause the learning to take too long to converge or become stuck in an unwanted local minimum.



Q. What is meant by ‘curse of dimensionality’?

A. The problem produced by the exponential rise in volume associated with adding extra dimensions to Euclidean space is known as the "curse of dimensionality." The curse of dimensionality states that as the number of characteristics grows, the error grows as well. It refers to the fact that high-dimensional algorithms are more difficult to build and often have a running duration that is proportional to the dimensions. A higher number of dimensions theoretically allows for more information to be stored, but in practice, it rarely helps because real-world data contains more noise and redundancy.



Q. Difference between remove, del and pop?

A. remove function removes the first matching value/object. It does not do anything with the indexing. del function removes the item at a specific index. And pop removes the item at a specific index and returns it.


ENJOY LEARNING 👍👍
Which of the following maybe involved in the data science project?
Anonymous Quiz
3%
Data Cleaning
4%
Data Visualization
2%
Feature selection
3%
Exploratory data analysis
89%
All of the above
Which of the following is not a machine learning algorithm?
Anonymous Quiz
2%
Linear Regression
6%
K-means clustering
87%
Data Cleaning
5%
Logistic Regression
DATA SCIENCE INTERVIEW QUESTIONS
[ PART - 13]


𝐐1. 𝐇𝐨𝐰 𝐭𝐨 𝐢𝐝𝐞𝐧𝐭𝐢𝐟𝐲 𝐚 𝐜𝐚𝐮𝐬𝐞 𝐯𝐬. 𝐚 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧? 𝐆𝐢𝐯𝐞 𝐞𝐱𝐚𝐦𝐩𝐥𝐞𝐬.

Ans. While causation and correlation can exist at the same time, correlation does not imply causation. Causation explicitly applies to cases where action A causes outcome B. On the other hand, correlation is simply a relationship. Correlation between Ice cream sales and sunglasses sold. As the sales of ice creams is increasing so do the sales of sunglasses. Causation takes a step further than correlation.

𝐐2. 𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧, 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐚𝐧𝐝 𝐫𝐞𝐜𝐚𝐥𝐥?

Ans. The recall is the ratio of the relevant results returned by the search engine to the total number of the relevant results that could have been returned. The precision is the proportion of relevant results in the list of all returned search results. Accuracy is the measurement used to determine which model is best at identifying relationships and patterns between variables in a dataset based on the input, or training, data.

𝐐3. 𝐜𝐡𝐨𝐨𝐬𝐞 𝐤 𝐢𝐧 𝐤-𝐦𝐞𝐚𝐧𝐬?

Ans. There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster.

𝐐4. 𝐰𝐨𝐫𝐝2𝐯𝐞𝐜 𝐦𝐞𝐭𝐡𝐨𝐝𝐬?

Ans. Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.


𝐐5. P𝐫𝐮𝐧𝐢𝐧𝐠 𝐢𝐧 𝐜𝐚𝐬𝐞 𝐨𝐟 𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐭𝐫𝐞𝐞𝐬?

Ans. Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.

ENJOY LEARNING 👍👍
👍3
Which of the following is used to read csv file in python using pandas?
import pandas as pd
Anonymous Quiz
10%
pd.readcsv(file.csv)
80%
pd.read_csv("file.csv")
4%
pd(read_csv.file)
𝐓𝐨𝐝𝐚𝐲'𝐬 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐐𝐮𝐞𝐬𝐭 𝐍 𝐀𝐧𝐬

DATA SCIENCE INTERVIEW QUESTIONS
[PART - 14]

𝐐1. 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐬𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧 𝐦𝐞𝐭𝐡𝐨𝐝𝐬 𝐟𝐨𝐫 𝐬𝐞𝐥𝐞𝐜𝐭𝐢𝐧𝐠 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐯𝐚𝐫𝐢𝐚𝐛𝐥𝐞𝐬 𝐟𝐨𝐫 𝐛𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐯𝐞 𝐦𝐨𝐝𝐞𝐥𝐬?

Ans. Some of the Feature selection techniques are: Information Gain, Chi-square test, Correlation Coefficient, Mean Absolute Difference (MAD), Exhaustive selection, Forward selection, Regularization.


𝐐2. 𝐓𝐫𝐞𝐚𝐭 𝐦𝐢𝐬𝐬𝐢𝐧𝐠 𝐯𝐚𝐥𝐮𝐞𝐬?

Ans. They are:
1. List wise or case deletion
2. Pairwise deletion
3. Mean substitution
4. Regression imputation
5. Maximum likelihood.


𝐐3. 𝐚𝐬𝐬𝐮𝐦𝐩𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐥𝐢𝐧𝐞𝐚𝐫 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧? 𝐖𝐡𝐚𝐭 𝐰𝐨𝐮𝐥𝐝 𝐡𝐚𝐩𝐩𝐞𝐧 𝐢𝐟 𝐭𝐡𝐞𝐲 𝐚𝐫𝐞 𝐯𝐢𝐨𝐥𝐚𝐭𝐞𝐝?

Ans. 1. Linear relationship.
2. Multivariate normality.
3. no or little multicollinearity.
4. no auto-correlation.
5. Homoscedasticity.

Data to be analyzed by linear regression were sampled violate one or more of the linear regression assumptions, the results of the analysis may be incorrect or misleading.


𝐐4. 𝐇𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐠𝐫𝐢𝐝 𝐬𝐞𝐚𝐫𝐜𝐡 𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐫𝐚𝐧𝐝𝐨𝐦 𝐬𝐞𝐚𝐫𝐜𝐡 𝐭𝐮𝐧𝐢𝐧𝐠 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲?

Ans. Random search differs from grid search in that we no longer provide an explicit set of possible values for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values are sampled. Essentially, we define a sampling distribution for each hyperparameter to carry out a randomized search.


𝐐5. 𝐈𝐬 𝐢𝐭 𝐠𝐨𝐨𝐝 𝐭𝐨 𝐝𝐨 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧𝐚𝐥𝐢𝐭𝐲 𝐫𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐛𝐞𝐟𝐨𝐫𝐞 𝐟𝐢𝐭𝐭𝐢𝐧𝐠 𝐚 𝐒𝐮𝐩𝐩𝐨𝐫𝐭 𝐕𝐞𝐜𝐭𝐨𝐫 𝐌𝐨𝐝𝐞𝐥?
𝐀ns. Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to perform dimensionality reduction before fitting an SVM if the number of features is large when compared to the number of observations.


𝐐6. 𝐑𝐎𝐂 𝐂𝐮𝐫𝐯𝐞?
Ans ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. 

ENJOY LEARNING 👍👍
1👍1
DATA SCIENCE INTERVIEW QUESTIONS
[PART -15]

𝐐1. 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐮𝐧𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐛𝐢𝐧𝐚𝐫𝐲 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧?

𝐀ns. Techniques to Handle unbalanced Data:
1. Use the right evaluation metrics 
2. Use K-fold Cross-Validation in the right way 
3. Ensemble different resampled datasets 
4. Resample with different ratios 
5. Design your own models


𝐐2. 𝐀𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧?

𝐀ns. Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output. 



𝐐3. 𝐃𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧 𝐫𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧?

𝐀ns. Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.

𝐐4. 𝐖𝐡𝐲 𝐢𝐬 𝐦𝐞𝐚𝐧 𝐬𝐪𝐮𝐚𝐫𝐞 𝐞𝐫𝐫𝐨𝐫 𝐚 𝐛𝐚𝐝 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐨𝐟 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞?

𝐀ns. Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations.


𝐐5. 𝐑𝐞𝐦𝐨𝐯𝐞 𝐦𝐮𝐥𝐭𝐢𝐜𝐨𝐥𝐥𝐢𝐧𝐞𝐚𝐫𝐢𝐭𝐲?

𝐀ns. To remove multicollinearities, we can do two things. 
1. We can create new features 
2. remove them from our data.


𝐐6. 𝐥𝐨𝐧𝐠-𝐭𝐚𝐢𝐥𝐞𝐝 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 ?

𝐀ns. A long tail distribution of numbers is a kind of distribution having many occurrences far from the "head" or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.


𝐐7. 𝐎𝐮𝐭𝐥𝐢𝐞𝐫? 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐢𝐭?

𝐀ns. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error. 
Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest “good” data, as opposed to truncating them completely.


𝐐8. 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 𝐰𝐡𝐞𝐫𝐞 𝐭𝐡𝐞 𝐦𝐞𝐝𝐢𝐚𝐧 𝐢𝐬 𝐚 𝐛𝐞𝐭𝐭𝐞𝐫 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐭𝐡𝐚𝐧 𝐭𝐡𝐞 𝐦𝐞𝐚𝐧 ?

𝐀ns. If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.


ENJOY LEARNING 👍👍
🔥2👍1
Which of the following method/s can be used to handle missing values?
Anonymous Quiz
16%
Mean Substitution
6%
Pairwise deletion
11%
Regression imputation
66%
All of the above
👍2