Data Science & Machine Learning – Telegram
Data Science & Machine Learning
72.1K subscribers
768 photos
1 video
68 files
677 links
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free

For collaborations: @love_data
Download Telegram
👍4😁2
You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
🔥6👍3
DATA SCIENCE INTERVIEW QUESTIONS
[PART-20]

1. What re
lationships exist between a logistic regression’s coefficient and the Odds Ratio?

The coefficients and the odds ratios then represent the effect of each independent variable controlling for all of the other independent variables in the model and each coefficient can be tested for significance.

2. What’s the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)

LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.The PC1 the first principal component formed by PCA will account for maximum variation in the data.PC2 does the second-best job in capturing maximum variation and so on.

The LD1 the first new axes created by Linear Discriminant Analysis will account for capturing most variation between the groups or categories and then comes LD2 and so on.


3. What’s the difference between logistic and linear regression? How do you avoid local minima?


Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
Linear regression provides a continuous output but Logistic regression provides discreet output.
The purpose of Linear Regression is to find the best-fitted line while Logistic regression is one step ahead and fitting the line values to the sigmoid curve.
The method for calculating loss function in linear regression is the mean squared error whereas for logistic regression it is maximum likelihood estimation.
We can try to prevent our loss function from getting stuck in a local minima by providing a momentum value. So, it provides a basic impulse to the loss function in a specific direction and helps the function avoid narrow or small local minima. Use stochastic gradient descent.

4. Explain the difference between type 1 and type 2 errors.

Type 1 error is a false positive error that ‘claims’ that an incident has occurred when, in fact, nothing has occurred. The best example of a false positive error is a false fire alarm – the alarm starts ringing when there’s no fire. Contrary to this, a Type 2 error is a false negative error that ‘claims’ nothing has occurred when something has definitely happened. It would be a Type 2 error to tell a pregnant lady that she isn’t carrying a baby.

ENJOY LEARNING 👍👍
Q. What do you understand by Recall and Precision?

A. Precision is defined as the fraction of relevant instances among all retrieved instances. Recall, sometimes referred to as ‘sensitivity, is the fraction of retrieved instances among all relevant instances. A perfect classifier has precision and recall both equal to 1..
.
👍8
DATA SCIENCE INTERVIEW QUESTIONS WITH ANSWERS


1. What are the assumptions required for linear regression? What if some of these assumptions are violated?

Ans: The assumptions are as follows:

The sample data used to fit the model is representative of the population

The relationship between X and the mean of Y is linear

The variance of the residual is the same for any value of X (homoscedasticity)

Observations are independent of each other

For any value of X, Y is normally distributed.

Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.


2.What is multicollinearity and how to remove it?

Ans: Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.

You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables — a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.


3. What is overfitting and how to prevent it?

Ans: Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.

Few approaches to prevent overfitting are:

- Cross-Validation:Cross-validation is a powerful preventative measure against overfitting. Here we use our initial training data to generate multiple mini train-test splits. Now we use these splits to tune our model.

- Train with more data: It won’t work every time, but training with more data can help algorithms detect the signal better or it can help my model to understand general trends in particular.

- We can remove irrelevant information or the noise from our dataset.

- Early Stopping: When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs.

Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data.

Early stopping refers stopping the training process before the learner passes that point.

- Regularization: It refers to a broad range of techniques for artificially forcing your model to be simpler. There are mainly 3 types of Regularization techniques:L1, L2,&,Elastic- net.

- Ensembling : Here we take number of learners and using these we get strong model. They are of two types : Bagging and Boosting.


4. Given two fair dices, what is the probability of getting scores that sum to 4 and 8?

Ans: There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12

There are 5 combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36

ENJOY LEARNING 👍👍
👍5
Which models do you know for solving time series problems?

Simple Exponential Smoothing: approximate the time series with an exponentional function
Trend-Corrected Exponential

Smoothing (Holt‘s Method): exponential smoothing that also models the trend
Trend- and Seasonality-Corrected Exponential Smoothing

(Holt-Winter‘s Method): exponential smoothing that also models trend and seasonality

Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling varation and irregular component

Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables.

Deep learning approaches (RNN, LSTM, etc.)
🎉2
How is kNN different from k-means clustering?
kNN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data. Both accomplish different tasks.
DATA SCIENCE INTERVIEW QUESTIONS WITH ANSWERS


1. What is a logistic function? What is the range of values of a logistic function?

f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity.


2. What is the difference between R square and adjusted R square?

R square and adjusted R square values are used for model validation in case of linear regression. R square indicates the variation of all the independent variables on the dependent variable. i.e. it considers all the independent variable to explain the variation. In the case of Adjusted R squared, it considers only significant variables(P values less than 0.05) to indicate the percentage of variation in the model.

Thus Adjusted R2 is always lesser then R2.


3. What is stratify in Train_test_split?

Stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset. So if my input data has 60% 0's and 40% 1's as my class label, then my train and test dataset will also have the similar proportions.


4. What is Backpropagation in Artificial Neuron Network?

Backpropagation is the method of fine-tuning the weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable by increasing its generalization.

ENJOY LEARNING 👍👍
👍7👎1
Machine learning .pdf
5.3 MB
Core machine learning concepts explained through memes and simple charts created by Mihail Eric.
🔰 Python for Machine Learning & Data Science Masterclass

44 Hours 📦 170 Lessons

Learn about Data Science and Machine Learning with Python! Including Numpy, Pandas, Matplotlib, Scikit-Learn and more!

Taught By: Jose Portilla

Download Full Course: https://news.1rj.ru/str/datasciencefree/69
Download All Courses: https://news.1rj.ru/str/datasciencefree/2
👍10