Some interview questions related to Data science
1- what is difference between structured data and unstructured data.
2- what is multicollinearity.and how to remove them
3- which algorithms you use to find the most correlated features in the datasets.
4- define entropy
5- what is the workflow of principal component analysis
6- what are the applications of principal component analysis not with respect to dimensionality reduction
7- what is the Convolutional neural network. Explain me its working
1- what is difference between structured data and unstructured data.
2- what is multicollinearity.and how to remove them
3- which algorithms you use to find the most correlated features in the datasets.
4- define entropy
5- what is the workflow of principal component analysis
6- what are the applications of principal component analysis not with respect to dimensionality reduction
7- what is the Convolutional neural network. Explain me its working
👍8❤5
Decision trees and Random forests?
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.
👍9
Top free Data Science resources
@datasciencefun
1. CS109 Data Science
http://cs109.github.io/2015/pages/videos.html
2. Data Science Essentials
https://www.edx.org/course/data-science-essentials
3. Learning From Data from California Institute of Technology
http://work.caltech.edu/telecourse
4. Mathematics for Machine Learning by University of California, Berkeley
https://gwthomas.github.io/docs/math4ml.pdf?fbclid=IwAR2UsBgZW9MRgS3nEo8Zh_ukUFnwtFeQS8Ek3OjGxZtDa7UxTYgIs_9pzSI
5. Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan
https://www.cs.cornell.edu/jeh/book.pdf?fbclid=IwAR19tDrnNh8OxAU1S-tPklL1mqj-51J1EJUHmcHIu2y6yEv5ugrWmySI2WY
6. Python Data Science Handbook
https://jakevdp.github.io/PythonDataScienceHandbook/?fbclid=IwAR34IRk2_zZ0ht7-8w5rz13N6RP54PqjarQw1PTpbMqKnewcwRy0oJ-Q4aM
7. CS 221 ― Artificial Intelligence
https://stanford.edu/~shervine/teaching/cs-221/
8. Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science
https://ocw.mit.edu/courses/mathematics/18-s096-topics-in-mathematics-of-data-science-fall-2015/lecture-notes/MIT18_S096F15_TenLec.pdf
9. Python for Data Analysis by Boston University
https://www.bu.edu/tech/files/2017/09/Python-for-Data-Analysis.pptx
10. Data Mining bu University of Buffalo
https://cedar.buffalo.edu/~srihari/CSE626/index.html?fbclid=IwAR3XZ50uSZAb3u5BP1Qz68x13_xNEH8EdEBQC9tmGEp1BoxLNpZuBCtfMSE
Share the channel link with friends
http://t.me/datasciencefun
#freecourses
@datasciencefun
1. CS109 Data Science
http://cs109.github.io/2015/pages/videos.html
2. Data Science Essentials
https://www.edx.org/course/data-science-essentials
3. Learning From Data from California Institute of Technology
http://work.caltech.edu/telecourse
4. Mathematics for Machine Learning by University of California, Berkeley
https://gwthomas.github.io/docs/math4ml.pdf?fbclid=IwAR2UsBgZW9MRgS3nEo8Zh_ukUFnwtFeQS8Ek3OjGxZtDa7UxTYgIs_9pzSI
5. Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan
https://www.cs.cornell.edu/jeh/book.pdf?fbclid=IwAR19tDrnNh8OxAU1S-tPklL1mqj-51J1EJUHmcHIu2y6yEv5ugrWmySI2WY
6. Python Data Science Handbook
https://jakevdp.github.io/PythonDataScienceHandbook/?fbclid=IwAR34IRk2_zZ0ht7-8w5rz13N6RP54PqjarQw1PTpbMqKnewcwRy0oJ-Q4aM
7. CS 221 ― Artificial Intelligence
https://stanford.edu/~shervine/teaching/cs-221/
8. Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science
https://ocw.mit.edu/courses/mathematics/18-s096-topics-in-mathematics-of-data-science-fall-2015/lecture-notes/MIT18_S096F15_TenLec.pdf
9. Python for Data Analysis by Boston University
https://www.bu.edu/tech/files/2017/09/Python-for-Data-Analysis.pptx
10. Data Mining bu University of Buffalo
https://cedar.buffalo.edu/~srihari/CSE626/index.html?fbclid=IwAR3XZ50uSZAb3u5BP1Qz68x13_xNEH8EdEBQC9tmGEp1BoxLNpZuBCtfMSE
Share the channel link with friends
http://t.me/datasciencefun
#freecourses
👍4😁2
You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?
Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
🔥6👍3
DATA SCIENCE INTERVIEW QUESTIONS
[PART-20]
1. What relationships exist between a logistic regression’s coefficient and the Odds Ratio?
The coefficients and the odds ratios then represent the effect of each independent variable controlling for all of the other independent variables in the model and each coefficient can be tested for significance.
2. What’s the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)
LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.The PC1 the first principal component formed by PCA will account for maximum variation in the data.PC2 does the second-best job in capturing maximum variation and so on.
The LD1 the first new axes created by Linear Discriminant Analysis will account for capturing most variation between the groups or categories and then comes LD2 and so on.
3. What’s the difference between logistic and linear regression? How do you avoid local minima?
Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
Linear regression provides a continuous output but Logistic regression provides discreet output.
The purpose of Linear Regression is to find the best-fitted line while Logistic regression is one step ahead and fitting the line values to the sigmoid curve.
The method for calculating loss function in linear regression is the mean squared error whereas for logistic regression it is maximum likelihood estimation.
We can try to prevent our loss function from getting stuck in a local minima by providing a momentum value. So, it provides a basic impulse to the loss function in a specific direction and helps the function avoid narrow or small local minima. Use stochastic gradient descent.
4. Explain the difference between type 1 and type 2 errors.
Type 1 error is a false positive error that ‘claims’ that an incident has occurred when, in fact, nothing has occurred. The best example of a false positive error is a false fire alarm – the alarm starts ringing when there’s no fire. Contrary to this, a Type 2 error is a false negative error that ‘claims’ nothing has occurred when something has definitely happened. It would be a Type 2 error to tell a pregnant lady that she isn’t carrying a baby.
ENJOY LEARNING 👍👍
[PART-20]
1. What relationships exist between a logistic regression’s coefficient and the Odds Ratio?
The coefficients and the odds ratios then represent the effect of each independent variable controlling for all of the other independent variables in the model and each coefficient can be tested for significance.
2. What’s the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)
LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.The PC1 the first principal component formed by PCA will account for maximum variation in the data.PC2 does the second-best job in capturing maximum variation and so on.
The LD1 the first new axes created by Linear Discriminant Analysis will account for capturing most variation between the groups or categories and then comes LD2 and so on.
3. What’s the difference between logistic and linear regression? How do you avoid local minima?
Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
Linear regression provides a continuous output but Logistic regression provides discreet output.
The purpose of Linear Regression is to find the best-fitted line while Logistic regression is one step ahead and fitting the line values to the sigmoid curve.
The method for calculating loss function in linear regression is the mean squared error whereas for logistic regression it is maximum likelihood estimation.
We can try to prevent our loss function from getting stuck in a local minima by providing a momentum value. So, it provides a basic impulse to the loss function in a specific direction and helps the function avoid narrow or small local minima. Use stochastic gradient descent.
4. Explain the difference between type 1 and type 2 errors.
Type 1 error is a false positive error that ‘claims’ that an incident has occurred when, in fact, nothing has occurred. The best example of a false positive error is a false fire alarm – the alarm starts ringing when there’s no fire. Contrary to this, a Type 2 error is a false negative error that ‘claims’ nothing has occurred when something has definitely happened. It would be a Type 2 error to tell a pregnant lady that she isn’t carrying a baby.
ENJOY LEARNING 👍👍