DATA SCIENCE INTERVIEW QUESTIONS
[PART -15]
𝐐1. 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐮𝐧𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐛𝐢𝐧𝐚𝐫𝐲 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧?
𝐀ns. Techniques to Handle unbalanced Data:
1. Use the right evaluation metrics
2. Use K-fold Cross-Validation in the right way
3. Ensemble different resampled datasets
4. Resample with different ratios
5. Design your own models
𝐐2. 𝐀𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧?
𝐀ns. Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output.
𝐐3. 𝐃𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧 𝐫𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧?
𝐀ns. Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.
𝐐4. 𝐖𝐡𝐲 𝐢𝐬 𝐦𝐞𝐚𝐧 𝐬𝐪𝐮𝐚𝐫𝐞 𝐞𝐫𝐫𝐨𝐫 𝐚 𝐛𝐚𝐝 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐨𝐟 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞?
𝐀ns. Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations.
𝐐5. 𝐑𝐞𝐦𝐨𝐯𝐞 𝐦𝐮𝐥𝐭𝐢𝐜𝐨𝐥𝐥𝐢𝐧𝐞𝐚𝐫𝐢𝐭𝐲?
𝐀ns. To remove multicollinearities, we can do two things.
1. We can create new features
2. remove them from our data.
𝐐6. 𝐥𝐨𝐧𝐠-𝐭𝐚𝐢𝐥𝐞𝐝 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 ?
𝐀ns. A long tail distribution of numbers is a kind of distribution having many occurrences far from the "head" or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.
𝐐7. 𝐎𝐮𝐭𝐥𝐢𝐞𝐫? 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐢𝐭?
𝐀ns. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error.
Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest “good” data, as opposed to truncating them completely.
𝐐8. 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 𝐰𝐡𝐞𝐫𝐞 𝐭𝐡𝐞 𝐦𝐞𝐝𝐢𝐚𝐧 𝐢𝐬 𝐚 𝐛𝐞𝐭𝐭𝐞𝐫 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐭𝐡𝐚𝐧 𝐭𝐡𝐞 𝐦𝐞𝐚𝐧 ?
𝐀ns. If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.
ENJOY LEARNING 👍👍
[PART -15]
𝐐1. 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐮𝐧𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐛𝐢𝐧𝐚𝐫𝐲 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧?
𝐀ns. Techniques to Handle unbalanced Data:
1. Use the right evaluation metrics
2. Use K-fold Cross-Validation in the right way
3. Ensemble different resampled datasets
4. Resample with different ratios
5. Design your own models
𝐐2. 𝐀𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧?
𝐀ns. Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output.
𝐐3. 𝐃𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧 𝐫𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧?
𝐀ns. Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.
𝐐4. 𝐖𝐡𝐲 𝐢𝐬 𝐦𝐞𝐚𝐧 𝐬𝐪𝐮𝐚𝐫𝐞 𝐞𝐫𝐫𝐨𝐫 𝐚 𝐛𝐚𝐝 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐨𝐟 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞?
𝐀ns. Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations.
𝐐5. 𝐑𝐞𝐦𝐨𝐯𝐞 𝐦𝐮𝐥𝐭𝐢𝐜𝐨𝐥𝐥𝐢𝐧𝐞𝐚𝐫𝐢𝐭𝐲?
𝐀ns. To remove multicollinearities, we can do two things.
1. We can create new features
2. remove them from our data.
𝐐6. 𝐥𝐨𝐧𝐠-𝐭𝐚𝐢𝐥𝐞𝐝 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 ?
𝐀ns. A long tail distribution of numbers is a kind of distribution having many occurrences far from the "head" or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.
𝐐7. 𝐎𝐮𝐭𝐥𝐢𝐞𝐫? 𝐃𝐞𝐚𝐥 𝐰𝐢𝐭𝐡 𝐢𝐭?
𝐀ns. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error.
Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest “good” data, as opposed to truncating them completely.
𝐐8. 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 𝐰𝐡𝐞𝐫𝐞 𝐭𝐡𝐞 𝐦𝐞𝐝𝐢𝐚𝐧 𝐢𝐬 𝐚 𝐛𝐞𝐭𝐭𝐞𝐫 𝐦𝐞𝐚𝐬𝐮𝐫𝐞 𝐭𝐡𝐚𝐧 𝐭𝐡𝐞 𝐦𝐞𝐚𝐧 ?
𝐀ns. If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.
ENJOY LEARNING 👍👍
🔥2👍1
Which of the following method/s can be used to handle missing values?
Anonymous Quiz
16%
Mean Substitution
6%
Pairwise deletion
11%
Regression imputation
66%
All of the above
👍2
Which of the following is not a feature selection technique?
Anonymous Quiz
21%
Information Gain
13%
Forward Selection
23%
Regularisation
44%
K-means clustering
Data Science Interview Questions
[PART-16]
Q. How can outlier values be treated?
A. An outlier is an observation in a dataset that differs significantly from the rest of the data. This signifies that an outlier is much larger or smaller than the rest of the data.
Given are some of the methods of treating the outliers: Trimming or removing the outlier, Quantile based flooring and capping, Mean/Median imputation.
Q. What is root cause analysis?
A. A root cause is a component that contributed to a nonconformance and should be eradicated permanently through process improvement. The root cause is the most fundamental problem—the most fundamental reason—that puts in motion the entire cause-and-effect chain that leads to the problem (s). Root cause analysis (RCA) is a word that refers to a variety of approaches, tools, and procedures used to identify the root causes of problems. Some RCA approaches are more directed toward uncovering actual root causes than others, while others are more general problem-solving procedures, and yet others just provide support for the root cause analysis core activity.
Q. What is bias and variance in Data Science?
A. The model's simplifying assumptions simplify the target function, making it easier to estimate. Bias is the difference between the Predicted Value and the Expected Value in its most basic form. Variance refers to how much the target function's estimate will fluctuate as a result of varied training data. In contrast to bias, variance occurs when the model takes into account the data's fluctuations, or noise.
Q. What is a confusion matrix?
A. A confusion matrix is a method of summarising a classification algorithm's performance. Calculating a confusion matrix can help you understand what your classification model is getting right and where it is going wrong. This gives us the following: "True positive" for event values that were successfully predicted. "False positive" for event values that were mistakenly predicted. For successfully anticipated no-event values, "true negative" is used. "False negative" for no-event values that were mistakenly predicted.
ENJOY LEARNING 👍👍
[PART-16]
Q. How can outlier values be treated?
A. An outlier is an observation in a dataset that differs significantly from the rest of the data. This signifies that an outlier is much larger or smaller than the rest of the data.
Given are some of the methods of treating the outliers: Trimming or removing the outlier, Quantile based flooring and capping, Mean/Median imputation.
Q. What is root cause analysis?
A. A root cause is a component that contributed to a nonconformance and should be eradicated permanently through process improvement. The root cause is the most fundamental problem—the most fundamental reason—that puts in motion the entire cause-and-effect chain that leads to the problem (s). Root cause analysis (RCA) is a word that refers to a variety of approaches, tools, and procedures used to identify the root causes of problems. Some RCA approaches are more directed toward uncovering actual root causes than others, while others are more general problem-solving procedures, and yet others just provide support for the root cause analysis core activity.
Q. What is bias and variance in Data Science?
A. The model's simplifying assumptions simplify the target function, making it easier to estimate. Bias is the difference between the Predicted Value and the Expected Value in its most basic form. Variance refers to how much the target function's estimate will fluctuate as a result of varied training data. In contrast to bias, variance occurs when the model takes into account the data's fluctuations, or noise.
Q. What is a confusion matrix?
A. A confusion matrix is a method of summarising a classification algorithm's performance. Calculating a confusion matrix can help you understand what your classification model is getting right and where it is going wrong. This gives us the following: "True positive" for event values that were successfully predicted. "False positive" for event values that were mistakenly predicted. For successfully anticipated no-event values, "true negative" is used. "False negative" for no-event values that were mistakenly predicted.
ENJOY LEARNING 👍👍
👍4❤1
Which of the following is not a python library?
Anonymous Quiz
3%
Pandas
2%
Numpy
3%
Matplotlib
10%
Scikit-learn
83%
Array
Which of the following is not a machine learning algorithm?
Anonymous Quiz
5%
Linear Regression
9%
Random Forest
77%
Standard scalar
6%
Decision Tree
4%
Logistic Regression
Which of the following is not a supervised algorithm?
Anonymous Quiz
11%
Linear Regression
9%
Logistic Regression
64%
Clustering
16%
Decision Tree
👏3
Which of the following tool can be used for Data Visualization?
Anonymous Quiz
9%
Tableau
11%
Matplotlib
7%
Power BI
74%
All of the above
Data Science & Machine Learning
Do you want daily quiz to enhance your knowledge?
Thats an amazing response from you guys ❤️👍
Which of the following cannot give 10 as an answer?
Anonymous Quiz
8%
5*2
7%
2+5*2-2
69%
2+5*(2-2)
16%
3*2+9//2
👍2
Data Science & Machine Learning
Which of the following cannot give 10 as an answer?
Well done guys!!
Explanation for those who marked wrong answer:
Read the question again
The Answer to (9//2) is 4 and not 4.5
Explanation for those who marked wrong answer:
Read the question again
The Answer to (9//2) is 4 and not 4.5
Mathematics for Machine Learning
Published by Cambridge University Press (published April 2020)
https://mml-book.com
PDF: https://mml-book.github.io/book/mml-book.pdf
Published by Cambridge University Press (published April 2020)
https://mml-book.com
PDF: https://mml-book.github.io/book/mml-book.pdf
👍5
Neural Networks and Learning Machines Third Edition
👇👇
https://cours.etsmtl.ca/sys843/REFS/Books/ebook_Haykin09.pdf
👇👇
https://cours.etsmtl.ca/sys843/REFS/Books/ebook_Haykin09.pdf
👍3
Which of the following is not an Unsupervised algorithm?
Anonymous Quiz
13%
K-means clustering
14%
Hierarchical Clustering
21%
Anomaly detection
52%
Logistic Regression
©How fresher can get a job as a data scientist?©
India as a job market is highly resistant to hire data scientist as a fresher. Everyone out there asks for at least 2 years of experience, but then the question is where will we get the two years experience from?
The important thing here to build a portfolio. As you are a fresher I would assume you had learnt data science through online courses. They only teach you the basics, the analytical skills required to clean the data and apply machine learning algorithms to them comes only from practice.
Do some real-world data science projects, participate in Kaggle competition. kaggle provides data sets for practice as well. Whatever projects you do, create a GitHub repository for it. Place all your projects there so when a recruiter is looking at your profile they know you have hands-on practice and do know the basics. This will take you a long way.
All the major data science jobs for freshers will only be available through off-campus interviews.
Some companies that hires data scientists are:
Siemens
Accenture
IBM
Cerner
Creating a technical portfolio will showcase the knowledge you have already gained and that is essential while you got out there as a fresher and try to find a data scientist job.
India as a job market is highly resistant to hire data scientist as a fresher. Everyone out there asks for at least 2 years of experience, but then the question is where will we get the two years experience from?
The important thing here to build a portfolio. As you are a fresher I would assume you had learnt data science through online courses. They only teach you the basics, the analytical skills required to clean the data and apply machine learning algorithms to them comes only from practice.
Do some real-world data science projects, participate in Kaggle competition. kaggle provides data sets for practice as well. Whatever projects you do, create a GitHub repository for it. Place all your projects there so when a recruiter is looking at your profile they know you have hands-on practice and do know the basics. This will take you a long way.
All the major data science jobs for freshers will only be available through off-campus interviews.
Some companies that hires data scientists are:
Siemens
Accenture
IBM
Cerner
Creating a technical portfolio will showcase the knowledge you have already gained and that is essential while you got out there as a fresher and try to find a data scientist job.
👍4
Forwarded from Data Science & Machine Learning
7 Steps of the Machine Learning Process
Data Collection: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
Data Processing and Preparation: Once you’ve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
Feature Engineering: Once you’ve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
Model Selection: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
Model Training and Data Pipeline: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
Model Validation: After training the model for a sufficient amount of time, you will need to validate the model’s performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
Model Persistence: Finally, after training and validating the model’s performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.
Data Collection: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
Data Processing and Preparation: Once you’ve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
Feature Engineering: Once you’ve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
Model Selection: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
Model Training and Data Pipeline: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
Model Validation: After training the model for a sufficient amount of time, you will need to validate the model’s performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
Model Persistence: Finally, after training and validating the model’s performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.
5_6339144778529113396.pdf
11.1 MB
Machine learning notes in 15 pages