1. Explain Gradient Descent algorithm.
Ans. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.
2. What is logistic regression used for classification instead of linear regression?
Ans. Using linear Regression , all predictions >= 0.5 can be considered as 1 and rest all < 0.5 can be considered as 0. But then the question arises why classification can’t be performed using it? Suppose we are classifying a mail as spam or not spam and our output is y, it can be 0(spam) or 1(not spam). In case of Linear Regression, hθ(x) can be > 1 or < 0. Although our prediction should be in between 0 and 1, the model will predict value out of the range i.e. maybe > 1 or < 0. So, that’s why for a Classification task, Logistic/Sigmoid Regression plays its role.
3. What is the Gini Index?
Ans. Gini Index is a score that evaluates how accurate a split is among the classified groups. Gini index evaluates a score in the range between 0 and 1, where 0 is when all observations belong to one class, and 1 is a random distribution of the elements within classes. In this case, we want to have a Gini index score as low as possible. Gini Index is the evaluation metrics we shall use to evaluate our Decision Tree Model.
4. Why is DBSCAN used over K means and other clustering methods?
Ans. Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
Clusters can be of arbitrary shape like non convex clusters
Data may contain noise.
Given such data, k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.
ENJOY LEARNING 👍👍
Ans. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.
2. What is logistic regression used for classification instead of linear regression?
Ans. Using linear Regression , all predictions >= 0.5 can be considered as 1 and rest all < 0.5 can be considered as 0. But then the question arises why classification can’t be performed using it? Suppose we are classifying a mail as spam or not spam and our output is y, it can be 0(spam) or 1(not spam). In case of Linear Regression, hθ(x) can be > 1 or < 0. Although our prediction should be in between 0 and 1, the model will predict value out of the range i.e. maybe > 1 or < 0. So, that’s why for a Classification task, Logistic/Sigmoid Regression plays its role.
3. What is the Gini Index?
Ans. Gini Index is a score that evaluates how accurate a split is among the classified groups. Gini index evaluates a score in the range between 0 and 1, where 0 is when all observations belong to one class, and 1 is a random distribution of the elements within classes. In this case, we want to have a Gini index score as low as possible. Gini Index is the evaluation metrics we shall use to evaluate our Decision Tree Model.
4. Why is DBSCAN used over K means and other clustering methods?
Ans. Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
Clusters can be of arbitrary shape like non convex clusters
Data may contain noise.
Given such data, k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.
ENJOY LEARNING 👍👍
👍9
Data Science Interview Q&A
1.What are the different types of Pooling? Explain their characteristics.
Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined shapes across the feature map to get the maximum value from that portion of the feature map. It is also known as subsampling because from the entire portion of the feature map covered by filter or kernel we are sampling one single maximum value.
Average pooling: Computes the average value of the feature map covered by kernel or filter, and takes the floor value of the result.
Sum pooling: Computes the sum of all elements in that window.
2. What is a Moving Average Process in Time series?
In time-series analysis, moving-average process, is a common approach for modeling univariate time series. The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic term.
3. What is the difference between SQL having vs where?
The WHERE clause specifies the criteria which individual records must meet to be selected by a query. It can be used without the GROUP by clause. The HAVING clause cannot be used without the GROUP BY clause . The WHERE clause selects rows before grouping. The HAVING clause selects rows after grouping. The WHERE clause cannot contain aggregate functions. The HAVING clause can contain aggregate functions
4. What is Relative cell referencing in excel?
In Relative referencing, there is a change when copying a formula from one cell to another cell with respect to the destination. cells’ address Meanwhile, there is no change in Absolute cell referencing when a formula is copied, irrespective of the cell’s destination. This type of referencing is there by default. Relative cell referencing doesn’t require a dollar sign in the formula.
ENJOY LEARNING 👍👍
1.What are the different types of Pooling? Explain their characteristics.
Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined shapes across the feature map to get the maximum value from that portion of the feature map. It is also known as subsampling because from the entire portion of the feature map covered by filter or kernel we are sampling one single maximum value.
Average pooling: Computes the average value of the feature map covered by kernel or filter, and takes the floor value of the result.
Sum pooling: Computes the sum of all elements in that window.
2. What is a Moving Average Process in Time series?
In time-series analysis, moving-average process, is a common approach for modeling univariate time series. The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic term.
3. What is the difference between SQL having vs where?
The WHERE clause specifies the criteria which individual records must meet to be selected by a query. It can be used without the GROUP by clause. The HAVING clause cannot be used without the GROUP BY clause . The WHERE clause selects rows before grouping. The HAVING clause selects rows after grouping. The WHERE clause cannot contain aggregate functions. The HAVING clause can contain aggregate functions
4. What is Relative cell referencing in excel?
In Relative referencing, there is a change when copying a formula from one cell to another cell with respect to the destination. cells’ address Meanwhile, there is no change in Absolute cell referencing when a formula is copied, irrespective of the cell’s destination. This type of referencing is there by default. Relative cell referencing doesn’t require a dollar sign in the formula.
ENJOY LEARNING 👍👍
👍4👎1
Machine Learning Glossary | Google Developers
Compilation of key machine-learning and TensorFlow terms, with beginner-friendly definitions. 🤓
https://developers.google.com/machine-learning/glossary/
Compilation of key machine-learning and TensorFlow terms, with beginner-friendly definitions. 🤓
https://developers.google.com/machine-learning/glossary/
Google for Developers
Machine Learning Glossary | Google for Developers
👍9
1. What do you understand by the term silhouette coefficient?
The silhouette coefficient is a measure of how well clustered together a data point is with respect to the other points in its cluster. It is a measure of how similar a point is to the points in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst possible score.
2. What is the difference between trend and seasonality in time series?
Trends and seasonality are two characteristics of time series metrics that break many models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising above a baseline and then decreasing again.
3. What is Bag of Words in NLP?
Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a classifier. This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.
4. What is the difference between bagging and boosting?
Bagging is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm
ENJOY LEARNING 👍👍
The silhouette coefficient is a measure of how well clustered together a data point is with respect to the other points in its cluster. It is a measure of how similar a point is to the points in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst possible score.
2. What is the difference between trend and seasonality in time series?
Trends and seasonality are two characteristics of time series metrics that break many models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising above a baseline and then decreasing again.
3. What is Bag of Words in NLP?
Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a classifier. This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.
4. What is the difference between bagging and boosting?
Bagging is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm
ENJOY LEARNING 👍👍
👍12❤1
Commonly used Python libraries are:
👉🏻NumPy: This library is used for scientific computing and working with arrays of data. It provides functions for working with arrays of data, including mathematical operations, linear algebra, and random number generation.
👉🏻Pandas: This library is used for data manipulation and analysis. It provides tools for importing, cleaning, and transforming data, as well as tools for working with time series data and performing statistical analysis.
👉🏻Matplotlib: This library is used for data visualization. It provides functions for creating a wide range of plots, including scatter plots, line plots, bar plots, and histograms.
👉🏻Scikit-learn: This library is used for machine learning. It provides a range of algorithms for classification, regression, clustering, and dimensionality reduction, as well as tools for model evaluation and selection.
👉🏻TensorFlow: This library is used for deep learning. It provides a range of tools and libraries for building and training neural networks, including support for distributed training and hardware acceleration.
👉🏻NumPy: This library is used for scientific computing and working with arrays of data. It provides functions for working with arrays of data, including mathematical operations, linear algebra, and random number generation.
👉🏻Pandas: This library is used for data manipulation and analysis. It provides tools for importing, cleaning, and transforming data, as well as tools for working with time series data and performing statistical analysis.
👉🏻Matplotlib: This library is used for data visualization. It provides functions for creating a wide range of plots, including scatter plots, line plots, bar plots, and histograms.
👉🏻Scikit-learn: This library is used for machine learning. It provides a range of algorithms for classification, regression, clustering, and dimensionality reduction, as well as tools for model evaluation and selection.
👉🏻TensorFlow: This library is used for deep learning. It provides a range of tools and libraries for building and training neural networks, including support for distributed training and hardware acceleration.
👍19
1. Compare SVM and Logistic Regression in handling outliers
For Logistic Regression, outliers can have an unusually large effect on the estimate of logistic regression coefficients. It will find a linear boundary if it exists to accommodate the outliers. To solve the problem of outliers, sometimes a sigmoid function is used in logistic regression.
For SVM, outliers can make the decision boundary deviate severely from the optimal hyperplane. One way for SVM to get around the problem is to introduce slack variables. There is a penalty involved with using slack variables, and how SVM handles outliers depends on how this penalty is imposed.
2. Can you explain how to implement a simple kNN algorithm in code?
The kNN algorithm can be used for a variety of tasks (classification & regression). To implement it, you will need to first calculate the distance between the new data point and all of the training data points. Once you have the distances, you will then need to find the k nearest neighbors and take the majority vote of those neighbors to determine the class of the new data point.
3. What type of node is considered Pure in the decision tree?
If the Gini Index of the data is 0 then it means that all the elements belong to a specific class. When this happens it is said to be pure.
When all of the data belongs to a single class (pure) then the leaf node is reached in the tree.
The leaf node represents the class label in the tree (which means that it gives the final output).
4. What is Space and Time Complexity of the Hierarchical Clustering Algorithm?
Space complexity: Hierarchical Clustering Technique requires very high space when the number of observations in our dataset is more since we need to store the similarity matrix in the RAM. So, the space complexity is the order of the square of n. Space complexity = O(n²) where n is the number of observations.
Time complexity: Since we have to perform n iterations and in each iteration, we need to update the proximity matrix and also restore that matrix, therefore the time complexity is also very high. So, the time complexity is the order of the cube of n. Time complexity = O(n³) where n is the number of observations.
ENJOY LEARNING 👍👍
For Logistic Regression, outliers can have an unusually large effect on the estimate of logistic regression coefficients. It will find a linear boundary if it exists to accommodate the outliers. To solve the problem of outliers, sometimes a sigmoid function is used in logistic regression.
For SVM, outliers can make the decision boundary deviate severely from the optimal hyperplane. One way for SVM to get around the problem is to introduce slack variables. There is a penalty involved with using slack variables, and how SVM handles outliers depends on how this penalty is imposed.
2. Can you explain how to implement a simple kNN algorithm in code?
The kNN algorithm can be used for a variety of tasks (classification & regression). To implement it, you will need to first calculate the distance between the new data point and all of the training data points. Once you have the distances, you will then need to find the k nearest neighbors and take the majority vote of those neighbors to determine the class of the new data point.
3. What type of node is considered Pure in the decision tree?
If the Gini Index of the data is 0 then it means that all the elements belong to a specific class. When this happens it is said to be pure.
When all of the data belongs to a single class (pure) then the leaf node is reached in the tree.
The leaf node represents the class label in the tree (which means that it gives the final output).
4. What is Space and Time Complexity of the Hierarchical Clustering Algorithm?
Space complexity: Hierarchical Clustering Technique requires very high space when the number of observations in our dataset is more since we need to store the similarity matrix in the RAM. So, the space complexity is the order of the square of n. Space complexity = O(n²) where n is the number of observations.
Time complexity: Since we have to perform n iterations and in each iteration, we need to update the proximity matrix and also restore that matrix, therefore the time complexity is also very high. So, the time complexity is the order of the cube of n. Time complexity = O(n³) where n is the number of observations.
ENJOY LEARNING 👍👍
👍11😁2
How to start learning Data Science?
There are many resources available to help you start learning data science, depending on your background and goals.
Here are a few steps you can take:
Develop a strong understanding of the basics of statistics and programming.
Learn Python or R programming languages, both are popular among data scientists.
Learn the basics of data manipulation and visualization with tools such as pandas and matplotlib.
Learn the basics of machine learning, such as linear regression and k-nearest neighbors, and practice applying them to real-world datasets.
Take online courses and tutorials, such as those offered by Coursera, edX, and DataCamp.
Practice by working on projects and participating in online data science competitions.
Get familiar with popular data science libraries such as numpy, scikit-learn, tensorflow, keras and pytorch.
It's a good idea to start with a solid foundation in statistics and programming, and then build on that foundation by learning the specific tools and techniques used in data science. As you gain experience, you can start working on more complex projects and exploring specialized areas of the field.
There are many resources available to help you start learning data science, depending on your background and goals.
Here are a few steps you can take:
Develop a strong understanding of the basics of statistics and programming.
Learn Python or R programming languages, both are popular among data scientists.
Learn the basics of data manipulation and visualization with tools such as pandas and matplotlib.
Learn the basics of machine learning, such as linear regression and k-nearest neighbors, and practice applying them to real-world datasets.
Take online courses and tutorials, such as those offered by Coursera, edX, and DataCamp.
Practice by working on projects and participating in online data science competitions.
Get familiar with popular data science libraries such as numpy, scikit-learn, tensorflow, keras and pytorch.
It's a good idea to start with a solid foundation in statistics and programming, and then build on that foundation by learning the specific tools and techniques used in data science. As you gain experience, you can start working on more complex projects and exploring specialized areas of the field.
👍16🔥2
1. What is DBSCAN Clustering?
DBSCAN groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
2. What are the different forms of joins in a table?
SQL has many kinds of different joins including INNER JOIN, SELF JOIN, CROSS JOIN, and OUTER JOIN. In fact, each join type defines the way two tables are related in a query. OUTER JOINS can further be divided into LEFT OUTER JOINS, RIGHT OUTER JOINS, and FULL OUTER JOINS.
3.How is the grid search parameter different from the random search?
Model Hyperparameter tuning is very useful to enhance the performance of a machine learning model. The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its role.
4.How should you maintain a deployed model?
A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Also, root cause analysis for wrong predictions should be done.
DBSCAN groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
2. What are the different forms of joins in a table?
SQL has many kinds of different joins including INNER JOIN, SELF JOIN, CROSS JOIN, and OUTER JOIN. In fact, each join type defines the way two tables are related in a query. OUTER JOINS can further be divided into LEFT OUTER JOINS, RIGHT OUTER JOINS, and FULL OUTER JOINS.
3.How is the grid search parameter different from the random search?
Model Hyperparameter tuning is very useful to enhance the performance of a machine learning model. The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its role.
4.How should you maintain a deployed model?
A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Also, root cause analysis for wrong predictions should be done.
👍9❤2
1. How does a Decision Tree handle continuous(numerical) features?
Ans.
Decision Trees handle continuous features by converting these continuous features to a threshold-based boolean feature.
To decide The threshold value, we use the concept of Information Gain, choosing that threshold that maximizes the information gain.
2. What are Loss Function and Cost Functions?
Ans. the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
3. What is the difference between Python Arrays and lists?
Ans. Arrays in python can only contain elements of same data types i.e., data type of array should be homogeneous. It is a thin wrapper around C language arrays and consumes far less memory than lists.
Lists in python can contain elements of different data types i.e., data type of lists can be heterogeneous. It has the disadvantage of consuming large memory.
4. What is root cause analysis? What is a causation vs. a correlation?
Ans. Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
ENJOY LEARNING 👍👍
Ans.
Decision Trees handle continuous features by converting these continuous features to a threshold-based boolean feature.
To decide The threshold value, we use the concept of Information Gain, choosing that threshold that maximizes the information gain.
2. What are Loss Function and Cost Functions?
Ans. the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
3. What is the difference between Python Arrays and lists?
Ans. Arrays in python can only contain elements of same data types i.e., data type of array should be homogeneous. It is a thin wrapper around C language arrays and consumes far less memory than lists.
Lists in python can contain elements of different data types i.e., data type of lists can be heterogeneous. It has the disadvantage of consuming large memory.
4. What is root cause analysis? What is a causation vs. a correlation?
Ans. Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
ENJOY LEARNING 👍👍
👍9❤4
1. Explain the difference between L1 and L2 regularization.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
2. What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
3. Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
You could list some examples of ensemble methods (bagging, boosting, the “bucket of models” method) and demonstrate how they could increase predictive power.
4. What’s the “kernel trick” and how is it useful?
Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
2. What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
3. Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
You could list some examples of ensemble methods (bagging, boosting, the “bucket of models” method) and demonstrate how they could increase predictive power.
4. What’s the “kernel trick” and how is it useful?
Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates.
👍11
Three different learning styles in machine learning algorithms:
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
👍9
1. What is the Impact of Outliers on Logistic Regression?
The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.
2. What is the difference between vanilla RNNs and LSTMs?
The main difference between vanilla RNNs and LSTMs is that LSTMs are able to better remember long-term dependencies, while vanilla RNNs tend to forget them. This is due to the fact that LSTMs have a special type of memory cell that can retain information for longer periods of time, while vanilla RNNs only have a single layer of memory cells.
3. What is Masked Language Model in NLP?
Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner.
The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.
2. What is the difference between vanilla RNNs and LSTMs?
The main difference between vanilla RNNs and LSTMs is that LSTMs are able to better remember long-term dependencies, while vanilla RNNs tend to forget them. This is due to the fact that LSTMs have a special type of memory cell that can retain information for longer periods of time, while vanilla RNNs only have a single layer of memory cells.
3. What is Masked Language Model in NLP?
Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner.
👍7🔥4
1. What are the disadvantages of the linear regression model?
One of the most significant demerits of the linear model is that it is sensitive and dependent on the outliers. It can affect the overall result. Another notable demerit of the linear model is overfitting. Similarly, underfitting is also a significant disadvantage of the linear model.
2. Why Naive Bayes is called Naive?
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
we consider that these predictors are independent
we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
3. How does Random Forest handle missing values?
The Random Forest methods encourage two ways of handling missing values:
Drop data points with missing values. This is not recommended due to the fact that all the available data points is not used.
Fill in the missing values with the median (for numerical values) or mode (for categorical values). This method will brush too broad a stroke for datasets with many gaps and significant structure.
There are other methods of filling in missing values such as calculating the similarity between the missing features, and the missing values estimated by weighting.
4. Why does XGBoost perform better than SVM?
In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM.
ENJOY LEARNING 👍👍
One of the most significant demerits of the linear model is that it is sensitive and dependent on the outliers. It can affect the overall result. Another notable demerit of the linear model is overfitting. Similarly, underfitting is also a significant disadvantage of the linear model.
2. Why Naive Bayes is called Naive?
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
we consider that these predictors are independent
we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
3. How does Random Forest handle missing values?
The Random Forest methods encourage two ways of handling missing values:
Drop data points with missing values. This is not recommended due to the fact that all the available data points is not used.
Fill in the missing values with the median (for numerical values) or mode (for categorical values). This method will brush too broad a stroke for datasets with many gaps and significant structure.
There are other methods of filling in missing values such as calculating the similarity between the missing features, and the missing values estimated by weighting.
4. Why does XGBoost perform better than SVM?
In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM.
ENJOY LEARNING 👍👍
👍12❤4
7 Baby steps to start with Machine Learning:
1. Start with Python
2. Learn to use Google Colab
3. Take a Pandas tutorial
4. Then a Seaborn tutorial
5. Decision Trees are a good first algorithm
6. Finish Kaggle's "Intro to Machine Learning"
7. Solve the Titanic challenge
1. Start with Python
2. Learn to use Google Colab
3. Take a Pandas tutorial
4. Then a Seaborn tutorial
5. Decision Trees are a good first algorithm
6. Finish Kaggle's "Intro to Machine Learning"
7. Solve the Titanic challenge
👍35❤2
1. What do you understand by a random forest model?
It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.
3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.
4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a numerical measure that allows us to determine how important a word is to a document in a collection of documents called a corpus. TF/IDF is used often in text mining and information retrieval.
ENJOY LEARNING 👍👍
It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.
3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.
4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a numerical measure that allows us to determine how important a word is to a document in a collection of documents called a corpus. TF/IDF is used often in text mining and information retrieval.
ENJOY LEARNING 👍👍
👍20
1. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the given dataset?
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.
2. When does regularization come into play in Machine Learning?
At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.
3. How can we relate standard deviation and variance?
Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.
4. What is the exploding gradient problem while using the back propagation technique?
When large error gradients accumulate and result in large changes in the neural network weights during training, it is called the exploding gradient problem. The values of weights can become so large as to overflow and result in NaN values. This makes the model unstable and the learning of the model to stall just like the vanishing gradient problem.
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.
2. When does regularization come into play in Machine Learning?
At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.
3. How can we relate standard deviation and variance?
Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.
4. What is the exploding gradient problem while using the back propagation technique?
When large error gradients accumulate and result in large changes in the neural network weights during training, it is called the exploding gradient problem. The values of weights can become so large as to overflow and result in NaN values. This makes the model unstable and the learning of the model to stall just like the vanishing gradient problem.
👍5❤4👏1
1. Mention The Different Types Of Data Structures In pandas?
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.
2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods do not have any fixed numbers of parameters in the model.
Similarly in KNN, the model parameters grow with the training data by considering each training case as a parameter of the model. So, KNN is a non-parametric algorithm.
3. Explain the CART Algorithm for Decision Trees.
CART is a variation of the decision tree algorithm. It can handle both classification and regression tasks.The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches for an optimum split at the top level, then repeats the same process at each of the subsequent levels. Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time complexity.
4. Explain leave-p-out cross validation.
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points. We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.
ENJOY LEARNING 👍👍
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.
2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods do not have any fixed numbers of parameters in the model.
Similarly in KNN, the model parameters grow with the training data by considering each training case as a parameter of the model. So, KNN is a non-parametric algorithm.
3. Explain the CART Algorithm for Decision Trees.
CART is a variation of the decision tree algorithm. It can handle both classification and regression tasks.The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches for an optimum split at the top level, then repeats the same process at each of the subsequent levels. Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time complexity.
4. Explain leave-p-out cross validation.
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points. We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.
ENJOY LEARNING 👍👍
👍12