How to start learning Data Science?
There are many resources available to help you start learning data science, depending on your background and goals.
Here are a few steps you can take:
Develop a strong understanding of the basics of statistics and programming.
Learn Python or R programming languages, both are popular among data scientists.
Learn the basics of data manipulation and visualization with tools such as pandas and matplotlib.
Learn the basics of machine learning, such as linear regression and k-nearest neighbors, and practice applying them to real-world datasets.
Take online courses and tutorials, such as those offered by Coursera, edX, and DataCamp.
Practice by working on projects and participating in online data science competitions.
Get familiar with popular data science libraries such as numpy, scikit-learn, tensorflow, keras and pytorch.
It's a good idea to start with a solid foundation in statistics and programming, and then build on that foundation by learning the specific tools and techniques used in data science. As you gain experience, you can start working on more complex projects and exploring specialized areas of the field.
There are many resources available to help you start learning data science, depending on your background and goals.
Here are a few steps you can take:
Develop a strong understanding of the basics of statistics and programming.
Learn Python or R programming languages, both are popular among data scientists.
Learn the basics of data manipulation and visualization with tools such as pandas and matplotlib.
Learn the basics of machine learning, such as linear regression and k-nearest neighbors, and practice applying them to real-world datasets.
Take online courses and tutorials, such as those offered by Coursera, edX, and DataCamp.
Practice by working on projects and participating in online data science competitions.
Get familiar with popular data science libraries such as numpy, scikit-learn, tensorflow, keras and pytorch.
It's a good idea to start with a solid foundation in statistics and programming, and then build on that foundation by learning the specific tools and techniques used in data science. As you gain experience, you can start working on more complex projects and exploring specialized areas of the field.
👍16🔥2
1. What is DBSCAN Clustering?
DBSCAN groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
2. What are the different forms of joins in a table?
SQL has many kinds of different joins including INNER JOIN, SELF JOIN, CROSS JOIN, and OUTER JOIN. In fact, each join type defines the way two tables are related in a query. OUTER JOINS can further be divided into LEFT OUTER JOINS, RIGHT OUTER JOINS, and FULL OUTER JOINS.
3.How is the grid search parameter different from the random search?
Model Hyperparameter tuning is very useful to enhance the performance of a machine learning model. The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its role.
4.How should you maintain a deployed model?
A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Also, root cause analysis for wrong predictions should be done.
DBSCAN groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
2. What are the different forms of joins in a table?
SQL has many kinds of different joins including INNER JOIN, SELF JOIN, CROSS JOIN, and OUTER JOIN. In fact, each join type defines the way two tables are related in a query. OUTER JOINS can further be divided into LEFT OUTER JOINS, RIGHT OUTER JOINS, and FULL OUTER JOINS.
3.How is the grid search parameter different from the random search?
Model Hyperparameter tuning is very useful to enhance the performance of a machine learning model. The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its role.
4.How should you maintain a deployed model?
A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Also, root cause analysis for wrong predictions should be done.
👍9❤2
1. How does a Decision Tree handle continuous(numerical) features?
Ans.
Decision Trees handle continuous features by converting these continuous features to a threshold-based boolean feature.
To decide The threshold value, we use the concept of Information Gain, choosing that threshold that maximizes the information gain.
2. What are Loss Function and Cost Functions?
Ans. the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
3. What is the difference between Python Arrays and lists?
Ans. Arrays in python can only contain elements of same data types i.e., data type of array should be homogeneous. It is a thin wrapper around C language arrays and consumes far less memory than lists.
Lists in python can contain elements of different data types i.e., data type of lists can be heterogeneous. It has the disadvantage of consuming large memory.
4. What is root cause analysis? What is a causation vs. a correlation?
Ans. Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
ENJOY LEARNING 👍👍
Ans.
Decision Trees handle continuous features by converting these continuous features to a threshold-based boolean feature.
To decide The threshold value, we use the concept of Information Gain, choosing that threshold that maximizes the information gain.
2. What are Loss Function and Cost Functions?
Ans. the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
3. What is the difference between Python Arrays and lists?
Ans. Arrays in python can only contain elements of same data types i.e., data type of array should be homogeneous. It is a thin wrapper around C language arrays and consumes far less memory than lists.
Lists in python can contain elements of different data types i.e., data type of lists can be heterogeneous. It has the disadvantage of consuming large memory.
4. What is root cause analysis? What is a causation vs. a correlation?
Ans. Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
ENJOY LEARNING 👍👍
👍9❤4
1. Explain the difference between L1 and L2 regularization.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
2. What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
3. Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
You could list some examples of ensemble methods (bagging, boosting, the “bucket of models” method) and demonstrate how they could increase predictive power.
4. What’s the “kernel trick” and how is it useful?
Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
2. What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
3. Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
You could list some examples of ensemble methods (bagging, boosting, the “bucket of models” method) and demonstrate how they could increase predictive power.
4. What’s the “kernel trick” and how is it useful?
Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates.
👍11
Three different learning styles in machine learning algorithms:
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
👍9
1. What is the Impact of Outliers on Logistic Regression?
The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.
2. What is the difference between vanilla RNNs and LSTMs?
The main difference between vanilla RNNs and LSTMs is that LSTMs are able to better remember long-term dependencies, while vanilla RNNs tend to forget them. This is due to the fact that LSTMs have a special type of memory cell that can retain information for longer periods of time, while vanilla RNNs only have a single layer of memory cells.
3. What is Masked Language Model in NLP?
Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner.
The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.
2. What is the difference between vanilla RNNs and LSTMs?
The main difference between vanilla RNNs and LSTMs is that LSTMs are able to better remember long-term dependencies, while vanilla RNNs tend to forget them. This is due to the fact that LSTMs have a special type of memory cell that can retain information for longer periods of time, while vanilla RNNs only have a single layer of memory cells.
3. What is Masked Language Model in NLP?
Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner.
👍7🔥4
1. What are the disadvantages of the linear regression model?
One of the most significant demerits of the linear model is that it is sensitive and dependent on the outliers. It can affect the overall result. Another notable demerit of the linear model is overfitting. Similarly, underfitting is also a significant disadvantage of the linear model.
2. Why Naive Bayes is called Naive?
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
we consider that these predictors are independent
we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
3. How does Random Forest handle missing values?
The Random Forest methods encourage two ways of handling missing values:
Drop data points with missing values. This is not recommended due to the fact that all the available data points is not used.
Fill in the missing values with the median (for numerical values) or mode (for categorical values). This method will brush too broad a stroke for datasets with many gaps and significant structure.
There are other methods of filling in missing values such as calculating the similarity between the missing features, and the missing values estimated by weighting.
4. Why does XGBoost perform better than SVM?
In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM.
ENJOY LEARNING 👍👍
One of the most significant demerits of the linear model is that it is sensitive and dependent on the outliers. It can affect the overall result. Another notable demerit of the linear model is overfitting. Similarly, underfitting is also a significant disadvantage of the linear model.
2. Why Naive Bayes is called Naive?
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
we consider that these predictors are independent
we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
3. How does Random Forest handle missing values?
The Random Forest methods encourage two ways of handling missing values:
Drop data points with missing values. This is not recommended due to the fact that all the available data points is not used.
Fill in the missing values with the median (for numerical values) or mode (for categorical values). This method will brush too broad a stroke for datasets with many gaps and significant structure.
There are other methods of filling in missing values such as calculating the similarity between the missing features, and the missing values estimated by weighting.
4. Why does XGBoost perform better than SVM?
In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM.
ENJOY LEARNING 👍👍
👍12❤4
7 Baby steps to start with Machine Learning:
1. Start with Python
2. Learn to use Google Colab
3. Take a Pandas tutorial
4. Then a Seaborn tutorial
5. Decision Trees are a good first algorithm
6. Finish Kaggle's "Intro to Machine Learning"
7. Solve the Titanic challenge
1. Start with Python
2. Learn to use Google Colab
3. Take a Pandas tutorial
4. Then a Seaborn tutorial
5. Decision Trees are a good first algorithm
6. Finish Kaggle's "Intro to Machine Learning"
7. Solve the Titanic challenge
👍35❤2
1. What do you understand by a random forest model?
It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.
3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.
4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a numerical measure that allows us to determine how important a word is to a document in a collection of documents called a corpus. TF/IDF is used often in text mining and information retrieval.
ENJOY LEARNING 👍👍
It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.
3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.
4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a numerical measure that allows us to determine how important a word is to a document in a collection of documents called a corpus. TF/IDF is used often in text mining and information retrieval.
ENJOY LEARNING 👍👍
👍20
1. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the given dataset?
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.
2. When does regularization come into play in Machine Learning?
At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.
3. How can we relate standard deviation and variance?
Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.
4. What is the exploding gradient problem while using the back propagation technique?
When large error gradients accumulate and result in large changes in the neural network weights during training, it is called the exploding gradient problem. The values of weights can become so large as to overflow and result in NaN values. This makes the model unstable and the learning of the model to stall just like the vanishing gradient problem.
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.
2. When does regularization come into play in Machine Learning?
At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.
3. How can we relate standard deviation and variance?
Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.
4. What is the exploding gradient problem while using the back propagation technique?
When large error gradients accumulate and result in large changes in the neural network weights during training, it is called the exploding gradient problem. The values of weights can become so large as to overflow and result in NaN values. This makes the model unstable and the learning of the model to stall just like the vanishing gradient problem.
👍5❤4👏1
1. Mention The Different Types Of Data Structures In pandas?
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.
2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods do not have any fixed numbers of parameters in the model.
Similarly in KNN, the model parameters grow with the training data by considering each training case as a parameter of the model. So, KNN is a non-parametric algorithm.
3. Explain the CART Algorithm for Decision Trees.
CART is a variation of the decision tree algorithm. It can handle both classification and regression tasks.The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches for an optimum split at the top level, then repeats the same process at each of the subsequent levels. Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time complexity.
4. Explain leave-p-out cross validation.
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points. We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.
ENJOY LEARNING 👍👍
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.
2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods do not have any fixed numbers of parameters in the model.
Similarly in KNN, the model parameters grow with the training data by considering each training case as a parameter of the model. So, KNN is a non-parametric algorithm.
3. Explain the CART Algorithm for Decision Trees.
CART is a variation of the decision tree algorithm. It can handle both classification and regression tasks.The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches for an optimum split at the top level, then repeats the same process at each of the subsequent levels. Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time complexity.
4. Explain leave-p-out cross validation.
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points. We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.
ENJOY LEARNING 👍👍
👍12
Sharing 20+ Diverse Datasets📊 for Data Science and Analytics practice!
1. How much did it rain :- https://www.kaggle.com/c/how-much-did-it-rain-ii/overview
2. Inventory Demand:- https://www.kaggle.com/c/grupo-bimbo-inventory-demand
3. Property Inspection predictiion:- https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
4. Restaurant Revenue prediction:- https://www.kaggle.com/c/restaurant-revenue-prediction/data
5. Customer satisfcation:-https://www.kaggle.com/c/santander-customer-satisfaction
6. Iris Dataset: https://archive.ics.uci.edu/ml/datasets/iris
7. Titanic Dataset: https://www.kaggle.com/c/titanic
8. Wine Quality Dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
9. Heart Disease Dataset: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
10. Bengaluru House Price Dataset: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data
11. Breast Cancer Dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
12. Credit Card Fraud Detection: https://www.kaggle.com/mlg-ulb/creditcardfraud
13. Netflix Movies and TV Shows: https://www.kaggle.com/shivamb/netflix-shows
14. Trending YouTube Video Statistics: https://www.kaggle.com/datasnaek/youtube-new
15. Walmart Store Sales Forecasting: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
16. FIFA 19 Complete Player Dataset: https://www.kaggle.com/karangadiya/fifa19
17. World Happiness Report: https://www.kaggle.com/unsdsn/world-happiness
18. TMDB 5000 Movie Dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata
19. Students Performance in Exams: https://www.kaggle.com/spscientist/students-performance-in-exams
20. Twitter Sentiment Analysis Dataset: https://www.kaggle.com/kazanova/sentiment140
21. Digit Recognizer: https://www.kaggle.com/c/digit-recognizer
💻🔍 Don't miss out on these valuable resources for advancing your data science journey!
1. How much did it rain :- https://www.kaggle.com/c/how-much-did-it-rain-ii/overview
2. Inventory Demand:- https://www.kaggle.com/c/grupo-bimbo-inventory-demand
3. Property Inspection predictiion:- https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
4. Restaurant Revenue prediction:- https://www.kaggle.com/c/restaurant-revenue-prediction/data
5. Customer satisfcation:-https://www.kaggle.com/c/santander-customer-satisfaction
6. Iris Dataset: https://archive.ics.uci.edu/ml/datasets/iris
7. Titanic Dataset: https://www.kaggle.com/c/titanic
8. Wine Quality Dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
9. Heart Disease Dataset: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
10. Bengaluru House Price Dataset: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data
11. Breast Cancer Dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
12. Credit Card Fraud Detection: https://www.kaggle.com/mlg-ulb/creditcardfraud
13. Netflix Movies and TV Shows: https://www.kaggle.com/shivamb/netflix-shows
14. Trending YouTube Video Statistics: https://www.kaggle.com/datasnaek/youtube-new
15. Walmart Store Sales Forecasting: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
16. FIFA 19 Complete Player Dataset: https://www.kaggle.com/karangadiya/fifa19
17. World Happiness Report: https://www.kaggle.com/unsdsn/world-happiness
18. TMDB 5000 Movie Dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata
19. Students Performance in Exams: https://www.kaggle.com/spscientist/students-performance-in-exams
20. Twitter Sentiment Analysis Dataset: https://www.kaggle.com/kazanova/sentiment140
21. Digit Recognizer: https://www.kaggle.com/c/digit-recognizer
💻🔍 Don't miss out on these valuable resources for advancing your data science journey!
👍16
1. What is the primary difference between R square and adjusted R square?
In linear regression, you use both these values for model validation. However, there is a clear distinction between the two. R square accounts for the variation of all independent variables on the dependent variable. In other words, it considers each independent variable for explaining the variation. In the case of Adjusted R square, it accounts for the significant variables alone for indicating the percentage of variation in the model. By significant, we refer to the P values less than 0.05.
2. What is the curse of dimensionality?
Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality’.
3. What are some Stopping Criteria for k-Means Clustering?
a. Convergence. No further changes, points stay in the same cluster.
b. The maximum number of iterations. When the maximum number of iterations has been reached, the algorithm will be stopped. This is done to limit the runtime of the algorithm.
c. Variance did not improve by at least x * initial variance
4. What are hard margin and soft Margin SVMs?
Hard margin SVMs work only if the data is linearly separable and these types of SVMs are quite sensitive to the outliers. But our main objective is to find a good balance between keeping the margins as large as possible and limiting the margin violation i.e. instances that end up in the middle of margin or even on the wrong side, and this method is called soft margin SVM.
In linear regression, you use both these values for model validation. However, there is a clear distinction between the two. R square accounts for the variation of all independent variables on the dependent variable. In other words, it considers each independent variable for explaining the variation. In the case of Adjusted R square, it accounts for the significant variables alone for indicating the percentage of variation in the model. By significant, we refer to the P values less than 0.05.
2. What is the curse of dimensionality?
Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality’.
3. What are some Stopping Criteria for k-Means Clustering?
a. Convergence. No further changes, points stay in the same cluster.
b. The maximum number of iterations. When the maximum number of iterations has been reached, the algorithm will be stopped. This is done to limit the runtime of the algorithm.
c. Variance did not improve by at least x * initial variance
4. What are hard margin and soft Margin SVMs?
Hard margin SVMs work only if the data is linearly separable and these types of SVMs are quite sensitive to the outliers. But our main objective is to find a good balance between keeping the margins as large as possible and limiting the margin violation i.e. instances that end up in the middle of margin or even on the wrong side, and this method is called soft margin SVM.
👍15❤1
1. What is the Difference Between a Shallow Copy and Deep Copy in python?
Deepcopy creates a different object and populates it with the child objects of the original object. Therefore, changes in the original object are not reflected in the copy. copy.deepcopy() creates a Deep Copy. Shallow copy creates a different object and populates it with the references of the child objects within the original object. Therefore, changes in the original object are reflected in the copy. copy.copy creates a Shallow Copy.
2. How can you remove duplicate values in a range of cells?
1. To delete duplicate values in a column, select the highlighted cells, and press the delete button. After deleting the values, go to the ‘Conditional Formatting’ option present in the Home tab. Choose ‘Clear Rules’ to remove the rules from the sheet.
2. You can also delete duplicate values by selecting the ‘Remove Duplicates’ option under Data Tools present in the Data tab.
3. Define shelves and sets in Tableau?
Shelves: Every worksheet in Tableau will have shelves such as columns, rows, marks, filters, pages, and more. By placing filters on shelves we can build our own visualization structure. We can control the marks by including or excluding data.
Sets: The sets are used to compute a condition on which the dataset will be prepared. Data will be grouped together based on a condition. Fields which is responsible for grouping are known assets. For example – students having grades of more than 70%.
4. Given a table Employee having columns empName and empId, what will be the result of the SQL query below?
select empName from Employee order by 2 asc;
“Order by 2” is valid when there are at least 2 columns used in SELECT statement. Here this query will throw error because only one column is used in the SELECT statement.
ENJOY LEARNING 👍👍
Deepcopy creates a different object and populates it with the child objects of the original object. Therefore, changes in the original object are not reflected in the copy. copy.deepcopy() creates a Deep Copy. Shallow copy creates a different object and populates it with the references of the child objects within the original object. Therefore, changes in the original object are reflected in the copy. copy.copy creates a Shallow Copy.
2. How can you remove duplicate values in a range of cells?
1. To delete duplicate values in a column, select the highlighted cells, and press the delete button. After deleting the values, go to the ‘Conditional Formatting’ option present in the Home tab. Choose ‘Clear Rules’ to remove the rules from the sheet.
2. You can also delete duplicate values by selecting the ‘Remove Duplicates’ option under Data Tools present in the Data tab.
3. Define shelves and sets in Tableau?
Shelves: Every worksheet in Tableau will have shelves such as columns, rows, marks, filters, pages, and more. By placing filters on shelves we can build our own visualization structure. We can control the marks by including or excluding data.
Sets: The sets are used to compute a condition on which the dataset will be prepared. Data will be grouped together based on a condition. Fields which is responsible for grouping are known assets. For example – students having grades of more than 70%.
4. Given a table Employee having columns empName and empId, what will be the result of the SQL query below?
select empName from Employee order by 2 asc;
“Order by 2” is valid when there are at least 2 columns used in SELECT statement. Here this query will throw error because only one column is used in the SELECT statement.
ENJOY LEARNING 👍👍
👍13
Amazing Hackthon Solved Data Science/ML Project Collection
⭐️ 167
https://github.com/analyticsindiamagazine/MachineHack/tree/master/Hackathon_Solutions
𝗘𝗡𝗝𝗢𝗬 𝗟𝗘𝗔𝗥𝗡𝗜𝗡𝗚 👍👍
⭐️ 167
https://github.com/analyticsindiamagazine/MachineHack/tree/master/Hackathon_Solutions
𝗘𝗡𝗝𝗢𝗬 𝗟𝗘𝗔𝗥𝗡𝗜𝗡𝗚 👍👍
❤7👍7