NEW BOT Телеграм, страница

Data Science & Machine Learning

Three different learning styles in machine learning algorithms:

1. Supervised Learning

Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include: Logistic Regression and the Back Propagation Neural Network.

2. Unsupervised Learning

Input data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and K-Means.

3. Semi-Supervised Learning

Input data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.

👍9

4.5K views16:26

Data Science & Machine Learning

1. What is the Impact of Outliers on Logistic Regression?

The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.

2. What is the difference between vanilla RNNs and LSTMs?

The main difference between vanilla RNNs and LSTMs is that LSTMs are able to better remember long-term dependencies, while vanilla RNNs tend to forget them. This is due to the fact that LSTMs have a special type of memory cell that can retain information for longer periods of time, while vanilla RNNs only have a single layer of memory cells.

3. What is Masked Language Model in NLP?

Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.

4. Why is the KNN Algorithm known as Lazy Learner?

When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner.

👍7🔥4

5.36K views08:10

Data Science & Machine Learning

1. What are the disadvantages of the linear regression model?
One of the most significant demerits of the linear model is that it is sensitive and dependent on the outliers. It can affect the overall result. Another notable demerit of the linear model is overfitting. Similarly, underfitting is also a significant disadvantage of the linear model.

2. Why Naive Bayes is called Naive?

We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
we consider that these predictors are independent
we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)

3. How does Random Forest handle missing values?

The Random Forest methods encourage two ways of handling missing values:
Drop data points with missing values. This is not recommended due to the fact that all the available data points is not used.
Fill in the missing values with the median (for numerical values) or mode (for categorical values). This method will brush too broad a stroke for datasets with many gaps and significant structure.
There are other methods of filling in missing values such as calculating the similarity between the missing features, and the missing values estimated by weighting.

4. Why does XGBoost perform better than SVM?

In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM.

ENJOY LEARNING 👍👍

👍12❤4

6.75K views02:39

Data Science & Machine Learning

7 Baby steps to start with Machine Learning:

1. Start with Python
2. Learn to use Google Colab
3. Take a Pandas tutorial
4. Then a Seaborn tutorial
5. Decision Trees are a good first algorithm
6. Finish Kaggle's "Intro to Machine Learning"
7. Solve the Titanic challenge

👍35❤2

5.94K views19:28

Data Science & Machine Learning

1. What do you understand by a random forest model?
It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.

2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.

3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.

4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a numerical measure that allows us to determine how important a word is to a document in a collection of documents called a corpus. TF/IDF is used often in text mining and information retrieval.

ENJOY LEARNING 👍👍

👍20

5.41K views02:21

Data Science & Machine Learning

List of popular ai tools

👍19

4.74K views14:33

Data Science & Machine Learning

1. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the given dataset?

One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.

2. When does regularization come into play in Machine Learning?

At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.

3. How can we relate standard deviation and variance?

Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.

4. What is the exploding gradient problem while using the back propagation technique?

When large error gradients accumulate and result in large changes in the neural network weights during training, it is called the exploding gradient problem. The values of weights can become so large as to overflow and result in NaN values. This makes the model unstable and the learning of the model to stall just like the vanishing gradient problem.

👍5❤4👏1

5.39K views07:26

Data Science & Machine Learning

1. Mention The Different Types Of Data Structures In pandas?
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.

2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods do not have any fixed numbers of parameters in the model.
Similarly in KNN, the model parameters grow with the training data by considering each training case as a parameter of the model. So, KNN is a non-parametric algorithm.

3. Explain the CART Algorithm for Decision Trees.
CART is a variation of the decision tree algorithm. It can handle both classification and regression tasks.The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches for an optimum split at the top level, then repeats the same process at each of the subsequent levels. Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time complexity.

4. Explain leave-p-out cross validation.
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points. We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.

ENJOY LEARNING 👍👍

👍12

5.31K viewsedited 07:26

Data Science & Machine Learning

👍7❤3

5.65K views08:19

Data Science & Machine Learning

18 FREE Resume/CV builders
👇👇
https://news.1rj.ru/str/getjobss/1341

6.17K views06:53

Data Science & Machine Learning

Free SQL Courses and Certifications
👇👇
https://news.1rj.ru/str/free4unow_backup/560

6.24K views14:04

Data Science & Machine Learning

Data Science Interview Resource
👇
Link

👍5

5.73K views07:11

Data Science & Machine Learning

Free Data Science Useful Resources
👇👇
https://news.1rj.ru/str/free4unow_backup/565

5.67K views07:17

Data Science & Machine Learning

👍16

6.36K views11:52

Data Science & Machine Learning

1. What is the primary difference between R square and adjusted R square?

In linear regression, you use both these values for model validation. However, there is a clear distinction between the two. R square accounts for the variation of all independent variables on the dependent variable. In other words, it considers each independent variable for explaining the variation. In the case of Adjusted R square, it accounts for the significant variables alone for indicating the percentage of variation in the model. By significant, we refer to the P values less than 0.05.

2. What is the curse of dimensionality?

Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality’.

3. What are some Stopping Criteria for k-Means Clustering?

a. Convergence. No further changes, points stay in the same cluster.
b. The maximum number of iterations. When the maximum number of iterations has been reached, the algorithm will be stopped. This is done to limit the runtime of the algorithm.
c. Variance did not improve by at least x * initial variance

4. What are hard margin and soft Margin SVMs?

Hard margin SVMs work only if the data is linearly separable and these types of SVMs are quite sensitive to the outliers. But our main objective is to find a good balance between keeping the margins as large as possible and limiting the margin violation i.e. instances that end up in the middle of margin or even on the wrong side, and this method is called soft margin SVM.

👍15❤1

6.4K views10:10

Data Science & Machine Learning

1. What is the Difference Between a Shallow Copy and Deep Copy in python?

Deepcopy creates a different object and populates it with the child objects of the original object. Therefore, changes in the original object are not reflected in the copy. copy.deepcopy() creates a Deep Copy. Shallow copy creates a different object and populates it with the references of the child objects within the original object. Therefore, changes in the original object are reflected in the copy. copy.copy creates a Shallow Copy.

2. How can you remove duplicate values in a range of cells?

1. To delete duplicate values in a column, select the highlighted cells, and press the delete button. After deleting the values, go to the ‘Conditional Formatting’ option present in the Home tab. Choose ‘Clear Rules’ to remove the rules from the sheet.

2. You can also delete duplicate values by selecting the ‘Remove Duplicates’ option under Data Tools present in the Data tab.

3. Define shelves and sets in Tableau?

Shelves: Every worksheet in Tableau will have shelves such as columns, rows, marks, filters, pages, and more. By placing filters on shelves we can build our own visualization structure. We can control the marks by including or excluding data.
Sets: The sets are used to compute a condition on which the dataset will be prepared. Data will be grouped together based on a condition. Fields which is responsible for grouping are known assets. For example – students having grades of more than 70%.

4. Given a table Employee having columns empName and empId, what will be the result of the SQL query below?

select empName from Employee order by 2 asc;

“Order by 2” is valid when there are at least 2 columns used in SELECT statement. Here this query will throw error because only one column is used in the SELECT statement.

ENJOY LEARNING 👍👍

👍13

8.03K views04:29

Data Science & Machine Learning

Amazing Hackthon Solved Data Science/ML Project Collection

⭐️ 167

https://github.com/analyticsindiamagazine/MachineHack/tree/master/Hackathon_Solutions

𝗘𝗡𝗝𝗢𝗬 𝗟𝗘𝗔𝗥𝗡𝗜𝗡𝗚 👍👍

❤7👍7

8.47K views07:12

Data Science & Machine Learning

FREE DATASET BUILDING YOUR PORTFOLIO ⭐

1. Supermarket Sales - https://lnkd.in/e86UpCMv
2.Credit Card Fraud Detection - https://lnkd.in/eFTsZDCW
3. FIFA 22 complete player dataset - https://lnkd.in/eDScdUUM
4. Walmart Store Sales Forecasting - https://lnkd.in/eVT6h-CT
5. Netflix Movies and TV Shows - https://lnkd.in/eZ3cduwK
6.LinkedIn Data Analyst jobs listings - https://lnkd.in/ezqxcmrE
7. Top 50 Fast-Food Chains in USA - https://lnkd.in/esBjf5u4
8. Amazon and Best Buy Electronics - https://lnkd.in/e4fBZvJ3
9. Forecasting Book Sales - https://lnkd.in/eXHN2XsQ
10. Real / Fake Job Posting Prediction - https://lnkd.in/e5SDDW9G

👍13😁1

16.3K views19:53

Data Science & Machine Learning

Forwarded from Machine Learning & Artificial Intelligence | Data Science Free Courses

Harvard University offers a ton of FREE online courses.
From Computer Science to Artificial Intelligence.
Here are 10 FREE courses you don't want to miss

1. Introduction to Computer Science
An introduction to the intellectual enterprises of computer science and the art of programming.
Check here 👇
https://pll.harvard.edu/course/cs50-introduction-computer-science?delta=0

2. Web Programming with Python and JavaScript
This course takes you deeply into the design and implementation of web apps with Python, JavaScript, and SQL using frameworks like Django, React, and Bootstrap.
Check here 👇
https://pll.harvard.edu/course/cs50s-web-programming-python-and-javanoscript?delta=0

3. Introduction to Programming with Scratch

A gentle introduction to programming that prepares you for subsequent courses in coding.
Check here 👇
https://pll.harvard.edu/course/cs50s-introduction-programming-scratch?delta=0

4. Introduction to Programming with Python
An introduction to programming using Python, a popular language for general-purpose programming, data science, web programming, and more.
Check here 👇
https://edx.org/course/cs50s-introduction-to-programming-with-python

5. Understanding Technology
This is CS50’s introduction to technology for students who don’t (yet!) consider themselves computer persons.
Check here 👇
https://pll.harvard.edu/course/cs50s-understanding-technology-0?delta=0

6. Introduction to Artificial Intelligence with Python
Learn to use machine learning in Python in this introductory course on artificial intelligence.
Check here 👇
https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python?delta=0

7. Introduction to Game Development
Learn about the development of 2D and 3D interactive games in this hands-on course, as you explore the design of games such as Super Mario Bros., Pokémon, Angry Birds, and more.
Check here 👇
https://pll.harvard.edu/course/cs50s-introduction-game-development?delta=0

8. CS50's Computer Science for Business Professionals
This is CS50’s introduction to computer science for business professionals.
Check here 👇
https://pll.harvard.edu/course/cs50s-computer-science-business-professionals-0?delta=0

9. Mobile App Development with React Native
Learn about mobile app development with React Native, a popular framework maintained by Facebook that enables cross-platform native apps using JavaScript without Java or Swift.
Check here 👇
https://pll.harvard.edu/course/cs50s-mobile-app-development-react-native?delta=0

10. Introduction to Data Science with Python
Join Harvard University instructor Pavlos Protopapas in this online course to learn how to use Python to harness and analyze data.
Check here 👇
https://pll.harvard.edu/course/introduction-data-science-python?delta=0

Harvard University

CS50: Introduction to Computer Science | Harvard University

An introduction to the intellectual enterprises of computer science and the art of programming.

👍10❤6👏2

9.09K views01:36

Data Science & Machine Learning

On What Kinds of data does chatgpt trained on

👍19

7.43K views07:04

Data Science & Machine Learning

👍16

7.8K views06:56

About

Blog

Apps

Platform