Data Science & Machine Learning – Telegram
Data Science & Machine Learning
72.5K subscribers
770 photos
2 videos
68 files
680 links
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free

For collaborations: @love_data
Download Telegram
👉The Ultimate Guide to the Pandas Library for Data Science in Python
👇👇

https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/amp/

A Visual Intro to NumPy and Data Representation
.
Link : 👇👇
https://jalammar.github.io/visual-numpy/

Matplotlib Cheatsheet 👇👇

https://github.com/rougier/matplotlib-cheatsheet

SQL Cheatsheet 👇👇

https://websitesetup.org/sql-cheat-sheet/
Seeing Theory : A visual introduction to probability and statistics

Link :👇👇
https://seeing-theory.brown.edu/

“The Projects You Should Do to Get a Data Science Job” by Ken Jee
👇👇
https://link.medium.com/Q2DnxSGRO6
Precision is one indicator of a machine learning model's performance – the quality of a positive prediction made by the model. Its formula would be?
Anonymous Quiz
43%
True Positive divided by actual yes
10%
True Positive divided by actual no
43%
True Positive divided by predicted yes
4%
True Positive divided by predicted no
👉A handy notebook on handling missing values

Link : 👇👇
https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python

A list of NLP Tutorials

Link : 👇👇
https://github.com/lyeoni/nlp-tutorial


“An Implementation and Explanation of the Random Forest in Python” by Will Koehrsen 👇👇
https://link.medium.com/GCWFv81v95

“How to analyse 100s of GBs of data on your laptop with Python” by Jovan Veljanoski 👇👇
https://link.medium.com/V8xS82Cax6
👍1
F-beta score is always
Anonymous Quiz
27%
Greater than 1
70%
Between 0 and 1
3%
Less than 0
Data Science & Machine Learning
What are precision, recall, and F1-score? Precision and recall are classification evaluation metrics: P = TP / (TP + FP) and R = TP / (TP + FN). Where TP is true positives, FP is false positives and FN is false negatives In both cases the score of 1 is…
Here is the explanation for the quiz

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.
What is the full form of LSTM?

Hint- LSTM algorithm is used for processing and making predictions based on time series data
Anonymous Quiz
7%
Long story total memory
72%
Long short-term memory
16%
Long short-term machine
4%
None of three
1
Data Science & Machine Learning
Which of the following project seems attractive to you?
Amazing response from you guys in this poll

Lets start with project #1

Fake News Detection

This is an example of text classification since we need to classify whether a news is real or fake

You can refer dataset from kaggle to work on such an amazing project
https://bit.ly/3FGcyoJ
Or
https://www.kaggle.com/c/fake-news/data

Before you work on this project, you should have fair understanding of below topics

Concepts: Stopwords, Porter Stemmer, Tokenisation, Tfid Vectorizer, LSTM, NLP

Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, re, nltk, Tensorflow

Steps:
1. Go through dataset
2. Import the libraries
3. Exploratory Data Analysis [EDA]
4. Data Visualization
5. Data Preparation using Tokenisation and padding
6. Apply theoretical concepts to reduce unnecessary words using Stopwords and Porter Stemmer. Convert text to vector using Count Vectorizer.
7. Split dataset into training and testing
8. Build and train the model using ML Algorithms
9. Model Evaluation using accuracy, recall, precision, confusion matrix and other metrics concepts

Algorithms you can apply:
Logistic Regression, Support Vector Machine, Multilayer Perceptron, KNN, Random Forest, Linear SVM, etc.

ENJOY LEARNING 👍👍
👍2
Data Science & Machine Learning
Amazing response from you guys in this poll Lets start with project #1 Fake News Detection This is an example of text classification since we need to classify whether a news is real or fake You can refer dataset from kaggle to work on such an amazing…
Overview of some important concepts:

👉 Natural Language Processing, or NLP is a subfield of Artificial Intelligence that enables machines to understand the human language. Its goal is to build systems that can make sense of text and automatically perform tasks like translation, spell check, or text classification.
NLP analyzes the grammatical structure of sentences and the individual meaning of words, then uses algorithms to extract meaning and deliver outputs. In other words, it makes sense of human language so that it can automatically perform different tasks.

👉 Tokenization is a part of syntactic analysis and break up a text into smaller parts called tokens (which can be sentences or words) to make text easier to handle.

👉 Stop-word removal technique removes frequently occuring words that don’t add any semantic value, such as I, they, have, like, yours, etc.

👉 The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. For example: words such as “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.

👉 TfidfVectorizer is used to transform text to feature vectors that can be used as input to estimator.

LSTM[Long Short-term memory] networks are well-suited for classifying, processing and making predictions based on time series data.

ENJOY LEARNING 👍👍
2
Some important questions to crack data science interview

Q. Describe how Gradient Boosting works.

A. Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. If a small change in the prediction for a case causes no change in error, then next target outcome of the case is zero. Gradient boosting produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.


Q. Describe the decision tree model.

A. Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The leaves are the decisions or the final outcomes. A decision tree is a machine learning algorithm that partitions the data into subsets.


Q. What is a neural network?

A. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. They, also known as Artificial Neural Networks, are the subset of Deep Learning.


Q. Explain the Bias-Variance Tradeoff

A. The bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.


Q. What’s the difference between L1 and L2 regularization?

A. The main intuitive difference between the L1 and L2 regularization is that L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting. That value will also be the median of the data distribution mathematically.

ENJOY LEARNING 👍👍
👍7
Which of the following is not a python library?
Anonymous Quiz
1%
Pandas
2%
Numpy
4%
Matplotlib
80%
Dictionary
13%
Seaborn
👍2
Which of the following is used specifically for applying machine learning algorithms?
Anonymous Quiz
14%
Matplotlib
71%
Scikit-learn
7%
Seaborn
8%
Scipy
Some important questions to crack data science interview Part-2

𝐐1. 𝐩-𝐯𝐚𝐥𝐮𝐞?

𝐀ns. p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.


𝐐2. 𝐈𝐧𝐭𝐞𝐫𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐄𝐱𝐭𝐫𝐚𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧?

𝐀ns. Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.



𝐐3. 𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐞𝐝 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 & 𝐧𝐨𝐫𝐦𝐚𝐥 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧?

𝐀ns. The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.

𝐐4. 𝐑𝐞𝐜𝐨𝐦𝐦𝐞𝐧𝐝𝐞𝐫 𝐒𝐲𝐬𝐭𝐞𝐦𝐬?

𝐀ns. The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.

𝐐5. 𝐉𝐎𝐈𝐍 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝐢𝐧 𝐒𝐐𝐋

𝐀ns. The SQL Joins clause is used to combine records from two or more tables in a database.

𝐐6. 𝐒𝐪𝐮𝐚𝐫𝐞𝐝 𝐞𝐫𝐫𝐨𝐫 𝐚𝐧𝐝 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐞𝐫𝐫𝐨𝐫?

𝐀ns. mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem's accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.

ENJOY LEARNING 👍👍
👍2
Interview Questions with Answers
Part-3
👇👇

Q. What is the loss function SVM tries to minimize?


A. Although there is no "loss function" for hard-margin SVMs, the loss does exist when solving soft-margin SVMs. The hinge loss is a loss function used in machine learning to train classifiers. For "maximum-margin" classification, the hinge loss is utilised, most notably for support vector machines (SVMs).



Q. Detect heteroscedasticity in simple linear regression?

A. Heteroscedasticity refers to the situation where the spread of the residuals changes in a systematic way over the range of observed values. A fitted value vs. residual plot is the simplest technique to determine heteroscedasticity. The "cone" form is a clear marker of heteroscedasticity if the residuals become significantly more spread out as the fitted values get greater. The Breusch-Pagan test is a more formal, mathematical method of determining heteroscedasticity.


Q. Explain ANOVA?

A. The analysis of variance (ANOVA) is a statistical technique for determining if the means of two or more groups differ significantly. One-way ANOVA, two-way ANOVA, and multivariate ANOVA are the three types. An ANOVA's null hypothesis is that there is no significant difference between the groups. The alternative hypothesis proposes that the groups have at least one substantial difference. The null hypothesis is rejected and the alternative hypothesis is validated if the p-value associated with the F is less than.05. If the null hypothesis is rejected, one concludes that the means of all the groups are not equal.


Q. Determine no. of neighbors in KNN?

A. The number of neighbors(K) in KNN is a hyperparameter that must be chosen during model construction. According to research, there is no ideal number of neighbors for all types of data sets. A small number of neighbors is the most flexible fit, resulting in low bias but high variation, whereas a big number of neighbors results in a smoother decision boundary, resulting in reduced variance but higher bias. If the number of classes is even, data scientists usually choose an odd number. You can also test the model's performance by creating it with different values of k and comparing the results. Elbow technique is another option.



Q. What do you mean by central trend?

A. The central trend is a denoscription of a dataset represented by a single value that represents the data distribution's center. The following measurements can be used to describe the central tendency of a dataset. The sum of all values in a dataset divided by the total number of values is the mean. The middle value in an ascending-ordered dataset is called the median. The most often occurring value in a dataset is defined by the mode. Despite the fact that the measures listed above are the most generally employed to describe central tendency, there are others, such as geometric mean, harmonic mean, midrange, and geometric median.


ENJOY LEARNING 👍👍
👍1
▪️11 MACHIN LEARNING METHODS YOU SHOULD LEARN

1. Regression
2. Classification
3. Clustering
4. Dimensionality Reduction
5. Ensemble Methods
6. Neural Networks and Deep Learning
7. Transfer learning
8. Reinforcement Learning
9. Natural Language Processing
10. Computer Vision
11. Word Embeddings
👍2