Data Science & Machine Learning – Telegram
Data Science & Machine Learning
72.1K subscribers
768 photos
1 video
68 files
677 links
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free

For collaborations: @love_data
Download Telegram
Amazon is hiring Data Scientist Intern!
Qualifications: Bachelor's/ Master's Degree
Salary: 5.4 LPA (Expected)
Batch: 2019/2020/2021/2022/2023
Experience: Freshers
Location: Bangalore, India

📌Apply Link: https://www.amazon.jobs/en/jobs/2213292/data-scientist-intern
👍3
Every ML project should keep the following documentation:

• Change log
• Tech debt log
• Potential risks
• Experiment logs
• Future work ideas
• List of assumptions
• ETL pipeline denoscription
👍101
Advanced Data Analytics Using Python.pdf
2.2 MB
Advanced Data Analytics Using Python
With Machine Learning, Deep Learning and NLP Examples
#book #Ml
👍4
Do you want roadmap for becoming data scientist in this channel?
Anonymous Poll
96%
Yes
4%
No
🤩64👍1👏1🎉1
Important Topics to become a data scientist
[Advanced Level]
👇👇

1. Mathematics

Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification

2. Probability

Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution

3. Statistics

Introduction to Statistics
Data Denoscription
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression

4. Programming

Python:

Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn

R Programming:

R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny

DataBase:
SQL
MongoDB

Data Structures

Web scraping

Linux

Git

5. Machine Learning

How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage

6. Deep Learning

Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification

7. Feature Engineering

Baseline Model
Categorical Encodings
Feature Generation
Feature Selection

8. Natural Language Processing

Text Classification
Word Vectors

9. Data Visualization Tools

BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense

10. Deployment

Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django

Join @datasciencefun to learning important data science and machine learning concepts

ENJOY LEARNING 👍👍
👍307
Some of the essential libraries of Python that are used in Data Science

Numpy

SciPy

Pandas

Matplotlib

Keras

TensorFlow

Scikit-learn
👍14
Python Machine Learning Projects
👇👇
https://news.1rj.ru/str/Programming_experts/151
👍1🥰1
You don't need to buy a GPU for machine learning work!

There are other alternatives. Here are some:

1. Google Colab
2. Kaggle
3. Deepnote
4. AWS SageMaker
5. GCP Notebooks
6. Azure Notebooks
7. Cocalc
8. Binder
9. Saturncloud
10. Datablore
11. IBM Notebooks

Spend your time focusing on your problem.💪💪
👍13
1. What is Dimensionality Reduction?

In the real world, Machine Learning models are built on top of features and parameters. These features can be multidimensional and large in number. Sometimes, the features may be irrelevant and it becomes a difficult task to visualize them. This is where dimensionality reduction is used to cut down irrelevant and redundant features with the help of principal variables. These principal variables conserve the features, and are a subgroup, of the parent variables.


2.What is the bin in tableau?

Bins in tableau are containers of equal size used to store data values fitting in bin size. In other words, bins group the data into groups of equal size or data which can be used in systematic viewing of data. All the discrete fields in tableau can also be considered as set of bins.


3.What’s a Fourier transform?

A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it’s how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes, and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain—it’s a very common way to extract features from audio signals or other time series such as sensor data.


4. What are Superkey and candidate key in SQL?

A super key may be a single or a combination of keys that help to identify a record in a table. Know that Super keys can have one or more attributes, even though all the attributes are not necessary to identify the records.

A candidate key is the subset of Superkey, which can have one or more than one attribute to identify records in a table. Unlike Superkey, all the attributes of the candidate key must be helpful to identify the records.
👍74
You don't need to spend several $𝟭𝟬𝟬𝟬𝘀 to learn Data Science.

Stanford University, Harvard University & Massachusetts Institute of Technology is providing free courses.💥

Here's 8 free Courses that'll teach you better than the paid ones:


1. CS50’s Introduction to Artificial Intelligence with Python (Harvard)

https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python

2. Data Science: Machine Learning (Harvard)

https://pll.harvard.edu/course/data-science-machine-learning

3. Artificial Intelligence (MIT)

https://lnkd.in/dG5BCPen

4. Introduction to Computational Thinking and Data Science (MIT)

https://lnkd.in/ddm5Ckk9

5. Machine Learning (MIT)

https://lnkd.in/dJEjStCw

6. Matrix Methods in Data Analysis, Signal Processing, and Machine Learning (MIT)

https://lnkd.in/dkpyt6qr

7. Statistical Learning (Stanford)

https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning

8. Mining Massive Data Sets (Stanford)

📍https://online.stanford.edu/courses/soe-ycs0007-mining-massive-data-sets

ENJOY LEARNING
👏8👍54🥰1
1.What is the meaning of term weight initialization in neural networks?

In neural networking, weight initialization is one of the essential factors. A bad weight initialization prevents a network from learning. On the other side, a good weight initialization helps in giving a quicker convergence and a better overall error. Biases can be initialized to zero. The standard rule for setting the weights is to be close to zero without being too small.

2.What is Cross-validation in Machine Learning?

Cross-validation allows a system to increase the performance of the given Machine Learning algorithm. This sampling process is done to break the dataset into smaller parts that have the same number of rows, out of which a random part is selected as a test set and the rest of the parts are kept as train sets. Cross-validation consists of the following techniques:
• Holdout method
• K-fold cross-validation
• Stratified k-fold cross-validation
• Leave p-out cross-validation

3.What is a Self-Join?

A self-join is a type of join that can be used to connect two tables. As a result, it is a unary relationship. Each row of the table is attached to itself and all other rows of the same table in a self-join. As a result, a self-join is mostly used to combine and compare rows from the same database table.

4. What are the types of views in SQL?

In SQL, the views are classified into four types. They are:
Simple View: A view that is based on a single table and does not have a GROUP BY clause or other features.
Complex View: A view that is built from several tables and includes a GROUP BY clause as well as functions.
Inline View: A view that is built on a subquery in the FROM clause, which provides a temporary table and simplifies a complicated query.
Materialized View: A view that saves both the definition and the details. It builds data replicas by physically preserving them.
👍7
1. What do you understand by the term silhouette coefficient?

The silhouette coefficient is a measure of how well clustered together a data point is with respect to the other points in its cluster. It is a measure of how similar a point is to the points in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst possible score.


2. What is the difference between trend and seasonality in time series?

Trends and seasonality are two characteristics of time series metrics that break many models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising above a baseline and then decreasing again.


3. What is Bag of Words in NLP?

Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a classifier. This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.


4. What is the difference between bagging and boosting?

Bagging is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

ENJOY LEARNING 👍👍
👍4
1. Explain Gradient Descent algorithm.

Ans. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

2. What is logistic regression used for classification instead of linear regression?

Ans. Using linear Regression , all predictions >= 0.5 can be considered as 1 and rest all < 0.5 can be considered as 0. But then the question arises why classification can’t be performed using it? Suppose we are classifying a mail as spam or not spam and our output is y, it can be 0(spam) or 1(not spam). In case of Linear Regression, hθ(x) can be > 1 or < 0. Although our prediction should be in between 0 and 1, the model will predict value out of the range i.e. maybe > 1 or < 0. So, that’s why for a Classification task, Logistic/Sigmoid Regression plays its role.


3. What is the Gini Index?

Ans. Gini Index is a score that evaluates how accurate a split is among the classified groups. Gini index evaluates a score in the range between 0 and 1, where 0 is when all observations belong to one class, and 1 is a random distribution of the elements within classes. In this case, we want to have a Gini index score as low as possible. Gini Index is the evaluation metrics we shall use to evaluate our Decision Tree Model.

4. Why is DBSCAN used over K means and other clustering methods?

Ans. Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
Clusters can be of arbitrary shape like non convex clusters
Data may contain noise.
Given such data, k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.

ENJOY LEARNING 👍👍
👍9
Data Science Interview Q&A

1.What are the different types of Pooling? Explain their characteristics.

Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined shapes across the feature map to get the maximum value from that portion of the feature map. It is also known as subsampling because from the entire portion of the feature map covered by filter or kernel we are sampling one single maximum value.
Average pooling: Computes the average value of the feature map covered by kernel or filter, and takes the floor value of the result.
Sum pooling: Computes the sum of all elements in that window.


2. What is a Moving Average Process in Time series?

In time-series analysis, moving-average process, is a common approach for modeling univariate time series. The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic term.

3. What is the difference between SQL having vs where?

The WHERE clause specifies the criteria which individual records must meet to be selected by a query. It can be used without the GROUP by clause. The HAVING clause cannot be used without the GROUP BY clause . The WHERE clause selects rows before grouping. The HAVING clause selects rows after grouping. The WHERE clause cannot contain aggregate functions. The HAVING clause can contain aggregate functions


4. What is Relative cell referencing in excel?

In Relative referencing, there is a change when copying a formula from one cell to another cell with respect to the destination. cells’ address Meanwhile, there is no change in Absolute cell referencing when a formula is copied, irrespective of the cell’s destination. This type of referencing is there by default. Relative cell referencing doesn’t require a dollar sign in the formula.

ENJOY LEARNING 👍👍
👍4👎1
Machine Learning Glossary  |  Google Developers

Compilation of key machine-learning and TensorFlow terms, with beginner-friendly definitions. 🤓

https://developers.google.com/machine-learning/glossary/
👍9
Important metrics to monitor while monitoring machine learning model
9
1. What do you understand by the term silhouette coefficient?

The silhouette coefficient is a measure of how well clustered together a data point is with respect to the other points in its cluster. It is a measure of how similar a point is to the points in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst possible score.


2. What is the difference between trend and seasonality in time series?

Trends and seasonality are two characteristics of time series metrics that break many models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising above a baseline and then decreasing again.


3. What is Bag of Words in NLP?

Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a classifier. This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.


4. What is the difference between bagging and boosting?

Bagging is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm

ENJOY LEARNING 👍👍
👍121
An high level overview for becoming a machine learning engineer
🔥9👍5