Most Machine Learning articles on Medium are really very bad quality and repetitive. Titles are usually clickbaits. Most start with a story which is utter nonsense and totally not required. In some 5-10% content is useful but most are fully useless. Sorry if I hurt feelings. Agree 👍
👍13
Important Topics to become a data scientist
[Advanced Level]
👇👇
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Denoscription
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learn important data science and machine learning concepts
ENJOY LEARNING 👍👍
[Advanced Level]
👇👇
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Denoscription
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learn important data science and machine learning concepts
ENJOY LEARNING 👍👍
👍15❤1
1.What are the conditions for Overfitting and Underfitting?
Ans:
• In Overfitting the model performs well for the training data, but for any new data it fails to provide output. For Underfitting the model is very simple and not able to identify the correct relationship. Following are the bias and variance conditions.
• Overfitting – Low bias and High Variance results in the overfitted model. The decision tree is more prone to Overfitting.
• Underfitting – High bias and Low Variance. Such a model doesn’t perform well on test data also. For example – Linear Regression is more prone to Underfitting.
2. Which models are more prone to Overfitting?
Ans: Complex models, like the Random Forest, Neural Networks, and XGBoost are more prone to overfitting. Simpler models, like linear regression, can overfit too – this typically happens when there are more features than the number of instances in the training data.
3. When does feature scaling should be done?
Ans: We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based algorithms (KNN, K-means, SVM) as these are very sensitive to the range of the data points.
4. What is a logistic function? What is the range of values of a logistic function?
Ans. f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity.
5. What are the drawbacks of a linear model?
Ans. There are a couple of drawbacks of a linear model:
A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.
Ans:
• In Overfitting the model performs well for the training data, but for any new data it fails to provide output. For Underfitting the model is very simple and not able to identify the correct relationship. Following are the bias and variance conditions.
• Overfitting – Low bias and High Variance results in the overfitted model. The decision tree is more prone to Overfitting.
• Underfitting – High bias and Low Variance. Such a model doesn’t perform well on test data also. For example – Linear Regression is more prone to Underfitting.
2. Which models are more prone to Overfitting?
Ans: Complex models, like the Random Forest, Neural Networks, and XGBoost are more prone to overfitting. Simpler models, like linear regression, can overfit too – this typically happens when there are more features than the number of instances in the training data.
3. When does feature scaling should be done?
Ans: We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based algorithms (KNN, K-means, SVM) as these are very sensitive to the range of the data points.
4. What is a logistic function? What is the range of values of a logistic function?
Ans. f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity.
5. What are the drawbacks of a linear model?
Ans. There are a couple of drawbacks of a linear model:
A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.
👍9
Roadmap for becoming a data analyst
https://news.1rj.ru/str/sqlspecialist/379
Learn Fundamentals:
Gain a strong foundation in mathematics and statistics.
Develop proficiency in a programming language like Python or R.
Familiarize yourself with data manipulation and analysis libraries such as pandas and NumPy.
Understand Data Analysis Concepts:
Learn about exploratory data analysis (EDA) techniques.
Study data visualization principles using tools like Matplotlib or Tableau.
Become familiar with statistical concepts and hypothesis testing.
Master SQL:
Learn Structured Query Language (SQL) for data querying and manipulation.
Understand database management systems and relational database concepts.
Gain Domain Knowledge:
Specialize in a specific industry or domain to understand its data requirements.
Learn about the relevant metrics and key performance indicators (KPIs) in that domain.
Develop Data Cleaning and Preprocessing Skills:
Learn techniques to handle missing data, outliers, and data inconsistencies.
Gain experience in data preprocessing tasks such as data transformation and feature engineering.
Learn Data Analysis Techniques:
Study various statistical analysis methods and models.
Explore predictive modeling techniques, such as regression and classification algorithms.
Understand time series analysis and forecasting.
Master Data Visualization:
Learn advanced data visualization techniques to effectively communicate insights.
Utilize tools like Tableau, Power BI, or matplotlib for creating impactful visualizations.
Acquire Business Intelligence Skills:
Understand the basics of business intelligence tools and dashboards.
Learn to create interactive dashboards for data reporting and analysis.
Gain Practical Experience:
Apply your skills through internships, projects, or Kaggle competitions.
Work on real-world datasets to gain hands-on experience in data analysis.
Continuously Learn and Stay Updated:
Keep up with the latest trends and advancements in data analysis and analytics tools.
Participate in online courses, workshops, and webinars to enhance your skills.
Remember, the roadmap may vary depending on individual preferences and career goals. It is important to adapt and continuously learn as the field of data analysis evolves.
https://news.1rj.ru/str/sqlspecialist/379
Learn Fundamentals:
Gain a strong foundation in mathematics and statistics.
Develop proficiency in a programming language like Python or R.
Familiarize yourself with data manipulation and analysis libraries such as pandas and NumPy.
Understand Data Analysis Concepts:
Learn about exploratory data analysis (EDA) techniques.
Study data visualization principles using tools like Matplotlib or Tableau.
Become familiar with statistical concepts and hypothesis testing.
Master SQL:
Learn Structured Query Language (SQL) for data querying and manipulation.
Understand database management systems and relational database concepts.
Gain Domain Knowledge:
Specialize in a specific industry or domain to understand its data requirements.
Learn about the relevant metrics and key performance indicators (KPIs) in that domain.
Develop Data Cleaning and Preprocessing Skills:
Learn techniques to handle missing data, outliers, and data inconsistencies.
Gain experience in data preprocessing tasks such as data transformation and feature engineering.
Learn Data Analysis Techniques:
Study various statistical analysis methods and models.
Explore predictive modeling techniques, such as regression and classification algorithms.
Understand time series analysis and forecasting.
Master Data Visualization:
Learn advanced data visualization techniques to effectively communicate insights.
Utilize tools like Tableau, Power BI, or matplotlib for creating impactful visualizations.
Acquire Business Intelligence Skills:
Understand the basics of business intelligence tools and dashboards.
Learn to create interactive dashboards for data reporting and analysis.
Gain Practical Experience:
Apply your skills through internships, projects, or Kaggle competitions.
Work on real-world datasets to gain hands-on experience in data analysis.
Continuously Learn and Stay Updated:
Keep up with the latest trends and advancements in data analysis and analytics tools.
Participate in online courses, workshops, and webinars to enhance your skills.
Remember, the roadmap may vary depending on individual preferences and career goals. It is important to adapt and continuously learn as the field of data analysis evolves.
👍4
👍2
1. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
Ans.
If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way, we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with values like mean, mode, forward or backward fill. There are different ways to do so, such as df.mean(), df.fillna(mean).
Q2. Hypothesis Testing. Null and Alternate hypothesis
Ans. Hypothesis testing is defined as the process of choosing hypotheses for a particular probability distribution, on the basis of observed data Hypothesis testing is simply a core and important topic in statistics.
A null hypothesis is a statistical hypothesis in which there is no significant difference exist between the set of variables. It is the original or default statement, with no effect, often represented by H0 (H-zero). It is always the hypothesis that is tested.
Alternative Hypothesis is a statistical hypothesis used in hypothesis testing, which states that there is a significant difference between the set of variables. It is often referred to as the hypothesis other than the null hypothesis, often denoted by H1 (H-one). The acceptance of alternative hypothesis depends on the rejection of the null hypothesis i.e. until and unless null hypothesis is rejected, an alternative hypothesis cannot be accepted.
Q3. Why use Decision Trees?
Ans. First, a decision tree is a visual representation of a decision situation (and hence aids communication). Second, the branches of a tree explicitly show all those factors within the analysis that are considered relevant to the decision (and implicitly those that are not).
4. What is the difference between observational and experimental data?
Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.
Q5. Central Limit Theorem?
Ans. The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.
Q6. Over Fitting and Under Fitting
Ans. Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. Underfitting refers to a model that can neither model the training data nor generalize to new data.
Q7. how to deal with imbalance data in classification modelling?
Ans. Follow these techniques:
1.Use the right evaluation metrics.
2. Use K-fold Cross-Validation in the right way.
3. Ensemble different resampled datasets.
4. resample with different ratios.
5. Cluster the abundant class.
6.Design your own models.
Ans.
If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way, we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with values like mean, mode, forward or backward fill. There are different ways to do so, such as df.mean(), df.fillna(mean).
Q2. Hypothesis Testing. Null and Alternate hypothesis
Ans. Hypothesis testing is defined as the process of choosing hypotheses for a particular probability distribution, on the basis of observed data Hypothesis testing is simply a core and important topic in statistics.
A null hypothesis is a statistical hypothesis in which there is no significant difference exist between the set of variables. It is the original or default statement, with no effect, often represented by H0 (H-zero). It is always the hypothesis that is tested.
Alternative Hypothesis is a statistical hypothesis used in hypothesis testing, which states that there is a significant difference between the set of variables. It is often referred to as the hypothesis other than the null hypothesis, often denoted by H1 (H-one). The acceptance of alternative hypothesis depends on the rejection of the null hypothesis i.e. until and unless null hypothesis is rejected, an alternative hypothesis cannot be accepted.
Q3. Why use Decision Trees?
Ans. First, a decision tree is a visual representation of a decision situation (and hence aids communication). Second, the branches of a tree explicitly show all those factors within the analysis that are considered relevant to the decision (and implicitly those that are not).
4. What is the difference between observational and experimental data?
Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.
Q5. Central Limit Theorem?
Ans. The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.
Q6. Over Fitting and Under Fitting
Ans. Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. Underfitting refers to a model that can neither model the training data nor generalize to new data.
Q7. how to deal with imbalance data in classification modelling?
Ans. Follow these techniques:
1.Use the right evaluation metrics.
2. Use K-fold Cross-Validation in the right way.
3. Ensemble different resampled datasets.
4. resample with different ratios.
5. Cluster the abundant class.
6.Design your own models.
👍4
1. What is the AdaBoost Algorithm?
AdaBoost also called Adaptive Boosting is a technique in Machine Learning used as an Ensemble Method. The most common algorithm used with AdaBoost is decision trees with one level that means with Decision trees with only 1 split. These trees are also called Decision Stumps. What this algorithm does is that it builds a model and gives equal weights to all the data points. It then assigns higher weights to points that are wrongly classified. Now all the points which have higher weights are given more importance in the next model. It will keep training models until and unless a lower error is received.
2. What is the Sliding Window method for Time Series Forecasting?
Time series can be phrased as supervised learning. Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem.
In the sliding window method, the previous time steps can be used as input variables, and the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of previous time steps is called the window width or size of the lag. This sliding window is the basis for how we can turn any time series dataset into a supervised learning problem.
3. What do you understand by sub-queries in SQL?
A subquery is a query inside another query where a query is defined to retrieve data or information back from the database. In a subquery, the outer query is called as the main query whereas the inner query is called subquery. Subqueries are always executed first and the result of the subquery is passed on to the main query. It can be nested inside a SELECT, UPDATE or any other query. A subquery can also use any comparison operators such as >,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A workbook contains sheets, which can be a worksheet, dashboard, or a story.
A worksheet contains a single view along with shelves, legends, and the Data pane.
A dashboard is a collection of views from multiple worksheets.
A story contains a sequence of worksheets or dashboards that work together to convey information.
5. How is a Random Forest related to Decision Trees?
Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks.
Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do.
A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance.
6. What are some disadvantages of using Naive Bayes Algorithm?
Some disadvantages of using Naive Bayes Algorithm are:
It relies on a very big assumption that the independent variables are not related to each other.
It is generally not suitable for datasets with large numbers of numerical attributes.
It has been observed that if a rare case is not in the training dataset but is in the testing dataset, then it will most definitely be wrong.
AdaBoost also called Adaptive Boosting is a technique in Machine Learning used as an Ensemble Method. The most common algorithm used with AdaBoost is decision trees with one level that means with Decision trees with only 1 split. These trees are also called Decision Stumps. What this algorithm does is that it builds a model and gives equal weights to all the data points. It then assigns higher weights to points that are wrongly classified. Now all the points which have higher weights are given more importance in the next model. It will keep training models until and unless a lower error is received.
2. What is the Sliding Window method for Time Series Forecasting?
Time series can be phrased as supervised learning. Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem.
In the sliding window method, the previous time steps can be used as input variables, and the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of previous time steps is called the window width or size of the lag. This sliding window is the basis for how we can turn any time series dataset into a supervised learning problem.
3. What do you understand by sub-queries in SQL?
A subquery is a query inside another query where a query is defined to retrieve data or information back from the database. In a subquery, the outer query is called as the main query whereas the inner query is called subquery. Subqueries are always executed first and the result of the subquery is passed on to the main query. It can be nested inside a SELECT, UPDATE or any other query. A subquery can also use any comparison operators such as >,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A workbook contains sheets, which can be a worksheet, dashboard, or a story.
A worksheet contains a single view along with shelves, legends, and the Data pane.
A dashboard is a collection of views from multiple worksheets.
A story contains a sequence of worksheets or dashboards that work together to convey information.
5. How is a Random Forest related to Decision Trees?
Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks.
Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do.
A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance.
6. What are some disadvantages of using Naive Bayes Algorithm?
Some disadvantages of using Naive Bayes Algorithm are:
It relies on a very big assumption that the independent variables are not related to each other.
It is generally not suitable for datasets with large numbers of numerical attributes.
It has been observed that if a rare case is not in the training dataset but is in the testing dataset, then it will most definitely be wrong.
👍7
Andrew Ng's new course on ChatGPT Prompt Engineering for Developers, created together with OpenAI, is available now for free!
👇👇
https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/
👇👇
https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/
👍5
Some useful PYTHON libraries for data science
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of denoscriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the capability of high-performance interactivity over very large or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.
Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the capability to start at a website home url and then dig through web-pages within the website to gather information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the computations as LaTeX code.
Requests for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more convenient.
Additional libraries, you might need:
os for Operating system and file operations
networkx and igraph for graph based data manipulations
regular expressions for finding patterns in text data
BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of denoscriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the capability of high-performance interactivity over very large or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.
Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the capability to start at a website home url and then dig through web-pages within the website to gather information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the computations as LaTeX code.
Requests for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more convenient.
Additional libraries, you might need:
os for Operating system and file operations
networkx and igraph for graph based data manipulations
regular expressions for finding patterns in text data
BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.
👍9❤1