🔹 Title: Exploring Conditions for Diffusion models in Robotic Control
🔹 Publication Date: Published on Oct 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.15510
• PDF: https://arxiv.org/pdf/2510.15510
• Project Page: https://orca-rc.github.io/
• Github: https://orca-rc.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.15510
• PDF: https://arxiv.org/pdf/2510.15510
• Project Page: https://orca-rc.github.io/
• Github: https://orca-rc.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: ChartAB: A Benchmark for Chart Grounding & Dense Alignment
🔹 Publication Date: Published on Oct 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26781
• PDF: https://arxiv.org/pdf/2510.26781
• Project Page: https://huggingface.co/datasets/umd-zhou-lab/ChartAlignBench
• Github: https://github.com/tianyi-lab/ChartAlignBench
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26781
• PDF: https://arxiv.org/pdf/2510.26781
• Project Page: https://huggingface.co/datasets/umd-zhou-lab/ChartAlignBench
• Github: https://github.com/tianyi-lab/ChartAlignBench
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25897
• PDF: https://arxiv.org/pdf/2510.25897
• Project Page: https://nicolas-dufour.github.io/miro/
• Github: https://nicolas-dufour.github.io/miro/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25897
• PDF: https://arxiv.org/pdf/2510.25897
• Project Page: https://nicolas-dufour.github.io/miro/
• Github: https://nicolas-dufour.github.io/miro/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
Forwarded from Kaggle Data Hub
Is Your Crypto Transfer Secure?
Score Your Transfer analyzes wallet activity, flags risky transactions in real time, and generates downloadable compliance reports—no technical skills needed. Protect funds & stay compliant.
Sponsored By WaybienAds
Score Your Transfer analyzes wallet activity, flags risky transactions in real time, and generates downloadable compliance reports—no technical skills needed. Protect funds & stay compliant.
Sponsored By WaybienAds
🔹 Title: Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
🔹 Publication Date: Published on Oct 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.19949
• PDF: https://arxiv.org/pdf/2510.19949
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.19949
• PDF: https://arxiv.org/pdf/2510.19949
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25364
• PDF: https://arxiv.org/pdf/2510.25364
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/colinglab/CLASS_IT
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25364
• PDF: https://arxiv.org/pdf/2510.25364
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/colinglab/CLASS_IT
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: The End of Manual Decoding: Towards Truly End-to-End Language Models
🔹 Publication Date: Published on Oct 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26697
• PDF: https://arxiv.org/pdf/2510.26697
• Github: https://github.com/Zacks917/AutoDeco
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26697
• PDF: https://arxiv.org/pdf/2510.26697
• Github: https://github.com/Zacks917/AutoDeco
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25867
• PDF: https://arxiv.org/pdf/2510.25867
• Project Page: https://ucsc-vlaa.github.io/MedVLSynther/
• Github: https://ucsc-vlaa.github.io/MedVLSynther/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.25867
• PDF: https://arxiv.org/pdf/2510.25867
• Project Page: https://ucsc-vlaa.github.io/MedVLSynther/
• Github: https://ucsc-vlaa.github.io/MedVLSynther/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
🔹 Publication Date: Published on Oct 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.22282
• PDF: https://arxiv.org/pdf/2510.22282
• Github: https://github.com/tsinghua-fib-lab/CityRiSE
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.22282
• PDF: https://arxiv.org/pdf/2510.22282
• Github: https://github.com/tsinghua-fib-lab/CityRiSE
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: PORTool: Tool-Use LLM Training with Rewarded Tree
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26020
• PDF: https://arxiv.org/pdf/2510.26020
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.26020
• PDF: https://arxiv.org/pdf/2510.26020
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: L^2M^3OF: A Large Language Multimodal Model for Metal-Organic Frameworks
🔹 Publication Date: Published on Oct 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.20976
• PDF: https://arxiv.org/pdf/2510.20976
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.20976
• PDF: https://arxiv.org/pdf/2510.20976
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Title: Performance Trade-offs of Optimizing Small Language Models for E-Commerce
🔹 Publication Date: Published on Oct 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.21970
• PDF: https://arxiv.org/pdf/2510.21970
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.21970
• PDF: https://arxiv.org/pdf/2510.21970
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
❤1
🔹 Title: POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24992
• PDF: https://arxiv.org/pdf/2510.24992
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24992
• PDF: https://arxiv.org/pdf/2510.24992
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
❤2
nature papers: 2000$
Q1 and Q2 papers 1000$
Q3 and Q4 papers 500$
Doctoral thesis (complete) 700$
M.S thesis 300$
paper simulation 200$
Contact me @husseinsheikho
Q1 and Q2 papers 1000$
Q3 and Q4 papers 500$
Doctoral thesis (complete) 700$
M.S thesis 300$
paper simulation 200$
Contact me @husseinsheikho
❤2
ML Research Hub pinned «nature papers: 2000$ Q1 and Q2 papers 1000$ Q3 and Q4 papers 500$ Doctoral thesis (complete) 700$ M.S thesis 300$ paper simulation 200$ Contact me @husseinsheikho»
Top 100 Data Analyst Interview Questions & Answers
#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience
Part 1: SQL Questions (Q1-30)
#1. What is the difference between
A:
•
•
•
#2. Select all unique departments from the
A: Use the
#3. Find the top 5 highest-paid employees.
A: Use
#4. What is the difference between
A:
•
•
#5. What are the different types of SQL joins?
A:
•
•
•
•
•
#6. Write a query to find the second-highest salary.
A: Use
#7. Find duplicate emails in a
A: Group by the email column and use
#8. What is a primary key vs. a foreign key?
A:
• A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
• A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.
#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.
#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a
#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience
Part 1: SQL Questions (Q1-30)
#1. What is the difference between
DELETE, TRUNCATE, and DROP?A:
•
DELETE is a DML command that removes rows from a table based on a WHERE clause. It is slower as it logs each row deletion and can be rolled back.•
TRUNCATE is a DDL command that quickly removes all rows from a table. It is faster, cannot be rolled back, and resets table identity.•
DROP is a DDL command that removes the entire table, including its structure, data, and indexes.#2. Select all unique departments from the
employees table.A: Use the
DISTINCT keyword.SELECT DISTINCT department
FROM employees;
#3. Find the top 5 highest-paid employees.
A: Use
ORDER BY and LIMIT.SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;
#4. What is the difference between
WHERE and HAVING?A:
•
WHERE is used to filter records before any groupings are made (i.e., it operates on individual rows).•
HAVING is used to filter groups after aggregations (GROUP BY) have been performed.-- Find departments with more than 10 employees
SELECT department, COUNT(employee_id)
FROM employees
GROUP BY department
HAVING COUNT(employee_id) > 10;
#5. What are the different types of SQL joins?
A:
•
(INNER) JOIN: Returns records that have matching values in both tables.•
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table.•
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.•
FULL (OUTER) JOIN: Returns all records when there is a match in either the left or right table.•
SELF JOIN: A regular join, but the table is joined with itself.#6. Write a query to find the second-highest salary.
A: Use
OFFSET or a subquery.-- Method 1: Using OFFSET
SELECT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- Method 2: Using a Subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
#7. Find duplicate emails in a
customers table.A: Group by the email column and use
HAVING to find groups with a count greater than 1.SELECT email, COUNT(email)
FROM customers
GROUP BY email
HAVING COUNT(email) > 1;
#8. What is a primary key vs. a foreign key?
A:
• A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
• A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.
#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.
-- Rank employees by salary within each department
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;
#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a
SELECT, INSERT, UPDATE, or DELETE statement. It helps improve readability and break down complex queries.❤2
WITH DepartmentSales AS (
SELECT department, SUM(sale_amount) as total_sales
FROM sales
GROUP BY department
)
SELECT department, total_sales
FROM DepartmentSales
WHERE total_sales > 100000;
---
#11. Difference between
UNION and UNION ALL?A:
•
UNION combines the result sets of two or more SELECT statements and removes duplicate rows.•
UNION ALL also combines result sets but includes all rows, including duplicates. It is faster because it doesn't check for duplicates.#12. How would you find the total number of employees in each department?
A: Use
COUNT() with GROUP BY.SELECT department, COUNT(employee_id) as number_of_employees
FROM employees
GROUP BY department;
#13. What is the difference between
RANK() and DENSE_RANK()?A:
•
RANK() assigns a rank to each row within a partition. If there are ties, it skips the next rank(s). (e.g., 1, 2, 2, 4)•
DENSE_RANK() also assigns ranks, but it does not skip any ranks in case of ties. (e.g., 1, 2, 2, 3)#14. Write a query to get the Nth highest salary.
A: Use
DENSE_RANK() in a CTE.WITH SalaryRanks AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
)
SELECT salary
FROM SalaryRanks
WHERE rnk = 5; -- For the 5th highest salary
#15. What is
COALESCE() used for?A: The
COALESCE() function returns the first non-NULL value in a list of expressions. It's useful for providing default values for nulls.SELECT name, COALESCE(commission, 0) as commission
FROM employees; -- Replaces NULL commissions with 0
---
#16. How would you select all employees whose name starts with 'A'?
A: Use the
LIKE operator with a wildcard (%).SELECT name
FROM employees
WHERE name LIKE 'A%';
#17. Get the current date and time.
A: This is function-dependent on the SQL dialect.
• PostgreSQL/MySQL:
NOW()• SQL Server:
GETDATE()SELECT NOW();
#18. How can you extract the month from a date?
A: Use the
EXTRACT function or MONTH().-- Standard SQL
SELECT EXTRACT(MONTH FROM '2023-10-27');
-- MySQL
SELECT MONTH('2023-10-27');
#19. What is a subquery? What are the types?
A: A subquery is a query nested inside another query.
• Scalar Subquery: Returns a single value (one row, one column).
• Multi-row Subquery: Returns multiple rows.
• Correlated Subquery: An inner query that depends on the outer query for its values. It is evaluated once for each row processed by the outer query.
#20. Write a query to find all employees who work in the 'Sales' department.
A: Use a
JOIN or a subquery.-- Using JOIN (preferred)
SELECT e.name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.name = 'Sales';
---
#21. How would you calculate the month-over-month growth rate of sales?
A: Use the
LAG() window function to get the previous month's sales and then apply the growth formula.WITH MonthlySales AS (
SELECT
DATE_TRUNC('month', order_date)::DATE as sales_month,
SUM(sale_amount) as total_sales
FROM sales
GROUP BY 1
)
SELECT
sales_month,
total_sales,
(total_sales - LAG(total_sales, 1) OVER (ORDER BY sales_month)) / LAG(total_sales, 1) OVER (ORDER BY sales_month) * 100 as growth_rate
FROM MonthlySales;
#22. What is an index in a database? Why is it useful?
A: An index is a special lookup table that the database search engine can use to speed up data retrieval. It works like an index in the back of a book. It improves the speed of
#23. Difference between
A:
•
•
#24. What is a
A: The
#25. Find the cumulative sum of sales over time.
A: Use a
---
#26. What does
A: These functions concatenate strings from a group into a single string with a specified separator.
#27. What is data normalization? Why is it important?
A: Data normalization is the process of organizing columns and tables in a relational database to minimize data redundancy. It is important because it reduces storage space, eliminates inconsistent data, and simplifies data management.
#28. Write a query to find users who made a purchase in January but not in February.
A: Use
#29. What is a self-join?
A: A self-join is a join in which a table is joined to itself. This is useful for querying hierarchical data or comparing rows within the same table.
#30. What is the execution order of a SQL query?
A: The logical processing order is generally:
•
•
•
•
•
•
•
•
---
Part 2: Python (Pandas/NumPy) Questions (Q31-50)
#31. How do you select a column named 'age' from a pandas DataFrame
A: There are two common ways.
#32. How do you filter a DataFrame
A: Use boolean indexing.
#33. What's the difference between
A:
•
•
A: An index is a special lookup table that the database search engine can use to speed up data retrieval. It works like an index in the back of a book. It improves the speed of
SELECT queries but can slow down data modification (INSERT, UPDATE, DELETE).#23. Difference between
VARCHAR and CHAR?A:
•
CHAR is a fixed-length string data type. CHAR(10) will always store 10 characters, padding with spaces if necessary.•
VARCHAR is a variable-length string data type. VARCHAR(10) can store up to 10 characters, but only uses the storage needed for the actual string.#24. What is a
CASE statement?A: The
CASE statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement).SELECT
name,
salary,
CASE
WHEN salary > 100000 THEN 'High Earner'
WHEN salary > 50000 THEN 'Mid Earner'
ELSE 'Low Earner'
END as salary_category
FROM employees;
#25. Find the cumulative sum of sales over time.
A: Use a
SUM() window function.SELECT
order_date,
sale_amount,
SUM(sale_amount) OVER (ORDER BY order_date) as cumulative_sales
FROM sales;
---
#26. What does
GROUP_CONCAT (MySQL) or STRING_AGG (PostgreSQL) do?A: These functions concatenate strings from a group into a single string with a specified separator.
-- PostgreSQL example
SELECT department, STRING_AGG(name, ', ') as employee_names
FROM employees
GROUP BY department;
#27. What is data normalization? Why is it important?
A: Data normalization is the process of organizing columns and tables in a relational database to minimize data redundancy. It is important because it reduces storage space, eliminates inconsistent data, and simplifies data management.
#28. Write a query to find users who made a purchase in January but not in February.
A: Use
LEFT JOIN or NOT IN.SELECT user_id
FROM sales
WHERE EXTRACT(MONTH FROM order_date) = 1
EXCEPT
SELECT user_id
FROM sales
WHERE EXTRACT(MONTH FROM order_date) = 2;
#29. What is a self-join?
A: A self-join is a join in which a table is joined to itself. This is useful for querying hierarchical data or comparing rows within the same table.
-- Find employees who have the same manager
SELECT e1.name as employee1, e2.name as employee2, e1.manager_id
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.manager_id AND e1.id <> e2.id;
#30. What is the execution order of a SQL query?
A: The logical processing order is generally:
•
FROM / JOIN•
WHERE•
GROUP BY•
HAVING•
SELECT•
DISTINCT•
ORDER BY•
LIMIT / OFFSET---
Part 2: Python (Pandas/NumPy) Questions (Q31-50)
#31. How do you select a column named 'age' from a pandas DataFrame
df?A: There are two common ways.
# Method 1 (preferred, handles column names with spaces)
age_column = df['age']
# Method 2 (dot notation)
age_column = df.age
#32. How do you filter a DataFrame
df to get rows where 'age' is greater than 30?A: Use boolean indexing.
filtered_df = df[df['age'] > 30]
#33. What's the difference between
.loc and .iloc?A:
•
.loc is a label-based indexer. You use row and column names to select data.•
.iloc is an integer-position-based indexer. You use integer indices (like in Python lists) to select data.# .loc example (select row with index 'a')
df.loc['a']
# .iloc example (select first row)
df.iloc[0]
#34. How do you handle missing values in a DataFrame?
A: Several methods:
•
df.isnull().sum(): To count missing values per column.•
df.dropna(): To remove rows/columns with missing values.•
df.fillna(value): To fill missing values with a specific value (e.g., 0, mean, median).# Fill missing age values with the mean age
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)
#35. How would you create a new column 'age_group' based on the 'age' column?
A: Use
pd.cut or a custom function with .apply.bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)
#36. How do you merge two DataFrames,
df1 and df2, on a common column 'user_id'?A: Use
pd.merge().merged_df = pd.merge(df1, df2, on='user_id', how='inner') # 'how' can be 'left', 'right', 'outer'
#37. How can you group a DataFrame by 'department' and calculate the average 'salary'?
A: Use
.groupby() and .agg() or a direct aggregation function.avg_salary_by_dept = df.groupby('department')['salary'].mean()#38. What is the purpose of the
.apply() method in pandas?A:
.apply() lets you apply a function along an axis of a DataFrame. It is used for complex, custom operations that are not covered by built-in pandas functions.# Create a new column by applying a custom function to the 'salary' column
def categorize_salary(salary):
if salary > 100000:
return 'High'
return 'Low'
df['salary_category'] = df['salary'].apply(categorize_salary)
#39. How do you remove duplicate rows from a DataFrame?
A: Use the
.drop_duplicates() method.# Keep the first occurrence of each duplicate row
unique_df = df.drop_duplicates()
# Keep the last occurrence
unique_df_last = df.drop_duplicates(keep='last')
#40. Explain the difference between
join() and merge() in pandas.A:
•
merge() is more versatile and is the main entry point for database-style join operations. It can join on columns or indices.•
join() is a convenience method for joining DataFrames primarily on their indices. It can also join on a column of the calling DataFrame to the index of the other.In most cases,
merge() is the more powerful and flexible choice.---
#41. How do you convert a column's data type, e.g., 'date_string' to datetime?
A: Use
pd.to_datetime().df['date'] = pd.to_datetime(df['date_string'])
#42. What is a pivot table and how do you create one in pandas?
A: A pivot table is a data summarization tool. It reshapes data by aggregating values based on one or more grouping keys along rows and columns. Use
df.pivot_table().# Create a pivot table to see average sales by region and product
pivot = df.pivot_table(values='sales', index='region', columns='product', aggfunc='mean')
#43. How would you select rows with multiple conditions, e.g., 'age' > 30 and 'city' == 'New York'?
A: Use boolean indexing with
& for AND, and | for OR. Wrap each condition in parentheses.filtered_df = df[(df['age'] > 30) & (df['city'] == 'New York')]
#44. How can you find the number of unique values in a column?
A: Use the
.nunique() method.unique_cities_count = df['city'].nunique()
#45. What is the difference between a pandas Series and a DataFrame?
A:
• A Series is a one-dimensional labeled array, capable of holding any data type. It's like a single column in a spreadsheet.
• A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a whole spreadsheet or an SQL table.
#46. How do you sort a DataFrame by the 'salary' column in descending order?
A: Use
.sort_values().sorted_df = df.sort_values(by='salary', ascending=False)
#47. What is method chaining in pandas?
A: Method chaining is the practice of calling methods on a DataFrame sequentially. It improves code readability by reducing the need for intermediate variables.
# Example of method chaining
result = (df[df['age'] > 30]
.groupby('department')
['salary']
.mean()
.sort_values(ascending=False))
#48. How do you rename the column 'user_name' to 'username'?
A: Use the
.rename() method.df.rename(columns={'user_name': 'username'}, inplace=True)#49. How do you get the correlation matrix for all numerical columns in a DataFrame?
A: Use the
.corr() method.correlation_matrix = df.corr(numeric_only=True)
#50. When would you use NumPy over pandas?
A:
• Use NumPy for performing complex mathematical operations on numerical data, especially in machine learning, where data is often represented as arrays (matrices). It is faster for numerical computations.
• Use pandas when you need to work with tabular data, handle missing values, use labeled axes, and perform data manipulation, cleaning, and preparation tasks. Pandas is built on top of NumPy.
---
Part 3: Statistics & Probability Questions (Q51-65)
#51. What is the difference between mean, median, and mode?
A:
• Mean: The average of all data points. It is sensitive to outliers.
• Median: The middle value of a dataset when sorted. It is robust to outliers.
• Mode: The most frequently occurring value in a dataset.
#52. Explain p-value.
A: The p-value is the probability of observing results as extreme as, or more extreme than, what was actually observed, assuming the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it.
#53. What are Type I and Type II errors?
A:
• Type I Error (False Positive): Rejecting the null hypothesis when it is actually true. (e.g., concluding a new drug is effective when it is not).
• Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false. (e.g., concluding a new drug is not effective when it actually is).
#54. What is a confidence interval?
A: A confidence interval is a range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter. For example, a 95% confidence interval means that if we were to repeat the experiment many times, 95% of the calculated intervals would contain the true population parameter.
#55. Explain the Central Limit Theorem (CLT).
A: The Central Limit Theorem states that the sampling distribution of the sample mean will be approximately normally distributed, regardless of the original population's distribution, as long as the sample size is sufficiently large (usually n > 30). This is fundamental to hypothesis testing.
#56. What is the difference between correlation and causation?
A:
• Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates that the variables increase or decrease in parallel; a negative correlation indicates that as one variable increases, the other decreases.
• Causation indicates that one event is the result of the occurrence of the other event; i.e., there is a causal relationship between the two events. Correlation does not imply causation.
#57. What is A/B testing?
A: A/B testing is a randomized experiment with two variants, A and B. It is a method of comparing two versions of a webpage, app, or feature against each other to determine which one performs better. A key metric is chosen (e.g., click-through rate), and statistical tests are used to determine if the difference in performance is statistically significant.
#58. What are confounding variables?
A: A confounding variable is an "extra" variable that you didn't account for. It can ruin an experiment and give you useless results because it is related to both the independent and dependent variables, creating a spurious association.
#59. What is selection bias?
A: Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.
#60. You are rolling two fair six-sided dice. What is the probability of rolling a sum of 7?
A:
• Total possible outcomes: 6 * 6 = 36.
• Favorable outcomes for a sum of 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). There are 6 favorable outcomes.
• Probability = Favorable Outcomes / Total Outcomes = 6 / 36 = 1/6.
---
#61. What is standard deviation and variance?
A:
• Variance measures how far a set of numbers is spread out from their average value. It is the average of the squared differences from the mean.
• Standard Deviation is the square root of the variance. It is expressed in the same units as the data, making it more interpretable than variance. It quantifies the amount of variation or dispersion of a set of data values.
#62. Explain conditional probability.
A: Conditional probability is the probability of an event occurring, given that another event has already occurred. It is denoted as P(A|B), the probability of event A given event B. The formula is P(A|B) = P(A and B) / P(B).
#63. What is the law of large numbers?
A: The law of large numbers is a theorem that states that as the number of trials of a random process increases, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
#64. What is a normal distribution? What are its key properties?
A: A normal distribution, also known as a Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.
• Properties: It is bell-shaped, symmetric, and defined by its mean (μ) and standard deviation (σ). About 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three.
#65. What is regression analysis? What are some types?
A: Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors' or 'features').
• Types: Linear Regression, Logistic Regression, Polynomial Regression, Ridge Regression.
---
Part 4: Product Sense & Case Study Questions (Q66-80)
A:
• Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates that the variables increase or decrease in parallel; a negative correlation indicates that as one variable increases, the other decreases.
• Causation indicates that one event is the result of the occurrence of the other event; i.e., there is a causal relationship between the two events. Correlation does not imply causation.
#57. What is A/B testing?
A: A/B testing is a randomized experiment with two variants, A and B. It is a method of comparing two versions of a webpage, app, or feature against each other to determine which one performs better. A key metric is chosen (e.g., click-through rate), and statistical tests are used to determine if the difference in performance is statistically significant.
#58. What are confounding variables?
A: A confounding variable is an "extra" variable that you didn't account for. It can ruin an experiment and give you useless results because it is related to both the independent and dependent variables, creating a spurious association.
#59. What is selection bias?
A: Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.
#60. You are rolling two fair six-sided dice. What is the probability of rolling a sum of 7?
A:
• Total possible outcomes: 6 * 6 = 36.
• Favorable outcomes for a sum of 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). There are 6 favorable outcomes.
• Probability = Favorable Outcomes / Total Outcomes = 6 / 36 = 1/6.
---
#61. What is standard deviation and variance?
A:
• Variance measures how far a set of numbers is spread out from their average value. It is the average of the squared differences from the mean.
• Standard Deviation is the square root of the variance. It is expressed in the same units as the data, making it more interpretable than variance. It quantifies the amount of variation or dispersion of a set of data values.
#62. Explain conditional probability.
A: Conditional probability is the probability of an event occurring, given that another event has already occurred. It is denoted as P(A|B), the probability of event A given event B. The formula is P(A|B) = P(A and B) / P(B).
#63. What is the law of large numbers?
A: The law of large numbers is a theorem that states that as the number of trials of a random process increases, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
#64. What is a normal distribution? What are its key properties?
A: A normal distribution, also known as a Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.
• Properties: It is bell-shaped, symmetric, and defined by its mean (μ) and standard deviation (σ). About 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three.
#65. What is regression analysis? What are some types?
A: Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors' or 'features').
• Types: Linear Regression, Logistic Regression, Polynomial Regression, Ridge Regression.
---
Part 4: Product Sense & Case Study Questions (Q66-80)