📚 Data Science Riddle
Your spark job fails due to executor memory pressure. Most effective optimization?
Your spark job fails due to executor memory pressure. Most effective optimization?
Anonymous Quiz
14%
Broadcast variables
29%
Larger cluster
41%
More shuffle partitions
16%
Persist fewer objects
BigDataAnalytics-Lecture.pdf
10.2 MB
Notes on HDFS, MapReduce, YARN, Hadoop vs. traditional systems and much more... from Columbia University.
❤7
📚 Data Science Riddle
You fit a forecasting model and residuals show increasing variance. What is needed?
You fit a forecasting model and residuals show increasing variance. What is needed?
Anonymous Quiz
20%
Differnecing
46%
Smoothing
27%
Decomposition
7%
Box-Cox
👍3❤1
📚 Data Science Riddle
A numeric feature has many repeated exact values with occasional jumps. What type of variable is this?
A numeric feature has many repeated exact values with occasional jumps. What type of variable is this?
Anonymous Quiz
28%
Discrete
22%
Ordinal
17%
Continuous
33%
Interval
❤4
Machine Learning Notes.pdf
226.8 KB
A Stanford CS' Lecture note diving into supervised/unsupervised algorithms, neural networks, SVMs with math proofs and Python pseudocode.
❤7
📚 Data Science Riddle
Two team members run the same notebook but get different results. What's the culprit?
Two team members run the same notebook but get different results. What's the culprit?
Anonymous Quiz
6%
Loss Curves
12%
Batch shapes
61%
Random seeds
22%
Metric choice
📚 Data Science Riddle
A query runs slowly due to large table scans. What's the most targeted fix?
A query runs slowly due to large table scans. What's the most targeted fix?
Anonymous Quiz
56%
Add indexes
17%
Use aliases
16%
Add DISTINCT
11%
Increase RAM
📚 Data Science Riddle
You want to detect extreme values visually in one plot. Which one is best?
You want to detect extreme values visually in one plot. Which one is best?
Anonymous Quiz
54%
Box plot
29%
Heatmap
9%
Line chart
7%
Area plot
Mining of Massive Datasets (Leskovec, Stanford).pdf
2.9 MB
The Big Data bible from Stanford: MapReduce, Spark, recommendation systems, PageRank, locality-sensitive hashing, Large scale machine learning and mining social networks/streams all explained clearly with real algorithms you can code today. 500 pages of pure gold.
❤3
📚 Data Science Riddle
You want to prevent inconsistent data across environments. What helps most?
You want to prevent inconsistent data across environments. What helps most?
Anonymous Quiz
30%
Checkpoints
18%
Contracts
40%
Indexes
13%
Sharding
🛠️ Running Code in Jupyter Notebooks
Jupyter Notebooks let you write & run code interactively.
Here’s a quick guide to make your workflow smoother:
▶️ Kernel & Code Cells
- Each notebook is tied to a single kernel (e.g. IPython).
- Code cells are where you write and execute code.
⌨️ Useful Shortcuts
- Shift + Enter → run current cell, move to next
- Alt + Enter → run current cell, insert new one below
- Ctrl + Enter → run current cell, stay in place
🔄 Kernel Management
- Interrupt the kernel if code hangs.
- Restart kernel to reset memory & variables.
🖥️ Output Handling
- Results & errors appear directly under the cell.
- Long-running code outputs appear as they’re generated.
- Large outputs can be scrolled or collapsed for clarity.
💡 Pro Tip:
Always “Restart & Run All” before sharing or saving a notebook.
This ensures reproducibility and clean results.
👉 Explore
Jupyter Notebooks let you write & run code interactively.
Here’s a quick guide to make your workflow smoother:
▶️ Kernel & Code Cells
- Each notebook is tied to a single kernel (e.g. IPython).
- Code cells are where you write and execute code.
⌨️ Useful Shortcuts
- Shift + Enter → run current cell, move to next
- Alt + Enter → run current cell, insert new one below
- Ctrl + Enter → run current cell, stay in place
🔄 Kernel Management
- Interrupt the kernel if code hangs.
- Restart kernel to reset memory & variables.
🖥️ Output Handling
- Results & errors appear directly under the cell.
- Long-running code outputs appear as they’re generated.
- Large outputs can be scrolled or collapsed for clarity.
💡 Pro Tip:
Always “Restart & Run All” before sharing or saving a notebook.
This ensures reproducibility and clean results.
👉 Explore
❤2
📚 Data Science Riddle
You need fast reads of small files. What storage options fits best?
You need fast reads of small files. What storage options fits best?
Anonymous Quiz
21%
Distributed FS
10%
Cold storage
21%
Object Storage
48%
Local SSD
❤4
📚 Data Science Riddle
A feature has low importance but domain experts insist it matters. What do you do?
A feature has low importance but domain experts insist it matters. What do you do?
Anonymous Quiz
25%
Encode it differently
21%
Scale it
11%
Drop the feature
43%
Check interaction effects
Advanced Data Science on Spark.pdf
1.8 MB
Covers Spark for ML, graph processing (GraphFrames), and integration with Hadoop from Stanford University.
❤4