Data Science from Scratch — 数据科学完整学习路线
一份覆盖数据科学全栈知识的项目式学习仓库。
每一个知识点配一个独立的 Jupyter Notebook,使用经典公开数据集,含中英双语讲解、数学公式推导和完整代码流程。
A project-based curriculum covering the full data-science stack.
Each topic has its own Jupyter notebook with a classic dataset, bilingual (Chinese / English) explanations, math derivations, and end-to-end code.
📐 统一符号约定 / Unified math notation :本仓库所有 notebook 共用同一套数学符号 ,定义见 NOTATION.md 。如果你以前学过 Bishop / ESL / Andrew Ng / Goodfellow,那个文件最后有一张对照表。
All notebooks share one consistent set of math symbols , defined in NOTATION.md . If you've used Bishop / ESL / Ng / Goodfellow before, there's a translation table at the bottom.
DataScience-from-scratch/
├── README.md ← 本文件 / this file
├── part00_foundations/ ← Python / NumPy / Pandas / Polars / 可视化 / 数学
├── part01_sql_databases/ ← SQL / NoSQL / 数据仓库
├── part02_statistics/ ← 概率 + 推断统计
├── part03_eda_preprocessing/ ← EDA + 数据清洗 + 特征工程
├── part04_supervised_regression/
├── part05_supervised_classification/
├── part06_unsupervised/
├── part07_model_evaluation/
├── part08_ensemble/
├── part09_deep_learning/
├── part10_computer_vision/
├── part11_classic_nlp/
├── part12_modern_nlp_llms/
├── part13_generative_models/ ← GAN / VAE / Diffusion / Flow
├── part14_time_series/
├── part15_recommender/
├── part16_graph_gnn/
├── part17_reinforcement_learning/
├── part18_bayesian/
├── part19_causal_inference/
├── part20_advanced_topics/ ← survival / geospatial / audio / anomaly
├── part21_big_data/ ← Spark / Dask / Polars at scale
├── part22_mlops/
├── part23_cloud_for_ds/
└── part24_ml_system_design/ ← 大厂 ML System Design + 案例面试
每个章节文件夹内含:
README.md — 本章导读 / chapter intro
每个知识点一个 XX_topic_name.ipynb
data/ — 数据集 (或下载脚本)
Part 0 · 基础准备 / Foundations
数据科学家工具箱与数学基础。
The toolbox + math.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
0.1
Python for Data Science
—
list comp, generator, decorator, OOP, typing, pathlib
0.2
NumPy 全面解析
—
ndarray, broadcasting, vectorization, einsum, memory layout
0.3
Pandas 全面解析
Titanic
Series/DataFrame, groupby, merge, pivot, MultiIndex, window
0.4
Polars 入门
NYC Taxi (subset)
lazy frames, expressions, vs pandas performance
0.5
Matplotlib & Seaborn
Iris, Tips
figure/axes, subplot grid, styling
0.6
Plotly & 交互可视化 / Interactive Viz
COVID time series
Plotly Express, Dash basics
0.7
线性代数 / Linear Algebra
—
vector, matrix, rank, eigen-decomp, SVD, projection
0.8
微积分 / Calculus
—
derivative, gradient, chain rule, Jacobian, Hessian, Taylor
0.9
概率论 / Probability
—
distributions, conditional, Bayes, joint/marginal, expectation
0.10
数值优化 / Numerical Optimization
—
GD, Newton's, BFGS, convexity, KKT, Lagrangian
0.11
信息论 / Information Theory
—
entropy, cross-entropy, KL, mutual information
0.12
工程化:Git / venv / Jupyter / VSCode
—
reproducibility, environments, notebook hygiene
Part 1 · SQL 与数据库 / SQL & Databases ⭐⭐⭐ (面试核心)
几乎所有 DS / DA 面试第一关都是 SQL。 单独成一大块。
Almost every DS/DA interview starts with SQL. Treated as a first-class section.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
1.1
SQL 基础 / SQL Basics
Chinook
SELECT, WHERE, ORDER BY, LIMIT
1.2
多表 JOIN
Chinook
INNER / LEFT / RIGHT / FULL / CROSS / SELF
1.3
聚合 & GROUP BY
Sakila
aggregate funcs, HAVING, ROLLUP, CUBE
1.4
子查询 & CTE
Sakila
scalar, correlated, WITH RECURSIVE
1.5
窗口函数 / Window Functions
Sales
ROW_NUMBER, RANK, LAG/LEAD, running totals
1.6
高级 SQL 题型 / Advanced Patterns
LeetCode-style
nth-highest, pivot, sessionization, funnel
1.7
索引与执行计划 / Indexes & EXPLAIN
Custom
B-tree, hash, EXPLAIN ANALYZE
1.8
Python + SQL 集成
any
SQLAlchemy, pandas.read_sql, duckdb
1.9
NoSQL 速览 / NoSQL Overview
Movies JSON
MongoDB (document), Redis (KV), Cassandra (wide)
1.10
数据仓库 / Data Warehouse
TPC-H
star vs snowflake schema, fact/dim, SCD
1.11
dbt 入门 / dbt Basics
toy warehouse
models, refs, tests, lineage
Part 2 · 统计学与概率 / Statistics & Probability
DS 面试第二关——统计基础题。
DS interview round two — stats.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
2.1
描述统计 / Descriptive Stats
Tips
mean/median/mode, variance, skewness, kurtosis
2.2
常见分布 / Distributions
Synthetic
Bernoulli, Binomial, Poisson, Normal, Exp, Gamma, Beta
2.3
大数定律 & 中心极限定理 / LLN & CLT
Simulated
convergence, sampling distribution
2.4
抽样方法 / Sampling
Census
SRS, stratified, cluster, reservoir sampling
2.5
置信区间 / Confidence Intervals
Heights
t-CI, bootstrap CI, coverage
2.6
假设检验 / Hypothesis Testing
Tips
t-test, z-test, chi-square, ANOVA, Wilcoxon, KS
2.7
多重比较 / Multiple Testing
Microarray
Bonferroni, BH-FDR
2.8
功效与样本量 / Power & Sample Size
Synthetic
type I/II error, MDE, power curves
2.9
最大似然 / MLE
Coin flips, Gaussian
likelihood, log-likelihood, Fisher info
2.10
贝叶斯估计 / Bayesian Estimation
Beta-Binomial
prior, posterior, conjugate, MAP
2.11
Bootstrap & Jackknife
Boston-like
resampling, percentile, BCa
2.12
蒙特卡洛 / Monte Carlo
—
importance sampling, MCMC primer
Part 3 · EDA 与数据预处理 / EDA & Preprocessing
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
3.1
探索性数据分析 / EDA
Titanic
univariate / bivariate / multivariate
3.2
缺失值处理 / Missing Values
Titanic
MCAR/MAR/MNAR, mean/median/KNN/MICE
3.3
异常值检测 / Outliers
House Prices
z-score, IQR, Isolation Forest, LOF, Mahalanobis
3.4
特征缩放 / Feature Scaling
Wine
standardization, normalization, robust, quantile
3.5
类别变量编码 / Categorical Encoding
Adult Income
label, one-hot, ordinal, target, frequency, hashing
3.6
特征工程 / Feature Engineering
House Prices (Ames)
interaction, polynomial, binning, datetime, geo
3.7
文本特征工程 / Text Features
SMS Spam
tf-idf, n-gram, embeddings
3.8
图像特征基础 / Image Features
Digits
HOG, SIFT, color histograms
3.9
数据泄漏 / Data Leakage
Credit
target leak, train-test contamination, pipeline fix
3.10
训练/验证/测试与 CV / Splitting & CV
Iris
hold-out, K-fold, stratified, group, time-series
3.11
不平衡数据 / Imbalanced Data
Fraud
SMOTE, ADASYN, class weight, threshold tuning
3.12
流水线 / Pipelines
Titanic
sklearn Pipeline, ColumnTransformer, FeatureUnion
Part 4 · 监督学习:回归 / Supervised Regression
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
4.1
线性回归 / Linear Regression
California Housing
OLS, normal equation, GD, Gauss-Markov
4.2
回归诊断 / Regression Diagnostics
Same
residual plots, heteroscedasticity, multicollinearity, VIF
4.3
多项式回归 / Polynomial Regression
Salary–Position
basis expansion, underfit/overfit
4.4
岭回归 / Ridge (L2)
California Housing
L2 penalty, bias-variance
4.5
Lasso 回归 / Lasso (L1)
California Housing
L1 sparsity, coordinate descent, LARS
4.6
弹性网 / Elastic Net
California Housing
L1+L2 hybrid
4.7
广义线性模型 / GLM
Insurance Claims
link function, exp family, Poisson, Gamma
4.8
非线性回归 / Nonlinear Regression
Curve-fitting
Levenberg-Marquardt, scipy.optimize
4.9
支持向量回归 / SVR
Boston-like
kernel trick, ε-insensitive
4.10
KNN 回归 / KNN Regression
Diamonds
weighted average, distance metric
4.11
决策树回归 / Decision Tree Regressor
Diamonds
CART, MSE split, pruning
4.12
随机森林回归 / Random Forest Regressor
Diamonds
bagging, OOB, importance
4.13
梯度提升回归 / Gradient Boosting Regressor
House Prices
additive model, learning rate
4.14
分位数回归 / Quantile Regression
Engel curves
pinball loss, prediction interval
4.15
稳健回归 / Robust Regression
Outlier data
Huber, RANSAC, Theil-Sen
4.16
等张回归 / Isotonic Regression
Calibration
PAV algorithm
Part 5 · 监督学习:分类 / Supervised Classification
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
5.1
逻辑回归 / Logistic Regression
Breast Cancer
sigmoid, log-loss, decision boundary
5.2
Softmax 回归 / Softmax Regression
Iris
multinomial logit, cross-entropy
5.3
K 近邻 / KNN
Iris
distance metric, K choice, curse of dim
5.4
朴素贝叶斯 / Naive Bayes
SMS Spam
Gaussian / Multinomial / Bernoulli NB
5.5
支持向量机 / SVM
MNIST (subset)
hinge loss, kernels, C, gamma, SMO
5.6
决策树分类器 / Decision Tree
Titanic
Gini, entropy, max_depth
5.7
随机森林分类器 / Random Forest
Titanic
bootstrapping, randomness
5.8
梯度提升 / GBDT
Adult Income
functional gradient
5.9
XGBoost
Adult Income
second-order Taylor, regularization
5.10
LightGBM
Adult Income
leaf-wise growth, histogram, categorical
5.11
CatBoost
Adult Income
ordered boosting, native categoricals
5.12
LDA & QDA
Wine
Bayes-optimal under Gaussian
5.13
多类与多标签 / Multi-class & Multi-label
20 Newsgroups
OvR, OvO, classifier chains
5.14
不平衡分类 / Imbalanced Classification
Credit Card Fraud
SMOTE, class_weight, threshold
5.15
在线学习 / Online Learning
Streaming
SGD partial_fit, perceptron, Vowpal Wabbit
Part 6 · 无监督学习 / Unsupervised Learning
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
6.1
K-Means
Mall Customers
Lloyd's, K-means++, elbow, silhouette
6.2
Mini-batch K-Means
Large synthetic
scalable clustering
6.3
层次聚类 / Hierarchical Clustering
Mall Customers
linkage, dendrogram
6.4
DBSCAN
Synthetic moons
ε, minPts, density
6.5
HDBSCAN
Geo points
hierarchical density
6.6
高斯混合 / GMM
Old Faithful
EM, soft assignment
6.7
谱聚类 / Spectral Clustering
Synthetic graph
graph Laplacian
6.8
主成分分析 / PCA
Iris, MNIST
eigen-decomp, explained variance
6.9
核 PCA / Kernel PCA
Swiss roll
kernel trick for nonlinear DR
6.10
因子分析 / Factor Analysis
Psych data
latent variable model
6.11
ICA
Audio mixing
source separation
6.12
t-SNE
MNIST
perplexity, KL divergence
6.13
UMAP
MNIST, single-cell
manifold learning, fast
6.14
LDA 作为降维 / LDA as DR
Wine
supervised dim reduction
6.15
自编码器 / Autoencoder
Fashion-MNIST
encoder-decoder, reconstruction
6.16
异常检测 / Anomaly Detection
KDD subset
One-Class SVM, Isolation Forest, LOF
6.17
关联规则 / Association Rules
Online Retail
Apriori, FP-Growth, support/confidence/lift
6.18
NMF
Documents
non-negative matrix factorization
Part 7 · 模型评估与优化 / Model Evaluation & Tuning
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
7.1
回归评估指标 / Regression Metrics
California
MAE, MSE, RMSE, R², adj R², MAPE, sMAPE
7.2
分类评估指标 / Classification Metrics
Breast Cancer
accuracy, precision, recall, F1, ROC-AUC, PR-AUC, MCC
7.3
偏差-方差权衡 / Bias-Variance
Synthetic
learning curves, validation curves
7.4
交叉验证策略 / CV Strategies
Iris
K-fold, stratified, group, time-series, nested
7.5
网格 / 随机 / 贝叶斯调参
Wine
GridSearchCV, Randomized, Optuna (TPE), Hyperopt
7.6
多目标 / 帕累托 / Multi-objective
Custom
trade-off frontiers
7.7
特征选择 / Feature Selection
Madelon
filter, wrapper, embedded, RFE
7.8
模型解释 / Model Interpretability
Adult Income
SHAP, LIME, permutation importance, PDP/ICE
7.9
校准 / Calibration
Credit
Platt, isotonic, reliability diagram
7.10
公平性 / Fairness
COMPAS
demographic parity, equalized odds, debiasing
7.11
鲁棒性 / Robustness
Image classifier
adversarial examples, FGSM
7.12
概念漂移 / Concept Drift
Streaming
drift detection (DDM, ADWIN)
Part 8 · 集成学习 / Ensemble Learning
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
8.1
Bagging
Titanic
bootstrap aggregation
8.2
Random Forest 深入 / RF Deep Dive
Titanic
feature randomness, OOB
8.3
AdaBoost
Titanic
exponential loss, weighted samples
8.4
Gradient Boosting 推导 / Derivation
Synthetic
functional gradient descent
8.5
XGBoost / LightGBM / CatBoost 对比
Adult
implementation differences, when to use
8.6
Voting & Averaging
Wine
hard / soft voting
8.7
Stacking & Blending
House Prices
meta-learner, OOF predictions
Part 9 · 深度学习基础 / Deep Learning Foundations
完整覆盖,不依赖外部仓库。
Standalone, full coverage.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
9.1
感知器 & 多层感知器 / Perceptron & MLP
XOR, MNIST
activation, layer
9.2
神经网络从零实现 / NN from Scratch (NumPy)
MNIST
forward, backprop, SGD
9.3
PyTorch 入门 / PyTorch Basics
MNIST
tensor, autograd, nn.Module, DataLoader
9.4
TensorFlow / Keras 入门
MNIST
Sequential, Functional API
9.5
激活函数 / Activations
Synthetic
sigmoid, tanh, ReLU, GELU, Swish, leaky
9.6
损失函数 / Loss Functions
various
MSE, CE, focal, contrastive, triplet
9.7
优化器 / Optimizers
MNIST
SGD, momentum, NAG, Adam, AdamW, RMSProp, Adagrad
9.8
学习率调度 / LR Schedulers
MNIST
step, cosine, warmup, ReduceLROnPlateau, one-cycle
9.9
初始化 / Initialization
MNIST
Xavier, He, orthogonal
9.10
正则化 / Regularization
CIFAR-10 (subset)
dropout, BN, LN, weight decay, early stopping, mixup
9.11
训练技巧 / Training Tricks
CIFAR
gradient clipping, accumulation, AMP/half precision
9.12
分布式训练 / Distributed Training
CIFAR
DataParallel, DDP, ZeRO 基础
9.13
迁移学习 / Transfer Learning
Flowers
feature extraction, fine-tuning
9.14
神经网络可视化 / NN Visualization
MNIST
activation, filter, Grad-CAM
Part 10 · 计算机视觉 / Computer Vision
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
10.1
图像处理基础 / Image Processing
any
OpenCV, PIL, color spaces, filters
10.2
卷积神经网络 / CNN
CIFAR-10
conv, pooling, receptive field
10.3
经典 CNN 架构 / Classic CNNs
CIFAR
LeNet, AlexNet, VGG, GoogLeNet
10.4
ResNet & Skip Connections
CIFAR
residual block, identity mapping
10.5
数据增强 / Data Augmentation
CIFAR
flip/crop, mixup, cutout, AutoAugment
10.6
目标检测 / Object Detection
Pascal VOC (subset)
sliding window, R-CNN family, YOLO, SSD
10.7
语义 & 实例分割 / Segmentation
Cityscapes (subset)
FCN, U-Net, Mask R-CNN
10.8
关键点检测 / Pose Estimation
COCO (subset)
heatmap regression
10.9
Vision Transformer / ViT
CIFAR
patch embedding, class token
10.10
多模态 / Multimodal (CLIP)
Custom image-text
contrastive learning
10.11
自监督学习 / Self-Supervised
CIFAR
SimCLR, MoCo, MAE
Part 11 · 经典 NLP / Classic NLP
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
11.1
文本预处理 / Text Preprocessing
20 Newsgroups
tokenize, stopwords, stem/lemma, regex
11.2
词袋与 TF-IDF / BoW & TF-IDF
SMS Spam
n-gram, sparse
11.3
词向量 / Word Embeddings
Text8
Word2Vec (CBOW, Skip-gram), GloVe, FastText
11.4
文本分类 / Text Classification
IMDB
logistic + TF-IDF, FastText
11.5
情感分析 / Sentiment Analysis
Twitter
lexicon + ML
11.6
主题模型 / Topic Modeling
NYT
LDA, NMF
11.7
命名实体识别 / NER
CoNLL-2003
BIO tagging, CRF, spaCy
11.8
序列标注 / Sequence Labeling
PoS
HMM, CRF, BiLSTM-CRF
11.9
文本相似度 / Text Similarity
Quora pairs
edit distance, cosine, Jaccard, BM25
Part 12 · 现代 NLP 与大语言模型 / Modern NLP & LLMs ⭐⭐⭐
LLM 是当下数据科学家的必备能力,完整覆盖一遍。
LLMs are table stakes today — full coverage here.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
12.1
RNN, LSTM, GRU
IMDB
sequence modeling, vanishing gradient
12.2
Seq2Seq & Attention
Translation toy
encoder-decoder, Bahdanau, Luong
12.3
Transformer 从零实现 / Transformer from Scratch
Toy translation
Q/K/V, multi-head, positional encoding
12.4
BERT & Encoder Models
GLUE (subset)
MLM, NSP, fine-tuning
12.5
GPT & Decoder Models
TinyStories
causal LM, autoregressive
12.6
T5 & Encoder-Decoder
Summarization
text-to-text framework
12.7
分词 / Tokenization
—
BPE, WordPiece, SentencePiece, tiktoken
12.8
预训练 vs 微调 / Pretraining vs Fine-tuning
—
concepts + when to use
12.9
参数高效微调 / PEFT
small LLM
LoRA, QLoRA, prefix tuning
12.10
RLHF / DPO
Preference data
reward model, PPO, DPO
12.11
Prompt Engineering
—
few-shot, CoT, ReAct, self-consistency
12.12
RAG / Retrieval-Augmented Generation
Wiki
embeddings + vector DB + reranker
12.13
向量数据库 / Vector Databases
—
FAISS, Chroma, Pinecone, HNSW
12.14
LLM Agents
—
tool use, function calling, planning
12.15
评估 LLM / LLM Evaluation
MT-Bench style
perplexity, BLEU, ROUGE, LLM-as-judge
12.16
LLM 推理优化 / LLM Inference Optimization
—
KV cache, quantization, vLLM, speculative decoding
Part 13 · 生成模型 / Generative Models
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
13.1
自编码器回顾 / AE Recap
MNIST
reconstruction
13.2
变分自编码器 / VAE
MNIST
ELBO, reparameterization
13.3
GAN 基础 / GAN
MNIST
minimax, mode collapse
13.4
DCGAN / WGAN
CelebA (subset)
conv GAN, Wasserstein loss
13.5
条件 GAN / cGAN, pix2pix
edges→shoes
conditional generation
13.6
流模型 / Normalizing Flows
Toy 2D
invertible NN, RealNVP
13.7
扩散模型 / Diffusion
MNIST
forward/reverse process, DDPM
13.8
Stable Diffusion 工作原理 / SD Internals
—
latent diffusion, U-Net, CLIP cond
13.9
评估生成模型 / Generative Eval
—
FID, IS, perceptual metrics
Part 14 · 时间序列 / Time Series
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
14.1
时间序列基础 / TS Basics
Air Passengers
trend, seasonality, stationarity, ACF/PACF
14.2
分解 / Decomposition
Air Passengers
additive vs multiplicative, STL
14.3
平滑法 / Smoothing
Sales
MA, EWMA, Holt-Winters
14.4
ARIMA / SARIMA / SARIMAX
Air Passengers
AR, MA, I, seasonal, exogenous
14.5
Prophet
Wikipedia pageviews
Bayesian additive
14.6
LSTM / GRU for TS
Stock prices
sequence-to-one, windowing
14.7
Temporal CNN / TCN
Energy load
dilated conv
14.8
Transformers for TS
Electricity
Informer, PatchTST 概念
14.9
多变量 & 多步预测 / Multivariate & Multi-step
M5 (subset)
VAR, direct vs recursive
14.10
异常检测 in TS / TS Anomaly Detection
NAB
STL residual, Twitter ESD
14.11
因果性检验 / Granger Causality
Macro
VAR, Granger test
Part 15 · 推荐系统 / Recommender Systems
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
15.1
基于内容 / Content-based
MovieLens
TF-IDF + cosine
15.2
协同过滤 / Collaborative Filtering
MovieLens
user-user, item-item, KNN
15.3
矩阵分解 / Matrix Factorization
MovieLens
SVD, ALS, SGD
15.4
隐式反馈 / Implicit Feedback
Last.fm
BPR, weighted MF
15.5
FM & FFM
Avazu (subset)
factorization machines
15.6
Wide & Deep
Census
memorization + generalization
15.7
DeepFM, DCN
Criteo (subset)
feature interactions
15.8
双塔模型 / Two-Tower
MovieLens
sampled softmax, retrieval
15.9
序列推荐 / Sequential
MovieLens
SASRec, GRU4Rec, BERT4Rec
15.10
多臂赌博机 / Multi-armed Bandits
Synthetic
ε-greedy, UCB, Thompson
15.11
评估 / Evaluation
MovieLens
precision@k, recall@k, NDCG, MAP, hit rate
Part 16 · 图数据与图神经网络 / Graph Data & GNN
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
16.1
图基础与 NetworkX / Graph Basics
Karate
nodes, edges, degree, paths
16.2
图算法 / Graph Algorithms
Karate
BFS/DFS, shortest path, centrality
16.3
社区发现 / Community Detection
Karate
modularity, Louvain
16.4
PageRank & HITS
Web graph
random walk
16.5
节点嵌入 / Node Embeddings
Cora
DeepWalk, node2vec
16.6
GCN
Cora
spectral GNN
16.7
GraphSAGE
Reddit (subset)
inductive learning
16.8
GAT
Citation
attention on graphs
16.9
知识图谱 / Knowledge Graphs
FB15k-237
TransE, RotatE
Part 17 · 强化学习 / Reinforcement Learning
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
17.1
MDP 与贝尔曼方程 / MDP & Bellman
GridWorld
state, action, reward, policy, value
17.2
动态规划 / Dynamic Programming
GridWorld
policy iter, value iter
17.3
蒙特卡洛方法 / Monte Carlo
Blackjack
every-visit, first-visit
17.4
TD 学习 / TD Learning
GridWorld
TD(0), SARSA
17.5
Q-Learning
Taxi-v3
off-policy, ε-greedy
17.6
DQN
CartPole
replay buffer, target net
17.7
策略梯度 / Policy Gradient
CartPole
REINFORCE, baseline
17.8
Actor-Critic, A2C, A3C
CartPole
advantage estimation
17.9
PPO
LunarLander
clipped surrogate
17.10
Bandits & Contextual Bandits
News rec
exploration vs exploitation
Part 18 · 贝叶斯方法 / Bayesian Methods
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
18.1
贝叶斯线性回归 / Bayesian Linear Regression
Synthetic
conjugate prior, posterior
18.2
贝叶斯逻辑回归 / Bayesian Logistic
Synthetic
Laplace approximation
18.3
MCMC
Beta-Binomial
Metropolis-Hastings, Gibbs
18.4
变分推断 / Variational Inference
Mixture
ELBO, mean-field
18.5
PyMC / Stan / NumPyro 实战
Hierarchical model
probabilistic programming
18.6
高斯过程 / Gaussian Processes
1D regression
kernel, posterior over functions
18.7
贝叶斯优化 / Bayesian Optimization
Hyperparam tuning
acquisition functions
Part 19 · 因果推断与实验 / Causal Inference & Experimentation
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
19.1
A/B 测试设计 / A/B Test Design
Synthetic
randomization, MDE, power
19.2
A/B 测试分析 / A/B Test Analysis
Synthetic
t-test, CUPED, bootstrap CI
19.3
多臂赌博机 vs A/B / Bandits vs AB
Synthetic
when to use which
19.4
因果图 / DAGs & do-calculus
—
confounding, mediation, collider
19.5
倾向得分匹配 / PSM
LaLonde
propensity, matching
19.6
双重差分 / DiD
Card-Krueger
parallel trends
19.7
工具变量 / IV
Education–wage
2SLS
19.8
回归断点 / RDD
Election
sharp / fuzzy
19.9
因果森林 / Causal Forest
Synthetic
heterogeneous treatment effects
19.10
提升模型 / Uplift Modeling
Marketing
T-learner, S-learner, X-learner, R-learner
19.11
网络效应实验 / Network Experiments
Social graph
SUTVA violation, cluster randomization
Part 20 · 高级 / 专门主题 / Advanced Topics
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
20.1
生存分析 / Survival Analysis
Lung
Kaplan-Meier, Cox PH
20.2
地理空间分析 / Geospatial
NYC Taxi, GeoJSON
GeoPandas, H3, shapely
20.3
音频与语音 / Audio
UrbanSound8K
spectrogram, MFCC, CNN audio
20.4
异常检测进阶 / Advanced Anomaly Detection
NAB, MVTec
autoencoder, deep SVDD, PaDiM
20.5
半监督学习 / Semi-supervised
CIFAR
label propagation, self-training, FixMatch
20.6
主动学习 / Active Learning
MNIST
uncertainty sampling
20.7
元学习 / Meta-Learning
Omniglot
MAML, prototypical net
20.8
联邦学习 / Federated Learning
MNIST partitioned
FedAvg
20.9
隐私保护机器学习 / Privacy-Preserving ML
MNIST
differential privacy, DP-SGD
20.10
数据合成 / Synthetic Data Generation
Tabular
SMOTE, CTGAN
20.11
多任务学习 / Multi-task Learning
Multi-output
shared backbone
20.12
排序学习 / Learning to Rank
LETOR
RankNet, LambdaMART
Part 21 · 大数据 / Big Data ⭐⭐ (大厂必备)
数据量上 TB / PB 后单机搞不定,必须懂这些。
Single-machine pandas dies above TB scale — must-know.
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
21.1
PySpark 入门 / PySpark Basics
NYC Taxi
RDD, DataFrame, SparkSession
21.2
Spark SQL & 优化 / Spark SQL & Tuning
NYC Taxi
partitioning, broadcast join, catalyst
21.3
Spark MLlib
Titanic-scale
pipeline, distributed training
21.4
Spark Streaming / Structured Streaming
Kafka topic
micro-batch, watermark
21.5
Dask
Large CSV
task graph, dask.dataframe, dask.delayed
21.6
Polars (Lazy) at Scale
Multi-GB
streaming engine
21.7
DuckDB
Parquet
analytical SQL on local data
21.8
数据存储格式 / Storage Formats
various
CSV, Parquet, Avro, ORC, Arrow
21.9
Hadoop 生态速览 / Hadoop Overview
—
HDFS, YARN, Hive
21.10
Kafka 入门 / Kafka Basics
Toy stream
producer, consumer, topic
21.11
Lakehouse: Delta / Iceberg / Hudi
—
ACID on object storage
21.12
分布式 ML 训练 / Distributed Training at Scale
—
Horovod, Ray, DeepSpeed 概念
Part 22 · MLOps 与部署 / MLOps & Deployment ⭐⭐
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
22.1
模型持久化 / Model Persistence
any
pickle, joblib, ONNX
22.2
Scikit-learn Pipelines
Titanic
ColumnTransformer, end-to-end fit
22.3
FastAPI 部署 / Deployment with FastAPI
Iris
REST API, pydantic
22.4
Streamlit 仪表板 / Dashboards
any
quick UI
22.5
Docker 入门 / Docker Basics
model service
image, container, compose
22.6
Kubernetes 入门 / K8s Basics for ML
—
pod, service, deployment, KServe
22.7
CI/CD for ML
GitHub Actions
lint, test, model release
22.8
实验追踪 / Experiment Tracking
any
MLflow, Weights & Biases
22.9
特征平台 / Feature Store
toy
Feast 基础
22.10
模型监控 & 漂移 / Monitoring & Drift
Synthetic
data drift, concept drift, PSI, KS, Evidently
22.11
A/B 测试基础设施 / A/B Infra
—
bucketing, traffic split, shadow deploy
22.12
边缘部署 / Edge Deployment
TFLite/ONNX
quantization, pruning, distillation
Part 23 · 云计算与数据科学 / Cloud for Data Science
#
主题 / Topic
数据集 / Dataset
关键概念 / Key Concepts
23.1
AWS for DS
—
S3, EC2, SageMaker, Lambda, Athena, Redshift
23.2
GCP for DS
—
GCS, BigQuery, Vertex AI, Dataflow
23.3
Azure for DS
—
Blob, Synapse, Azure ML
23.4
Databricks
—
notebooks, Delta Lake, jobs
23.5
Snowflake
—
warehouse, Snowpark
23.6
Airflow / Prefect / Dagster
toy DAG
scheduling, task graph
Part 24 · ML 系统设计与面试 / ML System Design & Interviews ⭐⭐⭐
终极阶段——大厂 senior DS / MLE 面试。
The final boss — senior DS / MLE interviews.
#
主题 / Topic
内容 / Content
24.1
设计推荐系统 / Design a Recommender
YouTube / Netflix 级别
24.2
设计搜索系统 / Design a Search Ranker
Google / Amazon 搜索
24.3
设计 feed / 时间线 / Design a News Feed
Facebook / Twitter
24.4
设计广告系统 / Design an Ads CTR System
Meta / Google Ads
24.5
设计欺诈检测 / Design Fraud Detection
Stripe / Visa
24.6
设计 ETA / 路径预测 / Design ETA Prediction
Uber / DoorDash
24.7
设计内容审核 / Design Content Moderation
Reddit / TikTok
24.8
设计 RAG 系统 / Design a RAG System
企业知识库
24.9
案例面试题型 / Case Interview Patterns
metric design, root cause analysis
24.10
行为面试与 DS 故事 / Behavioral & DS Storytelling
STAR framework
24.11
SQL 面试速通 / SQL Interview Cram
LeetCode hard SQL
24.12
机器学习概念速通 / ML Concepts Cram
classic interview Q&A
每个 Notebook 的标准结构 / Standard Notebook Template
每个 notebook 都遵循同一套结构,确保体验一致:
Every notebook follows the same flow for consistency:
背景与问题定义 / Background & Problem Statement — 中英双语
数学原理推导 / Math Derivation — LaTeX 公式
数据加载与 EDA / Data Loading & EDA
从零实现 / From-scratch Implementation — 仅 NumPy(适用时)
使用标准库 / Using Standard Libraries — sklearn / PyTorch / etc.
模型评估 / Evaluation — 多指标对比
结果可视化 / Visualization — 图表化结论
小结 / Summary — 知识点回顾 + 真实工作场景中怎么用