megatron 阅读笔记

入口部分

graph TD
  n1(main)
  
  subgraph pretrain_gpt.py
    n1---n3[/forward_step\]
  end
  subgraph pretrain.py
    n2(pretrain)  
  end
  n3-->n2

pretrain 的参数如下

# 实参调用 pretrain_gpt.py ; 形参定义 train.py
pretrain(train_valid_test_datasets_provider,
		 model_provider,
         ModelType.encoder_or_decoder,
         forward_step, # 形参  forward_step_func,
         args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
         # @ return None

pretrain 做四件事情

初始化 megatron
设置模型，优化器，学习率
获取数据集
训练模型

训练部分

在函数 pretrain (training.py) 内部，两个重要的函数，一个函数 setup_model_and_optimizer 设置模型的各项配置，另外一个函数 train 真正训练

graph LR
  n1(setup_model_and_optimizer)-->n2(train)

函数 train (training.py) 内部函数 update_num_microbatches 和 train_step (training.py) 是iteration 控制的 while 循环

flowchart TB
  subgraph whileloop_iter
    n2(update_num_microbatches)-->n3(train_step)
  end
  n1(train)-->whileloop_iter

在函数 train_step 内部

flowchart TB
  subgraph after_other_function
    n1(forward_backward_func)-->n2(allreduce_gradients)
    n2-->n4(...)
  end
  n3(train_step)-->after_other_function

1	名词解释 `DDP`＝DistributedDataParallel 只有 ['local','torch'] 两种选择，默认local

函数 forward_backward_func 实际上是分支成是否使用 pipeline parallel , 对于使用 pipline parallel 的情况，还会分成时候使用 interleaved (交错)

flowchart LR
  n1(forward_backward_func)-->n2(with pipline)
  n1-->n3(no pipline)
  n2-->n4(interleaved)
  n2-->n5(no interleaved)
  n3-->n6(interleaved)
  n3-->n7(no interleaved)
  n4---n8[branch1]
  n5---n9[branch2]
  n6---n10[branch3]
  n7---n11[branch4]

在 branch1 中

对于 microbatchsize 的解释，是从 README.md 中 # BERT Pretraining 这部分摘录

While this is single GPU training, the batch size specified by --micro-batch-size is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches global-batch-size which is the batch size per iteration.

意思是指每一次真正前向反向传播的 batch_size 大小，这就是，如果到达一定数量之后再把梯度求和

forward_backward_func 有个重要的参数， forward_step_func 这个参数被一路传递，最上面实参是在 pretrain (pretrain_gpt.py) 内部被定义的是 forword_step (pretrain_gpt.py) 这个函数，这个函数也是在 pretrain_gpt.py 内部

1	`def forward_step(data_iterator, model): # 形参`

flowchart TB
  subgraph schedule.py
    direction TB
    n1(forward_backward_pipelining_with_interleaving)-->n2(forward_step_helper self-func)
    n2-->n3(forward_step)
    end

1	注意这里的 `forward_step` 和之前在 pretrain_gpt.py 里面的 `forwar_step` 不同！

这里 forward_step

# 函数原型是在 pretrain_gpt.py 里面的 forward_step
forward_step_func(data_iterator,
				  model # 形参 model
				  )

然后寻找 model 这个参数的由来

flowchart TB 
  subgraph training.py
    direction TB
    n1(pretrain)-->n2(setup_model_and_optimizer)
  end

1
2
3

 # train.py #116-117
 model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider,
                                                               model_type)

观察函数 setup_model_and_optimizer

flowchart TB
  subgraph training.py
    direction TB
    n1(setup_model_and_optimizer)-->n2(get_model)
  end

通过 get_model 函数获得 model 而在 getmodel 里面有一个重要的参数 model_provider 这个参数是在 pretrain_gpt.py 里面被实例化的.

    pretrain(train_valid_test_datasets_provider, 
    model_provider, # 指的是这个参数
    ModelType.encoder_or_decoder,
    forward_step, args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})

而这个函数是在 pretain_gpt.py

def model_provider(pre_process=True, post_process=True):

    """Build the model."""
    print_rank_0('building GPT model ...')
    model = GPTModel(
        num_tokentypes=0,
        parallel_output=True,
        pre_process=pre_process,
        post_process=post_process
    )

    return model

#yuan

Megatron 代码分析

http://home.ustc.edu.cn/~ustcxwy0271/2022/04/27/yuan-code-analysis/

作者

Xu Weiye

发布于

2022年4月27日

许可协议

关于 python 的重定向延迟上一篇

python import 下一篇