随着大语言模型的快速发展，分布式训练成为训练大规模模型的关键技术。Megatron-LM 和 DeepSpeed 是当前最流行的两种框架，它们结合使用可以大幅提升训练效率，降低显存占用。

本文将介绍如何利用 Megatron-DeepSpeed 训练一个 3.45亿参数（345M）的 GPT-2 模型，重点讲解数据预处理和训练的关键步骤。

安装

环境要求

根据仓库 README.md 要求：

安装 Pytorch >= 1.9
根据 Network Repo Installation for Ubuntu安装 CUDA Toolkit，其中包含了 NVCC(NVIDIA CUDA Compiler)

安装依赖包

$ git clone https://github.com/NVIDIA/apex
$ cd apex  
$ python setup.py install

安装 Deepspeed 与 Magatron-Deepspeed 环境

$ pip install deepspeed transformers pybind11 nltk ipython matplotlib
$ export home_dir=$HOME/cyc_train
$ cd $home_dir
$ git clone https://github.com/microsoft/Megatron-DeepSpeed.git

Openwebtext 数据集准备

使用 openwebtext 数据集的一个子集，为了不污染环境，使用虚拟环境并安装生成数据集的环境：

$ git clone https://github.com/yet-another-account/openwebtext.git
$ cd openwebtext
$ python -m venv menv
$ activate menv
$ pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract lxml_html_clean
$ git clone https://github.com/mattilyra/LSH && cd LSH && python setup.py install

在 OpenWebText MEGA 上下载一个文件作为要爬取的目录，例如 RS_2018-10.xz.deduped.txt，将其放置在 $home_dir/openwebtext/urls/ 下

修改 openwebtext/filter.py 中的代码使其支持新的 tldxtract 库¹：

- domain = '.'.join([x for x in ext if x])
- basedomain = '.'.join(ext[-2:])
+ domain = ext.fqdn
+ basedomain = ext.registered_domain

下面进行数据爬取：

$ cd $home_dir/openwebtext
$ python blacklist_urls.py urls cleanup_urls.txt
$ python download.py ../openwebtext/cleanup_urls.txt --save_uncompressed

在进行数据爬取后可以看见在 $home_dir/openwebtext/scraped/data/ 下有一系列的 txt 文件，下一步需要进行数据的清洗和合并，这里需要使用已经被删除的 tokenizer.py 文件²

$ python merge_jsons.py --data_path scraped/data
$ python cleanup_dataset.py merged_output.json merged_clean.json

此时得到了tokens大于128的的有效文本数据集 merged_clean.json

为了进行数据预处理，首先需要下载词汇表 vocab.json 和 BPE（Byte Pair Encoding）合并规则 gpt2-merge.txt

$ cd dataset
$ bash download_vocab.sh

下面进行数据预处理：

$ export BASE_SRC_PATH=$home_dir/Megatron-DeepSpeed
$ export BASE_DATA_PATH=${BASE_SRC_PATH}/dataset
$ python3 ${BASE_SRC_PATH}/tools/preprocess_data.py --input ${BASE_DATA_PATH}/merged_clean.json --output-prefix ${BASE_DATA_PATH}/my-gpt2 --vocab-file ${BASE_DATA_PATH}/gpt2-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer --merge-file ${BASE_DATA_PATH}/gpt2-merges.txt --append-eod --workers 8

之后可以得到最后的训练数据集文件 dataset/my-gpt2_text_document.idx 和 dataset/my-gpt2_text_document.bin

Last modified on 2024-11-19