llm.c 的中文注解-20240421

llm.c 简单、纯 C/CUDA 的 LLM 训练。不需要 245MB 的 PyTorch 或 107MB 的 cPython。训练 GPT-2 （CPU， fp32）在单个文件 train_gpt2.c 中是 ~1,000 行干净代码，在 GPU 上训练它是 ~2,000 行（添加 CUDA 内核）在 train_gpt2.cu 中。代码立即编译并运行，它与 PyTorch 参考实现完全匹配，并且它 ~匹配（编译）PyTorch 的速度（fp32，no flash attention）。我选择 GPT-2 作为第一个工作示例，因为它是 LLM 的祖父，是现代堆栈第一次组合在一起。

参见：karpathy/llm.c: LLM training in simple, raw C/CUDA (github.com) 获得更多知识

下面的代码主要是从 train_gpt2.c和test_gpt2.c 中获取，进行中文的注释，代码是使用ChatGPT4 进行注解，

依照当前文件的函数排序，进行注解,为了保持和源代码的一致性，方便对比，英文注释保持不变。

1. train_gpt2.c

train_gpt2.c 是C 语言版的训练代码。

1.1 encoder_forward

这段代码是编码器（encoder）的前向传播函数, 它的作用是将输入的token序列转换成对应的嵌入向量序列。这些嵌入向量序列既包括了每个token的信息，也包括了token在序列中的位置信息。以下是对这个函数的详细中文注释：

// all the individual layers' forward and backward passes
// B = batch_size, T = sequence_length, C = channels, V = vocab_size
// 所有单独层的前向和反向传播
// B = 批大小, T = 序列长度, C = 通道数, V = 词汇表大小
// B = batch_size: 批量大小（Batch Size）是指在训练模型时，一次性输入模型的数据样本数量。
//     较大的批量大小可以提高内存利用率和训练速度，但也可能需要更多的内存，并且可能影响模型的泛化能力。
// T = sequence_length: 序列长度（Sequence Length）是指在处理序列数据时，序列中元素的数量。
//     对于文本数据，这可以是句子中的单词数量；对于时间序列数据，这可以是一系列连续观测值的数量。
// C = channels: 通道数（Channels）在不同上下文中可能有不同的含义。
//     在处理图像数据时，它可能指的是颜色通道（如RGB图像有3个通道）。
//     在处理文本或其他类型的序列数据时，它可能指的是嵌入向量的维度。
// V = vocab_size: 词汇表大小（Vocabulary Size）是指模型用于表示数据的词汇表中唯一词汇的数量。
//     在自然语言处理（NLP）任务中，词汇表大小决定了模型可以识别的不同单词或token的数量。

void encoder_forward(float* out,
                   int* inp, float* wte, float* wpe,
                   int B, int T, int C) {
    // out is (B,T,C). At each position (b,t), a C-dimensional vector summarizing token & position
    // inp is (B,T) of integers, holding the token ids at each (b,t) position
    // wte is (V,C) of token embeddings, short for "weight token embeddings"
    // wpe is (maxT,C) of position embeddings, short for "weight positional embedding"
    // out是(B,T,C)维的输出。在每个位置(b,t)，都有一个C维向量，综合了token和位置的信息
    // inp是(B,T)维的整数数组，每个位置(b,t)持有一个token的id
    // wte是(V,C)维的token嵌入矩阵，简称为"weight token embeddings"
    // wpe是(maxT,C)维的位置嵌入矩阵，简称为"weight positional embedding"
    // 遍历每个批次
    for (int b = 0; b < B; b++) {
        // 遍历每个位置
        for (int t = 0; t < T; t++) {
            // seek to the output position in out[b,t,:]
            // 定位到输出张量的指定位置out[b,t,:]
            float* out_bt = out + b * T * C + t * C;
            // get the index of the token at inp[b, t]
            // 获取当前位置的token索引
            int ix = inp[b * T + t];
            // seek to the position in wte corresponding to the token
            // 定位到token嵌入矩阵中对应的token嵌入向量
            float* wte_ix = wte + ix * C;
            // seek to the position in wpe corresponding to the position
            // 定位到位置嵌入矩阵中对应的位置嵌入向量
            float* wpe_t = wpe + t * C;
            // add the two vectors and store the result in out[b,t,:]
            // 将token嵌入向量和位置嵌入向量相加，并将结果存储在out[b,t,:]中
            for (int i = 0; i < C; i++) {
                out_bt[i] = wte_ix[i] + wpe_t[i];
            }
        }
    }
}

// all the individual layers' forward and backward passes

// B = batch_size, T = sequence_length, C = channels, V = vocab_size

// 所有单独层的前向和反向传播

// B = 批大小, T = 序列长度, C = 通道数, V = 词汇表大小

// B = batch_size: 批量大小（Batch Size）是指在训练模型时，一次性输入模型的数据样本数量。

// 较大的批量大小可以提高内存利用率和训练速度，但也可能需要更多的内存，并且可能影响模型的泛化能力。

// T = sequence_length: 序列长度（Sequence Length）是指在处理序列数据时，序列中元素的数量。

// 对于文本数据，这可以是句子中的单词数量；对于时间序列数据，这可以是一系列连续观测值的数量。

// C = channels: 通道数（Channels）在不同上下文中可能有不同的含义。

// 在处理图像数据时，它可能指的是颜色通道（如RGB图像有3个通道）。

// 在处理文本或其他类型的序列数据时，它可能指的是嵌入向量的维度。

// V = vocab_size: 词汇表大小（Vocabulary Size）是指模型用于表示数据的词汇表中唯一词汇的数量。

// 在自然语言处理（NLP）任务中，词汇表大小决定了模型可以识别的不同单词或token的数量。

void encoder_forward(float* out,

int* inp, float* wte, float* wpe,

int B, int T, int C) {

// out is (B,T,C). At each position (b,t), a C-dimensional vector summarizing token & position

// inp is (B,T) of integers, holding the token ids at each (b,t) position

// wte is (V,C) of token embeddings, short for "weight token embeddings"

// wpe is (maxT,C) of position embeddings, short for "weight positional embedding"

// out是(B,T,C)维的输出。在每个位置(b,t)，都有一个C维向量，综合了token和位置的信息

// inp是(B,T)维的整数数组，每个位置(b,t)持有一个token的id

// wte是(V,C)维的token嵌入矩阵，简称为"weight token embeddings"

// wpe是(maxT,C)维的位置嵌入矩阵，简称为"weight positional embedding"

// 遍历每个批次

for (int b = 0; b < B; b++) {

// 遍历每个位置

for (int t = 0; t < T; t++) {

// seek to the output position in out[b,t,:]

// 定位到输出张量的指定位置out[b,t,:]

float* out_bt = out + b * T * C + t * C;

// get the index of the token at inp[b, t]

// 获取当前位置的token索引

int ix = inp[b * T + t];

// seek to the position in wte corresponding to the token

// 定位到token嵌入矩阵中对应的token嵌入向量

float* wte_ix = wte + ix * C;

// seek to the position in wpe corresponding to the position

// 定位到位置嵌入矩阵中对应的位置嵌入向量

float* wpe_t = wpe + t * C;

// add the two vectors and store the result in out[b,t,:]

// 将token嵌入向量和位置嵌入向量相加，并将结果存储在out[b,t,:]中

for (int i = 0; i < C; i++) {

out_bt[i] = wte_ix[i] + wpe_t[i];

}

在深度学习模型的训练过程中，每一层都会进行前向传播，计算其输出以供下一层使用，直到生成最终的输出。在计算损失（即模型输出与实际标签之间的差异）后，模型通过反向传播过程计算损失相对于每个参数的梯度，然后使用这些梯度来更新模型参数。这个过程在多个训练周期（Epochs）中重复进行，直到模型性能达到满意的水平。

1.2 encoder_backward

这段代码执行编码器层的反向传播操作，用于更新词嵌入（word embeddings）和位置嵌入（positional embeddings）的梯度。这是深度学习模型，尤其是在处理序列数据如文本时的关键步骤。以下是对这个函数的详细中文注释：

void encoder_backward(float* dwte, float* dwpe,
                      float* dout, int* inp,
                      int B, int T, int C) {
    // 遍历批次和序列中的每个位置
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // 定位到特定批次和时间步的输出梯度
            float* dout_bt = dout + b * T * C + t * C;
            // 获取当前位置的输入词索引
            int ix = inp[b * T + t];
            // 定位到与当前词索引对应的词嵌入梯度
            float* dwte_ix = dwte + ix * C;
            // 定位到与当前位置对应的位置嵌入梯度
            float* dwpe_t = dwpe + t * C;
            // 累加计算词嵌入和位置嵌入的梯度
            for (int i = 0; i < C; i++) {
            	  // 获取当前梯度
                float d = dout_bt[i];
                // 更新词嵌入梯度
                dwte_ix[i] += d;
                // 更新位置嵌入梯度
                dwpe_t[i] += d;
            }
        }
    }
}

void encoder_backward(float* dwte, float* dwpe,

float* dout, int* inp,

int B, int T, int C) {

// 遍历批次和序列中的每个位置

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// 定位到特定批次和时间步的输出梯度

float* dout_bt = dout + b * T * C + t * C;

// 获取当前位置的输入词索引

int ix = inp[b * T + t];

// 定位到与当前词索引对应的词嵌入梯度

float* dwte_ix = dwte + ix * C;

// 定位到与当前位置对应的位置嵌入梯度

float* dwpe_t = dwpe + t * C;

// 累加计算词嵌入和位置嵌入的梯度

for (int i = 0; i < C; i++) {

// 获取当前梯度

float d = dout_bt[i];

// 更新词嵌入梯度

dwte_ix[i] += d;

// 更新位置嵌入梯度

dwpe_t[i] += d;

}

在训练过程中，模型的前向传播会计算出最终的输出，然后通过比较预测结果和实际结果来计算损失。在反向传播阶段，根据损失函数相对于模型参数的梯度，更新模型的参数，以减少未来的预测误差。这段代码特别处理了对于序列模型中两种重要类型嵌入——词嵌入和位置嵌入的梯度更新，这对于模型理解输入序列的语义和结构至关重要。

1.3 layernorm_forward

这段代码实现了层归一化（Layer Normalization）的前向传播过程。层归一化是深度学习中常用的一种技术，特别是在处理序列数据的模型中，如循环神经网络（RNNs）和Transformer。它有助于稳定神经网络的训练过程，通过对每一层的激活进行归一化来减少训练过程中的内部协变量偏移（Internal Covariate Shift）。以下是对这个函数的详细中文注释：

void layernorm_forward(float* out, float* mean, float* rstd,
                       float* inp, float* weight, float* bias,
                       int B, int T, int C) {
    // reference: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
    // both inp and out are (B,T,C) of the activations
    // mean and rstd are (B,T) buffers, to be used later in backward pass
    // at each position (b,t) of the input, the C-dimensional vector
    // of activations gets normalized, then scaled and shifted
    // 参考：https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
    // 输入(inp)和输出(out)都是(B,T,C)维度的激活值
    // mean和rstd是(B,T)维度的缓冲区，稍后在反向传播中使用
    // 对于输入的每个位置(b,t)，C维的激活向量会被归一化，然后进行缩放和偏移
    // 避免除以0
    float eps = 1e-5f;
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // seek to the input position inp[b,t,:]
            // 定位到输入位置inp[b,t,:]
            float* x = inp + b * T * C + t * C;
            // calculate the mean
            // 计算均值
            float m = 0.0f;
            for (int i = 0; i < C; i++) {
                m += x[i];
            }
            m = m/C;
            // calculate the variance (without any bias correction)
            // 计算方差（不进行偏差校正）
            float v = 0.0f;
            for (int i = 0; i < C; i++) {
                float xshift = x[i] - m;
                v += xshift * xshift;
            }
            v = v/C;
            // calculate the rstd (reciprocal standard deviation)
            // 计算逆标准差（reciprocal standard deviation）
            float s = 1.0f / sqrtf(v + eps);
            // seek to the output position in out[b,t,:]
            // 定位到输出位置out[b,t,:]
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                // 归一化
                float n = (s * (x[i] - m)); // normalize
                // 缩放和偏移
                float o = n * weight[i] + bias[i]; // scale and shift
                // 写入结果
                out_bt[i] = o; // write
            }
            // cache the mean and rstd for the backward pass later
            // 为反向传播缓存均值和逆标准差
            mean[b * T + t] = m;
            rstd[b * T + t] = s;
        }
    }
}

void layernorm_forward(float* out, float* mean, float* rstd,

float* inp, float* weight, float* bias,

int B, int T, int C) {

// reference: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

// both inp and out are (B,T,C) of the activations

// mean and rstd are (B,T) buffers, to be used later in backward pass

// at each position (b,t) of the input, the C-dimensional vector

// of activations gets normalized, then scaled and shifted

// 参考：https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

// 输入(inp)和输出(out)都是(B,T,C)维度的激活值

// mean和rstd是(B,T)维度的缓冲区，稍后在反向传播中使用

// 对于输入的每个位置(b,t)，C维的激活向量会被归一化，然后进行缩放和偏移

// 避免除以0

float eps = 1e-5f;

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// seek to the input position inp[b,t,:]

// 定位到输入位置inp[b,t,:]

float* x = inp + b * T * C + t * C;

// calculate the mean

// 计算均值

float m = 0.0f;

for (int i = 0; i < C; i++) {

m += x[i];

}

m = m/C;

// calculate the variance (without any bias correction)

// 计算方差（不进行偏差校正）

float v = 0.0f;

for (int i = 0; i < C; i++) {

float xshift = x[i] - m;

v += xshift * xshift;

}

v = v/C;

// calculate the rstd (reciprocal standard deviation)

// 计算逆标准差（reciprocal standard deviation）

float s = 1.0f / sqrtf(v + eps);

// seek to the output position in out[b,t,:]

// 定位到输出位置out[b,t,:]

float* out_bt = out + b * T * C + t * C;

for (int i = 0; i < C; i++) {

// 归一化

float n = (s * (x[i] - m)); // normalize

// 缩放和偏移

float o = n * weight[i] + bias[i]; // scale and shift

// 写入结果

out_bt[i] = o; // write

}

// cache the mean and rstd for the backward pass later

// 为反向传播缓存均值和逆标准差

mean[b * T + t] = m;

rstd[b * T + t] = s;

}

通过计算每个位置上的激活向量的均值和标准差，然后对每个激活向量进行归一化处理（即，减去均值，除以标准差），最后对归一化后的向量进行线性变换（通过权重和偏置进行缩放和偏移），得到最终的输出。这种处理方式使得模型的每一层都能有稳定的激活分布，从而有助于改善训练过程中的数值稳定性和收敛速度。

1.4 layernorm_backward

这段代码实现了层归一化（Layer Normalization）的反向传播过程。在深度学习模型的训练过程中，反向传播是一种计算损失函数关于每个参数梯度的方法，这些梯度随后被用来更新模型的参数。层归一化的反向传播特别重要，因为它涉及到如何将来自后续层的梯度传递回前面的层，并更新相关的权重和偏置。以下是对这个函数的详细中文注释：

void layernorm_backward(float* dinp, float* dweight, float* dbias,
                        float* dout, float* inp, float* weight, float* mean, float* rstd,
                        int B, int T, int C) {
    // 遍历每个批次和时间步
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // 定位到特定批次和时间步的输出梯度
            float* dout_bt = dout + b * T * C + t * C;
            // 定位到对应的输入和输入梯度
            float* inp_bt = inp + b * T * C + t * C;
            float* dinp_bt = dinp + b * T * C + t * C;
            // 获取对应的均值和逆标准差
            float mean_bt = mean[b * T + t];
            float rstd_bt = rstd[b * T + t];

            // first: two reduce operations
            // 首先：进行两次reduce操作以准备梯度计算
            float dnorm_mean = 0.0f;
            float dnorm_norm_mean = 0.0f;
            for (int i = 0; i < C; i++) {
                // 计算归一化的值及其与输出梯度的乘积
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                dnorm_mean += dnorm_i;
                dnorm_norm_mean += dnorm_i * norm_bti;
            }
            // 对两个求和结果取平均
            dnorm_mean = dnorm_mean / C;
            dnorm_norm_mean = dnorm_norm_mean / C;

            // now iterate again and accumulate all the gradients
            // 再次遍历并累加所有梯度
            for (int i = 0; i < C; i++) {
                // 计算归一化值及其与输出梯度的乘积
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                // gradient contribution to bias
                // 对偏置的梯度贡献
                dbias[i] += dout_bt[i];
                // gradient contribution to weight
                // 对权重的梯度贡献
                dweight[i] += norm_bti * dout_bt[i];
                // gradient contribution to input
                // 对输入的梯度贡献
                float dval = 0.0f;
                dval += dnorm_i; // term 1
                dval -= dnorm_mean; // term 2
                dval -= norm_bti * dnorm_norm_mean; // term 3
                dval *= rstd_bt; // final scale
                dinp_bt[i] += dval;
            }
        }
    }
}

void layernorm_backward(float* dinp, float* dweight, float* dbias,

float* dout, float* inp, float* weight, float* mean, float* rstd,

int B, int T, int C) {

// 遍历每个批次和时间步

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// 定位到特定批次和时间步的输出梯度

float* dout_bt = dout + b * T * C + t * C;

// 定位到对应的输入和输入梯度

float* inp_bt = inp + b * T * C + t * C;

float* dinp_bt = dinp + b * T * C + t * C;

// 获取对应的均值和逆标准差

float mean_bt = mean[b * T + t];

float rstd_bt = rstd[b * T + t];

// first: two reduce operations

// 首先：进行两次reduce操作以准备梯度计算

float dnorm_mean = 0.0f;

float dnorm_norm_mean = 0.0f;

for (int i = 0; i < C; i++) {

// 计算归一化的值及其与输出梯度的乘积

float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;

float dnorm_i = weight[i] * dout_bt[i];

dnorm_mean += dnorm_i;

dnorm_norm_mean += dnorm_i * norm_bti;

}

// 对两个求和结果取平均

dnorm_mean = dnorm_mean / C;

dnorm_norm_mean = dnorm_norm_mean / C;

// now iterate again and accumulate all the gradients

// 再次遍历并累加所有梯度

for (int i = 0; i < C; i++) {

// 计算归一化值及其与输出梯度的乘积

float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;

float dnorm_i = weight[i] * dout_bt[i];

// gradient contribution to bias

// 对偏置的梯度贡献

dbias[i] += dout_bt[i];

// gradient contribution to weight

// 对权重的梯度贡献

dweight[i] += norm_bti * dout_bt[i];

// gradient contribution to input

// 对输入的梯度贡献

float dval = 0.0f;

dval += dnorm_i; // term 1

dval -= dnorm_mean; // term 2

dval -= norm_bti * dnorm_norm_mean; // term 3

dval *= rstd_bt; // final scale

dinp_bt[i] += dval;

}

这个过程首先计算了两个重要的中间变量：dnorm_mean（输出梯度的平均值）和dnorm_norm_mean（归一化值与输出梯度乘积的平均值）。然后，它使用这些中间变量来计算对权重、偏置和输入的梯度。这种方式确保了梯度的计算考虑了归一化的影响，并适当地更新了权重和偏置，以改善模型在后续迭代中的表现。

1.5 matmul_forward

这段代码是矩阵乘法（Matrix Multiplication）的前向传播函数，主要用于深度学习模型中的线性层（或全连接层）。它将输入数据和权重矩阵相乘，然后加上偏置项，生成输出数据。以下是对这个函数的详细中文注释：

void matmul_forward(float* out,
                    float* inp, float* weight, float* bias,
                    int B, int T, int C, int OC) {
    // most of the running time is spent here and in matmul_backward
    // OC is short for "output channels"
    // inp is (B,T,C), weight is (OC, C), bias is (OC)
    // out will be (B,T,OC)
    // 此函数和matmul_backward函数占据了大部分运行时间
    // OC是“输出通道数”的简称
    // inp的维度是(B,T,C)，weight的维度是(OC, C)，bias的维度是(OC)
    // 输出out的维度将会是(B,T,OC)
    #pragma omp parallel for collapse(2)
    // 遍历批次
    for (int b = 0; b < B; b++) {
        // 遍历时间步或序列长度
        for (int t = 0; t < T; t++) {
            // 定位到输出矩阵的具体位置
            float* out_bt = out + b * T * OC + t * OC;
            // 定位到输入矩阵的具体位置
            float* inp_bt = inp + b * T * C + t * C;
            // 遍历输出通道
            for (int o = 0; o < OC; o++) {
                // 如果有偏置项，则初始化为偏置值，否则为0
                float val = (bias != NULL) ? bias[o] : 0.0f;
                // 定位到权重矩阵的具体行
                float* wrow = weight + o*C;
                // 对当前行的每个元素进行累加
                for (int i = 0; i < C; i++) {
                    // 执行点乘操作
                    val += inp_bt[i] * wrow[i];
                }
                // 将计算结果存储到输出矩阵中
                out_bt[o] = val;
            }
        }
    }
}

void matmul_forward(float* out,

float* inp, float* weight, float* bias,

int B, int T, int C, int OC) {

// most of the running time is spent here and in matmul_backward

// OC is short for "output channels"

// inp is (B,T,C), weight is (OC, C), bias is (OC)

// out will be (B,T,OC)

// 此函数和matmul_backward函数占据了大部分运行时间

// OC是“输出通道数”的简称

// inp的维度是(B,T,C)，weight的维度是(OC, C)，bias的维度是(OC)

// 输出out的维度将会是(B,T,OC)

#pragma omp parallel for collapse(2)

// 遍历批次

for (int b = 0; b < B; b++) {

// 遍历时间步或序列长度

for (int t = 0; t < T; t++) {

// 定位到输出矩阵的具体位置

float* out_bt = out + b * T * OC + t * OC;

// 定位到输入矩阵的具体位置

float* inp_bt = inp + b * T * C + t * C;

// 遍历输出通道

for (int o = 0; o < OC; o++) {

// 如果有偏置项，则初始化为偏置值，否则为0

float val = (bias != NULL) ? bias[o] : 0.0f;

// 定位到权重矩阵的具体行

float* wrow = weight + o*C;

// 对当前行的每个元素进行累加

for (int i = 0; i < C; i++) {

// 执行点乘操作

val += inp_bt[i] * wrow[i];

}

// 将计算结果存储到输出矩阵中

out_bt[o] = val;

}

此函数利用OpenMP进行并行化处理，以提高计算效率。#pragma omp parallel for collapse(2)指令会并行化嵌套的两层循环，从而加快批次和时间步的遍历过程。

这个矩阵乘法操作是深度学习中常见的操作之一，它在全连接层、卷积层的计算中都有广泛应用。通过将输入数据和权重矩阵相乘，再加上偏置项，可以实现对数据的线性变换，这是神经网络学习复杂表示的基础。

1.6 matmul_backward

这段代码实现了矩阵乘法操作的反向传播过程，主要用于更新输入、权重和偏置的梯度。在深度学习训练中，反向传播是计算损失函数关于网络参数梯度的关键步骤，用于参数的梯度下降更新。以下是对这个函数的详细中文注释：

void matmul_backward(float* dinp, float* dweight, float* dbias,
                     float* dout, float* inp, float* weight,
                     int B, int T, int C, int OC) {
    // most of the running time is spent here and in matmul_forward
    // this backward could be done in a single "round" of loops
    // but that doesn't afford an efficient parallelization strategy
    // 大部分运行时间花费在这里和matmul_forward函数
    // 这个反向传播可以在一轮循环中完成
    // 但这样做不利于有效的并行化策略

    // backward into inp first, parallelize over B,T
    // 首先反向传播到输入dinp，对B,T进行并行化
    #pragma omp parallel for collapse(2)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // 定位到特定批次和时间步的输出梯度
            float* dout_bt = dout + b * T * OC + t * OC;
            // 定位到对应的输入梯度
            float* dinp_bt = dinp + b * T * C + t * C;
            for (int o = 0; o < OC; o++) {
                // 定位到权重矩阵的特定行
                float* wrow = weight + o*C;
                // 获取输出梯度
                float d = dout_bt[o];
                for (int i = 0; i < C; i++) {
                    // 更新输入梯度
                    dinp_bt[i] += wrow[i] * d;
                }
            }
        }
    }
    // backward into weight/bias, parallelize over output channels OC
    // 然后反向传播到权重dweight和偏置dbias，对输出通道OC进行并行化
    #pragma omp parallel for
    for (int o = 0; o < OC; o++) {
        for (int b = 0; b < B; b++) {
            for (int t = 0; t < T; t++) {
                // 定位到输出梯度
                float* dout_bt = dout + b * T * OC + t * OC;
                // 定位到输入
                float* inp_bt = inp + b * T * C + t * C;
                // 定位到权重梯度的特定行
                float* dwrow = dweight + o*C;
                // 获取输出梯度
                float d = dout_bt[o];
                // 如果有偏置，更新偏置梯度
                if (dbias != NULL) { dbias[o] += d; }
                for (int i = 0; i < C; i++) {
                    // 更新权重梯度
                    dwrow[i] += inp_bt[i] * d;
                }
            }
        }
    }
}

void matmul_backward(float* dinp, float* dweight, float* dbias,

float* dout, float* inp, float* weight,

int B, int T, int C, int OC) {

// most of the running time is spent here and in matmul_forward

// this backward could be done in a single "round" of loops

// but that doesn't afford an efficient parallelization strategy

// 大部分运行时间花费在这里和matmul_forward函数

// 这个反向传播可以在一轮循环中完成

// 但这样做不利于有效的并行化策略

// backward into inp first, parallelize over B,T

// 首先反向传播到输入dinp，对B,T进行并行化

#pragma omp parallel for collapse(2)

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// 定位到特定批次和时间步的输出梯度

float* dout_bt = dout + b * T * OC + t * OC;

// 定位到对应的输入梯度

float* dinp_bt = dinp + b * T * C + t * C;

for (int o = 0; o < OC; o++) {

// 定位到权重矩阵的特定行

float* wrow = weight + o*C;

// 获取输出梯度

float d = dout_bt[o];

for (int i = 0; i < C; i++) {

// 更新输入梯度

dinp_bt[i] += wrow[i] * d;

}

// backward into weight/bias, parallelize over output channels OC

// 然后反向传播到权重dweight和偏置dbias，对输出通道OC进行并行化

#pragma omp parallel for

for (int o = 0; o < OC; o++) {

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// 定位到输出梯度

float* dout_bt = dout + b * T * OC + t * OC;

// 定位到输入

float* inp_bt = inp + b * T * C + t * C;

// 定位到权重梯度的特定行

float* dwrow = dweight + o*C;

// 获取输出梯度

float d = dout_bt[o];

// 如果有偏置，更新偏置梯度

if (dbias != NULL) { dbias[o] += d; }

for (int i = 0; i < C; i++) {

// 更新权重梯度

dwrow[i] += inp_bt[i] * d;

}

在这个函数中，首先通过并行化处理对每个输入单元dinp进行梯度更新。这是通过将权重矩阵weight和输出梯度dout相乘完成的。随后，对于权重矩阵dweight和偏置向量dbias的梯度更新，则是通过将输入inp和输出梯度dout相乘来实现的。这个过程反映了矩阵乘法的特性，即反向传播过程实际上是另一种形式的矩阵乘法，但是考虑了输出梯度对权重和输入的影响。

这种处理方式使得可以有效地通过反向传播来更新模型中线性层的参数，从而在训练过程中改善模型的性能。通过OpenMP的并行化指令，这个过程还利用了多核处理器的计算能力，以加速梯度的计算和更新。

1.7 attention_forward

这段代码实现了Transformer模型中注意力机制（Attention Mechanism）的前向传播过程。注意力机制是Transformer模型的核心，允许模型在处理序列数据时动态地关注（或“注意”）序列中的不同部分。以下是对这个函数的详细中文注释：

void attention_forward(float* out, float* preatt, float* att,
                       float* inp,
                       int B, int T, int C, int NH) {
    // input is (B, T, 3C) holding the query, key, value (Q, K, V) vectors
    // preatt, att are (B, NH, T, T). NH = number of heads, T = sequence length
    // that holds the pre-attention and post-attention scores (used in backward)
    // output is (B, T, C)
    // attention is the only layer that mixes information across time
    // every other operation is applied at every (b,t) position independently
    // (and of course, no layer mixes information across batch)
    // 输入inp的维度是(B, T, 3C)，包含查询（Query）、键（Key）和值（Value）向量
    // preatt和att的维度是(B, NH, T, T)，NH是头的数量，T是序列长度
    // 它们存储了注意力计算之前和之后的分数（在反向传播中使用）
    // 输出out的维度是(B, T, C)
    // 注意力层是唯一一个跨时间混合信息的层
    // 其他所有操作都在每个(b,t)位置独立应用
    // （当然，没有任何层会跨批次混合信息）
    int C3 = C*3;
    // 每个头的大小
    int hs = C / NH; // head size
    // 用于缩放点积的因子
    float scale = 1.0 / sqrtf(hs);

    #pragma omp parallel for collapse(3)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            for (int h = 0; h < NH; h++) {
                float* query_t = inp + b * T * C3 + t * C3 + h * hs;
                float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;
                float* att_bth = att + b*NH*T*T + h*T*T + t*T;

                // pass 1: calculate query dot key and maxval
                // 步骤1: 计算查询与键的点积并找到最大值
                // 用于数值稳定性
                float maxval = -10000.0f; // TODO something better
                for (int t2 = 0; t2 <= t; t2++) {
                    // 键向量位置
                    float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key

                    // (query_t) dot (key_t2)
                    // 计算点积并缩放
                    float val = 0.0f;
                    for (int i = 0; i < hs; i++) {
                        val += query_t[i] * key_t2[i];
                    }
                    val *= scale;
                    if (val > maxval) {
                        maxval = val;
                    }

                    preatt_bth[t2] = val;
                }

                // pass 2: calculate the exp and keep track of sum
                // maxval is being calculated and subtracted only for numerical stability
                // 步骤2: 计算exp并保持求和
                float expsum = 0.0f;
                for (int t2 = 0; t2 <= t; t2++) {
                    float expv = expf(preatt_bth[t2] - maxval);
                    expsum += expv;
                    att_bth[t2] = expv;
                }
                float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;

                // pass 3: normalize to get the softmax
                // 步骤3: 归一化以获得softmax
                for (int t2 = 0; t2 < T; t2++) {
                    if (t2 <= t) {
                        att_bth[t2] *= expsum_inv;
                    } else {
                        // causal attention mask. not strictly necessary to set to zero here
                        // only doing this explicitly for debugging and checking to PyTorch
                        // 因果注意力掩码。这里不严格必要设置为零
                        att_bth[t2] = 0.0f;
                    }
                }

                // pass 4: accumulate weighted values into the output of attention
                // 步骤4: 累积加权值到注意力输出
                float* out_bth = out + b * T * C + t * C + h * hs;
                for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }
                for (int t2 = 0; t2 <= t; t2++) {
                    // 值向量位置
                    float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's value
                    float att_btht2 = att_bth[t2];
                    for (int i = 0; i < hs; i++) {
                        out_bth[i] += att_btht2 * value_t2[i];
                    }
                }
            }
        }
    }
}

void attention_forward(float* out, float* preatt, float* att,

float* inp,

int B, int T, int C, int NH) {

// input is (B, T, 3C) holding the query, key, value (Q, K, V) vectors

// preatt, att are (B, NH, T, T). NH = number of heads, T = sequence length

// that holds the pre-attention and post-attention scores (used in backward)

// output is (B, T, C)

// attention is the only layer that mixes information across time

// every other operation is applied at every (b,t) position independently

// (and of course, no layer mixes information across batch)

// 输入inp的维度是(B, T, 3C)，包含查询（Query）、键（Key）和值（Value）向量

// preatt和att的维度是(B, NH, T, T)，NH是头的数量，T是序列长度

// 它们存储了注意力计算之前和之后的分数（在反向传播中使用）

// 输出out的维度是(B, T, C)

// 注意力层是唯一一个跨时间混合信息的层

// 其他所有操作都在每个(b,t)位置独立应用

// （当然，没有任何层会跨批次混合信息）

int C3 = C*3;

// 每个头的大小

int hs = C / NH; // head size

// 用于缩放点积的因子

float scale = 1.0 / sqrtf(hs);

#pragma omp parallel for collapse(3)

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

for (int h = 0; h < NH; h++) {

float* query_t = inp + b * T * C3 + t * C3 + h * hs;

float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;

float* att_bth = att + b*NH*T*T + h*T*T + t*T;

// pass 1: calculate query dot key and maxval

// 步骤1: 计算查询与键的点积并找到最大值

// 用于数值稳定性

float maxval = -10000.0f; // TODO something better

for (int t2 = 0; t2 <= t; t2++) {

// 键向量位置

float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key

// (query_t) dot (key_t2)

// 计算点积并缩放

float val = 0.0f;

for (int i = 0; i < hs; i++) {

val += query_t[i] * key_t2[i];

}

val *= scale;

if (val > maxval) {

maxval = val;

}

preatt_bth[t2] = val;

}

// pass 2: calculate the exp and keep track of sum

// maxval is being calculated and subtracted only for numerical stability

// 步骤2: 计算exp并保持求和

float expsum = 0.0f;

for (int t2 = 0; t2 <= t; t2++) {

float expv = expf(preatt_bth[t2] - maxval);

expsum += expv;

att_bth[t2] = expv;

}

float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;

// pass 3: normalize to get the softmax

// 步骤3: 归一化以获得softmax

for (int t2 = 0; t2 < T; t2++) {

if (t2 <= t) {

att_bth[t2] *= expsum_inv;

} else {

// causal attention mask. not strictly necessary to set to zero here

// only doing this explicitly for debugging and checking to PyTorch

// 因果注意力掩码。这里不严格必要设置为零

att_bth[t2] = 0.0f;

}

// pass 4: accumulate weighted values into the output of attention

// 步骤4: 累积加权值到注意力输出

float* out_bth = out + b * T * C + t * C + h * hs;

for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }

for (int t2 = 0; t2 <= t; t2++) {

// 值向量位置

float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's value

float att_btht2 = att_bth[t2];

for (int i = 0; i < hs; i++) {

out_bth[i] += att_btht2 * value_t2[i];

}

这个函数首先计算查询（Query）和键（Key）的缩放点积注意力，然后应用softmax函数以获取注意力权重，接着根据这些权重对值（Value）向量进行加权和，最终得到注意力层的输出。这种机制使得模型能够动态地聚焦于输入序列的不同部分，从而捕捉序列内的长距离依赖关系。通过OpenMP的并行化指令，这个过程还利用了多核处理器的计算能力，以加速计算过程。

1.8 attention_backward

这段代码实现了注意力机制的反向传播过程，是深度学习中Transformer模型的核心组成部分。在训练神经网络时，反向传播是一种计算损失函数关于网络参数梯度的方法，用于更新模型参数以改善性能。以下是对这个函数的详细中文注释：

void attention_backward(float* dinp, float* dpreatt, float* datt,
                        float* dout, float* inp, float* att,
                        int B, int T, int C, int NH) {
    // inp/dinp are (B, T, 3C) Q,K,V
    // att/datt/dpreatt are (B, NH, T, T)
    // dout is (B, T, C)
    // inp/dinp是(B, T, 3C)维的，包含了查询（Q）、键（K）、值（V）
    // att/datt/dpreatt是(B, NH, T, T)维的
    // dout是(B, T, C)维的输出梯度
    int C3 = C*3;
    // 每个头的大小
    int hs = C / NH; // head size
    // 缩放因子，用于缩放查询与键的点积
    float scale = 1.0 / sqrtf(hs);

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            for (int h = 0; h < NH; h++) {
                // 获取att、datt和dpreatt的指针
                float* att_bth = att + b*NH*T*T + h*T*T + t*T;
                float* datt_bth = datt + b*NH*T*T + h*T*T + t*T;
                float* dpreatt_bth = dpreatt + b*NH*T*T + h*T*T + t*T;
                // 获取查询（Q）的梯度和原始值的指针
                float* dquery_t = dinp + b * T * C3 + t * C3 + h * hs;
                float* query_t = inp + b * T * C3 + t * C3 + h * hs;

                // backward pass 4, through the value accumulation
                // 反向传播步骤4，通过值（V）的累积
                float* dout_bth = dout + b * T * C + t * C + h * hs;
                for (int t2 = 0; t2 <= t; t2++) {
                    // 获取值（V）的指针
                    float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's value
                    // 获取值（V）的梯度指针
                    float* dvalue_t2 = dinp + b * T * C3 + t2 * C3 + h * hs + C*2;
                    for (int i = 0; i < hs; i++) {
                        // in the forward pass this was:
                        // out_bth[i] += att_bth[t2] * value_t2[i];
                        // so now we have:
                        // 更新注意力权重和值（V）的梯度
                        datt_bth[t2] += value_t2[i] * dout_bth[i];
                        dvalue_t2[i] += att_bth[t2] * dout_bth[i];
                    }
                }

                // backward pass 2 & 3, the softmax
                // note that softmax (like e.g. tanh) doesn't need the input (preatt) to backward
                // 反向传播步骤2和3，softmax部分
                for (int t2 = 0; t2 <= t; t2++) {
                    for (int t3 = 0; t3 <= t; t3++) {
                        // 指示器函数
                        float indicator = t2 == t3 ? 1.0f : 0.0f;
                        // 本地导数
                        float local_derivative = att_bth[t2] * (indicator - att_bth[t3]);
                        dpreatt_bth[t3] += local_derivative * datt_bth[t2];
                    }
                }

                // backward pass 1, the query @ key matmul
                // 反向传播步骤1，查询（Q）与键（K）的矩阵乘法
                for (int t2 = 0; t2 <= t; t2++) {
                    // 获取键（K）的指针
                    float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key
                    // 获取键（K）的梯度指针
                    float* dkey_t2 = dinp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key
                    for (int i = 0; i < hs; i++) {
                        // in the forward pass this was:
                        // preatt_bth[t2] += (query_t[i] * key_t2[i]) * scale;
                        // so now we have:
                        // 更新查询（Q）和键（K）的梯度
                        dquery_t[i] += key_t2[i] * dpreatt_bth[t2] * scale;
                        dkey_t2[i] += query_t[i] * dpreatt_bth[t2] * scale;
                    }
                }
            }
        }
    }
}

void attention_backward(float* dinp, float* dpreatt, float* datt,

float* dout, float* inp, float* att,

int B, int T, int C, int NH) {

// inp/dinp are (B, T, 3C) Q,K,V

// att/datt/dpreatt are (B, NH, T, T)

// dout is (B, T, C)

// inp/dinp是(B, T, 3C)维的，包含了查询（Q）、键（K）、值（V）

// att/datt/dpreatt是(B, NH, T, T)维的

// dout是(B, T, C)维的输出梯度

int C3 = C*3;

// 每个头的大小

int hs = C / NH; // head size

// 缩放因子，用于缩放查询与键的点积

float scale = 1.0 / sqrtf(hs);

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

for (int h = 0; h < NH; h++) {

// 获取att、datt和dpreatt的指针

float* att_bth = att + b*NH*T*T + h*T*T + t*T;

float* datt_bth = datt + b*NH*T*T + h*T*T + t*T;

float* dpreatt_bth = dpreatt + b*NH*T*T + h*T*T + t*T;

// 获取查询（Q）的梯度和原始值的指针

float* dquery_t = dinp + b * T * C3 + t * C3 + h * hs;

float* query_t = inp + b * T * C3 + t * C3 + h * hs;

// backward pass 4, through the value accumulation

// 反向传播步骤4，通过值（V）的累积

float* dout_bth = dout + b * T * C + t * C + h * hs;

for (int t2 = 0; t2 <= t; t2++) {

// 获取值（V）的指针

float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's value

// 获取值（V）的梯度指针

float* dvalue_t2 = dinp + b * T * C3 + t2 * C3 + h * hs + C*2;

for (int i = 0; i < hs; i++) {

// in the forward pass this was:

// out_bth[i] += att_bth[t2] * value_t2[i];

// so now we have:

// 更新注意力权重和值（V）的梯度

datt_bth[t2] += value_t2[i] * dout_bth[i];

dvalue_t2[i] += att_bth[t2] * dout_bth[i];

}

// backward pass 2 & 3, the softmax

// note that softmax (like e.g. tanh) doesn't need the input (preatt) to backward

// 反向传播步骤2和3，softmax部分

for (int t2 = 0; t2 <= t; t2++) {

for (int t3 = 0; t3 <= t; t3++) {

// 指示器函数

float indicator = t2 == t3 ? 1.0f : 0.0f;

// 本地导数

float local_derivative = att_bth[t2] * (indicator - att_bth[t3]);

dpreatt_bth[t3] += local_derivative * datt_bth[t2];

}

// backward pass 1, the query @ key matmul

// 反向传播步骤1，查询（Q）与键（K）的矩阵乘法

for (int t2 = 0; t2 <= t; t2++) {

// 获取键（K）的指针

float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key

// 获取键（K）的梯度指针

float* dkey_t2 = dinp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key

for (int i = 0; i < hs; i++) {

// in the forward pass this was:

// preatt_bth[t2] += (query_t[i] * key_t2[i]) * scale;

// so now we have:

// 更新查询（Q）和键（K）的梯度

dquery_t[i] += key_t2[i] * dpreatt_bth[t2] * scale;

dkey_t2[i] += query_t[i] * dpreatt_bth[t2] * scale;

}

在这个反向传播过程中，首先处理值（Value）的累积，然后处理softmax部分，最后处理查询（Query）与键（Key）的点积。这种分步处理方式反映了注意力机制的计算过程：首先计算查询和键之间的相似度，然后通过softmax函数将这些相似度转换成注意力权重，最后使用这些权重来加权值（Value）向量，生成最终的输出。反向传播过程则是这一计算过程的逆过程，根据输出梯度（dout）来计算输入（inp）、注意力权重（att）及其预激活值（preatt）的梯度（dinp、datt、dpreatt）。

1.9 gelu_backward

这段代码实现了Gaussian Error Linear Unit (GeLU) 激活函数的前向传播。GeLU激活函数在Transformer模型中的多层感知机（MLP）部分被广泛使用，它可以提供一种非线性变换，有助于模型捕捉复杂的特征。以下是对这个函数的详细中文注释：

#define GELU_SCALING_FACTOR sqrtf(2.0f / M_PI)
void gelu_forward(float* out, float* inp, int N) {
    // (approximate) GeLU elementwise non-linearity in the MLP block of Transformer
    // Transformer中MLP块的（近似的）GeLU逐元素非线性激活函数
    for (int i = 0; i < N; i++) {
        // 当前元素的值
        float x = inp[i];
        // x的三次方乘以0.044715
        float cube = 0.044715f * x * x * x;
        // 计算GeLU函数的值并赋值给输出
        out[i] = 0.5f * x * (1.0f + tanhf(GELU_SCALING_FACTOR * (x + cube)));
    }
}

#define GELU_SCALING_FACTOR sqrtf(2.0f / M_PI)

void gelu_forward(float* out, float* inp, int N) {

// (approximate) GeLU elementwise non-linearity in the MLP block of Transformer

// Transformer中MLP块的（近似的）GeLU逐元素非线性激活函数

for (int i = 0; i < N; i++) {

// 当前元素的值

float x = inp[i];

// x的三次方乘以0.044715

float cube = 0.044715f * x * x * x;

// 计算GeLU函数的值并赋值给输出

out[i] = 0.5f * x * (1.0f + tanhf(GELU_SCALING_FACTOR * (x + cube)));

}

GeLU函数是通过将输入值x进行高斯分布的累积分布函数（CDF）变换得到的。这里使用的是GeLU函数的近似形式，其通过tanh函数来近似实现。GELU_SCALING_FACTOR是根据GeLU函数的定义计算得到的缩放因子，用于调整输入值的尺度。0.044715f * x * x * x项是用来增加函数的非线性度。通过这种方式，GeLU激活函数允许一些输入直接通过（对应于线性区域），同时对其他输入进行非线性变换，这有助于模型在训练过程中学习到更加复杂和抽象的表示。

1.10 gelu_backward

这段代码实现了GeLU激活函数的反向传播过程，用于计算GeLU激活函数相对于其输入的梯度，并根据链式法则更新前一层的梯度。在深度学习中，正确地计算梯度对于通过梯度下降算法有效地训练模型至关重要。以下是对这个函数的详细中文注释：

// we want to use -Ofast optimization, but sadly GeLU breaks, so disable this flag just for it (#168)
// 由于使用-Ofast优化时GeLU函数会出现问题，因此特意为GeLU函数禁用该优化标志（#168）
#pragma float_control(precise, on, push) // On msvc /fp:fast is a lot faster, but the expf inside coshf breaks the model
__attribute__((optimize("no-finite-math-only"))) // same for gcc -Ofast
void gelu_backward(float* dinp, float* inp, float* dout, int N) {
    for (int i = 0; i < N; i++) {
        // 输入值
        float x = inp[i];
        // 计算x的三次方项
        float cube = 0.044715f * x * x * x;
        // 计算tanh函数的参数
        float tanh_arg = GELU_SCALING_FACTOR * (x + cube);
        // 计算tanh的输出
        float tanh_out = tanhf(tanh_arg);
        // 计算cosh的输出，用于计算sech
        float coshf_out = coshf(tanh_arg);
        // 计算sech的输出
        float sech_out = 1.0f / (coshf_out * coshf_out);
        // 计算局部梯度
        float local_grad = 0.5f * (1.0f + tanh_out) + x * 0.5f * sech_out * GELU_SCALING_FACTOR * (1.0f + 3.0f * 0.044715f * x * x);
        // 更新梯度
        dinp[i] += local_grad * dout[i];
    }
}
#pragma float_control(pop)

// we want to use -Ofast optimization, but sadly GeLU breaks, so disable this flag just for it (#168)

// 由于使用-Ofast优化时GeLU函数会出现问题，因此特意为GeLU函数禁用该优化标志（#168）

#pragma float_control(precise, on, push) // On msvc /fp:fast is a lot faster, but the expf inside coshf breaks the model

__attribute__((optimize("no-finite-math-only"))) // same for gcc -Ofast

void gelu_backward(float* dinp, float* inp, float* dout, int N) {

for (int i = 0; i < N; i++) {

// 输入值

float x = inp[i];

// 计算x的三次方项

float cube = 0.044715f * x * x * x;

// 计算tanh函数的参数

float tanh_arg = GELU_SCALING_FACTOR * (x + cube);

// 计算tanh的输出

float tanh_out = tanhf(tanh_arg);

// 计算cosh的输出，用于计算sech

float coshf_out = coshf(tanh_arg);

// 计算sech的输出

float sech_out = 1.0f / (coshf_out * coshf_out);

// 计算局部梯度

float local_grad = 0.5f * (1.0f + tanh_out) + x * 0.5f * sech_out * GELU_SCALING_FACTOR * (1.0f + 3.0f * 0.044715f * x * x);

// 更新梯度

dinp[i] += local_grad * dout[i];

}

#pragma float_control(pop)

这里使用的编译器指令（如#pragma float_control和__attribute__）是为了确保在编译时不会因为优化选项而改变GeLU函数的数学行为，保证模型的准确性和稳定性。特别地，GeLU激活函数的反向传播涉及到了tanh和sech函数的导数，需要精确的浮点运算来保证梯度计算的准确性。通过计算每个输入元素的局部梯度并乘以来自后一层的梯度（dout），该函数能够为前一层（即GeLU函数的输入层）正确地更新梯度（dinp），这是模型训练过程中梯度反向传播的重要步骤。

1.11 residual_forward

这段代码实现了残差连接（Residual Connection）的前向传播过程。在深度学习模型中，特别是在深度网络如ResNet和Transformer中，残差连接帮助模型有效地训练，通过添加输入到输出来防止梯度消失问题。以下是对这个函数的详细中文注释：

void residual_forward(float* out, float* inp1, float* inp2, int N) {
    // 遍历每个元素
    for (int i = 0; i < N; i++) {
        // 将两个输入相加并存储到输出中
        out[i] = inp1[i] + inp2[i];
    }
}

void residual_forward(float* out, float* inp1, float* inp2, int N) {

// 遍历每个元素

for (int i = 0; i < N; i++) {

// 将两个输入相加并存储到输出中

out[i] = inp1[i] + inp2[i];

}

这里inp1和inp2分别是两个输入数组，它们可以代表深度神经网络中某层的输入和该层的变换输出。out是输出数组，其中存储了inp1和inp2逐元素相加的结果。N是数组中元素的总数。通过这种方式，残差连接允许网络学习对输入的恒等变换（identity transformation），即直接将输入传递到输出，这有助于解决更深层网络在训练时可能遇到的梯度消失问题。

1.12 residual_backward

这段代码实现了残差连接的反向传播过程。在深度学习模型中，反向传播是计算梯度并更新模型参数的关键步骤。对于残差连接，反向传播过程简单直接，因为残差连接只涉及简单的加法操作。以下是对这个函数的详细中文注释：

void residual_backward(float* dinp1, float* dinp2, float* dout, int N) {
    // 遍历每个元素
    for (int i = 0; i < N; i++) {
        // 将输出梯度直接传递给两个输入梯度
        dinp1[i] += dout[i];
        dinp2[i] += dout[i];
    }
}

void residual_backward(float* dinp1, float* dinp2, float* dout, int N) {

// 遍历每个元素

for (int i = 0; i < N; i++) {

// 将输出梯度直接传递给两个输入梯度

dinp1[i] += dout[i];

dinp2[i] += dout[i];

}

这里dout是从后续层传回的梯度数组，dinp1和dinp2分别是需要更新的两个输入梯度数组。N是数组中元素的总数。由于残差连接的前向传播只是将两个输入相加，所以其反向传播过程中梯度的传递非常直接——后续层传回的梯度dout被简单地累加到两个输入的梯度dinp1和dinp2上。

残差连接通过这种方式简化了梯度的流动，有助于防止深层网络中梯度消失或梯度爆炸的问题，使得训练深层网络变得更加可行。

1.13 softmax_forward

这段代码实现了softmax函数的前向传播过程。Softmax函数常用于深度学习模型中的多分类任务，它可以将一组未归一化的分数（logits）转换成概率分布，其中每个分数通过指数函数转换后，再除以所有转换分数的和，以确保所有输出概率之和为1。以下是对这个函数的详细中文注释：

void softmax_forward(float* probs, float* logits, int B, int T, int V) {
    // output: probs are (B,T,V) of the probabilities (sums to 1.0 in each b,t position)
    // input: logits is (B,T,V) of the unnormalized log probabilities
    // 输出：probs是(B,T,V)维的概率数组（在每个b,t位置的和为1.0）
    // 输入：logits是(B,T,V)维的未归一化的对数概率
    #pragma omp parallel for collapse(2)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // probs <- softmax(logits)
            // 将softmax应用于logits
            // 定位到特定批次和时间步的logits
            float* logits_bt = logits + b * T * V + t * V;
            // 定位到对应的输出概率
            float* probs_bt = probs + b * T * V + t * V;

            // maxval is only calculated and subtracted for numerical stability
            // 为了数值稳定性，先计算并减去最大值
            // 初始化为一个很小的数
            float maxval = -10000.0f; // TODO something better
            for (int i = 0; i < V; i++) {
                if (logits_bt[i] > maxval) {
                    // 找到最大的logit值
                    maxval = logits_bt[i];
                }
            }
            float sum = 0.0f;
            for (int i = 0; i < V; i++) {
                // 通过指数函数转换，并减去最大值以提高数值稳定性
                probs_bt[i] = expf(logits_bt[i] - maxval);
                // 计算所有转换后分数的和
                sum += probs_bt[i];
            }
            for (int i = 0; i < V; i++) {
                // 归一化以确保概率之和为1
                probs_bt[i] /= sum;
            }
        }
    }
}

void softmax_forward(float* probs, float* logits, int B, int T, int V) {

// output: probs are (B,T,V) of the probabilities (sums to 1.0 in each b,t position)

// input: logits is (B,T,V) of the unnormalized log probabilities

// 输出：probs是(B,T,V)维的概率数组（在每个b,t位置的和为1.0）

// 输入：logits是(B,T,V)维的未归一化的对数概率

#pragma omp parallel for collapse(2)

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// probs <- softmax(logits)

// 将softmax应用于logits

// 定位到特定批次和时间步的logits

float* logits_bt = logits + b * T * V + t * V;

// 定位到对应的输出概率

float* probs_bt = probs + b * T * V + t * V;

// maxval is only calculated and subtracted for numerical stability

// 为了数值稳定性，先计算并减去最大值

// 初始化为一个很小的数

float maxval = -10000.0f; // TODO something better

for (int i = 0; i < V; i++) {

if (logits_bt[i] > maxval) {

// 找到最大的logit值

maxval = logits_bt[i];

}

float sum = 0.0f;

for (int i = 0; i < V; i++) {

// 通过指数函数转换，并减去最大值以提高数值稳定性

probs_bt[i] = expf(logits_bt[i] - maxval);

// 计算所有转换后分数的和

sum += probs_bt[i];

}

for (int i = 0; i < V; i++) {

// 归一化以确保概率之和为1

probs_bt[i] /= sum;

}

在处理大规模数据时，减去logits数组中每个元素的最大值是一种常用的数值稳定技巧，这可以防止在计算指数函数时发生数值溢出。通过OpenMP的并行化指令，这个过程还利用了多核处理器的计算能力，以加速softmax函数的计算。这样，每个输入向量（或者说，每个时间步的所有类别的分数）都被转换成一个概率分布，这些概率分布可用于后续的训练或预测过程中。

1.14 crossentropy_forward

这段代码实现了交叉熵损失函数（Cross Entropy Loss）的前向传播过程，这是深度学习中用于多分类问题的一种常用损失函数。它衡量的是模型输出的概率分布与真实标签之间的差异。以下是对这个函数的详细中文注释：

void crossentropy_forward(float* losses,
                          float* probs, int* targets,
                          int B, int T, int V) {
    // output: losses is (B,T) of the individual losses at each position
    // input: probs are (B,T,V) of the probabilities
    // input: targets is (B,T) of integers giving the correct index in logits
    // 输出：losses是(B,T)维的，表示每个位置的单独损失
    // 输入：probs是(B,T,V)维的概率
    // 输入：targets是(B,T)维的整数，表示logits中正确类别的索引
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // loss = -log(probs[target])
            // 损失计算为：-log(probs[目标类别])
            // 定位到特定批次和时间步的概率
            float* probs_bt = probs + b * T * V + t * V;
            // 获取真实标签的索引
            int ix = targets[b * T + t];
            // 计算并存储损失
            losses[b * T + t] = -logf(probs_bt[ix]);
        }
    }
}

void crossentropy_forward(float* losses,

float* probs, int* targets,

int B, int T, int V) {

// output: losses is (B,T) of the individual losses at each position

// input: probs are (B,T,V) of the probabilities

// input: targets is (B,T) of integers giving the correct index in logits

// 输出：losses是(B,T)维的，表示每个位置的单独损失

// 输入：probs是(B,T,V)维的概率

// 输入：targets是(B,T)维的整数，表示logits中正确类别的索引

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// loss = -log(probs[target])

// 损失计算为：-log(probs[目标类别])

// 定位到特定批次和时间步的概率

float* probs_bt = probs + b * T * V + t * V;

// 获取真实标签的索引

int ix = targets[b * T + t];

// 计算并存储损失

losses[b * T + t] = -logf(probs_bt[ix]);

}

在这里，probs数组包含了模型对每个类别预测的概率，targets数组包含了每个样本的真实类别索引，losses数组用于存储每个样本的损失。对于每个样本，它的交叉熵损失是通过取真实类别对应概率的负对数来计算的。这意味着如果模型对真实类别的预测概率很高（接近1），损失将会很小；如果模型对真实类别的预测概率很低（接近0），损失将会很大。

交叉熵损失函数是优化分类模型常用的方法之一，因为它直接针对模型输出的概率分布，使得模型能够在训练过程中逐渐学习到将正确类别的概率预测得更高。

1.15 crossentropy_softmax_backward

这段代码实现了交叉熵损失函数和softmax激活函数的联合反向传播过程。这在训练深度学习分类模型时非常常见，因为很多模型在输出层使用softmax函数将logits转换为概率分布，然后使用交叉熵损失函数来计算预测的概率分布与真实标签之间的差异。以下是对这个函数的详细中文注释：

void crossentropy_softmax_backward(float* dlogits,
                           float* dlosses, float* probs, int* targets,
                           int B, int T, int V) {
    // backwards through both softmax and crossentropy
    // 同时对softmax和交叉熵执行反向传播
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // 定位到特定批次和时间步的logits梯度
            float* dlogits_bt = dlogits + b * T * V + t * V;
            // 定位到特定批次和时间步的概率
            float* probs_bt = probs + b * T * V + t * V;
            // 获取该位置的损失梯度
            float dloss = dlosses[b * T + t];
            // 获取真实类别的索引
            int ix = targets[b * T + t];

            for (int i = 0; i < V; i++) {
                // 当前类别的预测概率
                float p = probs_bt[i];
                // 指示器函数，如果i是真实类别则为1，否则为0
                float indicator = i == ix ? 1.0f : 0.0f;
                // 更新logits的梯度
                dlogits_bt[i] += (p - indicator) * dloss;
            }
        }
    }
}

void crossentropy_softmax_backward(float* dlogits,

float* dlosses, float* probs, int* targets,

int B, int T, int V) {

// backwards through both softmax and crossentropy

// 同时对softmax和交叉熵执行反向传播

for (int b = 0; b < B; b++) {

for (int t = 0; t < T; t++) {

// 定位到特定批次和时间步的logits梯度

float* dlogits_bt = dlogits + b * T * V + t * V;

// 定位到特定批次和时间步的概率

float* probs_bt = probs + b * T * V + t * V;

// 获取该位置的损失梯度

float dloss = dlosses[b * T + t];

// 获取真实类别的索引

int ix = targets[b * T + t];

for (int i = 0; i < V; i++) {

// 当前类别的预测概率

float p = probs_bt[i];

// 指示器函数，如果i是真实类别则为1，否则为0

float indicator = i == ix ? 1.0f : 0.0f;

// 更新logits的梯度

dlogits_bt[i] += (p - indicator) * dloss;

}

在这个过程中，dlogits数组存储了每个logit相对于损失的梯度，dlosses数组包含了每个样本损失相对于模型输出的梯度，probs数组包含了模型预测的概率分布，targets数组包含了每个样本的真实类别索引。反向传播的目的是计算dlogits，即每个输出logit相对于损失函数的梯度，这些梯度将被用来更新模型参数。

对于每个类别i，其logit的梯度是由预测概率p减去指示器函数indicator（当且仅当i是真实类别时为1，否则为0）再乘以该位置的损失梯度dloss计算得来。这种计算方式简洁地表达了softmax函数和交叉熵损失的联合梯度，允许模型通过梯度下降算法进行学习和优化。

1.16 ParameterTensors

这段代码定义了GPT-2模型中用到的参数结构体ParameterTensors。GPT-2是一个基于Transformer的预训练语言模型，广泛用于各种自然语言处理任务。ParameterTensors结构体中包含了模型所有必需的权重和偏置参数。以下是对每个成员的简要说明：

#define NUM_PARAMETER_TENSORS 16
typedef struct {
    // wte：词嵌入权重（Word Token Embeddings），维度是(V, C)，
    // 其中V是词汇表大小，C是嵌入维度。
    float* wte; // (V, C)
    // wpe：位置嵌入权重（Word Position Embeddings），维度是(maxT, C)，
    // 其中maxT是模型可以处理的最大序列长度。
    float* wpe; // (maxT, C)
    // ln1w和ln1b：第一层归一化的权重和偏置，维度分别是(L, C)。
    // L是Transformer层的数量。
    float* ln1w; // (L, C)
    float* ln1b; // (L, C)
    // qkvw和qkvb：查询（Query）、键（Key）和值（Value）的权重和偏置，
    // 用于自注意力机制，维度分别是(L, 3C, C)和(L, 3C)。
    // 每层有三个C维的向量分别对应查询、键和值。
    float* qkvw; // (L, 3*C, C)
    float* qkvb; // (L, 3*C)
    // attprojw和attprojb：注意力输出的投影权重和偏置，维度分别是(L, C, C)。
    float* attprojw; // (L, C, C)
    float* attprojb; // (L, C)
    // ln2w和ln2b：第二层归一化的权重和偏置，维度分别是(L, C)。
    float* ln2w; // (L, C)
    float* ln2b; // (L, C)
    // fcw和fcb：前馈网络（Feedforward Network）的权重和偏置，维度分别是(L, 4C, C)和(L, 4C)。
    // 前馈网络中使用了扩展的内部维度（4*C）。
    float* fcw; // (L, 4*C, C)
    float* fcb; // (L, 4*C)
    // fcprojw和fcprojb：前馈网络输出的投影权重和偏置，维度分别是(L, C, 4*C)。
    float* fcprojw; // (L, C, 4*C)
    float* fcprojb; // (L, C)
    // lnfw和lnfb：最后一层归一化的权重和偏置，维度分别是(C)。
    float* lnfw; // (C)
    float* lnfb; // (C)
} ParameterTensors;

#define NUM_PARAMETER_TENSORS 16

typedef struct {

// wte：词嵌入权重（Word Token Embeddings），维度是(V, C)，

// 其中V是词汇表大小，C是嵌入维度。

float* wte; // (V, C)

// wpe：位置嵌入权重（Word Position Embeddings），维度是(maxT, C)，

// 其中maxT是模型可以处理的最大序列长度。

float* wpe; // (maxT, C)

// ln1w和ln1b：第一层归一化的权重和偏置，维度分别是(L, C)。

// L是Transformer层的数量。

float* ln1w; // (L, C)

float* ln1b; // (L, C)

// qkvw和qkvb：查询（Query）、键（Key）和值（Value）的权重和偏置，

// 用于自注意力机制，维度分别是(L, 3C, C)和(L, 3C)。

// 每层有三个C维的向量分别对应查询、键和值。

float* qkvw; // (L, 3*C, C)

float* qkvb; // (L, 3*C)

// attprojw和attprojb：注意力输出的投影权重和偏置，维度分别是(L, C, C)。

float* attprojw; // (L, C, C)

float* attprojb; // (L, C)

// ln2w和ln2b：第二层归一化的权重和偏置，维度分别是(L, C)。

float* ln2w; // (L, C)

float* ln2b; // (L, C)

// fcw和fcb：前馈网络（Feedforward Network）的权重和偏置，维度分别是(L, 4C, C)和(L, 4C)。

// 前馈网络中使用了扩展的内部维度（4*C）。

float* fcw; // (L, 4*C, C)

float* fcb; // (L, 4*C)

// fcprojw和fcprojb：前馈网络输出的投影权重和偏置，维度分别是(L, C, 4*C)。

float* fcprojw; // (L, C, 4*C)

float* fcprojb; // (L, C)

// lnfw和lnfb：最后一层归一化的权重和偏置，维度分别是(C)。

float* lnfw; // (C)

float* lnfb; // (C)

} ParameterTensors;

这个结构体是构建和操作GPT-2模型的关键，它确保了所有必要的模型参数都可以被有效地存储和访问。这些参数在模型的训练过程中会不断更新，以使模型能够更好地学习和理解语言数据。

1.17 malloc_and_point_parameters

这段代码实现了为GPT-2模型的参数分配内存，并将各个参数张量指向正确的内存位置的功能。通过这种方式，所有模型参数都被连续地存储在一块内存区域中，而ParameterTensors结构体中的指针则指向这块内存中对应参数的位置。这种方法有助于提高内存使用效率和简化参数管理。以下是对这个函数的详细中文注释：

// allocate memory for the parameters and point the individual tensors to the right places
// 为参数分配内存，并将各个张量指向正确的位置。
float* malloc_and_point_parameters(ParameterTensors* params, size_t* param_sizes) {
    size_t num_parameters = 0;
    // 遍历所有参数张量，计算总的参数数量
    for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {
        num_parameters += param_sizes[i];
    }
    // malloc all parameters all at once
    // 一次性为所有参数分配足够的连续内存空间
    float* params_memory = (float*)malloc(num_parameters * sizeof(float));
    // assign all the tensors
     // 将各个参数张量的指针指向分配的内存中正确的位置
    float** ptrs[] = {
        &params->wte, &params->wpe, &params->ln1w, &params->ln1b, &params->qkvw, &params->qkvb,
        &params->attprojw, &params->attprojb, &params->ln2w, &params->ln2b, &params->fcw, &params->fcb,
        &params->fcprojw, &params->fcprojb, &params->lnfw, &params->lnfb
    };
    // 使用迭代器遍历分配的内存并为每个参数张量赋值
    float* params_memory_iterator = params_memory;
    for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {
        // 设置指针指向当前参数的位置
        *(ptrs[i]) = params_memory_iterator;
        // 更新迭代器以指向下一段参数内存
        params_memory_iterator += param_sizes[i];
    }
    // 返回分配的内存块的指针，用于后续的释放等操作
    return params_memory;
}

// allocate memory for the parameters and point the individual tensors to the right places

// 为参数分配内存，并将各个张量指向正确的位置。

float* malloc_and_point_parameters(ParameterTensors* params, size_t* param_sizes) {

size_t num_parameters = 0;

// 遍历所有参数张量，计算总的参数数量

for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {

num_parameters += param_sizes[i];

}

// malloc all parameters all at once

// 一次性为所有参数分配足够的连续内存空间

float* params_memory = (float*)malloc(num_parameters * sizeof(float));

// assign all the tensors

// 将各个参数张量的指针指向分配的内存中正确的位置

float** ptrs[] = {

&params->wte, &params->wpe, &params->ln1w, &params->ln1b, &params->qkvw, &params->qkvb,

&params->attprojw, &params->attprojb, &params->ln2w, &params->ln2b, &params->fcw, &params->fcb,

&params->fcprojw, &params->fcprojb, &params->lnfw, &params->lnfb

};

// 使用迭代器遍历分配的内存并为每个参数张量赋值

float* params_memory_iterator = params_memory;

for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {

// 设置指针指向当前参数的位置

*(ptrs[i]) = params_memory_iterator;

// 更新迭代器以指向下一段参数内存

params_memory_iterator += param_sizes[i];

}

// 返回分配的内存块的指针，用于后续的释放等操作

return params_memory;

}

这个函数的关键在于它允许ParameterTensors结构体中的所有参数张量通过单一的内存分配来管理，而不是为每个张量单独分配内存。这样不仅减少了内存碎片，还简化了内存管理。通过param_sizes数组指定每个参数张量所需的内存大小，并利用指针数组ptrs将每个参数张量指向分配的内存块中的正确位置。最终，函数返回指向分配内存块的指针，这允许在不需要这些参数时正确地释放内存。

1.18 ActivationTensors

这个结构体ActivationTensors定义了在GPT-2模型或类似的Transformer模型中使用的激活张量。这些激活张量存储了模型的中间输出，如编码后的嵌入、层归一化的结果、自注意力机制的输出等。以下是对每个成员的简要说明：

#define NUM_ACTIVATION_TENSORS 23
typedef struct {
    // encoded：编码后的嵌入，维度为(B, T, C)，B是批量大小，T是序列长度，C是隐藏层大小。
    float* encoded; // (B, T, C)
    // ln1, ln1_mean, ln1_rstd：第一层归一化及其统计量，分别对应于归一化后的值、均值和逆标准差。
    float* ln1; // (L, B, T, C)
    float* ln1_mean; // (L, B, T)
    float* ln1_rstd; // (L, B, T)
    // qkv：查询（Query）、键（Key）、值（Value）的合并张量，用于自注意力计算。
    float* qkv; // (L, B, T, 3*C)
    // atty：自注意力的输出。
    float* atty; // (L, B, T, C)
    // preatt, att：自注意力计算中的预注意力分数和注意力分数。
    float* preatt; // (L, B, NH, T, T)
    float* att; // (L, B, NH, T, T)
    // attproj：自注意力输出的投影。
    float* attproj; // (L, B, T, C)
    // residual2：第二个残差连接后的结果。
    float* residual2; // (L, B, T, C)
    // ln2, ln2_mean, ln2_rstd：第二层归一化及其统计量。
    float* ln2; // (L, B, T, C)
    float* ln2_mean; // (L, B, T)
    float* ln2_rstd; // (L, B, T)
    // fch, fch_gelu：前馈网络的输出和经过GeLU激活函数后的结果。
    float* fch; // (L, B, T, 4*C)
    float* fch_gelu; // (L, B, T, 4*C)
    // fcproj：前馈网络输出的投影。
    float* fcproj; // (L, B, T, C)
    // residual3：第三个残差连接后的结果。
    float* residual3; // (L, B, T, C)
    // lnf, lnf_mean, lnf_rstd：最终输出前的层归一化及其统计量。
    float* lnf; // (B, T, C)
    float* lnf_mean; // (B, T)
    float* lnf_rstd; // (B, T)
    // logits：模型的最终输出logits，即未归一化的分数。
    float* logits; // (B, T, V)
    // probs：通过softmax归一化后的概率分布。
    float* probs; // (B, T, V)
    // losses：每个时间步的损失。
    float* losses; // (B, T)
} ActivationTensors;

#define NUM_ACTIVATION_TENSORS 23

typedef struct {

// encoded：编码后的嵌入，维度为(B, T, C)，B是批量大小，T是序列长度，C是隐藏层大小。

float* encoded; // (B, T, C)

// ln1, ln1_mean, ln1_rstd：第一层归一化及其统计量，分别对应于归一化后的值、均值和逆标准差。

float* ln1; // (L, B, T, C)

float* ln1_mean; // (L, B, T)

float* ln1_rstd; // (L, B, T)

// qkv：查询（Query）、键（Key）、值（Value）的合并张量，用于自注意力计算。

float* qkv; // (L, B, T, 3*C)

// atty：自注意力的输出。

float* atty; // (L, B, T, C)

// preatt, att：自注意力计算中的预注意力分数和注意力分数。

float* preatt; // (L, B, NH, T, T)

float* att; // (L, B, NH, T, T)

// attproj：自注意力输出的投影。

float* attproj; // (L, B, T, C)

// residual2：第二个残差连接后的结果。

float* residual2; // (L, B, T, C)

// ln2, ln2_mean, ln2_rstd：第二层归一化及其统计量。

float* ln2; // (L, B, T, C)

float* ln2_mean; // (L, B, T)

float* ln2_rstd; // (L, B, T)

// fch, fch_gelu：前馈网络的输出和经过GeLU激活函数后的结果。

float* fch; // (L, B, T, 4*C)

float* fch_gelu; // (L, B, T, 4*C)

// fcproj：前馈网络输出的投影。

float* fcproj; // (L, B, T, C)

// residual3：第三个残差连接后的结果。

float* residual3; // (L, B, T, C)

// lnf, lnf_mean, lnf_rstd：最终输出前的层归一化及其统计量。

float* lnf; // (B, T, C)

float* lnf_mean; // (B, T)

float* lnf_rstd; // (B, T)

// logits：模型的最终输出logits，即未归一化的分数。

float* logits; // (B, T, V)

// probs：通过softmax归一化后的概率分布。

float* probs; // (B, T, V)

// losses：每个时间步的损失。

float* losses; // (B, T)

} ActivationTensors;

这个结构体捕获了模型从输入到输出的整个流程中的关键中间状态，为模型的前向传播和反向传播提供了必要的数据。

1.19 malloc_and_point_activations

这段代码实现了为GPT-2模型中定义的激活张量分配内存，并确保各个激活张量指向正确的内存位置。这样做旨在简化模型中激活张量的管理，并确保所有激活数据都存储在连续的内存块中。以下是对这个函数的详细中文注释：

float* malloc_and_point_activations(ActivationTensors* acts, size_t* act_sizes) {
    size_t num_activations = 0;
    // 遍历所有激活张量，计算总的激活数量
    for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {
        num_activations += act_sizes[i];
    }
    // 一次性为所有激活张量分配足够的连续内存空间
    float* acts_memory = (float*)malloc(num_activations * sizeof(float));
    // 将各个激活张量的指针指向分配的内存中正确的位置
    float** ptrs[] = {
        &acts->encoded, &acts->ln1, &acts->ln1_mean, &acts->ln1_rstd, &acts->qkv, &acts->atty,
        &acts->preatt, &acts->att, &acts->attproj, &acts->residual2, &acts->ln2, &acts->ln2_mean,
        &acts->ln2_rstd, &acts->fch, &acts->fch_gelu, &acts->fcproj, &acts->residual3, &acts->lnf,
        &acts->lnf_mean, &acts->lnf_rstd, &acts->logits, &acts->probs, &acts->losses
    };
    // 使用迭代器遍历分配的内存并为每个激活张量赋值
    float* acts_memory_iterator = acts_memory;
    for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {
        // 设置指针指向当前激活张量的位置
        *(ptrs[i]) = acts_memory_iterator;
        // 更新迭代器以指向下一段激活数据的内存
        acts_memory_iterator += act_sizes[i];
    }
    // 返回分配的内存块的指针，用于后续的释放等操作
    return acts_memory;
}

float* malloc_and_point_activations(ActivationTensors* acts, size_t* act_sizes) {

size_t num_activations = 0;

// 遍历所有激活张量，计算总的激活数量

for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {

num_activations += act_sizes[i];

}

// 一次性为所有激活张量分配足够的连续内存空间

float* acts_memory = (float*)malloc(num_activations * sizeof(float));

// 将各个激活张量的指针指向分配的内存中正确的位置

float** ptrs[] = {

&acts->encoded, &acts->ln1, &acts->ln1_mean, &acts->ln1_rstd, &acts->qkv, &acts->atty,

&acts->preatt, &acts->att, &acts->attproj, &acts->residual2, &acts->ln2, &acts->ln2_mean,

&acts->ln2_rstd, &acts->fch, &acts->fch_gelu, &acts->fcproj, &acts->residual3, &acts->lnf,

&acts->lnf_mean, &acts->lnf_rstd, &acts->logits, &acts->probs, &acts->losses

};

// 使用迭代器遍历分配的内存并为每个激活张量赋值

float* acts_memory_iterator = acts_memory;

for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {

// 设置指针指向当前激活张量的位置

*(ptrs[i]) = acts_memory_iterator;

// 更新迭代器以指向下一段激活数据的内存

acts_memory_iterator += act_sizes[i];

}

// 返回分配的内存块的指针，用于后续的释放等操作

return acts_memory;

}

通过这种方法，ActivationTensors结构体中的所有激活张量通过单一的内存分配来管理，而不是为每个张量单独分配内存。这样不仅减少了内存碎片，还简化了内存管理。act_sizes数组指定了每个激活张量所需的内存大小，ptrs指针数组用于将每个激活张量指向分配的内存块中的正确位置。最终，函数返回指向分配内存块的指针，允许在不需要这些激活数据时正确地释放内存。

1.20 GPT2Config

这个结构体GPT2Config定义了GPT-2模型的配置参数。这些参数是在模型构建和训练时必需的基本设置，它们确定了模型的大小、复杂度和处理能力。以下是对每个成员的简要说明：

typedef struct {
    // max_seq_len：最大序列长度，即模型能够处理的最大输入长度。
    // 例如，1024表示模型能够处理的最大单词或token数量为1024。
    int max_seq_len; // max sequence length, e.g. 1024
    //vocab_size：词汇表大小，即模型能够识别的唯一单词或token的数量。
    // 例如，50257表示模型的词汇表中有50257个不同的token。
    int vocab_size; // vocab size, e.g. 50257
    // num_layers：模型中Transformer层的数量。
    // 例如，12表示模型由12个Transformer层组成。
    int num_layers; // number of layers, e.g. 12
    // num_heads：自注意力（Self-attention）机制中的头数。
    // 例如，12表示每个Transformer层的自注意力机制包含12个头。
    int num_heads; // number of heads in attention, e.g. 12
    // channels：通道数，也可以理解为隐藏层的维度。
    // 例如，768表示每个Transformer层的输出维度为768。
    int channels; // number of channels, e.g. 768
} GPT2Config;

typedef struct {

// max_seq_len：最大序列长度，即模型能够处理的最大输入长度。

// 例如，1024表示模型能够处理的最大单词或token数量为1024。

int max_seq_len; // max sequence length, e.g. 1024

//vocab_size：词汇表大小，即模型能够识别的唯一单词或token的数量。

// 例如，50257表示模型的词汇表中有50257个不同的token。

int vocab_size; // vocab size, e.g. 50257

// num_layers：模型中Transformer层的数量。

// 例如，12表示模型由12个Transformer层组成。

int num_layers; // number of layers, e.g. 12

// num_heads：自注意力（Self-attention）机制中的头数。

// 例如，12表示每个Transformer层的自注意力机制包含12个头。

int num_heads; // number of heads in attention, e.g. 12

// channels：通道数，也可以理解为隐藏层的维度。

// 例如，768表示每个Transformer层的输出维度为768。

int channels; // number of channels, e.g. 768

} GPT2Config;

这些配置参数共同定义了GPT-2模型的架构和能力，影响模型的表达能力、参数数量和计算复杂度。在实际使用中，可以根据具体任务的需求和可用计算资源调整这些参数以达到最佳效果。

1.21 GPT2

这个结构体GPT2定义了GPT-2模型的整体结构，包括模型配置、参数、激活值及其梯度等关键组成部分。这是一个全面的数据结构，旨在捕获训练和推断过程中所需的所有信息。以下是对各个成员的详细解释：

typedef struct {
    // config：模型的配置，包含了最大序列长度、词汇表大小、层数、注意力机制的头数以及通道数等信息。
    GPT2Config config;
    // the weights (parameters) of the model, and their sizes
    // params：模型的权重（参数），按照ParameterTensors结构体组织，
    // 包括了词嵌入、位置嵌入、自注意力和前馈网络的权重等。
    ParameterTensors params;
    // param_sizes：数组，记录了每个参数张量的大小，用于内存分配。
    size_t param_sizes[NUM_PARAMETER_TENSORS];
    // params_memory：指向一块连续内存的指针，该内存块存储了所有的模型参数。
    float* params_memory;
    // num_parameters：模型参数的总数量。
    size_t num_parameters;
    // gradients of the weights
    // grads：模型参数的梯度，与params结构相同。
    ParameterTensors grads;
    // grads_memory：指向存储所有参数梯度的连续内存块的指针。
    float* grads_memory;
    // buffers for the AdamW optimizer
    // m_memory和v_memory：AdamW优化器中用于存储一阶和二阶动量的缓冲区。
    float* m_memory;
    float* v_memory;
    // the activations of the model, and their sizes
		// acts：模型的激活值，按照ActivationTensors结构体组织，包括了编码后的嵌入、自注意力的输出、层归一化的结果等。
    ActivationTensors acts;
		// act_sizes：数组，记录了每个激活张量的大小。
    size_t act_sizes[NUM_ACTIVATION_TENSORS];
		// acts_memory：指向存储所有激活值的连续内存块的指针。
    float* acts_memory;
		// num_activations：模型激活值的总数量。
    size_t num_activations;
    // gradients of the activations
		// grads_acts：激活值的梯度。
    ActivationTensors grads_acts;
		// grads_acts_memory：指向存储所有激活值梯度的连续内存块的指针。
    float* grads_acts_memory;
    // other run state configuration
		// batch_size（B）：当前前向传递的批量大小。
    int batch_size; // the batch size (B) of current forward pass
		// seq_len（T）：当前前向传递的序列长度。
    int seq_len; // the sequence length (T) of current forward pass
		// inputs：当前前向传递的输入token。
    int* inputs; // the input tokens for the current forward pass
		// targets：当前前向传递的目标token。
    int* targets; // the target tokens for the current forward pass
		// mean_loss：在进行带目标的前向传递后，该值会被填充为平均损失值。
    float mean_loss; // after a forward pass with targets, will be populated with the mean loss
} GPT2;

typedef struct {

// config：模型的配置，包含了最大序列长度、词汇表大小、层数、注意力机制的头数以及通道数等信息。

GPT2Config config;

// the weights (parameters) of the model, and their sizes

// params：模型的权重（参数），按照ParameterTensors结构体组织，

// 包括了词嵌入、位置嵌入、自注意力和前馈网络的权重等。

ParameterTensors params;

// param_sizes：数组，记录了每个参数张量的大小，用于内存分配。

size_t param_sizes[NUM_PARAMETER_TENSORS];

// params_memory：指向一块连续内存的指针，该内存块存储了所有的模型参数。

float* params_memory;

// num_parameters：模型参数的总数量。

size_t num_parameters;

// gradients of the weights

// grads：模型参数的梯度，与params结构相同。

ParameterTensors grads;

// grads_memory：指向存储所有参数梯度的连续内存块的指针。

float* grads_memory;

// buffers for the AdamW optimizer

// m_memory和v_memory：AdamW优化器中用于存储一阶和二阶动量的缓冲区。

float* m_memory;

float* v_memory;

// the activations of the model, and their sizes

// acts：模型的激活值，按照ActivationTensors结构体组织，包括了编码后的嵌入、自注意力的输出、层归一化的结果等。

ActivationTensors acts;

// act_sizes：数组，记录了每个激活张量的大小。

size_t act_sizes[NUM_ACTIVATION_TENSORS];

// acts_memory：指向存储所有激活值的连续内存块的指针。

float* acts_memory;

// num_activations：模型激活值的总数量。

size_t num_activations;

// gradients of the activations

// grads_acts：激活值的梯度。

ActivationTensors grads_acts;

// grads_acts_memory：指向存储所有激活值梯度的连续内存块的指针。

float* grads_acts_memory;

// other run state configuration

// batch_size（B）：当前前向传递的批量大小。

int batch_size; // the batch size (B) of current forward pass

// seq_len（T）：当前前向传递的序列长度。

int seq_len; // the sequence length (T) of current forward pass

// inputs：当前前向传递的输入token。

int* inputs; // the input tokens for the current forward pass

// targets：当前前向传递的目标token。

int* targets; // the target tokens for the current forward pass

// mean_loss：在进行带目标的前向传递后，该值会被填充为平均损失值。

float mean_loss; // after a forward pass with targets, will be populated with the mean loss

} GPT2;

这个结构体提供了一个框架，以支持GPT-2模型的训练和推理，使得模型的参数管理、前向和反向传播以及参数更新变得更加系统化和高效。

1.22 gpt2_build_from_checkpoint

这段代码实现了从检查点文件读取GPT-2模型的功能。检查点文件通常用于保存训练过程中的模型状态，包括模型的参数和配置，以便于后续的恢复训练或推理使用。以下是对这个函数的详细中文注释：

void gpt2_build_from_checkpoint(GPT2 *model, char* checkpoint_path) {

    // read in model from a checkpoint file
    // 从检查点文件中读取模型
    FILE *model_file = fopen(checkpoint_path, "rb");
    if (model_file == NULL) { printf("Error opening model file\n"); exit(1); }
    int model_header[256];
    fread(model_header, sizeof(int), 256, model_file);
    // 检查文件的魔数和版本，确保文件格式正确
    if (model_header[0] != 20240326) { printf("Bad magic model file"); exit(1); }
    if (model_header[1] != 1) { printf("Bad version in model file"); exit(1); }

    // read in hyperparameters
   // 读取超参数
    int maxT, V, L, NH, C;
    model->config.max_seq_len = maxT = model_header[2];
    model->config.vocab_size = V = model_header[3];
    model->config.num_layers = L = model_header[4];
    model->config.num_heads = NH = model_header[5];
    model->config.channels = C = model_header[6];
    printf("[GPT-2]\n");
    printf("max_seq_len: %d\n", maxT);
    printf("vocab_size: %d\n", V);
    printf("num_layers: %d\n", L);
    printf("num_heads: %d\n", NH);
    printf("channels: %d\n", C);

    // allocate space for all the parameters and read them in
    // 为所有参数分配空间并从文件中读取
    model->param_sizes[0] = V * C; // wte
    model->param_sizes[1] = maxT * C; // wpe
    model->param_sizes[2] = L * C; // ln1w
    model->param_sizes[3] = L * C; // ln1b
    model->param_sizes[4] = L * (3 * C) * C; // qkvw
    model->param_sizes[5] = L * (3 * C); // qkvb
    model->param_sizes[6] = L * C * C; // attprojw
    model->param_sizes[7] = L * C; // attprojb
    model->param_sizes[8] = L * C; // ln2w
    model->param_sizes[9] = L * C; // ln2b
    model->param_sizes[10] = L * (4 * C) * C; // fcw
    model->param_sizes[11] = L * (4 * C); // fcb
    model->param_sizes[12] = L * C * (4 * C); // fcprojw
    model->param_sizes[13] = L * C; // fcprojb
    model->param_sizes[14] = C; // lnfw
    model->param_sizes[15] = C; // lnfb

    // count the number of parameters
    // 计算参数总数
    size_t num_parameters = 0;
    for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {
        num_parameters += model->param_sizes[i];
    }
    printf("num_parameters: %zu\n", num_parameters);
    model->num_parameters = num_parameters;

    // read in all the parameters from file
    // 从文件中读取所有参数
    model->params_memory = malloc_and_point_parameters(&model->params, model->param_sizes);
    fread(model->params_memory, sizeof(float), num_parameters, model_file);
    fclose(model_file);

    // other inits
    // 其他初始化
    model->acts_memory = NULL;
    model->grads_memory = NULL;
    model->m_memory = NULL;
    model->v_memory = NULL;
    model->grads_acts_memory = NULL;
    model->inputs = NULL;
    model->targets = NULL;
    model->batch_size = 0;
    model->seq_len = 0;
    // 使用-1.0f标记没有损失
    model->mean_loss = -1.0f; // -1.0f will designate no loss
}

void gpt2_build_from_checkpoint(GPT2 *model, char* checkpoint_path) {

// read in model from a checkpoint file

// 从检查点文件中读取模型

FILE *model_file = fopen(checkpoint_path, "rb");

if (model_file == NULL) { printf("Error opening model file\n"); exit(1); }

int model_header[256];

fread(model_header, sizeof(int), 256, model_file);

// 检查文件的魔数和版本，确保文件格式正确

if (model_header[0] != 20240326) { printf("Bad magic model file"); exit(1); }

if (model_header[1] != 1) { printf("Bad version in model file"); exit(1); }

// read in hyperparameters

// 读取超参数

int maxT, V, L, NH, C;

model->config.max_seq_len = maxT = model_header[2];

model->config.vocab_size = V = model_header[3];

model->config.num_layers = L = model_header[4];

model->config.num_heads = NH = model_header[5];

model->config.channels = C = model_header[6];

printf("[GPT-2]\n");

printf("max_seq_len: %d\n", maxT);

printf("vocab_size: %d\n", V);

printf("num_layers: %d\n", L);

printf("num_heads: %d\n", NH);

printf("channels: %d\n", C);

// allocate space for all the parameters and read them in

// 为所有参数分配空间并从文件中读取

model->param_sizes[0] = V * C; // wte

model->param_sizes[1] = maxT * C; // wpe

model->param_sizes[2] = L * C; // ln1w

model->param_sizes[3] = L * C; // ln1b

model->param_sizes[4] = L * (3 * C) * C; // qkvw

model->param_sizes[5] = L * (3 * C); // qkvb

model->param_sizes[6] = L * C * C; // attprojw

model->param_sizes[7] = L * C; // attprojb

model->param_sizes[8] = L * C; // ln2w

model->param_sizes[9] = L * C; // ln2b

model->param_sizes[10] = L * (4 * C) * C; // fcw

model->param_sizes[11] = L * (4 * C); // fcb

model->param_sizes[12] = L * C * (4 * C); // fcprojw

model->param_sizes[13] = L * C; // fcprojb

model->param_sizes[14] = C; // lnfw

model->param_sizes[15] = C; // lnfb

// count the number of parameters

// 计算参数总数

size_t num_parameters = 0;

for (size_t i = 0; i < NUM_PARAMETER_TENSORS; i++) {

num_parameters += model->param_sizes[i];

}

printf("num_parameters: %zu\n", num_parameters);

model->num_parameters = num_parameters;

// read in all the parameters from file

// 从文件中读取所有参数

model->params_memory = malloc_and_point_parameters(&model->params, model->param_sizes);

fread(model->params_memory, sizeof(float), num_parameters, model_file);

fclose(model_file);

// other inits

// 其他初始化

model->acts_memory = NULL;

model->grads_memory = NULL;

model->m_memory = NULL;

model->v_memory = NULL;

model->grads_acts_memory = NULL;

model->inputs = NULL;

model->targets = NULL;

model->batch_size = 0;

model->seq_len = 0;

// 使用-1.0f标记没有损失

model->mean_loss = -1.0f; // -1.0f will designate no loss

}

通过这种方式，gpt2_build_from_checkpoint函数能够从一个预先保存的检查点文件中恢复GPT-2模型的状态，包括模型的结构配置和参数。这对于模型的继续训练或进行推理预测非常有用。在模型使用之前，确保所有相关的初始化和资源分配都已正确完成。

1.23 gpt2_forward

这段代码定义了GPT-2模型的前向传播过程。它负责根据给定的输入tokens计算模型的输出，以及可选的，根据目标tokens计算损失值。前向传播是深度学习中计算模型输出和损失的基本过程。以下是详细的中文注释：

void gpt2_forward(GPT2 *model, int* inputs, int* targets, int B, int T) {
    // targets are optional and could be NULL
    // 目标tokens是可选的，可以为NULL

    // ensure the model was initialized or error out
    // 确保模型已经正确初始化
    if (model->params_memory == NULL) {
        printf("Error: model was not initialized properly.\n");
        exit(1);
    }

    // convenience parameters
    // 方便起见，提取模型配置中的一些参数
    // 词汇表大小
    int V = model->config.vocab_size;
    // 层数
    int L = model->config.num_layers;
    // 注意力头数
    int NH = model->config.num_heads;
    // 通道数（隐藏层大小）
    int C = model->config.channels;

    // validate inputs, all indices must be in the range [0, V)
    // 验证输入，所有索引必须在[0, V)范围内
    for(int i = 0; i < B * T; i++) {
        assert(0 <= inputs[i] && inputs[i] < V);
        if (targets != NULL) {
            assert(0 <= targets[i] && targets[i] < V);
        }
    }

    // allocate space for all the activations if needed (done here, lazily)
    // 如有必要，懒加载方式为所有激活值分配空间
    if(model->acts_memory == NULL) {
        // record the current B,T as well
        // 记录当前的B,T
        model->batch_size = B;
        model->seq_len = T;
        // and now allocate the space
        // 现在分配空间
        model->act_sizes[0] = B * T * C; // encoded
        model->act_sizes[1] = L * B * T * C; // ln1
        model->act_sizes[2] = L * B * T;  // ln1_mean
        model->act_sizes[3] = L * B * T;  // ln1_rstd
        model->act_sizes[4] = L * B * T * 3*C; // qkv
        model->act_sizes[5] = L * B * T * C;  // atty
        model->act_sizes[6] = L * B * NH * T * T;  // preatt
        model->act_sizes[7] = L * B * NH * T * T;  // att
        model->act_sizes[8] = L * B * T * C; // attproj
        model->act_sizes[9] = L * B * T * C; // residual2
        model->act_sizes[10] = L * B * T * C; // ln2
        model->act_sizes[11] = L * B * T; // ln2_mean
        model->act_sizes[12] = L * B * T; // ln2_rstd
        model->act_sizes[13] = L * B * T * 4*C; // fch
        model->act_sizes[14] = L * B * T * 4*C; // fch_gelu
        model->act_sizes[15] = L * B * T * C; // fcproj
        model->act_sizes[16] = L * B * T * C; // residual3
        model->act_sizes[17] = B * T * C; // lnf
        model->act_sizes[18] = B * T; // lnf_mean
        model->act_sizes[19] = B * T; // lnf_rstd
        model->act_sizes[20] = B * T * V; // logits
        model->act_sizes[21] = B * T * V; // probs
        model->act_sizes[22] = B * T; // losses
        size_t num_activations = 0;
        for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {
            num_activations += model->act_sizes[i];
        }
        printf("num_activations: %zu\n", num_activations);
        model->num_activations = num_activations;
        model->acts_memory = malloc_and_point_activations(&model->acts, model->act_sizes);
        // also create memory for caching inputs and targets
        // 同时为输入和目标创建内存缓存
        model->inputs = (int*)malloc(B * T * sizeof(int));
        // 如果我们没有目标，这部分可能不会用到，但开销很小
        model->targets = (int*)malloc(B * T * sizeof(int)); // might be unused if we never have targets but it's small
    } else {
        // validate B,T is consistent with how we've allocated the memory before
        // in principle we could get more clever here in the future, for now this is safest
        // 验证B,T是否与之前分配的内存一致
        if (B != model->batch_size || T != model->seq_len) {
            printf("Model: B=%d T=%d, Desired: B=%d T=%d\n", model->batch_size, model->seq_len, B, T);
            exit(EXIT_FAILURE);
        }
    }

    // cache the inputs/targets
    // 缓存输入/目标
    memcpy(model->inputs, inputs, B * T * sizeof(int));
    if (targets != NULL) {
        memcpy(model->targets, targets, B * T * sizeof(int));
    }

    // forward pass
    // 前向传播
    ParameterTensors params = model->params; // for brevity
    ActivationTensors acts = model->acts;
    float* residual;
    // 使用encoder_forward函数对输入tokens进行编码。
    // 这个步骤将输入tokens转换为模型可以理解的格式，
    // 即每个token对应的向量表示。这里使用的是词嵌入（params.wte）和位置嵌入（params.wpe），
    // 分别捕获了词汇的语义信息和在序列中的位置信息。编码后的结果存储在acts.encoded中。
    encoder_forward(acts.encoded, inputs, params.wte, params.wpe, B, T, C); // encoding goes into residual[0]
    // 遍历每一层： 
    // 代码进入一个循环，对模型的每一层（从第0层到第L-1层）执行一系列计算。
    // 循环内的第一步是更新residual变量的值。
    // 对于第0层，residual保持为编码后的输入（即，第一层的输入是编码后的序列）；
    // 对于其他层，residual指向上一层的残差连接输出（acts.residual3加上偏移量）。
    // 这里的(l-1) * B * T * C计算确保了residual正确指向了前一层的残差输出，为当前层的计算提供输入。
    for (int l = 0; l < L; l++) {

        // residual变量被初始化为指向编码后的输入acts.encoded。
        // 在Transformer架构中，每一层的输出都会和输入进行相加，形成所谓的残差连接。
        // 这一机制有助于避免深层网络中的梯度消失问题，使得模型能够有效地学习。
        residual = l == 0 ? acts.encoded : acts.residual3 + (l-1) * B * T * C;

        // get the pointers of the weights for this layer
        // 获取这一层的权重指针。
        float* l_ln1w = params.ln1w + l * C;
        float* l_ln1b = params.ln1b + l * C;
        float* l_qkvw = params.qkvw + l * 3*C * C;
        float* l_qkvb = params.qkvb + l * 3*C;
        float* l_attprojw = params.attprojw + l * C * C;
        float* l_attprojb = params.attprojb + l * C;
        float* l_ln2w = params.ln2w + l * C;
        float* l_ln2b = params.ln2b + l * C;
        float* l_fcw = params.fcw + l * 4*C * C;
        float* l_fcb = params.fcb + l * 4*C;
        float* l_fcprojw = params.fcprojw + l * C * 4*C;
        float* l_fcprojb = params.fcprojb + l * C;

        // get the pointers of the activations for this layer
        // 获取这一层的激活值指针。
        float* l_ln1 = acts.ln1 + l * B * T * C;
        float* l_ln1_mean = acts.ln1_mean + l * B * T;
        float* l_ln1_rstd = acts.ln1_rstd + l * B * T;
        float* l_qkv = acts.qkv + l * B * T * 3*C;
        float* l_atty = acts.atty + l * B * T * C;
        float* l_preatt = acts.preatt + l * B * NH * T * T;
        float* l_att = acts.att + l * B * NH * T * T;
        float* l_attproj = acts.attproj + l * B * T * C;
        float* l_residual2 = acts.residual2 + l * B * T * C;
        float* l_ln2 = acts.ln2 + l * B * T * C;
        float* l_ln2_mean = acts.ln2_mean + l * B * T;
        float* l_ln2_rstd = acts.ln2_rstd + l * B * T;
        float* l_fch = acts.fch + l * B * T * 4*C;
        float* l_fch_gelu = acts.fch_gelu + l * B * T * 4*C;
        float* l_fcproj = acts.fcproj + l * B * T * C;
        float* l_residual3 = acts.residual3 + l * B * T * C;

        // now do the forward pass
        // 现在进行前向传播。
        // 
        // 层归一化（Layer Normalization）： 
        // 对残差连接后的输出residual进行归一化处理，使得其分布更加稳定。
        // 这有助于加速训练过程并提高模型的性能。
        layernorm_forward(l_ln1, l_ln1_mean, l_ln1_rstd, residual, l_ln1w, l_ln1b, B, T, C);
        // 矩阵乘法（Matrix Multiplication）： 
        // 对层归一化后的结果l_ln1和权重l_qkvw进行矩阵乘法，加上偏置l_qkvb，
        // 计算查询（Query）、键（Key）、值（Value）的合并表示l_qkv。
        matmul_forward(l_qkv, l_ln1, l_qkvw, l_qkvb, B, T, C, 3*C);
        // 自注意力机制（Self-Attention）： 
        // 使用自注意力机制处理l_qkv，生成注意力加权的输出l_atty。
        // 自注意力机制允许模型在处理每个单词时考虑到整个序列的上下文信息。
        attention_forward(l_atty, l_preatt, l_att, l_qkv, B, T, C, NH);
        // 再次进行矩阵乘法： 
        // 将自注意力的输出l_atty通过另一个线性层（权重为l_attprojw，偏置为l_attprojb），
        // 得到自注意力层的最终输出l_attproj。
        matmul_forward(l_attproj, l_atty, l_attprojw, l_attprojb, B, T, C, C);
        // 残差连接（Residual Connection）： 
        // 将自注意力层的输出l_attproj与残差连接的输入residual相加，
        // 形成新的残差连接输出l_residual2。
        // 残差连接有助于缓解深层网络训练过程中的梯度消失或爆炸问题。
        residual_forward(l_residual2, residual, l_attproj, B*T*C);
        // 第二次层归一化： 
        // 对新的残差连接输出l_residual2进行层归一化，得到l_ln2，进一步稳定模型的训练过程。
        layernorm_forward(l_ln2, l_ln2_mean, l_ln2_rstd, l_residual2, l_ln2w, l_ln2b, B, T, C);
        // 前馈网络（Feedforward Network）： 
        // 通过一个前馈网络（线性层加上激活函数），处理l_ln2，其中使用GELU作为激活函数，得到l_fch。
        matmul_forward(l_fch, l_ln2, l_fcw, l_fcb, B, T, C, 4*C);
        // GELU激活函数： 
        // 对前馈网络的输出l_fch应用GELU激活函数，得到l_fch_gelu。
        // GELU激活函数有助于引入非线性，增强模型的表达能力。
        gelu_forward(l_fch_gelu, l_fch, B*T*4*C);
        // 最后的线性变换： 
        // 将激活后的结果l_fch_gelu通过最后一个线性层（权重为l_fcprojw，偏置为l_fcprojb），
        // 得到此层的输出l_fcproj。
        matmul_forward(l_fcproj, l_fch_gelu, l_fcprojw, l_fcprojb, B, T, 4*C, C);
        // 最终的残差连接： 
        // 将最后一个线性层的输出l_fcproj与之前的残差连接输出l_residual2相加，
        // 形成最终的残差连接输出l_residual3。这一步完成了一个Transformer层的全部计算流程。
        residual_forward(l_residual3, l_residual2, l_fcproj, B*T*C);
        // 在模型的每个Transformer层中，下面步骤被重复执行，
        // 每一层的输出都会作为下一层的输入，直到所有的层都执行完毕。
        // 这个过程允许模型捕获和处理复杂的序列依赖关系，从而生成准确的预测或语言模型输出。
    }
    // 最后的残差在residual3中。
    // 选择最后一层的残差： 
    // 通过acts.residual3 + (L-1) * B * T * C表达式，
    // 我们选择了最后一个Transformer层的残差连接输出作为接下来层归一化的输入。
    // 这里的(L-1)表示最后一层，因为层的索引从0开始。
    residual = acts.residual3 + (L-1) * B * T * C; // last residual is in residual3
    // 最后一层的层归一化（Layer Normalization）： 
    // 使用layernorm_forward函数对最后一层的残差连接输出进行归一化处理，得到acts.lnf。
    // 这有助于模型的训练稳定性和性能。
    layernorm_forward(acts.lnf, acts.lnf_mean, acts.lnf_rstd, residual, params.lnfw, params.lnfb, B, T, C);
    // 输出层的线性变换： 
    // 通过matmul_forward函数，
    // 将层归一化后的输出acts.lnf与词嵌入权重params.wte进行矩阵乘法运算（此处没有使用偏置），
    // 得到模型的原始输出acts.logits。
    // 这一步骤生成了每个词汇在给定上下文中的得分（未归一化的概率）。
    matmul_forward(acts.logits, acts.lnf, params.wte, NULL, B, T, C, V);
    // Softmax归一化： 
    // 使用softmax_forward函数，对acts.logits进行softmax处理，
    // 将原始输出转换为概率分布acts.probs。
    // softmax确保了输出概率的总和为1，便于解释和后续处理。
    softmax_forward(acts.probs, acts.logits, B, T, V);

    // also forward the cross-entropy loss function if we have the targets
    // 如果有目标tokens，还会进行交叉熵损失的计算
    if (targets != NULL) {
        // 使用crossentropy_forward函数计算预测概率（model->acts.probs）和目标tokens之间的交叉熵损失。
        // 这一步骤评估了模型预测的概率分布与实际发生的tokens之间的差异，损失值存储在model->acts.losses中
        crossentropy_forward(model->acts.losses, model->acts.probs, targets, B, T, V);
        // for convenience also evaluate the mean loss
        // 为方便起见，同时计算平均损失
        float mean_loss = 0.0f;
        // 为了获得单个训练样本上的平均损失，将所有损失值相加然后除以tokens的总数（即B*T，其中B是batch大小，T是序列长度）。
        // 这个平均损失值为mean_loss，反映了模型在当前批次训练数据上的平均表现。
        for (int i=0; i<B*T; i++) { mean_loss += model->acts.losses[i]; }
        // 更新模型的平均损失属性： 
        // 将计算出的平均损失值赋给模型的mean_loss属性，以便于后续训练过程中的使用。
        mean_loss /= B*T;
        model->mean_loss = mean_loss;
    } else {
        // if we don't have targets, we don't have a loss
        // 如果没有目标tokens，那么就没有损失值
        model->mean_loss = -1.0f;
    }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

void gpt2_forward(GPT2 *model, int* inputs, int* targets, int B, int T) {

// targets are optional and could be NULL

// 目标tokens是可选的，可以为NULL

// ensure the model was initialized or error out

// 确保模型已经正确初始化

if (model->params_memory == NULL) {

printf("Error: model was not initialized properly.\n");

exit(1);

}

// convenience parameters

// 方便起见，提取模型配置中的一些参数

// 词汇表大小

int V = model->config.vocab_size;

// 层数

int L = model->config.num_layers;

// 注意力头数

int NH = model->config.num_heads;

// 通道数（隐藏层大小）

int C = model->config.channels;

// validate inputs, all indices must be in the range [0, V)

// 验证输入，所有索引必须在[0, V)范围内

for(int i = 0; i < B * T; i++) {

assert(0 <= inputs[i] && inputs[i] < V);

if (targets != NULL) {

assert(0 <= targets[i] && targets[i] < V);

}

// allocate space for all the activations if needed (done here, lazily)

// 如有必要，懒加载方式为所有激活值分配空间

if(model->acts_memory == NULL) {

// record the current B,T as well

// 记录当前的B,T

model->batch_size = B;

model->seq_len = T;

// and now allocate the space

// 现在分配空间

model->act_sizes[0] = B * T * C; // encoded

model->act_sizes[1] = L * B * T * C; // ln1

model->act_sizes[2] = L * B * T; // ln1_mean

model->act_sizes[3] = L * B * T; // ln1_rstd

model->act_sizes[4] = L * B * T * 3*C; // qkv

model->act_sizes[5] = L * B * T * C; // atty

model->act_sizes[6] = L * B * NH * T * T; // preatt

model->act_sizes[7] = L * B * NH * T * T; // att

model->act_sizes[8] = L * B * T * C; // attproj

model->act_sizes[9] = L * B * T * C; // residual2

model->act_sizes[10] = L * B * T * C; // ln2

model->act_sizes[11] = L * B * T; // ln2_mean

model->act_sizes[12] = L * B * T; // ln2_rstd

model->act_sizes[13] = L * B * T * 4*C; // fch

model->act_sizes[14] = L * B * T * 4*C; // fch_gelu

model->act_sizes[15] = L * B * T * C; // fcproj

model->act_sizes[16] = L * B * T * C; // residual3

model->act_sizes[17] = B * T * C; // lnf

model->act_sizes[18] = B * T; // lnf_mean

model->act_sizes[19] = B * T; // lnf_rstd

model->act_sizes[20] = B * T * V; // logits

model->act_sizes[21] = B * T * V; // probs

model->act_sizes[22] = B * T; // losses

size_t num_activations = 0;

for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {

num_activations += model->act_sizes[i];

}

printf("num_activations: %zu\n", num_activations);

model->num_activations = num_activations;

model->acts_memory = malloc_and_point_activations(&model->acts, model->act_sizes);

// also create memory for caching inputs and targets

// 同时为输入和目标创建内存缓存

model->inputs = (int*)malloc(B * T * sizeof(int));

// 如果我们没有目标，这部分可能不会用到，但开销很小

model->targets = (int*)malloc(B * T * sizeof(int)); // might be unused if we never have targets but it's small

} else {

// validate B,T is consistent with how we've allocated the memory before

// in principle we could get more clever here in the future, for now this is safest

// 验证B,T是否与之前分配的内存一致

if (B != model->batch_size || T != model->seq_len) {

printf("Model: B=%d T=%d, Desired: B=%d T=%d\n", model->batch_size, model->seq_len, B, T);

exit(EXIT_FAILURE);

}

// cache the inputs/targets

// 缓存输入/目标

memcpy(model->inputs, inputs, B * T * sizeof(int));

if (targets != NULL) {

memcpy(model->targets, targets, B * T * sizeof(int));

}

// forward pass

// 前向传播

ParameterTensors params = model->params; // for brevity

ActivationTensors acts = model->acts;

float* residual;

// 使用encoder_forward函数对输入tokens进行编码。

// 这个步骤将输入tokens转换为模型可以理解的格式，

// 即每个token对应的向量表示。这里使用的是词嵌入（params.wte）和位置嵌入（params.wpe），

// 分别捕获了词汇的语义信息和在序列中的位置信息。编码后的结果存储在acts.encoded中。

encoder_forward(acts.encoded, inputs, params.wte, params.wpe, B, T, C); // encoding goes into residual[0]

// 遍历每一层：

// 代码进入一个循环，对模型的每一层（从第0层到第L-1层）执行一系列计算。

// 循环内的第一步是更新residual变量的值。

// 对于第0层，residual保持为编码后的输入（即，第一层的输入是编码后的序列）；

// 对于其他层，residual指向上一层的残差连接输出（acts.residual3加上偏移量）。

// 这里的(l-1) * B * T * C计算确保了residual正确指向了前一层的残差输出，为当前层的计算提供输入。

for (int l = 0; l < L; l++) {

// residual变量被初始化为指向编码后的输入acts.encoded。

// 在Transformer架构中，每一层的输出都会和输入进行相加，形成所谓的残差连接。

// 这一机制有助于避免深层网络中的梯度消失问题，使得模型能够有效地学习。

residual = l == 0 ? acts.encoded : acts.residual3 + (l-1) * B * T * C;

// get the pointers of the weights for this layer

// 获取这一层的权重指针。

float* l_ln1w = params.ln1w + l * C;

float* l_ln1b = params.ln1b + l * C;

float* l_qkvw = params.qkvw + l * 3*C * C;

float* l_qkvb = params.qkvb + l * 3*C;

float* l_attprojw = params.attprojw + l * C * C;

float* l_attprojb = params.attprojb + l * C;

float* l_ln2w = params.ln2w + l * C;

float* l_ln2b = params.ln2b + l * C;

float* l_fcw = params.fcw + l * 4*C * C;

float* l_fcb = params.fcb + l * 4*C;

float* l_fcprojw = params.fcprojw + l * C * 4*C;

float* l_fcprojb = params.fcprojb + l * C;

// get the pointers of the activations for this layer

// 获取这一层的激活值指针。

float* l_ln1 = acts.ln1 + l * B * T * C;

float* l_ln1_mean = acts.ln1_mean + l * B * T;

float* l_ln1_rstd = acts.ln1_rstd + l * B * T;

float* l_qkv = acts.qkv + l * B * T * 3*C;

float* l_atty = acts.atty + l * B * T * C;

float* l_preatt = acts.preatt + l * B * NH * T * T;

float* l_att = acts.att + l * B * NH * T * T;

float* l_attproj = acts.attproj + l * B * T * C;

float* l_residual2 = acts.residual2 + l * B * T * C;

float* l_ln2 = acts.ln2 + l * B * T * C;

float* l_ln2_mean = acts.ln2_mean + l * B * T;

float* l_ln2_rstd = acts.ln2_rstd + l * B * T;

float* l_fch = acts.fch + l * B * T * 4*C;

float* l_fch_gelu = acts.fch_gelu + l * B * T * 4*C;

float* l_fcproj = acts.fcproj + l * B * T * C;

float* l_residual3 = acts.residual3 + l * B * T * C;

// now do the forward pass

// 现在进行前向传播。

// 层归一化（Layer Normalization）：

// 对残差连接后的输出residual进行归一化处理，使得其分布更加稳定。

// 这有助于加速训练过程并提高模型的性能。

layernorm_forward(l_ln1, l_ln1_mean, l_ln1_rstd, residual, l_ln1w, l_ln1b, B, T, C);

// 矩阵乘法（Matrix Multiplication）：

// 对层归一化后的结果l_ln1和权重l_qkvw进行矩阵乘法，加上偏置l_qkvb，

// 计算查询（Query）、键（Key）、值（Value）的合并表示l_qkv。

matmul_forward(l_qkv, l_ln1, l_qkvw, l_qkvb, B, T, C, 3*C);

// 自注意力机制（Self-Attention）：

// 使用自注意力机制处理l_qkv，生成注意力加权的输出l_atty。

// 自注意力机制允许模型在处理每个单词时考虑到整个序列的上下文信息。

attention_forward(l_atty, l_preatt, l_att, l_qkv, B, T, C, NH);

// 再次进行矩阵乘法：

// 将自注意力的输出l_atty通过另一个线性层（权重为l_attprojw，偏置为l_attprojb），

// 得到自注意力层的最终输出l_attproj。

matmul_forward(l_attproj, l_atty, l_attprojw, l_attprojb, B, T, C, C);

// 残差连接（Residual Connection）：

// 将自注意力层的输出l_attproj与残差连接的输入residual相加，

// 形成新的残差连接输出l_residual2。

// 残差连接有助于缓解深层网络训练过程中的梯度消失或爆炸问题。

residual_forward(l_residual2, residual, l_attproj, B*T*C);

// 第二次层归一化：

// 对新的残差连接输出l_residual2进行层归一化，得到l_ln2，进一步稳定模型的训练过程。

layernorm_forward(l_ln2, l_ln2_mean, l_ln2_rstd, l_residual2, l_ln2w, l_ln2b, B, T, C);

// 前馈网络（Feedforward Network）：

// 通过一个前馈网络（线性层加上激活函数），处理l_ln2，其中使用GELU作为激活函数，得到l_fch。

matmul_forward(l_fch, l_ln2, l_fcw, l_fcb, B, T, C, 4*C);

// GELU激活函数：

// 对前馈网络的输出l_fch应用GELU激活函数，得到l_fch_gelu。

// GELU激活函数有助于引入非线性，增强模型的表达能力。

gelu_forward(l_fch_gelu, l_fch, B*T*4*C);

// 最后的线性变换：

// 将激活后的结果l_fch_gelu通过最后一个线性层（权重为l_fcprojw，偏置为l_fcprojb），

// 得到此层的输出l_fcproj。

matmul_forward(l_fcproj, l_fch_gelu, l_fcprojw, l_fcprojb, B, T, 4*C, C);

// 最终的残差连接：

// 将最后一个线性层的输出l_fcproj与之前的残差连接输出l_residual2相加，

// 形成最终的残差连接输出l_residual3。这一步完成了一个Transformer层的全部计算流程。

residual_forward(l_residual3, l_residual2, l_fcproj, B*T*C);

// 在模型的每个Transformer层中，下面步骤被重复执行，

// 每一层的输出都会作为下一层的输入，直到所有的层都执行完毕。

// 这个过程允许模型捕获和处理复杂的序列依赖关系，从而生成准确的预测或语言模型输出。

}

// 最后的残差在residual3中。

// 选择最后一层的残差：

// 通过acts.residual3 + (L-1) * B * T * C表达式，

// 我们选择了最后一个Transformer层的残差连接输出作为接下来层归一化的输入。

// 这里的(L-1)表示最后一层，因为层的索引从0开始。

residual = acts.residual3 + (L-1) * B * T * C; // last residual is in residual3

// 最后一层的层归一化（Layer Normalization）：

// 使用layernorm_forward函数对最后一层的残差连接输出进行归一化处理，得到acts.lnf。

// 这有助于模型的训练稳定性和性能。

layernorm_forward(acts.lnf, acts.lnf_mean, acts.lnf_rstd, residual, params.lnfw, params.lnfb, B, T, C);

// 输出层的线性变换：

// 通过matmul_forward函数，

// 将层归一化后的输出acts.lnf与词嵌入权重params.wte进行矩阵乘法运算（此处没有使用偏置），

// 得到模型的原始输出acts.logits。

// 这一步骤生成了每个词汇在给定上下文中的得分（未归一化的概率）。

matmul_forward(acts.logits, acts.lnf, params.wte, NULL, B, T, C, V);

// Softmax归一化：

// 使用softmax_forward函数，对acts.logits进行softmax处理，

// 将原始输出转换为概率分布acts.probs。

// softmax确保了输出概率的总和为1，便于解释和后续处理。

softmax_forward(acts.probs, acts.logits, B, T, V);

// also forward the cross-entropy loss function if we have the targets

// 如果有目标tokens，还会进行交叉熵损失的计算

if (targets != NULL) {

// 使用crossentropy_forward函数计算预测概率（model->acts.probs）和目标tokens之间的交叉熵损失。

// 这一步骤评估了模型预测的概率分布与实际发生的tokens之间的差异，损失值存储在model->acts.losses中

crossentropy_forward(model->acts.losses, model->acts.probs, targets, B, T, V);

// for convenience also evaluate the mean loss

// 为方便起见，同时计算平均损失

float mean_loss = 0.0f;

// 为了获得单个训练样本上的平均损失，将所有损失值相加然后除以tokens的总数（即B*T，其中B是batch大小，T是序列长度）。

// 这个平均损失值为mean_loss，反映了模型在当前批次训练数据上的平均表现。

for (int i=0; i<B*T; i++) { mean_loss += model->acts.losses[i]; }

// 更新模型的平均损失属性：

// 将计算出的平均损失值赋给模型的mean_loss属性，以便于后续训练过程中的使用。

mean_loss /= B*T;

model->mean_loss = mean_loss;

} else {

// if we don't have targets, we don't have a loss

// 如果没有目标tokens，那么就没有损失值

model->mean_loss = -1.0f;

}

此函数首先确保模型已正确初始化，然后验证输入和目标的有效性。如果激活值尚未分配内存，则进行分配，并根据当前的批量大小和序列长度调整模型配置。接下来，函数执行模型的前向传播过程，计算编码、自注意力等。

当我们谈到GPT-2模型的前向传播时，我们指的是模型根据输入数据（如文本序列）计算预测输出（如下一个单词的概率分布）的过程。这个过程涉及到模型内部各层的顺序激活和参数的使用。以下是对前面代码中几个关键步骤的再解释：

获取权重指针： 在每一层的开始，我们需要获取当前层所使用的所有权重和偏置的指针。这包括了自注意力层的查询、键、值权重（qkvw, qkvb）、注意力输出的投影权重（attprojw, attprojb）等。这样可以直接使用这些参数进行计算，而无需在每一步查找它们的位置。
获取激活值指针： 同样地，我们也需要获取保存中间计算结果的激活值的指针，例如编码后的输入（encoded）、自注意力的输出（atty）、各种层归一化的结果等。这些激活值在前向传播的不同阶段被计算出来，并被后续的层使用。
进行前向传播： 在准备好所有必需的权重和激活值之后，我们按照模型的架构逐层进行计算。这包括：

对输入序列进行编码，生成编码后的嵌入。
通过自注意力层处理编码后的嵌入，得到注意力加权的输出。
应用前馈神经网络（Feedforward Neural Network, FNN）到自注意力层的输出上。
使用残差连接和层归一化来稳定训练过程并提高模型性能。

处理最后的残差： 模型的每一层输出都会与输入进行残差连接，最后一层的残差连接输出存储在residual3中。这个残差连接的输出接下来会通过最后一层的层归一化和线性层，最终生成模型的输出logits。
计算损失： 如果提供了目标（如正确的下一个单词），模型会计算预测输出与实际目标之间的损失，通常是使用交叉熵损失函数。这个损失值可用于后续的模型训练过程中，通过反向传播更新模型参数以提高模型的预测准确性。

整个前向传播过程通过模型的层次结构逐步进行，每一步都建立在前一步的输出之上，最终产生模型对输入数据的预测输出。

1.24 gpt2_zero_grad

这段代码是GPT-2模型训练过程中用于重置模型梯度的函数。在每次训练迭代开始之前，需要将之前计算的梯度清零，以便于新的训练迭代中正确累计梯度。具体来说，这个函数做了以下操作：

void gpt2_zero_grad(GPT2 *model) {
    // 检查是否存在权重梯度（grads_memory）： 
    // 如果模型的权重梯度内存(grads_memory)已经被分配，
    // 那么使用memset函数将所有权重梯度设为0。
    // 这里的model->num_parameters表示模型所有参数的总数，
    // sizeof(float)表示每个梯度值占用的字节数。
    // 这一步确保了在开始新的训练步骤之前，所有的权重梯度都被重置。
    if(model->grads_memory != NULL) { memset(model->grads_memory, 0, model->num_parameters * sizeof(float)); }
    // 检查是否存在激活梯度（grads_acts_memory）： 
    // 类似地，如果模型的激活梯度内存(grads_acts_memory)已经被分配，
    // 那么也将所有激活梯度清零。
    // model->num_activations表示所有激活值的总数，
    // 每个激活值的梯度也被重置为0。
    if(model->grads_acts_memory != NULL) { memset(model->grads_acts_memory, 0, model->num_activations * sizeof(float)); }
}

void gpt2_zero_grad(GPT2 *model) {

// 检查是否存在权重梯度（grads_memory）：

// 如果模型的权重梯度内存(grads_memory)已经被分配，

// 那么使用memset函数将所有权重梯度设为0。

// 这里的model->num_parameters表示模型所有参数的总数，

// sizeof(float)表示每个梯度值占用的字节数。

// 这一步确保了在开始新的训练步骤之前，所有的权重梯度都被重置。

if(model->grads_memory != NULL) { memset(model->grads_memory, 0, model->num_parameters * sizeof(float)); }

// 检查是否存在激活梯度（grads_acts_memory）：

// 类似地，如果模型的激活梯度内存(grads_acts_memory)已经被分配，

// 那么也将所有激活梯度清零。

// model->num_activations表示所有激活值的总数，

// 每个激活值的梯度也被重置为0。

if(model->grads_acts_memory != NULL) { memset(model->grads_acts_memory, 0, model->num_activations * sizeof(float)); }

}

这个过程是深度学习训练中的标准步骤，确保每次反向传播计算的梯度不会与前一次迭代的梯度混淆，从而保障训练过程的正确性和稳定性。

1.25 gpt2_backward

这段代码展示了GPT-2模型的反向传播过程，关键步骤如下：

检查前向传播是否包含目标： 通过检验model->mean_loss是否为-1.0f来确认是否已进行包含目标tokens的前向传播。如果未进行，程序将报错并退出。
延迟分配梯度内存： 如果尚未为权重和激活的梯度分配内存（model->grads_memory和model->grads_acts_memory为空），则进行分配并通过gpt2_zero_grad(model)初始化为零。
定义便捷变量： 定义了几个便捷变量，如批量大小（B）、序列长度（T）、词汇量（V）、层数（L）、头数（NH）和通道数（C），简化了代码的阅读。
初始化梯度： 以1.0/(B*T)初始化grads_acts.losses，启动链式法则。
反向传播交叉熵和Softmax： 首先反向传播交叉熵和Softmax层，更新grads_acts.logits。
逐层反向传播： 从最后一层开始，逆序遍历每一层，对每一层执行以下操作：

使用residual_backward处理残差连接的反向传播。
通过matmul_backward、gelu_backward、layernorm_backward等函数反向传播该层的线性变换、GELU非线性和层归一化操作。
更新权重和激活的梯度。

权重和激活的梯度更新： 在反向传播的每一步，更新模型参数和激活函数的梯度。
编码器的反向传播： 最后，执行编码器的反向传播，更新词嵌入和位置嵌入的梯度。

void gpt2_backward(GPT2 *model) {

    // double check we forwarded previously, with targets
    // 确保已经执行了带有目标tokens的前向传播
    if (model->mean_loss == -1.0f) {
        printf("Error: must forward with targets before backward\n");
        exit(1);
    }

    // lazily allocate the memory for gradients of the weights and activations, if needed
    // 如果梯度的内存还没有被分配，则进行懒加载分配并初始化为零
    if (model->grads_memory == NULL) {
        model->grads_memory = malloc_and_point_parameters(&model->grads, model->param_sizes);
        model->grads_acts_memory = malloc_and_point_activations(&model->grads_acts, model->act_sizes);
        // 将梯度初始化为零
        gpt2_zero_grad(model);
    }

    // convenience shortcuts
    // 定义便利变量
    int B = model->batch_size;
    int T = model->seq_len;
    int V = model->config.vocab_size;
    int L = model->config.num_layers;
    int NH = model->config.num_heads;
    int C = model->config.channels;

    // backward pass: go in the reverse order of the forward pass, and call backward() functions
    // 反向传播：按照前向传播的逆序执行，并调用相应的反向传播函数
    ParameterTensors params = model->params; // for brevity
    ParameterTensors grads = model->grads;
    ActivationTensors acts = model->acts;
    ActivationTensors grads_acts = model->grads_acts;

    // we kick off the chain rule by filling in dlosses with 1.0f/(B*T)
    // technically this is a small, inline backward() pass of calculating
    // total, final loss as the mean over all losses over all (B,T) positions in the batch
    // 我们通过用1.0f/(B*T)填充dlosses来启动链式法则
    // 从技术上讲，这是一个小型的内联反向传播步骤，用于计算批次中所有(B,T)位置上所有损失的总和的平均值作为最终的损失
    // 对所有(B,T)位置的损失求均值
    float dloss_mean = 1.0f / (B*T);
    for (int i = 0; i < B*T; i++) { grads_acts.losses[i] = dloss_mean; }

    // 交叉熵和Softmax层的反向传播
    crossentropy_softmax_backward(grads_acts.logits, grads_acts.losses, acts.probs, model->targets, B, T, V);
    matmul_backward(grads_acts.lnf, grads.wte, NULL, grads_acts.logits, acts.lnf, params.wte, B, T, C, V);
    // 最后一层的残差
    float* residual = acts.residual3 + (L-1) * B * T * C; // last layer's residual
    // 反向传播到最后一层的残差
    float* dresidual = grads_acts.residual3 + (L-1) * B * T * C; // write to last layer's residual

    // 层归一化的反向传播
    layernorm_backward(dresidual, grads.lnfw, grads.lnfb, grads_acts.lnf, residual, params.lnfw, acts.lnf_mean, acts.lnf_rstd, B, T, C);

    // 逆序遍历每一层进行反向传播
    for (int l = L-1; l >= 0; l--) {

        residual = l == 0 ? acts.encoded : acts.residual3 + (l-1) * B * T * C;
        dresidual = l == 0 ? grads_acts.encoded : grads_acts.residual3 + (l-1) * B * T * C;

        // get the pointers of the weights for this layer
        // 获取这一层的权重指针。
        float* l_ln1w = params.ln1w + l * C;
        float* l_qkvw = params.qkvw + l * 3*C * C;
        float* l_attprojw = params.attprojw + l * C * C;
        float* l_ln2w = params.ln2w + l * C;
        float* l_fcw = params.fcw + l * 4*C * C;
        float* l_fcprojw = params.fcprojw + l * C * 4*C;
        // get the pointers of the gradients of the weights for this layer
        // 获取这一层权重梯度的指针
        float* dl_ln1w = grads.ln1w + l * C;
        float* dl_ln1b = grads.ln1b + l * C;
        float* dl_qkvw = grads.qkvw + l * 3*C * C;
        float* dl_qkvb = grads.qkvb + l * 3*C;
        float* dl_attprojw = grads.attprojw + l * C * C;
        float* dl_attprojb = grads.attprojb + l * C;
        float* dl_ln2w = grads.ln2w + l * C;
        float* dl_ln2b = grads.ln2b + l * C;
        float* dl_fcw = grads.fcw + l * 4*C * C;
        float* dl_fcb = grads.fcb + l * 4*C;
        float* dl_fcprojw = grads.fcprojw + l * C * 4*C;
        float* dl_fcprojb = grads.fcprojb + l * C;
        // get the pointers of the activations for this layer
        // 获取这一层的激活值指针。
        float* l_ln1 = acts.ln1 + l * B * T * C;
        float* l_ln1_mean = acts.ln1_mean + l * B * T;
        float* l_ln1_rstd = acts.ln1_rstd + l * B * T;
        float* l_qkv = acts.qkv + l * B * T * 3*C;
        float* l_atty = acts.atty + l * B * T * C;
        float* l_att = acts.att + l * B * NH * T * T;
        float* l_residual2 = acts.residual2 + l * B * T * C;
        float* l_ln2 = acts.ln2 + l * B * T * C;
        float* l_ln2_mean = acts.ln2_mean + l * B * T;
        float* l_ln2_rstd = acts.ln2_rstd + l * B * T;
        float* l_fch = acts.fch + l * B * T * 4*C;
        float* l_fch_gelu = acts.fch_gelu + l * B * T * 4*C;
        // get the pointers of the gradients of the activations for this layer
        // 获取这一层激活梯度的指针
        float* dl_ln1 = grads_acts.ln1 + l * B * T * C;
        float* dl_qkv = grads_acts.qkv + l * B * T * 3*C;
        float* dl_atty = grads_acts.atty + l * B * T * C;
        float* dl_preatt = grads_acts.preatt + l * B * NH * T * T;
        float* dl_att = grads_acts.att + l * B * NH * T * T;
        float* dl_attproj = grads_acts.attproj + l * B * T * C;
        float* dl_residual2 = grads_acts.residual2 + l * B * T * C;
        float* dl_ln2 = grads_acts.ln2 + l * B * T * C;
        float* dl_fch = grads_acts.fch + l * B * T * 4*C;
        float* dl_fch_gelu = grads_acts.fch_gelu + l * B * T * 4*C;
        float* dl_fcproj = grads_acts.fcproj + l * B * T * C;
        float* dl_residual3 = grads_acts.residual3 + l * B * T * C;

        // backprop this layer
        // 对残差连接进行反向传播
        residual_backward(dl_residual2, dl_fcproj, dl_residual3, B*T*C);
        // 对全连接层（使用GELU激活函数之后）进行反向传播
        matmul_backward(dl_fch_gelu, dl_fcprojw, dl_fcprojb, dl_fcproj, l_fch_gelu, l_fcprojw, B, T, 4*C, C);
        // 对GELU激活函数进行反向传播
        gelu_backward(dl_fch, l_fch, dl_fch_gelu, B*T*4*C);
        // 对全连接层（输入到GELU激活函数之前）进行反向传播
        matmul_backward(dl_ln2, dl_fcw, dl_fcb, dl_fch, l_ln2, l_fcw, B, T, C, 4*C);
        // 对层归一化进行反向传播，这是应用在加权残差连接之后的
        layernorm_backward(dl_residual2, dl_ln2w, dl_ln2b, dl_ln2, l_residual2, l_ln2w, l_ln2_mean, l_ln2_rstd, B, T, C);
        // 对另一个残差连接进行反向传播
        residual_backward(dresidual, dl_attproj, dl_residual2, B*T*C);
        // 对应用在注意力机制输出上的全连接层进行反向传播
        matmul_backward(dl_atty, dl_attprojw, dl_attprojb, dl_attproj, l_atty, l_attprojw, B, T, C, C);
        // 对注意力机制本身进行反向传播
        attention_backward(dl_qkv, dl_preatt, dl_att, dl_atty, l_qkv, l_att, B, T, C, NH);
        // 对应用在层归一化之前的全连接层进行反向传播
        matmul_backward(dl_ln1, dl_qkvw, dl_qkvb, dl_qkv, l_ln1, l_qkvw, B, T, C, 3*C);
        // 对另一个层归一化进行反向传播，这是应用在残差连接之前的
        layernorm_backward(dresidual, dl_ln1w, dl_ln1b, dl_ln1, residual, l_ln1w, l_ln1_mean, l_ln1_rstd, B, T, C);
    }
    // 编码器权重梯度的更新
    encoder_backward(grads.wte, grads.wpe, grads_acts.encoded, model->inputs, B, T, C);
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

void gpt2_backward(GPT2 *model) {

// double check we forwarded previously, with targets

// 确保已经执行了带有目标tokens的前向传播

if (model->mean_loss == -1.0f) {

printf("Error: must forward with targets before backward\n");

exit(1);

}

// lazily allocate the memory for gradients of the weights and activations, if needed

// 如果梯度的内存还没有被分配，则进行懒加载分配并初始化为零

if (model->grads_memory == NULL) {

model->grads_memory = malloc_and_point_parameters(&model->grads, model->param_sizes);

model->grads_acts_memory = malloc_and_point_activations(&model->grads_acts, model->act_sizes);

// 将梯度初始化为零

gpt2_zero_grad(model);

}

// convenience shortcuts

// 定义便利变量

int B = model->batch_size;

int T = model->seq_len;

int V = model->config.vocab_size;

int L = model->config.num_layers;

int NH = model->config.num_heads;

int C = model->config.channels;

// backward pass: go in the reverse order of the forward pass, and call backward() functions

// 反向传播：按照前向传播的逆序执行，并调用相应的反向传播函数

ParameterTensors params = model->params; // for brevity

ParameterTensors grads = model->grads;

ActivationTensors acts = model->acts;

ActivationTensors grads_acts = model->grads_acts;

// we kick off the chain rule by filling in dlosses with 1.0f/(B*T)

// technically this is a small, inline backward() pass of calculating

// total, final loss as the mean over all losses over all (B,T) positions in the batch

// 我们通过用1.0f/(B*T)填充dlosses来启动链式法则

// 从技术上讲，这是一个小型的内联反向传播步骤，用于计算批次中所有(B,T)位置上所有损失的总和的平均值作为最终的损失

// 对所有(B,T)位置的损失求均值

float dloss_mean = 1.0f / (B*T);

for (int i = 0; i < B*T; i++) { grads_acts.losses[i] = dloss_mean; }

// 交叉熵和Softmax层的反向传播

crossentropy_softmax_backward(grads_acts.logits, grads_acts.losses, acts.probs, model->targets, B, T, V);

matmul_backward(grads_acts.lnf, grads.wte, NULL, grads_acts.logits, acts.lnf, params.wte, B, T, C, V);

// 最后一层的残差

float* residual = acts.residual3 + (L-1) * B * T * C; // last layer's residual

// 反向传播到最后一层的残差

float* dresidual = grads_acts.residual3 + (L-1) * B * T * C; // write to last layer's residual

// 层归一化的反向传播

layernorm_backward(dresidual, grads.lnfw, grads.lnfb, grads_acts.lnf, residual, params.lnfw, acts.lnf_mean, acts.lnf_rstd, B, T, C);

// 逆序遍历每一层进行反向传播

for (int l = L-1; l >= 0; l--) {

residual = l == 0 ? acts.encoded : acts.residual3 + (l-1) * B * T * C;

dresidual = l == 0 ? grads_acts.encoded : grads_acts.residual3 + (l-1) * B * T * C;

// get the pointers of the weights for this layer

// 获取这一层的权重指针。

float* l_ln1w = params.ln1w + l * C;

float* l_qkvw = params.qkvw + l * 3*C * C;

float* l_attprojw = params.attprojw + l * C * C;

float* l_ln2w = params.ln2w + l * C;

float* l_fcw = params.fcw + l * 4*C * C;

float* l_fcprojw = params.fcprojw + l * C * 4*C;

// get the pointers of the gradients of the weights for this layer

// 获取这一层权重梯度的指针

float* dl_ln1w = grads.ln1w + l * C;

float* dl_ln1b = grads.ln1b + l * C;

float* dl_qkvw = grads.qkvw + l * 3*C * C;

float* dl_qkvb = grads.qkvb + l * 3*C;

float* dl_attprojw = grads.attprojw + l * C * C;

float* dl_attprojb = grads.attprojb + l * C;

float* dl_ln2w = grads.ln2w + l * C;

float* dl_ln2b = grads.ln2b + l * C;

float* dl_fcw = grads.fcw + l * 4*C * C;

float* dl_fcb = grads.fcb + l * 4*C;

float* dl_fcprojw = grads.fcprojw + l * C * 4*C;

float* dl_fcprojb = grads.fcprojb + l * C;

// get the pointers of the activations for this layer

// 获取这一层的激活值指针。

float* l_ln1 = acts.ln1 + l * B * T * C;

float* l_ln1_mean = acts.ln1_mean + l * B * T;

float* l_ln1_rstd = acts.ln1_rstd + l * B * T;

float* l_qkv = acts.qkv + l * B * T * 3*C;

float* l_atty = acts.atty + l * B * T * C;

float* l_att = acts.att + l * B * NH * T * T;

float* l_residual2 = acts.residual2 + l * B * T * C;

float* l_ln2 = acts.ln2 + l * B * T * C;

float* l_ln2_mean = acts.ln2_mean + l * B * T;

float* l_ln2_rstd = acts.ln2_rstd + l * B * T;

float* l_fch = acts.fch + l * B * T * 4*C;

float* l_fch_gelu = acts.fch_gelu + l * B * T * 4*C;

// get the pointers of the gradients of the activations for this layer

// 获取这一层激活梯度的指针

float* dl_ln1 = grads_acts.ln1 + l * B * T * C;

float* dl_qkv = grads_acts.qkv + l * B * T * 3*C;

float* dl_atty = grads_acts.atty + l * B * T * C;

float* dl_preatt = grads_acts.preatt + l * B * NH * T * T;

float* dl_att = grads_acts.att + l * B * NH * T * T;

float* dl_attproj = grads_acts.attproj + l * B * T * C;

float* dl_residual2 = grads_acts.residual2 + l * B * T * C;

float* dl_ln2 = grads_acts.ln2 + l * B * T * C;

float* dl_fch = grads_acts.fch + l * B * T * 4*C;

float* dl_fch_gelu = grads_acts.fch_gelu + l * B * T * 4*C;

float* dl_fcproj = grads_acts.fcproj + l * B * T * C;

float* dl_residual3 = grads_acts.residual3 + l * B * T * C;

// backprop this layer

// 对残差连接进行反向传播

residual_backward(dl_residual2, dl_fcproj, dl_residual3, B*T*C);

// 对全连接层（使用GELU激活函数之后）进行反向传播

matmul_backward(dl_fch_gelu, dl_fcprojw, dl_fcprojb, dl_fcproj, l_fch_gelu, l_fcprojw, B, T, 4*C, C);

// 对GELU激活函数进行反向传播

gelu_backward(dl_fch, l_fch, dl_fch_gelu, B*T*4*C);

// 对全连接层（输入到GELU激活函数之前）进行反向传播

matmul_backward(dl_ln2, dl_fcw, dl_fcb, dl_fch, l_ln2, l_fcw, B, T, C, 4*C);

// 对层归一化进行反向传播，这是应用在加权残差连接之后的

layernorm_backward(dl_residual2, dl_ln2w, dl_ln2b, dl_ln2, l_residual2, l_ln2w, l_ln2_mean, l_ln2_rstd, B, T, C);

// 对另一个残差连接进行反向传播

residual_backward(dresidual, dl_attproj, dl_residual2, B*T*C);

// 对应用在注意力机制输出上的全连接层进行反向传播

matmul_backward(dl_atty, dl_attprojw, dl_attprojb, dl_attproj, l_atty, l_attprojw, B, T, C, C);

// 对注意力机制本身进行反向传播

attention_backward(dl_qkv, dl_preatt, dl_att, dl_atty, l_qkv, l_att, B, T, C, NH);

// 对应用在层归一化之前的全连接层进行反向传播

matmul_backward(dl_ln1, dl_qkvw, dl_qkvb, dl_qkv, l_ln1, l_qkvw, B, T, C, 3*C);

// 对另一个层归一化进行反向传播，这是应用在残差连接之前的

layernorm_backward(dresidual, dl_ln1w, dl_ln1b, dl_ln1, residual, l_ln1w, l_ln1_mean, l_ln1_rstd, B, T, C);

}

// 编码器权重梯度的更新

encoder_backward(grads.wte, grads.wpe, grads_acts.encoded, model->inputs, B, T, C);

}

这个过程是深度学习训练的核心，通过计算损失函数相对于每个参数的梯度，并利用这些梯度来更新模型参数，从而最小化损失函数，提升模型性能。

1.26 gpt2_update

这段代码是GPT-2模型的参数更新过程，使用AdamW优化器进行更新。

void gpt2_update(GPT2 *model, float learning_rate, float beta1, float beta2, float eps, float weight_decay, int t) {
    // reference: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

    // lazily allocate the memory for m_memory and v_memory
    // 如果还没有为m_memory和v_memory分配内存，则懒加载分配内存
    if (model->m_memory == NULL) {
        // 为第一矩分配内存
        model->m_memory = (float*)calloc(model->num_parameters, sizeof(float));
        // 为第二矩分配内存
        model->v_memory = (float*)calloc(model->num_parameters, sizeof(float));
    }

    for (int i = 0; i < model->num_parameters; i++) {
        // 当前参数值
        float param = model->params_memory[i];
        // 当前梯度值
        float grad = model->grads_memory[i];

        // update the first moment (momentum)
        // 更新第一矩（动量）
        float m = beta1 * model->m_memory[i] + (1.0f - beta1) * grad;
        // update the second moment (RMSprop)
        // 更新第二矩（RMSprop）
        float v = beta2 * model->v_memory[i] + (1.0f - beta2) * grad * grad;
        // bias-correct both moments
        // 对两个矩进行偏差修正
        float m_hat = m / (1.0f - powf(beta1, t));
        float v_hat = v / (1.0f - powf(beta2, t));

        // update
        // 根据AdamW优化算法更新参数
        // 更新动量
        model->m_memory[i] = m;
        // 更新RMSprop
        model->v_memory[i] = v;
        // 更新参数值，包括学习率、偏差修正后的动量和RMSprop以及权重衰减
        model->params_memory[i] -= learning_rate * (m_hat / (sqrtf(v_hat) + eps) + weight_decay * param);
    }
}

void gpt2_update(GPT2 *model, float learning_rate, float beta1, float beta2, float eps, float weight_decay, int t) {

// reference: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

// lazily allocate the memory for m_memory and v_memory

// 如果还没有为m_memory和v_memory分配内存，则懒加载分配内存

if (model->m_memory == NULL) {

// 为第一矩分配内存

model->m_memory = (float*)calloc(model->num_parameters, sizeof(float));

// 为第二矩分配内存

model->v_memory = (float*)calloc(model->num_parameters, sizeof(float));

}

for (int i = 0; i < model->num_parameters; i++) {

// 当前参数值

float param = model->params_memory[i];

// 当前梯度值

float grad = model->grads_memory[i];

// update the first moment (momentum)

// 更新第一矩（动量）

float m = beta1 * model->m_memory[i] + (1.0f - beta1) * grad;

// update the second moment (RMSprop)

// 更新第二矩（RMSprop）

float v = beta2 * model->v_memory[i] + (1.0f - beta2) * grad * grad;

// bias-correct both moments

// 对两个矩进行偏差修正

float m_hat = m / (1.0f - powf(beta1, t));

float v_hat = v / (1.0f - powf(beta2, t));

// update

// 根据AdamW优化算法更新参数

// 更新动量

model->m_memory[i] = m;

// 更新RMSprop

model->v_memory[i] = v;

// 更新参数值，包括学习率、偏差修正后的动量和RMSprop以及权重衰减

model->params_memory[i] -= learning_rate * (m_hat / (sqrtf(v_hat) + eps) + weight_decay * param);

}

这个函数根据模型当前的梯度和历史动量、RMSprop值来更新模型参数。它首先检查是否已为动量(m_memory)和RMSprop(v_memory)分配内存，如果没有，则进行分配。接着，对每个参数，计算其更新后的值，其中包括第一矩和第二矩的更新，以及应用偏差修正和权重衰减。这样可以在训练过程中逐步优化模型参数，以期达到更好的性能。

1.27 gpt2_free

这段代码释放了GPT-2模型中使用的所有动态分配的内存。具体来说，它释放了：

void gpt2_free(GPT2 *model) {
    // 释放参数内存
    free(model->params_memory);
    // 释放梯度内存
    free(model->grads_memory);
    // 释放动量内存
    free(model->m_memory);
    // 释放RMSprop内存
    free(model->v_memory);
    // 释放激活内存
    free(model->acts_memory);
    // 释放激活梯度内存
    free(model->grads_acts_memory);
    // 释放输入内存
    free(model->inputs);
    // 释放目标token内存
    free(model->targets);
}

void gpt2_free(GPT2 *model) {

// 释放参数内存

free(model->params_memory);

// 释放梯度内存

free(model->grads_memory);

// 释放动量内存

free(model->m_memory);

// 释放RMSprop内存

free(model->v_memory);

// 释放激活内存

free(model->acts_memory);

// 释放激活梯度内存

free(model->grads_acts_memory);

// 释放输入内存

free(model->inputs);

// 释放目标token内存

free(model->targets);

}

此函数通常在模型训练完成或不再需要模型时调用，以确保及时回收资源，避免内存泄露。

1.28 DataLoader

这个结构体定义了一个数据加载器（DataLoader），它负责从文件中加载训练或验证数据，以便用于模型的训练或评估。具体包括：

typedef struct {
    // hyperparameters
    // 超参数
    // B：批量大小（batch size），即每次训练或评估模型时一次性处理的数据数量。
    int B; // batch size
    // T：序列长度（sequence length），指单个输入数据的长度，例如一个句子中的单词或字符数量。
    int T; // sequence length
    // input handling and its state
    // 指向包含训练或验证数据的文件的指针。
    FILE* tokens_file;
    // 文件的大小，单位为字节。
    long file_size;
    // 当前读取位置，在文件中的偏移量
    long current_position;
    // output memory
    // 输出内存
    // batch：指向当前批量数据的指针。它可能包含一批输入数据（tokens）。
    int* batch;
    // inputs：指向当前批量的输入数据的指针。
    // 在训练GPT模型时，这将是一系列token的ID。
    int* inputs;
    // targets：指向当前批量的目标数据的指针。
    // 在训练中，目标通常是预测下一个token的ID。
    int* targets;
    // convenience variables
    // 便利变量
    // num_batches：根据文件大小、批量大小和序列长度计算的总批次数。
    int num_batches;
} DataLoader;

typedef struct {

// hyperparameters

// 超参数

// B：批量大小（batch size），即每次训练或评估模型时一次性处理的数据数量。

int B; // batch size

// T：序列长度（sequence length），指单个输入数据的长度，例如一个句子中的单词或字符数量。

int T; // sequence length

// input handling and its state

// 指向包含训练或验证数据的文件的指针。

FILE* tokens_file;

// 文件的大小，单位为字节。

long file_size;

// 当前读取位置，在文件中的偏移量

long current_position;

// output memory

// 输出内存

// batch：指向当前批量数据的指针。它可能包含一批输入数据（tokens）。

int* batch;

// inputs：指向当前批量的输入数据的指针。

// 在训练GPT模型时，这将是一系列token的ID。

int* inputs;

// targets：指向当前批量的目标数据的指针。

// 在训练中，目标通常是预测下一个token的ID。

int* targets;

// convenience variables

// 便利变量

// num_batches：根据文件大小、批量大小和序列长度计算的总批次数。

int num_batches;

} DataLoader;

这个结构体是处理和准备数据集以供模型训练和评估使用的重要组成部分，通过从文件中按批次加载数据，能够有效管理内存使用，同时也支持大规模数据集的训练。

1.29 dataloader_init

这段代码初始化了一个数据加载器，用于从文件中加载训练或验证数据。

void dataloader_init(DataLoader *loader, char* filename, int B, int T) {
    // 设置批量大小（B）和序列长度（T）
    loader->B = B;
    loader->T = T;

    // open the input file for reading
    // 打开指定的文件进行读取。如果文件无法打开，程序将报错并退出。
    loader->tokens_file = fopen(filename, "rb");
    if (loader->tokens_file == NULL) {
        printf("Error opening tokens file\n");
        exit(1);
    }

    // determine the file size
    // 确定文件大小，以便知道有多少数据可用。
    fseek(loader->tokens_file, 0, SEEK_END);
    loader->file_size = ftell(loader->tokens_file);
    fseek(loader->tokens_file, 0, SEEK_SET);
    // 检查文件大小是否足够至少包含一个批次的数据。如果不够，程序将报错并退出。
    if (loader->file_size < (B * T + 1) * sizeof(int)) {
        printf("Error: file size is too small for the batch size and sequence length\n");
        exit(1);
    }
    // 设置当前读取位置为文件开头。
    loader->current_position = 0; // start at the beginning

    // allocate space for B*T + 1 integers to store the inputs and targets
    // 为整个批次分配内存空间，包括输入和目标数据。
    // 这里多分配了一个整数的空间，用于处理目标数据时的位移。
    loader->batch = (int*) malloc((B * T + 1) * sizeof(int));
    loader->inputs = loader->batch;
    loader->targets = loader->batch + 1; // targets are shifted by one
    // 计算文件中总共可以分成多少个批次，这取决于文件大小、批量大小和序列长度。
    loader->num_batches = loader->file_size / (B * T * sizeof(int));
}

void dataloader_init(DataLoader *loader, char* filename, int B, int T) {

// 设置批量大小（B）和序列长度（T）

loader->B = B;

loader->T = T;

// open the input file for reading

// 打开指定的文件进行读取。如果文件无法打开，程序将报错并退出。

loader->tokens_file = fopen(filename, "rb");

if (loader->tokens_file == NULL) {

printf("Error opening tokens file\n");

exit(1);

}

// determine the file size

// 确定文件大小，以便知道有多少数据可用。

fseek(loader->tokens_file, 0, SEEK_END);

loader->file_size = ftell(loader->tokens_file);

fseek(loader->tokens_file, 0, SEEK_SET);

// 检查文件大小是否足够至少包含一个批次的数据。如果不够，程序将报错并退出。

if (loader->file_size < (B * T + 1) * sizeof(int)) {

printf("Error: file size is too small for the batch size and sequence length\n");

exit(1);

}

// 设置当前读取位置为文件开头。

loader->current_position = 0; // start at the beginning

// allocate space for B*T + 1 integers to store the inputs and targets

// 为整个批次分配内存空间，包括输入和目标数据。

// 这里多分配了一个整数的空间，用于处理目标数据时的位移。

loader->batch = (int*) malloc((B * T + 1) * sizeof(int));

loader->inputs = loader->batch;

loader->targets = loader->batch + 1; // targets are shifted by one

// 计算文件中总共可以分成多少个批次，这取决于文件大小、批量大小和序列长度。

loader->num_batches = loader->file_size / (B * T * sizeof(int));

}

通过这个初始化过程，数据加载器准备好从文件中读取数据，以供模型训练或评估使用。

1.30 dataloader_reset

这段代码将数据加载器的当前读取位置重置为文件的开头。这通常在每次新的数据遍历开始时使用，确保从文件的开始处重新开始读取数据。

void dataloader_reset(DataLoader *loader) {
    // 设置当前读取位置为文件开头。
    loader->current_position = 0;
}

void dataloader_reset(DataLoader *loader) {

// 设置当前读取位置为文件开头。

loader->current_position = 0;

}

1.31 dataloader_next_batch

这段代码用于从数据文件中读取下一个批次的数据。

void dataloader_next_batch(DataLoader *loader) {
    // 批量大小
    int B = loader->B;
    // 序列长度
    int T = loader->T;
    // if we are at the end of the file, loop back to the beginning
    // 如果文件末尾不足以提供一个完整的批次，将位置重置到文件开头
    if (loader->current_position + (B*T+1) * sizeof(int) > loader->file_size) {
        loader->current_position = 0;
    }
    // read the B*T+1 integers from the file into batch
    // 从当前位置读取B*T+1个整数到批量数据区
    fseek(loader->tokens_file, loader->current_position, SEEK_SET);
    fread(loader->batch, sizeof(int), B*T+1, loader->tokens_file);
    // advance the current position by B*T integers
    // 更新当前位置，前进B*T个整数的长度，为下一次读取做准备
    loader->current_position += B*T * sizeof(int);
}

void dataloader_next_batch(DataLoader *loader) {

// 批量大小

int B = loader->B;

// 序列长度

int T = loader->T;

// if we are at the end of the file, loop back to the beginning

// 如果文件末尾不足以提供一个完整的批次，将位置重置到文件开头

if (loader->current_position + (B*T+1) * sizeof(int) > loader->file_size) {

loader->current_position = 0;

}

// read the B*T+1 integers from the file into batch

// 从当前位置读取B*T+1个整数到批量数据区

fseek(loader->tokens_file, loader->current_position, SEEK_SET);

fread(loader->batch, sizeof(int), B*T+1, loader->tokens_file);

// advance the current position by B*T integers

// 更新当前位置，前进B*T个整数的长度，为下一次读取做准备

loader->current_position += B*T * sizeof(int);

}

这段代码实现了数据的循环读取：当数据读取到文件末尾时，自动从文件开头继续读取，保证连续的数据流供模型训练。这对于训练周期多次遍历数据集时非常有用。

1.32 dataloader_free

这段代码用于释放数据加载器（DataLoader）使用的资源。

void dataloader_free(DataLoader *loader) {
    // 关闭打开的文件
    fclose(loader->tokens_file);
    // 释放分配的内存空间，用于存储批次数据
    free(loader->batch);
}

void dataloader_free(DataLoader *loader) {

// 关闭打开的文件

fclose(loader->tokens_file);

// 释放分配的内存空间，用于存储批次数据

free(loader->batch);

}

通过调用此函数，可以确保在数据加载器不再需要时，相关资源被适当地释放，避免内存泄漏。

1.33 random_u32

这段代码实现了一个简单的随机数生成器（RNG）使用xorshift算法。xorshift是一种快速、高质量的伪随机数生成器，广泛用于各种计算场景。这个特定的版本使用了一个64位状态变量，并通过一系列位移和异或操作来生成新的随机数。

// the GPT-2 end-of-text token id
// GPT-2文本结束标记的id
#define GPT2_EOT 50256

// 实现了一个简单的随机数生成器（RNG）使用xorshift算法。
// xorshift是一种快速、高质量的伪随机数生成器，广泛用于各种计算场景。
// 这个特定的版本使用了一个64位状态变量，并通过一系列位移和异或操作来生成新的随机数。
unsigned int random_u32(unsigned long long *state) {
    // xorshift rng: https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
    // // 对状态变量进行右移12位并与原状态进行异或操作
    *state ^= *state >> 12;
    // 对状态变量进行左移25位并与当前状态进行异或操作
    *state ^= *state << 25;
    // 对状态变量进行右移27位并与当前状态进行异或操作
    *state ^= *state >> 27;
    // 使用一个魔法常数乘以当前状态，并右移32位来生成最终的随机数
    return (*state * 0x2545F4914F6CDD1Dull) >> 32;
}

// the GPT-2 end-of-text token id

// GPT-2文本结束标记的id

#define GPT2_EOT 50256

// 实现了一个简单的随机数生成器（RNG）使用xorshift算法。

// xorshift是一种快速、高质量的伪随机数生成器，广泛用于各种计算场景。

// 这个特定的版本使用了一个64位状态变量，并通过一系列位移和异或操作来生成新的随机数。

unsigned int random_u32(unsigned long long *state) {

// xorshift rng: https://en.wikipedia.org/wiki/Xorshift#xorshift.2A

// // 对状态变量进行右移12位并与原状态进行异或操作

*state ^= *state >> 12;

// 对状态变量进行左移25位并与当前状态进行异或操作

*state ^= *state << 25;

// 对状态变量进行右移27位并与当前状态进行异或操作

*state ^= *state >> 27;

// 使用一个魔法常数乘以当前状态，并右移32位来生成最终的随机数

return (*state * 0x2545F4914F6CDD1Dull) >> 32;

}

通过改变状态变量，这个函数能够在每次调用时生成一个新的无符号32位整数作为随机数。这种方法的优点是速度快且实现简单，但由于它是伪随机的，生成的随机序列是可预测的，因此不适用于需要高安全性的随机数生成场景。

1.34 random_f32

生成一个在[0, 1)区间内的随机float32数字。

float random_f32(unsigned long long *state) { // random float32 in [0,1)
    return (random_u32(state) >> 8) / 16777216.0f;
}

float random_f32(unsigned long long *state) { // random float32 in [0,1)

return (random_u32(state) >> 8) / 16777216.0f;

}

1.35 sample_mult

从概率分布中采样一个索引。这些概率值的总和必须为1！参数coin是一个在[0, 1)区间内的随机数，通常由random_f32()函数生成。此函数计算累积分布函数（CDF），并在coin小于CDF的当前值时返回对应的索引。如果因为舍入误差，coin没有小于任何CDF的值，则默认返回最后一个索引。

int sample_mult(float* probabilities, int n, float coin) {
    // sample index from probabilities (they must sum to 1!)
    // coin is a random number in [0, 1), usually from random_f32()
    // 从概率分布中采样一个索引，这些概率的总和必须为1
    // coin是一个在[0, 1)区间内的随机数，通常由random_f32()生成
    // 累积分布函数的初始值
    float cdf = 0.0f;
    for (int i = 0; i < n; i++) {
        // 累加概率到CDF
        cdf += probabilities[i];
        if (coin < cdf) {
            // 如果随机数小于当前的CDF值，则返回当前的索引
            return i;
        }
    }
    // 如果因为舍入误差，coin没有小于任何CDF的值，则返回最后一个索引
    return n - 1; // in case of rounding errors
}

int sample_mult(float* probabilities, int n, float coin) {

// sample index from probabilities (they must sum to 1!)

// coin is a random number in [0, 1), usually from random_f32()

// 从概率分布中采样一个索引，这些概率的总和必须为1

// coin是一个在[0, 1)区间内的随机数，通常由random_f32()生成

// 累积分布函数的初始值

float cdf = 0.0f;

for (int i = 0; i < n; i++) {

// 累加概率到CDF

cdf += probabilities[i];

if (coin < cdf) {

// 如果随机数小于当前的CDF值，则返回当前的索引

return i;

}

// 如果因为舍入误差，coin没有小于任何CDF的值，则返回最后一个索引

return n - 1; // in case of rounding errors

}

1.36 Tokenizer

// Tokenizer (only supports decoding)
// Tokenizer（仅支持解码）

typedef struct {
    // 词汇表大小
    uint32_t vocab_size;
    // 令牌表，存储每个索引对应的字符串
    char **token_table;
    // 初始化状态，如果成功初始化则为1，否则为0
    int init_ok;
} Tokenizer;

// Tokenizer (only supports decoding)

// Tokenizer（仅支持解码）

typedef struct {

// 词汇表大小

uint32_t vocab_size;

// 令牌表，存储每个索引对应的字符串

char **token_table;

// 初始化状态，如果成功初始化则为1，否则为0

int init_ok;

} Tokenizer;

1.37 safe_printf

void safe_printf(const char *piece) {
    // the tokens are raw bytes, and we we only want to print the printable ones
    // many bytes can be various control codes, backspace, etc.
    // 令牌是原始字节，我们只想打印可打印的那些。
    // 许多字节可能是各种控制码、退格等。
    if (piece == NULL) { return; }
    if (piece[0] == '\0') { return; }
    // handle individual byte tokens
    // every token is asserted to be at least one byte so doing piece[1] is ok
    // 处理单字节令牌
    // 每个令牌至少有一个字节，所以直接检查 piece[1] 是可行的
    if (piece[1] == '\0') {
        unsigned char byte_val = piece[0];
        if (!(isprint(byte_val) || isspace(byte_val))) {
            // 奇怪的字节，不打印它
            return; // weird byte, don't print it
        }
    }
    printf("%s", piece);
}

void safe_printf(const char *piece) {

// the tokens are raw bytes, and we we only want to print the printable ones

// many bytes can be various control codes, backspace, etc.

// 令牌是原始字节，我们只想打印可打印的那些。

// 许多字节可能是各种控制码、退格等。

if (piece == NULL) { return; }

if (piece[0] == '\0') { return; }

// handle individual byte tokens

// every token is asserted to be at least one byte so doing piece[1] is ok

// 处理单字节令牌

// 每个令牌至少有一个字节，所以直接检查 piece[1] 是可行的

if (piece[1] == '\0') {

unsigned char byte_val = piece[0];

if (!(isprint(byte_val) || isspace(byte_val))) {

// 奇怪的字节，不打印它

return; // weird byte, don't print it

}

printf("%s", piece);

}

1.38 tokenizer_init

此函数的目的是初始化Tokenizer结构体，通过从指定文件中加载令牌信息。首先，它尝试打开指定的文件；如果失败，则打印一条警告消息，并设置init_ok标志为0，表示初始化失败。如果文件打开成功，它将读取文件头来获取词汇表的大小，并为每个令牌读取其长度和内容，将每个令牌存储为一个以空字符终止的字符串。最后，它关闭文件并将init_ok标志设置为1，表示初始化成功。

void tokenizer_init(Tokenizer *tokenizer, const char *filename) {
    FILE *file = fopen(filename, "rb");
    if (file == NULL) {
        // try to be more helpful as we just added this feature, erase later
        // 尝试提供更有用的信息，因为这个特性刚刚添加，稍后删除这部分提示
        printf("---\n");
        printf("WARNING: Failed to open the tokenizer file %s\n", filename);
        printf("The Tokenizer is a new feature added April 14 2024.\n");
        printf("Re-run `python train_gpt2.py` to write it\n");
        printf("---\n");
        tokenizer->init_ok = 0;
        return;
    }
    // read in the header
    // 读取头部信息
    uint32_t header[256];
    fread(header, sizeof(uint32_t), 256, file);
    // 断言文件格式正确
    assert(header[0] == 20240328);
    // 断言文件版本为1
    assert(header[1] == 1);
    // 读取词汇表大小
    tokenizer->vocab_size = header[2];
    // read in all the tokens
    // 读取所有令牌
    // 令牌长度
    unsigned char length;
    tokenizer->token_table = (char **)malloc(tokenizer->vocab_size * sizeof(char *));
    for (uint32_t i = 0; i < tokenizer->vocab_size; i++) {
        fread(&length, sizeof(unsigned char), 1, file);
        // 断言每个令牌至少有一个字符
        assert(length > 0); // every token should be at least one character
        // 分配内存
        char *token_bytes = (char *)malloc(length + 1);
        fread(token_bytes, sizeof(char), length, file);
        // 添加空字符终止符以便打印
        token_bytes[length] = '\0';  // Add null terminator for printing
        // 存储令牌
        tokenizer->token_table[i] = token_bytes;
    }
    // cleanups
    // 清理工作
    fclose(file);
    // 初始化成功标志
    tokenizer->init_ok = 1;
}

void tokenizer_init(Tokenizer *tokenizer, const char *filename) {

FILE *file = fopen(filename, "rb");

if (file == NULL) {

// try to be more helpful as we just added this feature, erase later

// 尝试提供更有用的信息，因为这个特性刚刚添加，稍后删除这部分提示

printf("---\n");

printf("WARNING: Failed to open the tokenizer file %s\n", filename);

printf("The Tokenizer is a new feature added April 14 2024.\n");

printf("Re-run `python train_gpt2.py` to write it\n");

printf("---\n");

tokenizer->init_ok = 0;

return;

}

// read in the header

// 读取头部信息

uint32_t header[256];

fread(header, sizeof(uint32_t), 256, file);

// 断言文件格式正确

assert(header[0] == 20240328);

// 断言文件版本为1

assert(header[1] == 1);

// 读取词汇表大小

tokenizer->vocab_size = header[2];

// read in all the tokens

// 读取所有令牌

// 令牌长度

unsigned char length;

tokenizer->token_table = (char **)malloc(tokenizer->vocab_size * sizeof(char *));

for (uint32_t i = 0; i < tokenizer->vocab_size; i++) {

fread(&length, sizeof(unsigned char), 1, file);

// 断言每个令牌至少有一个字符

assert(length > 0); // every token should be at least one character

// 分配内存

char *token_bytes = (char *)malloc(length + 1);

fread(token_bytes, sizeof(char), length, file);

// 添加空字符终止符以便打印

token_bytes[length] = '\0'; // Add null terminator for printing

// 存储令牌

tokenizer->token_table[i] = token_bytes;

}

// cleanups

// 清理工作

fclose(file);

// 初始化成功标志

tokenizer->init_ok = 1;

}

1.39 tokenizer_decode

这个函数的目的是根据给定的令牌ID解码并返回对应的令牌字符串。首先检查令牌器是否初始化成功，如果没有，则直接返回NULL。如果初始化成功，并且令牌ID在词汇表的范围内，则返回对应的令牌字符串。如果令牌ID超出了词汇表的范围，则打印一条错误信息，并返回NULL。

const char *tokenizer_decode(Tokenizer *tokenizer, uint32_t token_id) {
    if (tokenizer->init_ok == 0) {
        // 如果令牌器未成功初始化，则返回空
        return NULL;
    }
    if (token_id < tokenizer->vocab_size) {
        // 如果令牌ID在词汇表大小范围内，则返回对应的令牌
        return tokenizer->token_table[token_id];
    } else {
        // 如果令牌ID超出词汇表范围，则打印错误信息并返回空
        printf("invalid token id %d!\n", token_id);
        return NULL;
    }
}

const char *tokenizer_decode(Tokenizer *tokenizer, uint32_t token_id) {

if (tokenizer->init_ok == 0) {

// 如果令牌器未成功初始化，则返回空

return NULL;

}

if (token_id < tokenizer->vocab_size) {

// 如果令牌ID在词汇表大小范围内，则返回对应的令牌

return tokenizer->token_table[token_id];

} else {

// 如果令牌ID超出词汇表范围，则打印错误信息并返回空

printf("invalid token id %d!\n", token_id);

return NULL;

}

1.40 tokenizer_free

这个函数用于释放令牌器分配的资源。首先检查令牌器是否已成功初始化，如果是，就释放每个令牌字符串占用的内存，并最终释放存储令牌字符串指针的数组占用的内存。这是在令牌器不再需要时，避免内存泄漏的重要步骤。

void tokenizer_free(Tokenizer *tokenizer) {
    if (tokenizer->init_ok) {
        // 如果令牌器已成功初始化
        for (uint32_t i = 0; i < tokenizer->vocab_size; i++) {
            // 遍历每个令牌，并释放其占用的内存
            free(tokenizer->token_table[i]);
        }
        // 释放令牌表数组本身占用的内存
        free(tokenizer->token_table);
    }
}

void tokenizer_free(Tokenizer *tokenizer) {

if (tokenizer->init_ok) {

// 如果令牌器已成功初始化

for (uint32_t i = 0; i < tokenizer->vocab_size; i++) {

// 遍历每个令牌，并释放其占用的内存

free(tokenizer->token_table[i]);

}

// 释放令牌表数组本身占用的内存

free(tokenizer->token_table);

}

1.41 main

// main training loop
// 主训练循环
int main() {

    // build the GPT-2 model from a checkpoint
    // 从检查点构建GPT-2模型
    GPT2 model;
    gpt2_build_from_checkpoint(&model, "gpt2_124M.bin");

    // build the DataLoaders from tokens files. for now use tiny_shakespeare if available, else tiny_stories
    // 从tokens文件构建DataLoaders。如果可用，现在使用tiny_shakespeare，否则使用tiny_stories
    char* tiny_stories_train = "data/TinyStories_train.bin";
    char* tiny_stories_val = "data/TinyStories_val.bin";
    char* tiny_shakespeare_train = "data/tiny_shakespeare_train.bin";
    char* tiny_shakespeare_val = "data/tiny_shakespeare_val.bin";
    char* train_tokens = access(tiny_shakespeare_train, F_OK) != -1 ? tiny_shakespeare_train : tiny_stories_train;
    char* val_tokens = access(tiny_shakespeare_val, F_OK) != -1 ? tiny_shakespeare_val : tiny_stories_val;
    // 批量大小为4，即将对4个独立的token序列进行训练
    int B = 4; // batch size 4 (i.e. 4 independent token sequences will be trained on)
    // 序列长度为64，每个序列是64个tokens长，必须小于等于GPT-2的maxT，即1024
    int T = 64; // sequence length 64 (i.e. each sequence is 64 tokens long). must be <= maxT, which is 1024 for GPT-2
    DataLoader train_loader;
    dataloader_init(&train_loader, train_tokens, B, T);
    printf("train dataset num_batches: %d\n", train_loader.num_batches);
    DataLoader val_loader;
    dataloader_init(&val_loader, val_tokens, B, T);
    printf("val dataset num_batches: %d\n", val_loader.num_batches);
    int val_num_batches = 5;

    // build the Tokenizer
    // 初始化分词器（Tokenizer）
    Tokenizer tokenizer;
    tokenizer_init(&tokenizer, "gpt2_tokenizer.bin");

    // some memory for generating samples from the model
    // 用于从模型生成样本的一些内存空间
    unsigned long long rng_state = 1337;
    int* gen_tokens = (int*)malloc(B * T * sizeof(int));
    // 在推理步骤中，我们将生成这么多tokens的序列
    const int genT = 64; // number of steps of inference we will do

    // train
    // 训练过程
    struct timespec start, end;
    for (int step = 0; step <= 40; step++) {

        // once in a while estimate the validation loss
        // 偶尔估计验证损失
        if (step % 10 == 0) {
            // 初始化验证损失为0
            float val_loss = 0.0f;
            // 重置验证数据加载器的当前位置，准备从头开始读取数据
            dataloader_reset(&val_loader);
            // 遍历所有的验证批次
            for (int i = 0; i < val_num_batches; i++) {
                // 获取下一批验证数据
                dataloader_next_batch(&val_loader);
                // 对验证数据执行前向传播
                gpt2_forward(&model, val_loader.inputs, val_loader.targets, B, T);
                // 累加模型在当前批次的平均损失
                val_loss += model.mean_loss;
            }
            // 计算所有验证批次的平均损失
            val_loss /= val_num_batches;
            // 打印验证损失
            printf("val loss %f\n", val_loss);
        }

        // once in a while do model inference to print generated text
        // 偶尔进行模型推理以打印生成的文本
        if (step > 0 && step % 20 == 0) {
            // fill up gen_tokens with the GPT2_EOT, which kicks off the generation
            // 使用GPT2_EOT填充gen_tokens，这将触发生成过程
            for(int i = 0; i < B * T; ++i) {
                gen_tokens[i] = GPT2_EOT;
            }
            // now sample from the model autoregressively
            // 现在从模型中自回归地进行抽样
            printf("generating:\n---\n");
            for (int t = 1; t < genT; t++) {
                // note that inference is very wasteful here because for each token
                // we re-calculate the forward pass for all of (B,T) positions from scratch
                // but the inference here is just for sanity checking anyway
                // and we can maybe optimize a bit more later, with careful tests
                // furthermore, below we're only using b=0 (i.e. the first row) of all B rows
                // we're in principle running B "inference streams" in parallel here
                // but only using position 0
                // get the V-dimensional vector probs[0, t-1, :]
                 // 注意，这里的推理非常浪费资源，因为对于每个token
                // 我们从头开始重新计算所有(B,T)位置的前向传播
                // 但这里的推理仅仅是为了进行基本的检查
                // 我们或许可以在以后通过仔细测试来进行一些优化
                gpt2_forward(&model, gen_tokens, NULL, B, T);
                // furthermore, below we're only using b=0 (i.e. the first row) of all B rows
                // we're in principle running B "inference streams" in parallel here
                // but only using position 0
                // get the V-dimensional vector probs[0, t-1, :]
                // 此外，在下面我们仅使用了b=0（即所有B行中的第一行）
                // 原则上我们在这里并行运行B个“推理流”
                // 但只使用了位置0
                // 获取V维向量probs[0, t-1, :]
                float* probs = model.acts.probs + (t-1) * model.config.vocab_size;
                float coin = random_f32(&rng_state);
                int next_token = sample_mult(probs, model.config.vocab_size, coin);
                gen_tokens[t] = next_token;
                // print the generated token, either using the Tokenizer or a fallback
                // 打印生成的token，可以使用Tokenizer或者一个备用方案
                if (tokenizer.init_ok) {
                    const char* token_str = tokenizer_decode(&tokenizer, next_token);
                    safe_printf(token_str);
                } else {
                    // fall back to printing the token id
                    // 回退到打印token的id
                    printf("%d ", next_token);
                }
                fflush(stdout);
            }
            printf("\n---\n");
        }

        // do a training step
        // 执行一个训练步骤
        // 获取当前时间，作为开始时间
        clock_gettime(CLOCK_MONOTONIC, &start);
        // 从数据加载器中获取下一批训练数据
        dataloader_next_batch(&train_loader);
        // 对获取的训练数据执行前向传播
        gpt2_forward(&model, train_loader.inputs, train_loader.targets, B, T);
        // 清零模型的梯度
        gpt2_zero_grad(&model);
        // 执行反向传播，计算梯度
        gpt2_backward(&model);
        // 使用AdamW优化器更新模型的参数
        gpt2_update(&model, 1e-4f, 0.9f, 0.999f, 1e-8f, 0.0f, step+1);
        // 获取当前时间，作为结束时间
        clock_gettime(CLOCK_MONOTONIC, &end);
        // 计算训练这一步所消耗的时间
        double time_elapsed_s = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
        // 打印当前步骤的训练损失和时间消耗
        printf("step %d: train loss %f (took %f ms)\n", step, model.mean_loss, time_elapsed_s * 1000);
    }

    // free
    // 释放资源
    dataloader_free(&train_loader);
    dataloader_free(&val_loader);
    tokenizer_free(&tokenizer);
    gpt2_free(&model);
    free(gen_tokens);
    return 0;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

// main training loop

// 主训练循环

int main() {

// build the GPT-2 model from a checkpoint

// 从检查点构建GPT-2模型

GPT2 model;

gpt2_build_from_checkpoint(&model, "gpt2_124M.bin");

// build the DataLoaders from tokens files. for now use tiny_shakespeare if available, else tiny_stories

// 从tokens文件构建DataLoaders。如果可用，现在使用tiny_shakespeare，否则使用tiny_stories

char* tiny_stories_train = "data/TinyStories_train.bin";

char* tiny_stories_val = "data/TinyStories_val.bin";

char* tiny_shakespeare_train = "data/tiny_shakespeare_train.bin";

char* tiny_shakespeare_val = "data/tiny_shakespeare_val.bin";

char* train_tokens = access(tiny_shakespeare_train, F_OK) != -1 ? tiny_shakespeare_train : tiny_stories_train;

char* val_tokens = access(tiny_shakespeare_val, F_OK) != -1 ? tiny_shakespeare_val : tiny_stories_val;

// 批量大小为4，即将对4个独立的token序列进行训练

int B = 4; // batch size 4 (i.e. 4 independent token sequences will be trained on)

// 序列长度为64，每个序列是64个tokens长，必须小于等于GPT-2的maxT，即1024

int T = 64; // sequence length 64 (i.e. each sequence is 64 tokens long). must be <= maxT, which is 1024 for GPT-2

DataLoader train_loader;

dataloader_init(&train_loader, train_tokens, B, T);

printf("train dataset num_batches: %d\n", train_loader.num_batches);

DataLoader val_loader;

dataloader_init(&val_loader, val_tokens, B, T);

printf("val dataset num_batches: %d\n", val_loader.num_batches);

int val_num_batches = 5;

// build the Tokenizer

// 初始化分词器（Tokenizer）

Tokenizer tokenizer;

tokenizer_init(&tokenizer, "gpt2_tokenizer.bin");

// some memory for generating samples from the model

// 用于从模型生成样本的一些内存空间

unsigned long long rng_state = 1337;

int* gen_tokens = (int*)malloc(B * T * sizeof(int));

// 在推理步骤中，我们将生成这么多tokens的序列

const int genT = 64; // number of steps of inference we will do

// train

// 训练过程

struct timespec start, end;

for (int step = 0; step <= 40; step++) {

// once in a while estimate the validation loss

// 偶尔估计验证损失

if (step % 10 == 0) {

// 初始化验证损失为0

float val_loss = 0.0f;

// 重置验证数据加载器的当前位置，准备从头开始读取数据

dataloader_reset(&val_loader);

// 遍历所有的验证批次

for (int i = 0; i < val_num_batches; i++) {

// 获取下一批验证数据

dataloader_next_batch(&val_loader);

// 对验证数据执行前向传播

gpt2_forward(&model, val_loader.inputs, val_loader.targets, B, T);

// 累加模型在当前批次的平均损失

val_loss += model.mean_loss;

}

// 计算所有验证批次的平均损失

val_loss /= val_num_batches;

// 打印验证损失

printf("val loss %f\n", val_loss);

}

// once in a while do model inference to print generated text

// 偶尔进行模型推理以打印生成的文本

if (step > 0 && step % 20 == 0) {

// fill up gen_tokens with the GPT2_EOT, which kicks off the generation

// 使用GPT2_EOT填充gen_tokens，这将触发生成过程

for(int i = 0; i < B * T; ++i) {

gen_tokens[i] = GPT2_EOT;

}

// now sample from the model autoregressively

// 现在从模型中自回归地进行抽样

printf("generating:\n---\n");

for (int t = 1; t < genT; t++) {

// note that inference is very wasteful here because for each token

// we re-calculate the forward pass for all of (B,T) positions from scratch

// but the inference here is just for sanity checking anyway

// and we can maybe optimize a bit more later, with careful tests

// furthermore, below we're only using b=0 (i.e. the first row) of all B rows

// we're in principle running B "inference streams" in parallel here

// but only using position 0

// get the V-dimensional vector probs[0, t-1, :]

// 注意，这里的推理非常浪费资源，因为对于每个token

// 我们从头开始重新计算所有(B,T)位置的前向传播

// 但这里的推理仅仅是为了进行基本的检查

// 我们或许可以在以后通过仔细测试来进行一些优化

gpt2_forward(&model, gen_tokens, NULL, B, T);

// furthermore, below we're only using b=0 (i.e. the first row) of all B rows

// we're in principle running B "inference streams" in parallel here

// but only using position 0

// get the V-dimensional vector probs[0, t-1, :]

// 此外，在下面我们仅使用了b=0（即所有B行中的第一行）

// 原则上我们在这里并行运行B个“推理流”

// 但只使用了位置0

// 获取V维向量probs[0, t-1, :]

float* probs = model.acts.probs + (t-1) * model.config.vocab_size;

float coin = random_f32(&rng_state);

int next_token = sample_mult(probs, model.config.vocab_size, coin);

gen_tokens[t] = next_token;

// print the generated token, either using the Tokenizer or a fallback

// 打印生成的token，可以使用Tokenizer或者一个备用方案

if (tokenizer.init_ok) {

const char* token_str = tokenizer_decode(&tokenizer, next_token);

safe_printf(token_str);

} else {

// fall back to printing the token id

// 回退到打印token的id

printf("%d ", next_token);

}

fflush(stdout);

}

printf("\n---\n");

}

// do a training step

// 执行一个训练步骤

// 获取当前时间，作为开始时间

clock_gettime(CLOCK_MONOTONIC, &start);

// 从数据加载器中获取下一批训练数据

dataloader_next_batch(&train_loader);

// 对获取的训练数据执行前向传播

gpt2_forward(&model, train_loader.inputs, train_loader.targets, B, T);

// 清零模型的梯度

gpt2_zero_grad(&model);

// 执行反向传播，计算梯度

gpt2_backward(&model);

// 使用AdamW优化器更新模型的参数

gpt2_update(&model, 1e-4f, 0.9f, 0.999f, 1e-8f, 0.0f, step+1);

// 获取当前时间，作为结束时间

clock_gettime(CLOCK_MONOTONIC, &end);

// 计算训练这一步所消耗的时间

double time_elapsed_s = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;

// 打印当前步骤的训练损失和时间消耗

printf("step %d: train loss %f (took %f ms)\n", step, model.mean_loss, time_elapsed_s * 1000);

}

// free

// 释放资源

dataloader_free(&train_loader);

dataloader_free(&val_loader);

tokenizer_free(&tokenizer);

gpt2_free(&model);

free(gen_tokens);

return 0;

}

2. test_gpt2.c

test_gpt2.c 是C语言版的模型准确性验证，包含了 train_gpt2.c 的代码

2.1 check_tensor

这段代码是一个用于验证两个张量是否在一定容差范围内大致相等的函数。这在验证神经网络的实现正确性时非常有用，特别是在对比前向传播和反向传播的结果时。下面是代码的详细注释：

// poor man's tensor checker
// 这是一个用于比较两个张量在给定容差下是否大致相等的函数
int check_tensor(float *a, float *b, int n, char* label) {
    // 将要打印的最大元素数量，用于错误调试
    int print_upto = 5;
    // 标记所有元素是否在容差内相等，默认为1（相等）
    int ok = 1;
    // 记录所有元素差异的最大值
    float maxdiff = 0.0f;
    // 容差值，两个元素的差异必须小于此值才认为它们相等
    float tol = 2e-2;
    // 打印当前正在检查的张量标签
    printf("%s\n", label);
    // 遍历每个元素进行比较
    for (int i = 0; i < n; i++) {
        // look at the diffence at position i of these two tensors
        // 计算当前元素的绝对差值
        float diff = fabsf(a[i] - b[i]);

        // keep track of the overall error
        // 如果当前元素在容差内，则保持ok为1，否则变为0
        ok = ok && (diff <= tol);
        // 更新最大差异值
        if (diff > maxdiff) { maxdiff = diff; }

        // for the first few elements of each tensor, pretty print
        // the actual numbers, so we can do a visual, qualitative proof/assessment
        // 对前几个元素进行详细打印，以便进行人工检查
        if (i < print_upto) {
            if (diff <= tol) {
                if (i < print_upto) { printf("OK "); }
            } else {
                if (i < print_upto) { printf("NOT OK "); }
            }
            printf("%f %f\n", a[i], b[i]);
        }
    }
    // print the final result for this tensor
    // 打印整个张量的检查结果
    if (ok) {
        // 所有元素都在容差内
        printf("TENSOR OK, maxdiff = %e\n", maxdiff);
    } else {
        // 存在元素超出容差
        printf("TENSOR NOT OK, maxdiff = %e\n", maxdiff);
    }
    // 返回检查结果，1表示所有元素都在容差内相等，0表示至少有一个元素不相等
    return ok;
}

// poor man's tensor checker

// 这是一个用于比较两个张量在给定容差下是否大致相等的函数

int check_tensor(float *a, float *b, int n, char* label) {

// 将要打印的最大元素数量，用于错误调试

int print_upto = 5;

// 标记所有元素是否在容差内相等，默认为1（相等）

int ok = 1;

// 记录所有元素差异的最大值

float maxdiff = 0.0f;

// 容差值，两个元素的差异必须小于此值才认为它们相等

float tol = 2e-2;

// 打印当前正在检查的张量标签

printf("%s\n", label);

// 遍历每个元素进行比较

for (int i = 0; i < n; i++) {

// look at the diffence at position i of these two tensors

// 计算当前元素的绝对差值

float diff = fabsf(a[i] - b[i]);

// keep track of the overall error

// 如果当前元素在容差内，则保持ok为1，否则变为0

ok = ok && (diff <= tol);

// 更新最大差异值

if (diff > maxdiff) { maxdiff = diff; }

// for the first few elements of each tensor, pretty print

// the actual numbers, so we can do a visual, qualitative proof/assessment

// 对前几个元素进行详细打印，以便进行人工检查

if (i < print_upto) {

if (diff <= tol) {

if (i < print_upto) { printf("OK "); }

} else {

if (i < print_upto) { printf("NOT OK "); }

}

printf("%f %f\n", a[i], b[i]);

}

// print the final result for this tensor

// 打印整个张量的检查结果

if (ok) {

// 所有元素都在容差内

printf("TENSOR OK, maxdiff = %e\n", maxdiff);

} else {

// 存在元素超出容差

printf("TENSOR NOT OK, maxdiff = %e\n", maxdiff);

}

// 返回检查结果，1表示所有元素都在容差内相等，0表示至少有一个元素不相等

return ok;

}

这个函数接收两个张量a和b（作为一维数组），它们的元素数量n，以及一个用于在打印时标识张量的label字符串。函数遍历这些张量的每个元素，计算它们的差异，并检查这些差异是否都在定义的容差tol内。如果所有元素的差异都在容差内，则函数返回1，表示张量大致相等；如果至少有一个元素的差异超出容差，则返回0，表示张量不相等。

2.2 main

这是一个测试脚本的主函数，用于加载GPT-2模型、执行前向和反向传播，以及更新模型参数，并进行一系列的准确性验证。这通常用于确保模型实现的正确性，通过与预期的结果进行对比。下面是对这段代码的中文注释：

int main(int argc, char *argv[]) {

    // build the GPT-2 model from a checkpoint
    // 从检查点文件构建GPT-2模型
    GPT2 model;
    gpt2_build_from_checkpoint(&model, "gpt2_124M.bin");

    // 从模型配置中获取一些基本信息
    int C = model.config.channels;
    int V = model.config.vocab_size;
    int maxT = model.config.max_seq_len;
    int L = model.config.num_layers;

    // load additional information that we will use for debugging and error checking
    // 加载调试和错误检查用的附加信息
    FILE *state_file = fopen("gpt2_124M_debug_state.bin", "rb");
    if (state_file == NULL) { printf("Error opening state file\n"); return 1; }
    int state_header[256];
    fread(state_header, sizeof(int), 256, state_file);
    if (state_header[0] != 20240327) { printf("Bad magic state file"); return 1; }
    if (state_header[1] != 1) { printf("Bad version in state file"); return 1; }
    // 批大小
    int B = state_header[2]; // batch size, e.g. 4
    // 序列长度
    int T = state_header[3]; // time / sequence length (e.g. 64, up to maxT)
    printf("[State]\n");
    printf("batch_size: %d\n", B);
    printf("seq_len: %d\n", T);

    // 预期梯度
    ParameterTensors expected_grads;
    float* expected_grads_memory = malloc_and_point_parameters(&expected_grads, model.param_sizes);

    // inputs and expected outputs, only used for error checking
    // 输入和预期输出，仅用于错误检查
    int* x = (int*) malloc(B * T * sizeof(int));
    int* y = (int*) malloc(B * T * sizeof(int));
    float* expected_logits = (float*) malloc(B * T * V * sizeof(float));
    float* expected_loss = (float*) malloc(1 * sizeof(float));

    // read reference information from Python
    // 从Python中读取参考信息
    fread(x, sizeof(int), B*T, state_file);
    fread(y, sizeof(int), B*T, state_file);
    fread(expected_logits, sizeof(float), B*T*V, state_file);
    fread(expected_loss, sizeof(float), 1, state_file);
    fread(expected_grads_memory, sizeof(float), model.num_parameters, state_file);
    fclose(state_file);

    // overall OK signal for the test
    // 整体的OK标志
    int allok = 1;

    // let's do 10 training iterations, following the pytorch code
    // 进行10次训练迭代，跟随pytorch代码
    float expected_losses[10] = {
        5.270007133483887,
        4.059706687927246,
        3.3751230239868164,
        2.8007826805114746,
        2.315382242202759,
        1.8490285873413086,
        1.3946564197540283,
        0.9991465210914612,
        0.6240804195404053,
        0.37651097774505615
    };
    for (int step = 0; step < 10; step++) {

        struct timespec start, end;
        clock_gettime(CLOCK_MONOTONIC, &start);

        // 执行前向传播、梯度归零、反向传播
        gpt2_forward(&model, x, y, B, T);
        gpt2_zero_grad(&model);
        gpt2_backward(&model);

        clock_gettime(CLOCK_MONOTONIC, &end);
        double time_elapsed_s = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;

        // 第0步进行错误检查
        if (step == 0) {
            // error checking at step 0 for reference activations/gradients
            // 在第0步进行参考激活/梯度的错误检查。

            // at this point, target should be equal to expected_logits, let's compare
            // 此时，目标应该等于预期的logits，让我们来进行比较。
            // 比较目标和预期的logits
            int logits_ok = 1;
            // 循环检查logits是否匹配
            for (int i=0; i<B*T*V; i++) {
                if(i < 3) {
                    printf("%f %f\n", expected_logits[i], model.acts.logits[i]);
                }
                if (fabsf(expected_logits[i] - model.acts.logits[i]) >= 1e-2) {
                    printf("MISMATCH AT INDEX %d: ", i);
                    printf("%f %f\n", expected_logits[i],model.acts.logits[i]);
                    logits_ok = 0;
                    break;
                }
            }
            if(!logits_ok) { printf("NOT "); }
            printf("OK (LOGITS)\n");
            allok = allok && logits_ok;

            // compare the achieved loss
            // 比较实际损失和预期损失
            if (fabsf(model.mean_loss - *expected_loss) >= 1e-2) {
                printf("LOSS MISMATCH: %f %f\n", model.mean_loss, *expected_loss);
                allok = 0;
            } else {
                printf("LOSS OK: %f %f\n", model.mean_loss, *expected_loss);
            }

            // finally check all the gradients
            // 检查所有梯度
            int gradoks[16];
            // 对每个梯度进行检查
            ParameterTensors grads = model.grads;
            gradoks[0] = check_tensor(grads.wte, expected_grads.wte, V*C, "dwte");
            gradoks[1] = check_tensor(grads.wpe, expected_grads.wpe, maxT*C, "dwpe");
            gradoks[2] = check_tensor(grads.ln1w, expected_grads.ln1w, L*C, "dln1w");
            gradoks[3] = check_tensor(grads.ln1b, expected_grads.ln1b, L*C, "dln1b");
            gradoks[4] = check_tensor(grads.qkvw, expected_grads.qkvw, L*3*C*C, "dqkvw");
            gradoks[5] = check_tensor(grads.qkvb, expected_grads.qkvb, L*3*C, "dqkvb");
            gradoks[6] = check_tensor(grads.attprojw, expected_grads.attprojw, L*C*C, "dattprojw");
            gradoks[7] = check_tensor(grads.attprojb, expected_grads.attprojb, L*C, "dattprojb");
            gradoks[8] = check_tensor(grads.ln2w, expected_grads.ln2w, L*C, "dln2w");
            gradoks[9] = check_tensor(grads.ln2b, expected_grads.ln2b, L*C, "dln2b");
            gradoks[10] = check_tensor(grads.fcw, expected_grads.fcw, L*4*C*C, "dfcw");
            gradoks[11] = check_tensor(grads.fcb, expected_grads.fcb, L*4*C, "dfcb");
            gradoks[12] = check_tensor(grads.fcprojw, expected_grads.fcprojw, L*C*4*C, "dfcprojw");
            gradoks[13] = check_tensor(grads.fcprojb, expected_grads.fcprojb, L*C, "dfcprojb");
            gradoks[14] = check_tensor(grads.lnfw, expected_grads.lnfw, C, "dlnfw");
            gradoks[15] = check_tensor(grads.lnfb, expected_grads.lnfb, C, "dlnfb");
            for (int i = 0; i < 16; i++) {
                allok = allok && gradoks[i];
            }
        }

        // 更新模型参数
        gpt2_update(&model, 1e-4f, 0.9f, 0.999f, 1e-8f, 0.01f, step+1);

        // compare the losses
        // 比较损失
        float expected_loss = expected_losses[step];
        float actual_loss = model.mean_loss;
        int step_loss_ok = fabsf(expected_loss - actual_loss) < 1e-2;
        allok = allok && step_loss_ok;

        // print the timing information at the end
        // 打印时间信息
        printf("step %d: loss %f (took %f ms) OK = %d\n", step, model.mean_loss, time_elapsed_s * 1000, step_loss_ok);
    }

    // final judgement
    // 给出最终判断
    printf("overall okay: %d\n", allok);

    // free everything
    // 释放所有资源
    free(x);
    free(y);
    free(expected_logits);
    free(expected_loss);
    free(expected_grads_memory);
    gpt2_free(&model);
    return 0;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

int main(int argc, char *argv[]) {

// build the GPT-2 model from a checkpoint

// 从检查点文件构建GPT-2模型

GPT2 model;

gpt2_build_from_checkpoint(&model, "gpt2_124M.bin");

// 从模型配置中获取一些基本信息

int C = model.config.channels;

int V = model.config.vocab_size;

int maxT = model.config.max_seq_len;

int L = model.config.num_layers;

// load additional information that we will use for debugging and error checking

// 加载调试和错误检查用的附加信息

FILE *state_file = fopen("gpt2_124M_debug_state.bin", "rb");

if (state_file == NULL) { printf("Error opening state file\n"); return 1; }

int state_header[256];

fread(state_header, sizeof(int), 256, state_file);

if (state_header[0] != 20240327) { printf("Bad magic state file"); return 1; }

if (state_header[1] != 1) { printf("Bad version in state file"); return 1; }

// 批大小

int B = state_header[2]; // batch size, e.g. 4

// 序列长度

int T = state_header[3]; // time / sequence length (e.g. 64, up to maxT)

printf("[State]\n");

printf("batch_size: %d\n", B);

printf("seq_len: %d\n", T);

// 预期梯度

ParameterTensors expected_grads;

float* expected_grads_memory = malloc_and_point_parameters(&expected_grads, model.param_sizes);

// inputs and expected outputs, only used for error checking

// 输入和预期输出，仅用于错误检查

int* x = (int*) malloc(B * T * sizeof(int));

int* y = (int*) malloc(B * T * sizeof(int));

float* expected_logits = (float*) malloc(B * T * V * sizeof(float));

float* expected_loss = (float*) malloc(1 * sizeof(float));

// read reference information from Python

// 从Python中读取参考信息

fread(x, sizeof(int), B*T, state_file);

fread(y, sizeof(int), B*T, state_file);

fread(expected_logits, sizeof(float), B*T*V, state_file);

fread(expected_loss, sizeof(float), 1, state_file);

fread(expected_grads_memory, sizeof(float), model.num_parameters, state_file);

fclose(state_file);

// overall OK signal for the test

// 整体的OK标志

int allok = 1;

// let's do 10 training iterations, following the pytorch code

// 进行10次训练迭代，跟随pytorch代码

float expected_losses[10] = {

5.270007133483887,

4.059706687927246,

3.3751230239868164,

2.8007826805114746,

2.315382242202759,

1.8490285873413086,

1.3946564197540283,

0.9991465210914612,

0.6240804195404053,

0.37651097774505615

};

for (int step = 0; step < 10; step++) {

struct timespec start, end;

clock_gettime(CLOCK_MONOTONIC, &start);

// 执行前向传播、梯度归零、反向传播

gpt2_forward(&model, x, y, B, T);

gpt2_zero_grad(&model);

gpt2_backward(&model);

clock_gettime(CLOCK_MONOTONIC, &end);

double time_elapsed_s = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;

// 第0步进行错误检查

if (step == 0) {

// error checking at step 0 for reference activations/gradients

// 在第0步进行参考激活/梯度的错误检查。

// at this point, target should be equal to expected_logits, let's compare

// 此时，目标应该等于预期的logits，让我们来进行比较。

// 比较目标和预期的logits

int logits_ok = 1;

// 循环检查logits是否匹配

for (int i=0; i<B*T*V; i++) {

if(i < 3) {

printf("%f %f\n", expected_logits[i], model.acts.logits[i]);

}

if (fabsf(expected_logits[i] - model.acts.logits[i]) >= 1e-2) {

printf("MISMATCH AT INDEX %d: ", i);

printf("%f %f\n", expected_logits[i],model.acts.logits[i]);

logits_ok = 0;

break;

}

if(!logits_ok) { printf("NOT "); }

printf("OK (LOGITS)\n");

allok = allok && logits_ok;

// compare the achieved loss

// 比较实际损失和预期损失

if (fabsf(model.mean_loss - *expected_loss) >= 1e-2) {

printf("LOSS MISMATCH: %f %f\n", model.mean_loss, *expected_loss);

allok = 0;

} else {

printf("LOSS OK: %f %f\n", model.mean_loss, *expected_loss);

}

// finally check all the gradients

// 检查所有梯度

int gradoks[16];

// 对每个梯度进行检查

ParameterTensors grads = model.grads;

gradoks[0] = check_tensor(grads.wte, expected_grads.wte, V*C, "dwte");

gradoks[1] = check_tensor(grads.wpe, expected_grads.wpe, maxT*C, "dwpe");

gradoks[2] = check_tensor(grads.ln1w, expected_grads.ln1w, L*C, "dln1w");

gradoks[3] = check_tensor(grads.ln1b, expected_grads.ln1b, L*C, "dln1b");

gradoks[4] = check_tensor(grads.qkvw, expected_grads.qkvw, L*3*C*C, "dqkvw");

gradoks[5] = check_tensor(grads.qkvb, expected_grads.qkvb, L*3*C, "dqkvb");

gradoks[6] = check_tensor(grads.attprojw, expected_grads.attprojw, L*C*C, "dattprojw");

gradoks[7] = check_tensor(grads.attprojb, expected_grads.attprojb, L*C, "dattprojb");

gradoks[8] = check_tensor(grads.ln2w, expected_grads.ln2w, L*C, "dln2w");

gradoks[9] = check_tensor(grads.ln2b, expected_grads.ln2b, L*C, "dln2b");

gradoks[10] = check_tensor(grads.fcw, expected_grads.fcw, L*4*C*C, "dfcw");

gradoks[11] = check_tensor(grads.fcb, expected_grads.fcb, L*4*C, "dfcb");

gradoks[12] = check_tensor(grads.fcprojw, expected_grads.fcprojw, L*C*4*C, "dfcprojw");

gradoks[13] = check_tensor(grads.fcprojb, expected_grads.fcprojb, L*C, "dfcprojb");

gradoks[14] = check_tensor(grads.lnfw, expected_grads.lnfw, C, "dlnfw");

gradoks[15] = check_tensor(grads.lnfb, expected_grads.lnfb, C, "dlnfb");

for (int i = 0; i < 16; i++) {

allok = allok && gradoks[i];

}

// 更新模型参数

gpt2_update(&model, 1e-4f, 0.9f, 0.999f, 1e-8f, 0.01f, step+1);

// compare the losses

// 比较损失

float expected_loss = expected_losses[step];

float actual_loss = model.mean_loss;

int step_loss_ok = fabsf(expected_loss - actual_loss) < 1e-2;

allok = allok && step_loss_ok;

// print the timing information at the end

// 打印时间信息

printf("step %d: loss %f (took %f ms) OK = %d\n", step, model.mean_loss, time_elapsed_s * 1000, step_loss_ok);

}

// final judgement

// 给出最终判断

printf("overall okay: %d\n", allok);

// free everything

// 释放所有资源

free(x);

free(y);

free(expected_logits);

free(expected_loss);

free(expected_grads_memory);

gpt2_free(&model);

return 0;

}

此代码段主要执行以下操作：

加载模型、输入数据、预期输出和预期梯度。
对模型进行多次训练迭代，每次迭代后都会更新模型的参数。
在第一次迭代后，通过比较实际的输出、损失和梯度与预期值来验证模型的准确性。
根据验证结果，给出模型整体的准确性评价。
最后释放所有分配的资源。

1. train_gpt2.c

1.1 encoder_forward

1.2 encoder_backward

1.3 layernorm_forward

1.4 layernorm_backward

1.5 matmul_forward

1.6 matmul_backward

1.7 attention_forward

1.8 attention_backward

1.9 gelu_backward

1.10 gelu_backward

1.11 residual_forward

1.12 residual_backward

1.13 softmax_forward

1.14 crossentropy_forward

1.15 crossentropy_softmax_backward

1.16 ParameterTensors

1.17 malloc_and_point_parameters

1.18 ActivationTensors

1.19 malloc_and_point_activations

1.20 GPT2Config

1.21 GPT2

1.22 gpt2_build_from_checkpoint

1.23 gpt2_forward

1.24 gpt2_zero_grad

1.25 gpt2_backward

1.26 gpt2_update

1.27 gpt2_free

1.28 DataLoader

1.29 dataloader_init

1.30 dataloader_reset

1.31 dataloader_next_batch

1.32 dataloader_free

1.33 random_u32

1.34 random_f32

1.35 sample_mult

1.36 Tokenizer

1.37 safe_printf

1.38 tokenizer_init

1.39 tokenizer_decode

1.40 tokenizer_free

1.41 main

2. test_gpt2.c

2.1 check_tensor

2.2 main

相关文章

发表评论 取消回复

发表评论取消回复