首页 > 解决方案 > 你有没有在训练的时候遇到过类似loss jitter的问题?

问题描述

背景:这是关于在每个训练时期的开始阶段产生的损失抖动。当数据加载器加载第一批数据馈入网络时,损失值总是突然上升,然后从第二批恢复正常并继续下降。曲线太奇怪了。我需要你的帮助!

    for epoch in range(begin_epoch, end_epoch):
        print('PROGRESS: %.2f%%' % (100.0 * epoch / end_epoch))

        # set epoch as random seed of sampler while distributed training
        if train_sampler is not None and hasattr(train_sampler, 'set_epoch'):
            train_sampler.set_epoch(epoch)

        # reset metrics
        metrics.reset()

        # set net to train mode
        net.train()

        # clear the paramter gradients
        # optimizer.zero_grad()

        # init end time
        end_time = time.time()

        if isinstance(lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            name, value = validation_monitor.metrics.get()
            val = value[name.index(validation_monitor.host_metric_name)]
            lr_scheduler.step(val, epoch)

        # training
        train_loader_iter = iter(train_loader)
        for nbatch in range(total_size):
            try:
                batch = next(train_loader_iter)
            except StopIteration:
                print('reset loader .. ')
                train_loader_iter = iter(train_loader)
                batch = next(train_loader_iter)
            global_steps = total_size * epoch + nbatch

            os.environ['global_steps'] = str(global_steps)

            # record time
            data_in_time = time.time() - end_time

            # transfer data to GPU
            data_transfer_time = time.time()
            batch = to_cuda(batch)
            data_transfer_time = time.time() - data_transfer_time

            # forward
            forward_time = time.time()
            outputs, loss = net(*batch)
            loss = loss.mean()
            if gradient_accumulate_steps > 1:
                loss = loss / gradient_accumulate_steps
            forward_time = time.time() - forward_time

            # backward
            backward_time = time.time()
            if fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()
            backward_time = time.time() - backward_time

            optimizer_time = time.time()
            if (global_steps + 1) % gradient_accumulate_steps == 0:

                # clip gradient
                if clip_grad_norm > 0:
                    if fp16:
                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer),
                                                                    clip_grad_norm)
                    else:
                        total_norm = torch.nn.utils.clip_grad_norm_(net.parameters(),
                                                                    clip_grad_norm)
                    if writer is not None:
                        writer.add_scalar(tag='grad-para/Total-Norm',
                                        scalar_value=float(total_norm),
                                        global_step=global_steps)

                optimizer.step()
                
                # step LR scheduler
                if lr_scheduler is not None and not isinstance(lr_scheduler,
                                                            torch.optim.lr_scheduler.ReduceLROnPlateau):
                    lr_scheduler.step()

                # clear the parameter gradients
                optimizer.zero_grad()
            optimizer_time = time.time() - optimizer_time

            # update metric
            metric_time = time.time()
            metrics.update(outputs)
            if writer is not None and nbatch % 50 == 0:
                with torch.no_grad():
                    for group_i, param_group in enumerate(optimizer.param_groups):
                        writer.add_scalar(tag='Initial-LR/Group_{}'.format(group_i),
                                        scalar_value=param_group['initial_lr'],
                                        global_step=global_steps)
                        writer.add_scalar(tag='LR/Group_{}'.format(group_i),
                                        scalar_value=param_group['lr'],
                                        global_step=global_steps)
                    writer.add_scalar(tag='Train-Loss',
                                    scalar_value=float(loss.item()),
                                    global_step=global_steps)
                    name, value = metrics.get()
                    for n, v in zip(name, value):
                        if 'Logits' in n:
                            writer.add_scalar(tag='Train-Logits/' + n,
                                            scalar_value=v,
                                            global_step=global_steps)
                        else:
                            writer.add_scalar(tag='Train-' + n,
                                            scalar_value=v,
                                            global_step=global_steps)
                    for k, v in outputs.items():
                        if 'score' in k:
                            writer.add_histogram(tag=k,
                                                 values=v,
                                                 global_step=global_steps)

            metric_time = time.time() - metric_time

标签: deep-learningpytorchtensorboardlossbert-language-model

解决方案


您的数据集中有一批损失很高,就是这样。

人们为每个批次存储指标并不常见,通常是存储的时期的平均值(或多个批次步骤的平均值)。如果您存储平均值,您将不会看到这样的峰值。

您还可以通过改组数据来减少这些峰值,以便有问题的批次分布在整个时期。一般来说,在每个 epoch 开始时这样做是一个好习惯。


推荐阅读