为什么用bio_clinicalbert模型再来训练病历文本数据时,总出现问题

用bio_clinicalbert模型再来训练我的病历文本数据时,总会出现一些问题,请帮忙调试一个能用的代码。
如下的代码来自chatGPT,我用这些代码调试时总有问题,我已经用chatGPT调试过多次了,请不要发来自GPT的答案。
我给的提问是:
I have text data of more than 1000 patients' medical records which are classified into two types, healthy and unhealthy. Please combine Bio_ClinicalBERT model to train these text data and debug an optimal model. Divide the data into training set, validation set and test set. During model training, train with the training set and test with the validation set to come up with the optimal hyperparameter combination, and show the accuracy and loss of each epoch with matplotlab.pyplot drawing with python code. Then the training and validation sets are merged into a new training set, and the new training set is used to train the Bio_ClinicalBERT model with the optimal hyperparameters obtained before, and then the trained model is saved. At the time of model testing, load the saved model and then perform accuracy testing on the test set. I have gpu in my computer, please use gpu for training. Please give specific executable code that explains each step and parameter.

chatGPT给的代码:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.utils import shuffle

os.chdir('E:/CNSI_project/demo_and_data_3.0/aim2_tfModel_CliBertToken/')
# Set device to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load and preprocess the data
df = pd.read_csv('InputData_short_test_filtered.csv')
df = shuffle(df, random_state=2023)
texts = df['2']
labels = df['1']

# Split the data into train, validation, and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=2023)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.25, random_state=2023)

# Load the tokenizer and encode the data
tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
train_encodings = tokenizer(list(train_texts), padding=True, truncation=True, return_tensors='pt')
val_encodings = tokenizer(list(val_texts), padding=True, truncation=True, return_tensors='pt')
test_encodings = tokenizer(list(test_texts), padding=True, truncation=True, return_tensors='pt')

# Create PyTorch datasets
class MedicalRecordDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MedicalRecordDataset(train_encodings, train_labels)
val_dataset = MedicalRecordDataset(val_encodings, val_labels)
test_dataset = MedicalRecordDataset(test_encodings, test_labels)

# Create PyTorch data loaders
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define the model
num_labels = len(train_labels.unique())
model = AutoModelForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)

# Define the optimizer and learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Define the number of epochs and initialize lists to store losses and accuracies
num_epochs = 10
train_losses, val_losses, train_accs, val_accs = [], [], [], []

# Train the model,尤其是这里面,总是出现问题
for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)

        labels = torch.tensor(labels).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
    train_losses.append(train_loss / total_train)
    train_accs.append(correct_train / total_train)

    # Evaluate the model on the validation set
    model.eval()
    val_loss = 0
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for data, labels in val_loader:
            data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
            labels = torch.tensor(labels.values).to(device)
            outputs = model(**data, labels=labels)
            loss = outputs.loss
            val_loss += loss.item() * data['input_ids'].size(0)
            preds = torch.argmax(outputs.logits, axis=1)
            correct_val += (preds == labels).sum().item()
            total_val += data['input_ids'].size(0)
    val_losses.append(val_loss / total_val)
    val_accs.append(correct_val / total_val)

    # Print the training and validation loss and accuracy for each epoch
    print(f'Epoch {epoch+1}/{num_epochs}:')
    print(f'Train loss: {train_losses[-1]:.4f}, Train acc: {train_accs[-1]*100:.2f}%')
    print(f'Val loss: {val_losses[-1]:.4f}, Val acc: {val_accs[-1]*100:.2f}%')

    # Plot the training and validation loss and accuracy for each epoch
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    plt.plot(train_losses, label='Training Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.subplot(1,2,2)
    plt.plot(train_accs, label='Training Accuracy')
    plt.plot(val_accs, label='Validation Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

# Merge training and validation sets and retrain model with optimal hyperparameters
train_val_data = pd.concat([train_data, val_data], axis=0)
train_val_loader = create_data_loader(train_val_data, batch_size)

# Train model with optimal hyperparameters on merged training and validation set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_val_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
        
        train_losses.append(train_loss / total_train)
        train_accs.append(correct_train / total_train)

# Save the trained model.
torch.save(model.state_dict(), 'clinical_bert_model.pth')

# Load the saved model and perform accuracy testing on the test set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
model.load_state_dict(torch.load('clinical_bert_model.pth'))
model.eval()
test_loss = 0
correct_test = 0
total_test = 0
with torch.no_grad():
    for data, labels in test_loader:
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        test_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_test += (preds == labels).sum().item()
        total_test += data['input_ids'].size(0)

print(f'Test loss: {test_loss / total_test:.4f}, Test acc: {correct_test / total_test*100:.2f}%')

你让他给你调试

该回答引用GPTᴼᴾᴱᴺᴬᴵ
根据您提供的代码,我可以看到您使用了Bio_ClinicalBERT模型来进行文本分类。在使用此模型时,您需要注意以下几点:

1.对于文本分类任务,您需要指定模型输出的标签数(num_labels),应该设置为2,即健康和不健康。

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)


2.您的数据集中的标签应该是0或1,而不是字符串,可以使用LabelEncoder来将其转换为0或1。

3.在训练模型时,您需要适当修改以下代码:

data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)


修改为:

data = {k: v.to(device) for k, v in data.items()}


4.在train_loader中,labels应该被转换为torch.LongTensor类型,以避免出现精度错误。

labels = torch.LongTensor(labels).to(device)


参考GPT和自己的思路,这段代码出现问题的原因可能有很多,需要进一步分析调试。在这里我列出一些常见的问题及解决方法:

1 ValueError: logits and labels must have the same shape ((batch_size, num_labels) vs (batch_size,))。这个问题可能出现在计算loss的时候,logits的维度不匹配。解决方法是在定义dataset的时候,将labels也进行one-hot编码,变成(batch_size, num_labels)的形状。

2 RuntimeError: CUDA out of memory。这个问题通常是由于显存不足引起的。解决方法有两个,一是减小batch_size,二是将模型和数据放到CPU上运行。

3 TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first。这个问题出现在将GPU上的tensor转换为numpy array的时候,需要先将其转换到CPU上。解决方法是在转换之前调用.cpu()方法将tensor拷贝到CPU上。

4 AttributeError: 'str' object has no attribute 'size'。这个问题通常是由于在使用tokenizer编码文本时,将文本用list包裹,使得输入变成了list of strings。解决方法是将文本放到一个list中,而不是一个list of strings中。

5 ValueError: Expected input batch_size (x) to match target batch_size (y)。这个问题出现在数据的长度不一致时,通常是由于在定义dataset时没有将文本padding到相同长度。解决方法是在定义dataset时将padding=True,并设置相同的max_length。