怎样进行样本不均衡抽样?

下面是对样本数据的预处理,包含九种故障类型,一种正常类型,将数据分为测试集、训练集和验证集,比例为7:2:1,样本量为1000,如果要进行不均衡样本分类,将正常类样本与故障样本比例设置为200:1,应该怎么写代码呢?

    for data_type in range(10):
        fs = 12000
        t = 0.1
        opt = "0-"
        N = 1024
        data = all_data[data_type]

        for load_type in range(1):
            load_data = data

            start = 0
            for i in range(700):
                temp = load_data[start: start + N]
           
                start += 100
                
                temp = ([i for arr in temp for i in arr])
                f, t, nd = signal.stft(temp, fs=fs, window='hann', nperseg=64, noverlap=None, nfft=None,
                                       detrend=False, return_onesided=True, boundary='zeros', padded=True, axis=-1)
                Gaussian = cv2.GaussianBlur(abs(nd), (3, 3), 1)
                stfts_train_pics.append(Gaussian)
                stfts_train_labels.append(data_type)

            for i in range(200):
                temp = load_data[start: start + N]
                
                start += 100
                
                temp = ([i for arr in temp for i in arr])
                f, t, nd = signal.stft(temp, fs=fs, window='hann', nperseg=64, noverlap=None, nfft=None,
                                       detrend=False, return_onesided=True, boundary='zeros', padded=True, axis=-1)
                Gaussian = cv2.GaussianBlur(abs(nd), (3, 3), 1)
                stfts_test_pics.append(Gaussian)
                stfts_test_labels.append(data_type)

            for i in range(100):
                temp = load_data[start: start + N]
                start += 100
               
                temp = ([i for arr in temp for i in arr])
                f, t, nd = signal.stft(temp, fs=fs, window='hann', nperseg=64, noverlap=None, nfft=None,
                                       detrend=False, return_onesided=True, boundary='zeros', padded=True, axis=-1)
                Gaussian = cv2.GaussianBlur(abs(nd), (3, 3), 1)
                stfts_validation_pics.append(Gaussian)
                stfts_validation_labels.append(data_type)

【以下回答由 GPT 生成】

问题的关键在于如何实现不均衡样本抽样,将正常类样本和故障样本的比例设置为200:1。下面是一个可能的解决方案和代码示例:

  1. 确定每个类别需要的样本数量:
num_normal_samples = 200
num_fault_samples = num_normal_samples * 200
  1. 对于不均衡样本的分类,可以使用分层抽样方法来确保每个类别的比例正确。首先,创建一个空的列表来存储正常样本和故障样本:
balanced_data = []
  1. 对每个数据类型进行循环处理,调整样本数量并将其添加到balanced_data列表中:
for data_type in range(10):
    fs = 12000
    t = 0.1
    opt = "0-"
    N = 1024
    data = all_data[data_type]

    # 调整正常样本数量
    num_normal_samples_adjusted = min(num_normal_samples, len(data))
    for i in range(num_normal_samples_adjusted):
        temp = load_data[start: start + N]

        start += 100

        temp = ([i for arr in temp for i in arr])
        f, t, nd = signal.stft(temp, fs=fs, window='hann', nperseg=64, noverlap=None, nfft=None,
                               detrend=False, return_onesided=True, boundary='zeros', padded=True, axis=-1)
        Gaussian = cv2.GaussianBlur(abs(nd), (3, 3), 1)
        balanced_data.append((Gaussian, data_type))

    # 调整故障样本数量
    num_fault_samples_adjusted = min(num_fault_samples, len(data) - num_normal_samples_adjusted)
    for i in range(num_fault_samples_adjusted):
        temp = load_data[start: start + N]

        start += 100

        temp = ([i for arr in temp for i in arr])
        f, t, nd = signal.stft(temp, fs=fs, window='hann', nperseg=64, noverlap=None, nfft=None,
                               detrend=False, return_onesided=True, boundary='zeros', padded=True, axis=-1)
        Gaussian = cv2.GaussianBlur(abs(nd), (3, 3), 1)
        balanced_data.append((Gaussian, data_type))
  1. 上述代码将正常样本和故障样本添加到balanced_data列表中,并将它们与它们对应的数据类型一起存储为元组。

现在,balanced_data列表中的样本已经按照所需的比例进行了分类,并且可以继续使用它进行训练集、测试集和验证集的划分。

请注意,上述示例代码仅仅是一种可能的解决方案,具体代码实现可能还需要根据实际情况进行调整。但这个示例可以作为一个起点来进行进一步的优化和开发。



【相关推荐】



如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^