飞桨pippicodel深度学习训练时报错非法数据集

在使用百度飞桨aistudio进行pip-picodel训练时在训练环节出现

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized
[05/12 14:02:15] ppdet.data.source.voc WARNING: Illegal xml file: /home/aistudio/data/voc_voc/Annotations/0000195.JPG, and it will be ignored
[05/12 14:02:15] ppdet.data.source.voc WARNING: Illegal xml file: /home/aistudio/data/voc_voc/Annotations/0000025.JPG, and it will be ignored
[05/12 14:02:15] ppdet.data.source.voc WARNING: Illegal xml file: /home/aistudio/data/voc_voc/Annotations/0000115.JPG, and it will be ignored
[05/12 14:02:15] ppdet.data.source.voc WARNING: Illegal xml file: /home/aistudio/data/voc_voc/Annotations/0000130.JPG, and it will be ignored
[05/12 14:02:15] ppdet.data.source.voc WARNING: Illegal xml file: /home/aistudio/data/voc_voc/Annotations/0000084.JPG, and it will be ignored





如何解决

回答引用GPT:

首先，对于出现的DeprecationWarning，这是Python在你使用的版本中，你正在使用的某个特性将在未来的版本中被废弃的警告。这个特性是在'collections'库中使用 ABCs，而不是在'collections.abc'中使用。这并不会影响你的程序的运行，但是为了避免在未来的Python版本中出现问题，建议你更新使用的库或者修改代码以满足新的规范。

对于第二个问题，看起来你在使用PaddlePaddle的目标检测库PaddleDetection进行训练，而你的VOC数据集中存在一些无法解析的XML文件。你应该检查这些给出的文件，看看它们是否完整，格式是否正确。如果这些文件不重要，你可以选择忽略这些警告，但如果这些文件包含重要的训练数据，你就需要修复或者重新生成这些XML文件。

修复XML文件的一般步骤如下：

打开XML文件，检查是否有任何明显的格式错误，如缺失的标签，拼写错误等。

用在线的XML验证器，如

https://www.xmlvalidation.com/

将你的XML文件内容复制粘贴到验证器中，检查是否有错误。

如果你的XML文件是由某个工具生成的，可能需要检查该工具是否有已知的问题，或者尝试使用不同的设置重新生成XML文件。

该回答通过自己思路及引用到GPTᴼᴾᴱᴺᴬᴵ搜索,得到内容具体如下：
在您提供的信息中，最后一行的错误信息显示数据集中包含非法的xml文件，因此训练过程中出现了错误。您需要检查数据集中的xml文件是否正确，以确保其符合数据集的格式要求。

通常来说，数据集中的xml文件应该包含每个图像的标注信息，例如对象的类别、位置和大小等。您可以使用XML解析器来解析xml文件，并检查它们是否包含正确的标注信息。

此外，您还可以使用ppdet.data.source.voc模块提供的VocDetection类来加载数据集，并使用check()方法来检查数据集是否符合要求。例如：

from ppdet.data.source.voc import VocDetection

dataset = VocDetection(
    data_dir='path/to/data',
    splits='trainval',
    use_default_label=False,
    check_dataset=True
)

在上面的代码中，check_dataset=True参数将启用数据集检查功能。如果数据集中包含非法的xml文件，该方法将打印警告信息并忽略这些文件。

另外，您也可以手动检查每个xml文件是否包含正确的标注信息，例如：

import xml.etree.ElementTree as ET

def parse_xml(xml_path):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    for obj in root.findall('object'):
        name = obj.find('name').text
        bbox = obj.find('bndbox')
        xmin = float(bbox.find('xmin').text)
        ymin = float(bbox.find('ymin').text)
        xmax = float(bbox.find('xmax').text)
        ymax = float(bbox.find('ymax').text)
        # 处理标注信息
        # ...

在上面的代码中，parse_xml函数使用Python内置的xml.etree.ElementTree模块来解析xml文件，并提取标注信息。您可以根据您的具体需求来处理标注信息。

总之，您需要检查数据集中的xml文件是否正确，并根据需要进行相应的处理。这样可以确保数据集符合要求，并避免在训练过程中出现错误。

如果以上回答对您有所帮助，点击一下采纳该答案～谢谢

以下内容由CHATGPT及阿里嘎多学长共同生成、有用望采纳：

这个问题是由于训练数据集中存在非法的xml文件导致的。可以通过以下步骤来解决：

检查训练数据集中的xml文件是否格式正确，可以使用VOC数据集标注工具来检查。
删除非法的xml文件，或者重新标注正确的xml文件。
修改训练配置文件中的数据集路径，确保路径正确。
重新运行训练命令。

以下是示例代码，可以根据实际情况进行修改：

# 检查xml文件是否格式正确
!python /path/to/voc_annotation.py --check /path/to/Annotations

# 删除非法的xml文件
!rm /path/to/Annotations/0000195.JPG
!rm /path/to/Annotations/0000025.JPG

# 修改训练配置文件中的数据集路径
!sed -i 's#/path/to/dataset#/path/to/new/dataset#g' /path/to/train_config.yml

# 重新运行训练命令
!python /path/to/train.py --config /path/to/train_config.yml

以下答案由GPT-3.5大模型与博主波罗歌共同编写：
这些警告信息表明使用了已经过时的函数，不是导致训练失败的原因，而是读取数据集时部分数据格式或内容错误导致的。应该注意检查数据集中的格式和内容是否符合要求。

针对这个问题，你可以先检查一下数据集路径下的JPG文件是否都对应有相应的XML标注文件，如果没有就删除掉，并重新制作新的注释文件，确保数据集中图片和注释文件一一对应。

代码实现：

import os
import xml.etree.ElementTree as ET
 
def check_ann(img_path, ann_path, classes):
    for _, _, files in os.walk(img_path):
        print('Number of images:', len(files))
        for image in files:
            img_name = image.split('.')[0]
            ann_file = os.path.join(ann_path, img_name + '.xml')
            if not os.path.exists(ann_file):
                os.remove(os.path.join(img_path, image))
                print('Annotation missing or Invalid :', ann_file)
                
if __name__ == '__main__':
    # 数据集路径
    data_path = './data/VOCdevkit/VOC2007'
    img_path = os.path.join(data_path, 'JPEGImages')
    ann_path = os.path.join(data_path, 'Annotations')
    classes = ['aeroplane','bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat','chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant','sheep', 'sofa', 'train', 'tvmonitor']
    check_ann(img_path, ann_path, classes)

这样就可以删除掉不存在或无效的文件，然后重新跑训练代码。
如果我的回答解决了您的问题，请采纳！

飞桨深度学习集训营学习心得
可以借鉴下
https://blog.csdn.net/CHICLEEXXX/article/details/104484113/