【python】【文字识别】百度文档识别返回的结果,整理为dataframe表格

一、背景
pdf文档中有一个表格(请见附一),已获得百度文档识别返回的结果(请见附二,节选红框所示的三行)。

二、目标
将返回的结果整理成python_dataframe表格

三、请帮助
如何使用python实现?谢谢

附一:原始表格

img

附二:初步识别代码(json)

a = [
    {
        "words_location": {"top": 382, "left": 124, "width": 52, "height": 12},
        "word": "豪华客房",
    },
    {
        "words_location": {"top": 383, "left": 280, "width": 59, "height": 11},
        "word": "14501530",
    },
    {
        "words_location": {"top": 383, "left": 425, "width": 23, "height": 10},
        "word": "450",
    },
    {
        "words_location": {"top": 383, "left": 553, "width": 28, "height": 11},
        "word": "510",
    },
    {
        "words_location": {"top": 383, "left": 689, "width": 25, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 412, "left": 113, "width": 76, "height": 13},
        "word": "高级豪华客房",
    },
    {
        "words_location": {"top": 414, "left": 277, "width": 61, "height": 11},
        "word": "5001580",
    },
    {
        "words_location": {"top": 413, "left": 424, "width": 23, "height": 11},
        "word": "500",
    },
    {
        "words_location": {"top": 413, "left": 554, "width": 26, "height": 11},
        "word": "560",
    },
    {
        "words_location": {"top": 413, "left": 690, "width": 22, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 442, "left": 111, "width": 76, "height": 12},
        "word": "行攻豪华客房",
    },
    {
        "words_location": {"top": 444, "left": 278, "width": 60, "height": 12},
        "word": "1700/1700",
    },
    {
        "words_location": {"top": 444, "left": 424, "width": 25, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 554, "width": 27, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 689, "width": 22, "height": 10},
        "word": "NA ",
    },
]


import pprint
import pandas as pd
df=pd.DataFrame(columns=['B','C','D','E'])
dic = [
    {
        "words_location": {"top": 382, "left": 124, "width": 52, "height": 12},
        "word": "豪华客房",
    },
    {
        "words_location": {"top": 383, "left": 280, "width": 59, "height": 11},
        "word": "14501530",
    },
    {
        "words_location": {"top": 383, "left": 425, "width": 23, "height": 10},
        "word": "450",
    },
    {
        "words_location": {"top": 383, "left": 553, "width": 28, "height": 11},
        "word": "510",
    },
    {
        "words_location": {"top": 383, "left": 689, "width": 25, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 412, "left": 113, "width": 76, "height": 13},
        "word": "高级豪华客房",
    },
    {
        "words_location": {"top": 414, "left": 277, "width": 61, "height": 11},
        "word": "5001580",
    },
    {
        "words_location": {"top": 413, "left": 424, "width": 23, "height": 11},
        "word": "500",
    },
    {
        "words_location": {"top": 413, "left": 554, "width": 26, "height": 11},
        "word": "560",
    },
    {
        "words_location": {"top": 413, "left": 690, "width": 22, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 442, "left": 111, "width": 76, "height": 12},
        "word": "行攻豪华客房",
    },
    {
        "words_location": {"top": 444, "left": 278, "width": 60, "height": 12},
        "word": "1700/1700",
    },
    {
        "words_location": {"top": 444, "left": 424, "width": 25, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 554, "width": 27, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 689, "width": 22, "height": 10},
        "word": "NA ",
    },

]
k=0
for i in range(len(dic)//5):
    a=dic[5*k+0]["word"]
    b=dic[5*k+1]['word']
    c=dic[5*k+2]['word']
    d=dic[5*k+3]['word']
    e = dic[5 * k +4]['word']
    df.loc[a]=[b,c,d,e]
    k+=1
pprint.pprint(df)

识别结果返回 json 也太抽象了,应该有直接返回 Excel 的选项

Python有个 json 库,说不定你可以用到。