遍历文件夹,输出包含“媒体质疑”的文件,输出名称
中途报错跳过几个,最后出现大量报错停止
import os
import openpyxl
import fitz
import re
path=r'D:\下载\数据3 - 副本'
os.chdir(path)
my = openpyxl.Workbook()
mywb=my["Sheet"]
row=0
count=0
for d1 in os.listdir():
check1=d1+r'\信息披露\注册稿'
check2=d1+r'\问询与回复'
tmpcount=0
for file in os.listdir(check1):
doc = fitz.open(check1+'\\'+file)
for page in doc:
text = page.get_text()
if text.find('媒体质疑')!=-1:
tmpcount = tmpcount + 1
print(check1+'\\'+file)
for file in os.listdir(check2):
doc = fitz.open(check2+'\\'+file)
for page in doc:
text = page.get_text()
if text.find('媒体质疑') != -1:
tmpcount = tmpcount + 1
print(check2 + '\\' + file)
row = row + 1
mywb.cell(row, 1, d1)
if tmpcount>=1:
count=count+1
mywb.cell(row, 2, 1)
else:
mywb.cell(row, 2, 0)
print("被质疑的公司一共有",count,"家")
my.save(r'C:\Users\huang\Desktop'+"\\媒体质疑.xlsx")
mupdf: xref generation number missing
mupdf: expected object number
mupdf: cannot find startxref
mupdf: object out of range (1358 0 R); xref size 1353
mupdf: object is not a stream
mupdf: invalid ICC colorspace
mupdf: realloc (257379 bytes) failed
mupdf: malloc of 332758 bytes failed
RuntimeError: malloc of 115172 bytes failed
不明所以
所有文件全部成功读取
请问下,问题解决了么? 我也是遇到类似情况。。我想捕捉这个错误,结果没法捕捉到。。进入源码,居然都找不到这个提示是哪里冒出来的。。。
你现在遇到什么问题了,再读取PDF的时候