from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', html=html)
extractor.getText()
我爬了一些链接存在txt中,类似这样:
https://www.bsfuji.tv/akitamogul2020/pub/index.html
https://www.sponichi.co.jp/sports/news/2020/02/21/kiji/20200221s00048000307000c.html
https://www.sponichi.co.jp/sports/news/2020/02/21/kiji/20200221s00048000307000c.html
怎么把这些html标记去除变成www.bsfuji.tv/akitamogul2020/pub这样类型的链接呢?
import re
string = 'https://www.bsfuji.tv/akitamogul2020/pub/index.html'
result = re.search(r'(?<=https\:\/\/).+(?=\/[^\/]+\.html)', string).group()
print(result)