怎么用python boilerpipe将爬取的链接中的html标记去掉

from boilerpipe.extract import Extractor

extractor = Extractor(extractor='ArticleExtractor', html=html)

extractor.getText()

我爬了一些链接存在txt中,类似这样:

https://www.bsfuji.tv/akitamogul2020/pub/index.html

https://www.sponichi.co.jp/sports/news/2020/02/21/kiji/20200221s00048000307000c.html

https://www.azcentral.com/story/sports/high-school/2020/02/21/trevor-browne-copper-canyon-fill-head-football-coaching-vacancies/4836472002/

https://www.sponichi.co.jp/sports/news/2020/02/21/kiji/20200221s00048000307000c.html

怎么把这些html标记去除变成www.bsfuji.tv/akitamogul2020/pub这样类型的链接呢?

import re

string = 'https://www.bsfuji.tv/akitamogul2020/pub/index.html'
result = re.search(r'(?<=https\:\/\/).+(?=\/[^\/]+\.html)', string).group()
print(result)

www.bsfuji.tv/akitamogul2020/pub