如下图:
在节点内
我想爬取href的数据,既/tjgb/20gx/36169.html
但是我代码写content_all = soup.find_all.table(class_="box") 时却什么也爬不下来,结果是个空列表。
请问应该怎么准确定位到包含href内容的那个节点呢?
网站的网址是 http://tjcn.org/tjgb/20gx/index.html
以下是我写的代码
import re
import requests
from bs4 import BeautifulSoup
for page in range(0,10):
url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"
if page == 0:
url = "http://www.tjcn.org/tjgb/20gx/index.html"
print(url)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html = response.text
soup = BeautifulSoup(html, "lxml")
content_all = soup.find_all.table(class_="box")
print(content_all)
改下定位操作就行了。
```python
import re
import requests
from bs4 import BeautifulSoup
for page in range(0,10):
url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"
if page == 0:
url = "http://www.tjcn.org/tjgb/20gx/index.html"
#print(url)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html = response.text
#print(html)
soup = BeautifulSoup(html, "lxml")
# content_all = soup.find_all.table(class_="box")
# print(content_all)
#定位到<ul>标签
items = soup.find_all('ul')
for li in items:
#获取<a>标签下的href
href = li.find('a').get('href')
print(href)
```
用您的程序是把整页所有的标签下的href都爬下来了,导致每页都爬到重复的href,既是/tjgb/20gx/32947.html,因为这个重复的href是在边框(class='slider')那里的,要想不出现这个重复的href,就要爬class='box'那里的内容。怎么样才能定位到box那里呢?
tjgb/20gx/36536.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_1.html
http://www.tjcn.org/tjgb/20gx/index_2.html
/
/tjgb/20gx/36086.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_3.html
/
/tjgb/20gx/35648.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_4.html
/
/tjgb/20gx/35161.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_5.html
/
/tjgb/20gx/28688.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_6.html
/
/tjgb/20gx/27693.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_7.html
/
/tjgb/20gx/26955.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_8.html
/
/tjgb/20gx/24447.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_9.html
/
/tjgb/20gx/18801.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
兄弟,用xpath吧,上面的bs4代码有点问题
import re
import requests
from lxml import etree
for page in range(0,10):
url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"
if page == 0:
url = "http://www.tjcn.org/tjgb/20gx/index.html"
#print(url)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
response = requests.get(url=url, headers=headers)
response.encoding = response.apparent_encoding
html = response.text
#print(html)
tree = etree.HTML(html)
list = tree.xpath('//table[@class="box"]//ul/li')
for li in list:
href = li.xpath('./a/@href')[0]
print(href)
```