python爬虫的节点选择器无效

如下图:
img
在节点内

img

我想爬取href的数据,既/tjgb/20gx/36169.html
但是我代码写content_all = soup.find_all.table(class_="box") 时却什么也爬不下来,结果是个空列表。
请问应该怎么准确定位到包含href内容的那个节点呢?
网站的网址是 http://tjcn.org/tjgb/20gx/index.html
以下是我写的代码
import re
import requests
from bs4 import BeautifulSoup

for page in range(0,10):

url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"

if page == 0:
    url = "http://www.tjcn.org/tjgb/20gx/index.html"
print(url)

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}

response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html = response.text

soup = BeautifulSoup(html, "lxml")

content_all = soup.find_all.table(class_="box")
print(content_all)

改下定位操作就行了。


```python
import re
import requests
from bs4 import BeautifulSoup

for page in range(0,10):
    url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"
    if page == 0:
        url = "http://www.tjcn.org/tjgb/20gx/index.html"
    #print(url)
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    html = response.text
    #print(html)
    soup = BeautifulSoup(html, "lxml")
    # content_all = soup.find_all.table(class_="box")
    # print(content_all)
    
    #定位到<ul>标签
    items = soup.find_all('ul')
    for li in items:
        #获取<a>标签下的href
        href = li.find('a').get('href')
        print(href)
       


```

用您的程序是把整页所有的标签下的href都爬下来了,导致每页都爬到重复的href,既是/tjgb/20gx/32947.html,因为这个重复的href是在边框(class='slider')那里的,要想不出现这个重复的href,就要爬class='box'那里的内容。怎么样才能定位到box那里呢?
tjgb/20gx/36536.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_1.html
http://www.tjcn.org/tjgb/20gx/index_2.html
/
/tjgb/20gx/36086.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_3.html
/
/tjgb/20gx/35648.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_4.html
/
/tjgb/20gx/35161.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_5.html
/
/tjgb/20gx/28688.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_6.html
/
/tjgb/20gx/27693.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_7.html
/
/tjgb/20gx/26955.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_8.html
/
/tjgb/20gx/24447.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html
http://www.tjcn.org/tjgb/20gx/index_9.html
/
/tjgb/20gx/18801.html
/tjgb/20gx/36217.html
/tjgb/20gx/32947.html

兄弟,用xpath吧,上面的bs4代码有点问题



import re
import requests
from lxml import etree

   
for page in range(0,10):
    url = f"http://www.tjcn.org/tjgb/20gx/index_{page}.html"
    if page == 0:
        url = "http://www.tjcn.org/tjgb/20gx/index.html"
    #print(url)
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
    response = requests.get(url=url, headers=headers)
    response.encoding = response.apparent_encoding
    html = response.text
        #print(html)
    tree = etree.HTML(html)
    list = tree.xpath('//table[@class="box"]//ul/li')
    for li in list:
        href = li.xpath('./a/@href')[0]
        print(href)

```