有哪位大佬指点下怎么爬这种网页的数据

http://fish.haidiao.com
求大佬指点啊,眼睛都熬红了

用fiddler抓包看了下很简单

首先获取的html里面有


var subval2 = new Array();  //这里定义一个数组


       subval2[0] = new Array('盲鳗科','蒲氏粘盲鳗');

       subval2[1] = new Array('盲鳗科','紫粘盲鳗');

       subval2[2] = new Array('盲鳗科','福尔摩沙盲鳗');

       subval2[3] = new Array('盲鳗科','郭氏盲鳗');

       subval2[4] = new Array('盲鳗科','陈氏副盲鳗');

       subval2[5] = new Array('盲鳗科','费氏副盲鳗');

       subval2[6] = new Array('盲鳗科','纽氏副盲鳗');

       subval2[7] = new Array('盲鳗科','沈氏副盲鳗');

       subval2[8] = new Array('盲鳗科','台湾副盲鳗');

       subval2[9] = new Array('盲鳗科','怀氏副盲鳗');

       subval2[10] = new Array('盲鳗科','杨氏副盲鳗');
...

遍历这个,然后依次post发包

POST http://fish.haidiao.com/indexlist.asp HTTP/1.1
Host: fish.haidiao.com
Connection: keep-alive
Content-Length: 60
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Origin: http://fish.haidiao.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://fish.haidiao.com/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: yunsuo_session_verify=ad5bf876c9d468da81f5f25f62c9f96d; ASPSESSIONIDCQSQQTRA=IAOPKAODOODIOCJKAPHNPACE; UM_distinctid=16991ea57ddb5-0cf04ace8af874-39475561-1aeaa0-16991ea57df4ac; CNZZDATA577583=cnzz_eid%3D771781776-1552931247-null%26ntime%3D1552931247

SelectItem=%B2%E6%B3%DD%F7%5E%BF%C6&submit.x=101&submit.y=11

SelectItem=你的这些条目中的每一个

如果是爬具体鱼的信息的话,下面是个简单示例:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import time

browser = webdriver.Chrome()
for i in range(1,10):
browser.get("http://fish.haidiao.com/view.asp?id="+str(i))
time.sleep(3)
browser.maximize_window()
name=browser.find_element_by_xpath('/html/body/table/tbody/tr/td[1]/table/tbody/tr/td[2]/table/tbody/tr/td/table[2]/tbody/tr/td/table/tbody/tr/td/table[1]/tbody/tr/td[1]/table[1]/tbody/tr/td[2]').text
print(name)

其实我可以推荐给你一个最简单的方案,就是用selenium+Python直接模拟抓取,虽然性能上不高,而且大材小用,不过对于一般化的需求还是很快捷的

推荐用python + selenium

import time
from selenium import webdriver

driver = webdriver.Safari() # mac电脑可以用 记得在Safari下启动允许远程自动化
# driver = webdriver.Chrome() windows系统下可以用Chrome浏览器
url = "http://fish.haidiao.com" # 要爬取的网页
driver.get(url) # 打开网页
time.sleep(2)

contents = dict()
# 查看网页源码获取科目名称列表xpath路径
objs_xpath = "/html/body/table/tbody/tr/td[1]/table/tbody/tr/td[2]/table/tbody/tr/td/table[2]/tbody/tr[2]/td[1]/form/select/option[%s]"
objs_num = 229 # 科目列表一共229项

# 具体的鱼类名称列表的xpath是不变的
fishes_xpath = "/html/body/table/tbody/tr/td[1]/table/tbody/tr/td[2]/table/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/form/select"

for i in range(1,objs_num+1):
    obj = driver.find_element_by_xpath(objs_xpath % i) # 获取按钮
    obj.click()
    fishes = driver.find_element_by_xpath(fishes_xpath).find_elements_by_xpath("*")
    contents[obj.text] = []
    for fish in fishes:
        contents[obj.text].append(fish.text)

print(contents)