关于网站数据POST获取和种类、翻页问题

问题遇到的现象和发生背景

安排对如下网站数据进行采集:
https://ggzy.jiangxi.gov.cn/jyxx/002006/002006001/trade.html
由于做了URL封装: https://ggzy.jiangxi.gov.cn/XZinterface/rest/esinteligentsearch/getFullTextDataNew

找不到公示信息json获取的路径

用代码块功能插入代码,请勿粘贴截图

1
47
961.1 KB
970.4 KB
0
Preview
Headers
Cookies
Sizes
Timing
Security
Summary
URL: https://ggzy.jiangxi.gov.cn/XZinterface/rest/esinteligentsearch/getFullTextDataNew
Status: 200 OK
Source: Network
Address: 218.87.21.50:443
Initiator:
jquery.min.js:4:25687

Request
POST /XZinterface/rest/esinteligentsearch/getFullTextDataNew HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: application/json, text/javascript, /; q=0.01
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Host: ggzy.jiangxi.gov.cn
Origin: https://ggzy.jiangxi.gov.cn/
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15
Connection: keep-alive
Referer: https://ggzy.jiangxi.gov.cn/jyxx/002006/trade.html
Content-Length: 531
Cookie: noOauthAccessToken=68dcea2e8cd7348bb7a52fb4f7522fc8; noOauthRefreshToken=9574d7d45188e88b4fa6849b5e3e6d6f; oauthClientId=admin; oauthLoginUrl=http://192.168.164.241:96/membercenter/login.html?redirect_uri=; oauthLogoutUrl=; oauthPath=http://10.4.157.26:8080/XZEWB-FRONT; userGuid=1965633006
X-Requested-With: XMLHttpRequest

Response
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Content-Type: application/json;charset=UTF-8
Vary: Origin
Access-Control-Expose-Headers: X-Test-2, X-Test-1
Connection: keep-alive
Date: Mon, 14 Nov 2022 13:11:30 GMT
X-Frame-Options: SAMEORIGIN
Content-Length: 16437
Access-Control-Allow-Origin: https://ggzy.jiangxi.gov.cn/
Server: nginx

Request Data
MIME Type: application/x-www-form-urlencoded; charset=UTF-8
{"token":"","pn":50,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}

1. catedata: [{categorynum: "002001006", categoryname: "招标计划"}, {categorynum: "002001001", categoryname: "招标公告"},…]
    1. 0: {categorynum: "002001006", categoryname: "招标计划"}
        1. categoryname: "招标计划"
        2. categorynum: "002001006"
    2. 1: {categorynum: "002001001", categoryname: "招标公告"}
        1. categoryname: "招标公告"
        2. categorynum: "002001001"
    3. 2: {categorynum: "002001002", categoryname: "答疑澄清"}
        1. categoryname: "答疑澄清"
        2. categorynum: "002001002"
    4. 3: {categorynum: "002001003", categoryname: "文件下载"}
        1. categoryname: "文件下载"
        2. categorynum: "002001003"
    5. 4: {categorynum: "002001004", categoryname: "中标公示"}
        1. categoryname: "中标公示"
        2. categorynum: "002001004"
  1. {,…}

    1. catedata: [{categorynum: "002001", categoryname: "房建及市政工程"}, {categorynum: "002002", categoryname: "交通工程"},…]
      1. 0: {categorynum: "002001", categoryname: "房建及市政工程"}
      2. 1: {categorynum: "002002", categoryname: "交通工程"}
      3. 2: {categorynum: "002003", categoryname: "水利工程"}
        1. categoryname: "水利工程"
        2. categorynum: "002003"
      4. 3: {categorynum: "002004", categoryname: "重点工程"}
      5. 4: {categorynum: "002005", categoryname: "外贷工程"}
      6. 5: {categorynum: "002006", categoryname: "政府采购"}
      7. 6: {categorynum: "002007", categoryname: "国土资源交易"}
      8. 7: {categorynum: "002008", categoryname: "产权交易"}
      9. 8: {categorynum: "002009", categoryname: "林权交易"}
      10. 9: {categorynum: "002010", categoryname: "医药采购"}
      11. 10: {categorynum: "002011", categoryname: "排污权交易"}
      12. 11: {categorynum: "002013", categoryname: "其它项目"}
      13. 12: {categorynum: "002015", categoryname: "疫苗采购"}
  2. {,…}

    1. catedata: [{categorynum: "002006007", categoryname: "采购意向"}, {categorynum: "002006001", categoryname: "采购公告"},…]
      1. 0: {categorynum: "002006007", categoryname: "采购意向"}
      2. 1: {categorynum: "002006001", categoryname: "采购公告"}
      3. 2: {categorynum: "002006002", categoryname: "变更公告"}
      4. 3: {categorynum: "002006003", categoryname: "答疑澄清"}
      5. 4: {categorynum: "002006004", categoryname: "结果公示"}
        1. categoryname: "结果公示"
        2. categorynum: "002006004"
      6. 5: {categorynum: "002006005", categoryname: "单一来源公示"}
      7. 6: {categorynum: "002006006", categoryname: "合同公示"}
      8. 7: {categorynum: "002006008", categoryname: "中小企业执行情况"}
运行结果及报错内容

无法获取数据,获取失败

我的解答思路和尝试过的方法

通过以下获取数据:
{"token":"","pn":50,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}

Request Data
MIME Type: application/x-www-form-urlencoded; charset=UTF-8
{"token":"","pn":90,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}
不能成功,可以选择分类、时间,但不能翻页

我想要达到的结果

采用JAVA如何设置程序,可以分类、分时间、分页获取getFullTextDataNew的公示json数组。

谢谢。

pn除以10是页数

可以试试Crawler 这个java 爬虫工具,里面有一些参数设置,可以使用。

可以看看
https://blog.csdn.net/muumian123/article/details/81747053

参考这个链接看看
https://b23.tv/xRl309e

1 #coding=utf-8 2 3 import urllib2 4 import urllib 5 import json 6 9 output = open('huizho.json', 'w') 11 for page in range(1,30): //爬取的页数,从1至29页 12 request =urllib2.Request('http://www.hdgtjy.com/Index/PublicResults') 13 request.add_header('X-Requested-With','XMLHttpRequest') 14 request.add_header('Content-Type','application/x-www-form-urlencoded') 15 values = 'page=%d&size=10'%page  或者values = 'page='+str(page)+'&size=10' 21 request.add_data(values) 22 response = urllib2.urlopen(request) 25 resHtml =response.read() 27 line = json.dumps(resHtml,ensure_ascii=False) + '\n'  //因为爬取的内容含有中文,所以ensure_ascii不能为默认值True; 28 29 output.write(line) 30 output.close()本段代码主要实现post请求方式的翻页功能,爬取的内容为某汽车网站的内容;编写代码遇到以下一个问题1.当把12、13、14,行代码放到for循环上面,会发生当爬取的页数首位数发生改变时(这里因为网页原因只验证了个位数和两位数),爬取的内容出错(极大部分内容丢失,且爬取的不是目标内容)  ;原因:request.add_data(value)这句代码,会造成数据持续叠加;出现BUG 需要每次翻页时都请求服务器,才不会造成BUG出现。(此处理解不是很到位,不会说,望高手指正)2.json.dumps()函数 ensure_ascii属性默认为True;当爬取的内容中含有中文时需要修改默认值。