安排对如下网站数据进行采集:
https://ggzy.jiangxi.gov.cn/jyxx/002006/002006001/trade.html
由于做了URL封装: https://ggzy.jiangxi.gov.cn/XZinterface/rest/esinteligentsearch/getFullTextDataNew
找不到公示信息json获取的路径
1
47
961.1 KB
970.4 KB
0
Preview
Headers
Cookies
Sizes
Timing
Security
Summary
URL: https://ggzy.jiangxi.gov.cn/XZinterface/rest/esinteligentsearch/getFullTextDataNew
Status: 200 OK
Source: Network
Address: 218.87.21.50:443
Initiator:
jquery.min.js:4:25687
Request
POST /XZinterface/rest/esinteligentsearch/getFullTextDataNew HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: application/json, text/javascript, /; q=0.01
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Host: ggzy.jiangxi.gov.cn
Origin: https://ggzy.jiangxi.gov.cn/
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15
Connection: keep-alive
Referer: https://ggzy.jiangxi.gov.cn/jyxx/002006/trade.html
Content-Length: 531
Cookie: noOauthAccessToken=68dcea2e8cd7348bb7a52fb4f7522fc8; noOauthRefreshToken=9574d7d45188e88b4fa6849b5e3e6d6f; oauthClientId=admin; oauthLoginUrl=http://192.168.164.241:96/membercenter/login.html?redirect_uri=; oauthLogoutUrl=; oauthPath=http://10.4.157.26:8080/XZEWB-FRONT; userGuid=1965633006
X-Requested-With: XMLHttpRequest
Response
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Content-Type: application/json;charset=UTF-8
Vary: Origin
Access-Control-Expose-Headers: X-Test-2, X-Test-1
Connection: keep-alive
Date: Mon, 14 Nov 2022 13:11:30 GMT
X-Frame-Options: SAMEORIGIN
Content-Length: 16437
Access-Control-Allow-Origin: https://ggzy.jiangxi.gov.cn/
Server: nginx
Request Data
MIME Type: application/x-www-form-urlencoded; charset=UTF-8
{"token":"","pn":50,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}
1. catedata: [{categorynum: "002001006", categoryname: "招标计划"}, {categorynum: "002001001", categoryname: "招标公告"},…]
1. 0: {categorynum: "002001006", categoryname: "招标计划"}
1. categoryname: "招标计划"
2. categorynum: "002001006"
2. 1: {categorynum: "002001001", categoryname: "招标公告"}
1. categoryname: "招标公告"
2. categorynum: "002001001"
3. 2: {categorynum: "002001002", categoryname: "答疑澄清"}
1. categoryname: "答疑澄清"
2. categorynum: "002001002"
4. 3: {categorynum: "002001003", categoryname: "文件下载"}
1. categoryname: "文件下载"
2. categorynum: "002001003"
5. 4: {categorynum: "002001004", categoryname: "中标公示"}
1. categoryname: "中标公示"
2. categorynum: "002001004"
{,…}
{,…}
无法获取数据,获取失败
通过以下获取数据:
{"token":"","pn":50,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}
Request Data
MIME Type: application/x-www-form-urlencoded; charset=UTF-8
{"token":"","pn":90,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"","sort":"{"webdate":"0","id":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"002002","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"2022-10-15 00:00:00","endTime":"2022-11-14 23:59:59"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}
不能成功,可以选择分类、时间,但不能翻页
采用JAVA如何设置程序,可以分类、分时间、分页获取getFullTextDataNew的公示json数组。
谢谢。
pn除以10是页数
可以试试Crawler 这个java 爬虫工具,里面有一些参数设置,可以使用。
可以看看
https://blog.csdn.net/muumian123/article/details/81747053
参考这个链接看看
https://b23.tv/xRl309e
1 #coding=utf-8 2 3 import urllib2 4 import urllib 5 import json 6 9 output = open('huizho.json', 'w') 11 for page in range(1,30): //爬取的页数,从1至29页 12 request =urllib2.Request('http://www.hdgtjy.com/Index/PublicResults') 13 request.add_header('X-Requested-With','XMLHttpRequest') 14 request.add_header('Content-Type','application/x-www-form-urlencoded') 15 values = 'page=%d&size=10'%page 或者values = 'page='+str(page)+'&size=10' 21 request.add_data(values) 22 response = urllib2.urlopen(request) 25 resHtml =response.read() 27 line = json.dumps(resHtml,ensure_ascii=False) + '\n' //因为爬取的内容含有中文,所以ensure_ascii不能为默认值True; 28 29 output.write(line) 30 output.close()本段代码主要实现post请求方式的翻页功能,爬取的内容为某汽车网站的内容;编写代码遇到以下一个问题1.当把12、13、14,行代码放到for循环上面,会发生当爬取的页数首位数发生改变时(这里因为网页原因只验证了个位数和两位数),爬取的内容出错(极大部分内容丢失,且爬取的不是目标内容) ;原因:request.add_data(value)这句代码,会造成数据持续叠加;出现BUG 需要每次翻页时都请求服务器,才不会造成BUG出现。(此处理解不是很到位,不会说,望高手指正)2.json.dumps()函数 ensure_ascii属性默认为True;当爬取的内容中含有中文时需要修改默认值。