python 爬虫 正则表达式 选择 网页中的内容

我想选择下面这段网页代码中的数据内容,请指导下我应该怎么写,用爬虫工具还是正则表达式

img


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<META http-equiv="Content-Style-Type" content="text/css" charset="UTF-8">
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />
<style type="text/css">
legend { font-family:arial,verdana; font-size:80%;}
h1 {font-family:arial,verdana; font-size:100%; color:#444444;}
font.general {font-family:arial,verdana; font-size:80%;}
fieldset{margin-bottom:12px;margin-left:20px;}
input {font-family:arial,verdana; font-size:100%;}
a.help {text-decoration:none;font-family:arial,verdana; color:#FF7200;background-color:#dddddd;}
code {font-family: monospace;text-align: left;}
table {width:900px;border: medium solid #777777;border-collapse: collapse}
td, th {border: thin solid #6495ed;}
table.menu {width:900px;border:0px;margin-bottom:10px;background-color:#FFFFFF;empty-cells:show;}
td.menu {border:0px;background-color:#FFFFFF;text-align:center;padding:0px;}
th.menu {border:0px;padding:0px;}
div.disclaimer {width:900px;margin-bottom:15px;min-height:20px;}
span.disclaimer {text-align:center;font-size:85%;color:#C63306;background-color:#f5f5f5;padding:5px;border:2px solid #cccccc;border-radius: 10px 10px; -moz-border-radius: 10px;}
span.disclaimer:hover {text-align:center;font-size:85%;color:#C63306;padding:5px;border:2px solid #c63306;border-radius: 10px 10px; -moz-border-radius: 10px;}
a.disclaimer {text-decoration:none;color:#675849;outline:0;}
div#logdiv {display:none;margin-bottom:10px;margin-top:5px;}
div#linkdiv {margin-bottom:10px;margin-top:5px;padding:4px;width:886px;max-height:200px;overflow:auto;font-family:monospace;background-color:#F3F3F3;border:#777777 solid 3px;line-height:1.8;}
div#hidelink{display:none;}
div#hidequery{display:none;}
div#logquery {margin-bottom:10px;margin-top:5px;padding:4px;width:886px;max-height:200px;overflow:auto;font-family:monospace;background-color:#F3F3F3;border:#777777 solid 3px;line-height:1.8;}
.func_class{color:#6D3386;}
.statement{color:#08961F;}
div#hideacknow{display:none;}
a.prevbutton {border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#EAE9D7;color:#666666;font-family: arial, sans-serif;font-weight:bold;}
a:hover.prevbutton {background-color:#EDEAA2;}
a.zoombutton {border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#DED9E2;color:#666666;font-family:arial,sans-serif;font-weight:bold;}
a:hover.zoombutton {background-color:#CDBAD9;}
a.asciibutton {cursor:pointer;border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#E3CDD3;color:#666666;font-family:arial,sans-serif;font-weight:bold;}
a:hover.asciibutton {background-color:#F0C7D0;}
a.logbutton {cursor:pointer;border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#D0E3CF;color:#666666;font-family:arial,sans-serif;font-weight:bold;}
a:hover.logbutton {background-color:#B9E6B6;}
a.eventbutton {cursor:pointer;border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#CDE4E5;color:#666666;font-family:arial,sans-serif;font-weight:bold;}
a:hover.eventbutton {background-color:#B1E6E8;}
a.backbutton {cursor:pointer;border: #555555 solid 1px;text-decoration:none;padding:2px;background-color:#BFBFBF;color:#111111;font-family:arial,sans-serif;font-weight:bold;}
a:hover.backbutton {background-color:#EFEFEF;}
a.navig_meta {cursor:pointer;border: #555555 solid 1px;text-decoration:none;padding:0px;background-color:#EAE9D7;color:#666666;font-family:arial,sans-serif;}
a:hover.navig_meta {background-color:#EDEAA2;}
font.disablebutton {border: #999999 solid 1px;text-decoration:none;padding:2px;background-color:#EEEEEE;color:#999999;font-family:arial,sans-serif;font-weight:bold;}
a.gletab {cursor:pointer;text-decoration:underline;color:#222222;font-family:arial,sans-serif;text-align:left;vertical-align:middle}
</style>
<script language="javascript" type="text/javascript">

<!-- 1 click -> hide div, next click show div etc. -->
function displaydiv(divid){
    var elementdiv = document.getElementById(divid);
    var divstyle = elementdiv.style.display;
    if(divstyle.toLowerCase()=="block"){
        elementdiv.style.display = 'none';
    } else {
        elementdiv.style.display = 'block';
        elementdiv.style.height = 'auto';
    }
}
<!-- 1 click -> hide div -->
function displaymetadiv(divid){
    var elementdiv = document.getElementById(divid);
    var divstyle = elementdiv.style.display;
    if(divstyle.toLowerCase()=="none"){
        elementdiv.style.display = 'block';
        elementdiv.style.height = 'auto';
    }
}

<!--from spacefrog-->
document.getElementsByReg=function(reg,attr){
var tabReg=new Array();
var tabElts=document.body.getElementsByTagName('*');
var TEL=tabElts.length;
if(! (reg instanceof RegExp)){return tabReg;}
i=0;
while(tabElts[i]){
        if(tabElts[i][attr]){
    
             if(reg.test(tabElts[i][attr])){tabReg.push(tabElts[i]);}
             }
    i++;
    }
return tabReg;
}

<!-- hide all meta div (metadiv1, metadiv2 etc.)-->
function hideallmetadivn(){
    var elementdiv = document.getElementsByReg(/metadiv[1-9]+/,'id');
    var TEL=elementdiv.length;
    i=0;
    while(i<TEL){
        var divstyle = elementdiv[i].style.display;
        if(divstyle.toLowerCase()=="block"){
            elementdiv[i].style.display = 'none';
        }
    i++;
    }
}

</script>

<body><font class="general">

<br><small>Connected to:read.nmdb.eu</small><br><table class="menu"><tr><th colspan="14" class="menu"><img style="vertical-align:bottom;" src=IMG/top_top.jpg></img></th></tr><tr><td class="menu" style="border-left: 1px #222222 solid"><img style="vertical-align:bottom;" SRC="IMG/nmdb-est-logo-7mini.png" width="78" border="0"></img></td><td class="menu"><a class="backbutton" href="back_and_reset.php">&nbsp;back&reset&nbsp;</a></td><td class="menu"><a class="backbutton" onClick='history.back()'>&nbsp;back&nbsp;</a></td><td class="menu"></td><td class="menu"></td><td class="menu"></td><td class="menu"></td><td class="menu"><a class="logbutton" onClick="displaydiv('logdiv');">&nbsp;log&nbsp;</a></td><td class="menu"><a class="prevbutton" href='draw_graph.php?navigate=prevstep'>prev&nbsp;step</a></td><td class="menu"><a class="zoombutton" href='draw_graph.php?navigate=zoominl'>zoom&nbsp;left</a></td><td class="menu"><a class="zoombutton" href='draw_graph.php?navigate=zoominc'>zoom&nbsp;center</a></td><td class="menu"><a class="zoombutton" href='draw_graph.php?navigate=zoominr'>zoom&nbsp;right</a></td><td class="menu"><a class="zoombutton" href='draw_graph.php?navigate=zoomout'>zoomout</a></td><td class="menu" style="border-right: 1px #222222 solid"><a class="prevbutton" href='draw_graph.php?navigate=nextstep'>next&nbsp;step</a></td></tr><tr><th colspan="14" class="menu"><img style="vertical-align:bottom;" src=IMG/top_down.jpg></img></th></tr></table><div class="disclaimer"><span class="disclaimer" style="float:left;"><a class="disclaimer" onClick="displaydiv('hidequery');" style="cursor:pointer;outline:0;"><b>the MySQL query</b></a></span><span style="float:left;width:8px;">&nbsp;</span><span class="disclaimer" style="float:left;"><a class="disclaimer" onClick="displaydiv('hidelink');" style="cursor:pointer;outline:0;"><b>http link</b></a></span><span class="disclaimer" style="float:right;"><a class="disclaimer" onClick="displaydiv('hideacknow');" style="cursor:pointer;outline:0;"><b>acknowledgements & disclaimer</b></a></span></div><div name ="logdiv" id="logdiv"><table><tr><td><font class="general">&nbsp;NO LOG</font></td></tr></table></div><div name="hidequery" id="hidequery"><div name ="logquery" id="logquery"><font style='color:#653AE3'>The following query for<b> KIEL</b> returned 7200 rows</font><br><font class="statement">SELECT</font> start_date_time,measured_corr_for_efficiency <font class="statement">FROM</font> KIEL_ori <font class="statement">WHERE</font> start_date_time >='2009-09-01 00:00:00' <font class="statement">AND</font> start_date_time < '2009-09-06 00:00:00' <font class="statement">ORDER</font> <font class="statement">BY</font> start_date_time <font class="statement">ASC</font><br><br><font style='color:#653AE3'>The following query for<b> KERG</b> returned 7200 rows</font><br><font class="statement">SELECT</font> start_date_time,measured_corr_for_efficiency <font class="statement">FROM</font> KERG_ori <font class="statement">WHERE</font> start_date_time >='2009-09-01 00:00:00' <font class="statement">AND</font> start_date_time < '2009-09-06 00:00:00' <font class="statement">ORDER</font> <font class="statement">BY</font> start_date_time <font class="statement">ASC</font></div></div><div name="hidelink" id="hidelink"><div name ="linkdiv" id="linkdiv">UNDER DEV !<br> <font color="blue">http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=KIEL&stations[]=KERG&tabchoice=ori&dtype=corr_for_efficiency&yunits=0&date_choice=bydate&start_day=01&start_month=09&start_year=2009&start_hour=00&start_min=00&end_day=05&end_month=09&end_year=2009&end_hour=23&end_min=59&output=ascii</font></div></div><div name="hideacknow" id="hideacknow"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<META http-equiv="Content-Style-Type" content="text/css">

<style type="text/css">

font.general {font-family:arial,verdana; font-size:85%;}
input {font-family:arial,verdana; font-size:85%;}
h1 {font-family:arial,verdana; font-size:100%; color:#444444;}
div.ways {
width: 885px;
padding: 5px;
margin-left:0px;
border: solid;
border-width: 3px;
border-color: #777777;
background-color:#FAFAFA;
text-align:justify;
 }

</style>

</head>
<body>
<font class="general">
<br>
<div class="ways">
<h1>Proprietary rights and acknowledgements:</h1>

NMDB distributes official data provided by the PIs of the neutron monitor stations. Data of different origin may not have been validated or authorised by the respective PI, and any deviation with respect to the authorised data is not his/her responsibility.  <br><br>

Data retrieved via NMDB are the property of the individual data providers. These data are free for non commercial use within the restrictions imposed by the providers. If you use such data for your research or applications, please acknowledge the origin by a sentence like "<i>We acknowledge the NMDB database (www.nmdb.eu), founded under the European Union's FP7 programme (contract no. 213007) for providing data, and the PIs of individual neutron monitors at: 
Kiel (Institut fur Experimentelle und Angewandte Physik, Christian-Albrechts-Universitat zu Kiel, Germany), Kerguelen (Observatoire de Paris and the French Polar Institute IPEV, France)"</i></div><br>

<div class="ways">
<h1>Acknowledgement of other data sources where relevant</h1>

Smoothed Sunspot Numbers and Monthly Sunspot Numbers are provided by the Solar Influences Data Analysis Center (<a href="http://sidc.oma.be">SIDC</a>), Royal Observatory of Belgium.<br><br>
Kp 3-hourly index is provided by the Deutsches GeoForschungsZentrum Postdam (<a href="http://www.gfz-potsdam.de">GFZ</a>) in cooperation with the International Association of Geomagnetism and Aeronomy (<a href="http://www.iugg.org/IAGA/">IAGA</a>) and the International Service of Geomagnetic Indices (<a href="http://isgi.unistra.fr/">ISGI</a>). <a href="https://doi.org/10.5880/Kp.0001">DOI of the Kp dataset</a><br><br>
GOES proton flux data is provided by the United States NOAA/National Geophysical Data Center (<a href="http://www.ngdc.noaa.gov">NGDC</a>)<br>
</div><br>
<div class="ways">
<h1>"original" and "revised" data in NMDB:</h1>

Neutron monitor count rates are available through NMDB in various versions. Data called  "original" are the count rates of a given monitor as provided by this monitor's registration system. Depending on the neutron monitor station these data <i>may</i> have been modified by real-time quality checking procedures, which correct obvious short-lived instrumental effects such as spikes or data gaps produced by hardware and software disturbances, as well as erroneous atmospheric pressure measurements. These corrections are made under the responsibility of the data provider who also archives the raw data. "Revised" data contain further corrections made during subsequent analyses. <br>More information on the NMDB format can be found <a
href="http://nest.nmdb.eu/help.php#helptable">here</a>.
</div>

</font>
<br>
</body>
</html>
</div></font><pre><code>#_____________ QUERY RESULTS SUMMARY ____________________________________
#
#      STATION: KIEL, KERG
#   START TIME: 2009-09-01 00:00:00 UTC
#     END TIME: 2009-09-05 23:59:00 UTC
#   NMDB TABLE: original
#    DATA TYPE: corr_for_efficiency
#    AVERAGING: No
# ORIGINAL RES: 1 min
#________________________________________________________________________
#
# Timestamps always correspond to the beginning of the time interval 
#________________________________________________________________________
#
# ____________________________________________________________________________________________________
#|                                                                                                    |
#| Data retrieved via NMDB are the property of the individual data providers. These data are free for |
#| non commercial use to within the restriction imposed by the providers. If you use such data for    |
#| your research or applications, please acknowledge the origin by a sentence like 'We acknowledge    |
#| the NMDB database (www.nmdb.eu) founded under the European Union's FP7 programme (contract no. 213 |
#| 007), and the PIs of individual neutron monitors at: Kiel (Institut fur Experimentelle und Angewan |
#| dte Physik, Christian-Albrechts-Universitat zu Kiel, Germany), Kerguelen (Observatoire de Paris an |
#| d the French Polar Institute IPEV, France)                                                         |
#|____________________________________________________________________________________________________|
#
                       KIEL    KERG
2009-09-01 00:00:00;178.267;232.781
2009-09-01 00:01:00;182.117;228.353
2009-09-01 00:02:00;169.233;237.209
2009-09-01 00:03:00;177.750;238.082
2009-09-01 00:04:00;178.317;237.841
2009-09-01 00:05:00;175.617;234.679
2009-09-01 00:06:00;178.683;235.311
</code></pre><br>Total Execution Time:2.482 sec (1.041 sec for mysql query)<br></font></body></html>




可以使用python进行解析
需要提前安装python,以及需要的库:requestsbs4
pip install requests
pip install bs4
代码如下:

import requests
from bs4 import BeautifulSoup
import re
url = "https://www.nmdb.eu/nest/draw_graph.php?formchk=1&stations%5B%5D=KERG&stations%5D=KIEL&output=ascii&tabchoice=ori&dtype=corr_for_efficiency&date_choice=bydate&start_year=2009&start_month=09&start_day=01&start_hour=00&start_min=00&end_year=2009&end_month=09&end_day=05&end_hour=23&end_min=59&yunits=0"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
code_tag = soup.find('code')
if code_tag:
    code_content = code_tag.text.strip()
    pattern = r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2};\d+\.\d+)'
    matches = re.findall(pattern, code_content)
    with open("output.txt", "a") as file:
        for match in matches:
            file.write(match)
            file.write("\n")
else:
    print("网站解析失败")

运行之后会在根目录生成一个output.txt,里面就是解析好的内容

img

你让这位大佬看看吧@以山河作礼。

网页内容是下面这个接口返回的,厉害的,方案给的好可以增加酬金,谢谢
http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=KERG&stations]=KIEL&output=ascii&tabchoice=ori&dtype=corr_for_efficiency&date_choice=bydate&start_year=2009&start_month=09&start_day=01&start_hour=00&start_min=00&end_year=2009&end_month=09&end_day=05&end_hour=23&end_min=59&yunits=0

你这数据已经很整齐了啊,直接随便什么都能拿到数据


import requests
import re


res = requests.get('http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=KERG&stations]=KIEL&output=ascii&tabchoice=ori&dtype=corr_for_efficiency&date_choice=bydate&start_year=2009&start_month=09&start_day=01&start_hour=00&start_min=00&end_year=2009&end_month=09&end_day=05&end_hour=23&end_min=59&yunits=0')
data = res.text
d = re.findall(r'(?<=[\n])(\d{4}-\d+-\d+ \d+:\d+:\d+);(\d+(?:\.\d+)?)(?=[\r\n])',data)

img