网址:# http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp #
想爬取网页左侧导航rock内的所有csv文件,但是爬取的源代码里没有csv地址,请问如何解决,谢谢各位能人!
这是我的代码:
archive_url = "http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp" # 网址链接
def get_video_links():
r = requests.get(archive_url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
item = soup.find_all('tb', class_="arialtb12") #csv所在代码位置
print(item)
if __name__ == "__main__":
video_links = get_video_links()
以下是我获取的html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<!-- saved from url=(0049)http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="description" content="GEOROC - Geochemical Database on magmatic Rocks">
<meta name="keywords" lang="en-us" content="database, analyses, volcanic rocks, mantle xenoliths, major elements, trace element,
concentrations, radiogenic, nonradiogenic, isotope ratios, analytical, ages, whole rocks, volcanic, glasses, minerals, inclusions">
<meta name="keywords" lang="en" content="database, analyses, volcanic, rocks, mantle xenoliths, major elements, trace element,
concentrations, radiogenic, nonradiogenic, isotope ratios, igneous, analytical, ages, whole rocks, volcanic, glasses, minerals, inclusions">
<meta name="keywords" lang="de" content="Datenbank, geochemie, spurenelement, oxid, gehalte, vulkanite, isotopenverh鋖tnisse, magmatite, xenolithe, analytik, minerale">
<meta name="keywords" lang="it" content="Database , geochimica , contenuti, ossido, oligoelementi , vulcanici , rapporti isotopici , rocce ignee , xenoliti , analisi , minerali">
<title>Geochemical Rock Database-Query</title>
</head>
<frameset rows="100%" cols="18%,82%">
<frame frameborder="0" marginwidth="0" src="./Geochemical Rock Database-Query_files/Query.html" name="Query">
<frame frameborder="0" marginwidth="0" src="./Geochemical Rock Database-Query_files/QueryBlank.html" name="Search">
<noframes>
<body>
<table style="background-color:#99CCFF; padding:1px; border-width:3px; border-color: #000099; height: 8%; width: 40%; margin-left:auto; margin-right:auto; ">
<tr>
<td style="text-align:center;">
<a href="Start.asp"><b>Home</b> |</a> <a href="Content.htm"><b>Content</b> |</a>
</td>
</tr>
</table>
<h1 style="text-align:center;"><b>Query by</b></h1>
<p style="text-align:center;">
</p>
<h2 style="text-align:center;">
<a href="Authors.asp?Frames=no">1. Bibliography</a></h2>
<br/>
<h2 style="text-align:center;">
<a href="QueryLoc.asp?Frames=no">2. Location</a></h2>
<br/>
<h2 style="text-align:center;">
<a href="QueryChem.asp?Frames=no">3. Chemistry</a></h2>
<br/>
<br/>
<br/>
<table style="height: 8%; width:auto; margin-left:auto; margin-right:auto;">
<tr>
<td>
<a href="http://www.mpic.de">© MPI für Chemie, Mainz, Germany</a>
</td>
</tr>
</table>
<p style="text-align:center;">
<span style="font-family: Arial; color: #000033"> State: 05/01/2018 </span></p>
</body>
</noframes>
</frameset>
</html>
你检查下这个网页中的内容是不是通过js代码读取外部json数据来动态更新的。
requests只能获取网页的静态源代码,动态更新的内容取不到。
对于动态更新的内容要用selenium 来爬取。
或者是通过F12控制台分析页面数据加载的链接,找到真正json数据的地址进行爬取。
在页面上点击右键,右键菜单中选 "查看网页源代码"。
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Download Compiled Data Files </title>
<link rel="stylesheet" type="text/css" href="StyleSheet.css" />
<style type="text/css">
td,
th {
border-style: solid;
border-color: #00507d;
border-width: 1px;
padding: 5px;
}
.arial12 {
font-family: arial, helvetica;
font-size: 12px;
font-weight: bold;
text-align: justify;
text-indent: 20px;
}
.arialtbl {
color: #ffffff;
font-family: arial, helvetica;
font-size: 16px;
font-weight: bold;
background-color: #00507d;
}
.arialtblk {
color: #ffffff;
font-family: arial, helvetica;
font-size: 12px;
font-weight: bold;
background-color: #00507d;
}
.arialtbl2 {
font-family: arial, helvetica;
font-size: 12px;
font-weight: bold;
text-align: left;
}
.arialtbl3 {
font-family: arial, helvetica;
font-size: 12px;
font-weight: bold;
color: red;
}
a:link {
text-decoration: none;
color: #0000cc;
}
a:visited {
text-decoration: none;
color: #0000cc;
}
a:hover {
text-decoration: underline;
color: #0000cc;
}
</style>
</head>
<body>
<h2>
Download Precompiled Datasets <a href="CompRules.html" target="_top">?</a>
</h2>
<table
style="width:50%; border-style: none; border-spacing:0px; background-color: #00507f; color:#ffffff; margin-left:auto; margin-right:auto;">
<tr>
<td class="arialtbl2" style="text-align:left;"> Note: Each file is downloaded (use
<span style="color:#ffff66;">'save as'</span>) as comma separated text file with the
extension '.csv'. This file may be opened and saved by EXCEL
</td>
</tr>
</table>
<br/>
<h2>Available Files</h2>
<form method="post" action="./CompFiles.aspx" id="fcompfiles">
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJNzczNjYyOTY4ZGS141vTb2Mj+J93WU0NnlTsVqoS9xRcV/+YaorZY9kEFw==" />
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="5589E234" />
<table id="tcompfiles" cellpadding="4" width="80%"
style="border-color:#00507d; border-collapse:collapse; margin-left:auto; margin-right:auto;">
<tr>
<td class="arialtbl" align="center" colspan="3">Archean Cratons </td>
</tr>
<tr>
<td class="arialtblk" align="left">Download</td>
<td class="arialtblk" align="left">Size (KB)</td>
<td class="arialtblk" align="left">Last Actualization</td>
</tr>
<tr>
<td class="arialtbl2"><a
href="/georoc/Csv_Downloads/Archean_Cratons_comp/ALDAN_SHIELD_-_ARCHEAN.csv">ALDAN SHIELD -
ARCHEAN.csv</a></td>
<td class="arialtbl2">212</td>
<td class="arialtbl2">10/13/2021</td>
</tr>
</table>
</form>
<!-- Piwik -->
<script type="text/javascript">
var pkBaseURL = (("https:" == document.location.protocol) ? "https://piwik.gwdg.de/" : "http://piwik.gwdg.de/");
document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var piwikTracker = Piwik.getTracker(pkBaseURL + "piwik.php", 31);
piwikTracker.trackPageView();
piwikTracker.enableLinkTracking();
} catch( err ) {}
</script>
<noscript>
<p><img
src="http://piwik.gwdg.de/piwik.php?idsite=31" style="border:0"
alt=""/></p>
</noscript>
<!-- End Piwik Tag -->
</body>
</html>
文件路径在这个文件里http://georoc.mpch-mainz.gwdg.de/georoc/CompFiles.aspx