关于#python#的问题:通过request.get获取的源代码不全

网址:# http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp #
想爬取网页左侧导航rock内的所有csv文件,但是爬取的源代码里没有csv地址,请问如何解决,谢谢各位能人!

img

这是我的代码:

archive_url = "http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp"  # 网址链接

def get_video_links():
    r = requests.get(archive_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    print(soup)
    item = soup.find_all('tb', class_="arialtb12")     #csv所在代码位置
    print(item)

if __name__ == "__main__":
    video_links = get_video_links()

以下是我获取的html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<!-- saved from url=(0049)http://georoc.mpch-mainz.gwdg.de/georoc/Start.asp -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">



<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="description" content="GEOROC - Geochemical Database on magmatic Rocks">
<meta name="keywords" lang="en-us" content="database, analyses, volcanic rocks, mantle xenoliths, major elements, trace element,
     concentrations, radiogenic, nonradiogenic, isotope ratios, analytical, ages, whole rocks, volcanic, glasses, minerals, inclusions">
<meta name="keywords" lang="en" content="database, analyses, volcanic, rocks, mantle xenoliths, major elements, trace element,
     concentrations, radiogenic, nonradiogenic, isotope ratios, igneous, analytical, ages, whole rocks, volcanic, glasses, minerals, inclusions">
<meta name="keywords" lang="de" content="Datenbank, geochemie, spurenelement, oxid, gehalte, vulkanite, isotopenverh鋖tnisse, magmatite, xenolithe, analytik, minerale">
<meta name="keywords" lang="it" content="Database , geochimica , contenuti, ossido, oligoelementi , vulcanici , rapporti isotopici , rocce ignee , xenoliti , analisi , minerali">

<title>Geochemical Rock Database-Query</title>

</head>

<frameset rows="100%" cols="18%,82%">
<frame frameborder="0" marginwidth="0" src="./Geochemical Rock Database-Query_files/Query.html" name="Query">
<frame frameborder="0" marginwidth="0" src="./Geochemical Rock Database-Query_files/QueryBlank.html" name="Search">
<noframes>
<body>

<table style="background-color:#99CCFF; padding:1px; border-width:3px; border-color: #000099; height: 8%; width: 40%; margin-left:auto; margin-right:auto; ">
<tr>
<td style="text-align:center;">
<a href="Start.asp"><b>Home</b> |</a> <a href="Content.htm"><b>Content</b> |</a>
</td>
</tr>
</table>

<h1 style="text-align:center;"><b>Query by</b></h1>
<p style="text-align:center;">
&nbsp;</p>
<h2 style="text-align:center;">
<a href="Authors.asp?Frames=no">1. Bibliography</a></h2>
<br/>
<h2 style="text-align:center;">
<a href="QueryLoc.asp?Frames=no">2. Location</a></h2>
<br/>
<h2 style="text-align:center;">
<a href="QueryChem.asp?Frames=no">3. Chemistry</a></h2>
<br/>
<br/>
<br/>

<table style="height: 8%; width:auto; margin-left:auto; margin-right:auto;">
<tr>
<td>
<a href="http://www.mpic.de">&copy; MPI f&uuml;r Chemie, Mainz, Germany</a>&nbsp;
</td>
</tr>
</table>

<p style="text-align:center;">
<span style="font-family: Arial; color: #000033">&nbsp;&nbsp;&nbsp;State: 05/01/2018    </span></p>
</body>
</noframes>
</frameset>

</html>

你检查下这个网页中的内容是不是通过js代码读取外部json数据来动态更新的。
requests只能获取网页的静态源代码,动态更新的内容取不到。
对于动态更新的内容要用selenium 来爬取。

或者是通过F12控制台分析页面数据加载的链接,找到真正json数据的地址进行爬取。

在页面上点击右键,右键菜单中选 "查看网页源代码"。

img


这样看到的才是网页的静态源代码。
如果这个网页的静态源代码中有你需要爬取的内容,就说明该页面没有动态内容,可以用requests爬取。
否则就说明该页面的内容是动态更新的,要用selenium 来爬取.


进这个页面爬

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
    <title>Download Compiled Data Files </title>
    <link rel="stylesheet" type="text/css" href="StyleSheet.css" />
    <style type="text/css">
        td,
        th {
            border-style: solid;
            border-color: #00507d;
            border-width: 1px;
            padding: 5px;
        }

        .arial12 {
            font-family: arial, helvetica;
            font-size: 12px;
            font-weight: bold;
            text-align: justify;
            text-indent: 20px;
        }

        .arialtbl {
            color: #ffffff;
            font-family: arial, helvetica;
            font-size: 16px;
            font-weight: bold;
            background-color: #00507d;
        }

        .arialtblk {
            color: #ffffff;
            font-family: arial, helvetica;
            font-size: 12px;
            font-weight: bold;
            background-color: #00507d;
        }

        .arialtbl2 {
            font-family: arial, helvetica;
            font-size: 12px;
            font-weight: bold;
            text-align: left;
        }

        .arialtbl3 {
            font-family: arial, helvetica;
            font-size: 12px;
            font-weight: bold;
            color: red;
        }

        a:link {
            text-decoration: none;
            color: #0000cc;
        }

        a:visited {
            text-decoration: none;
            color: #0000cc;
        }

        a:hover {
            text-decoration: underline;
            color: #0000cc;
        }
    </style>
</head>

<body>
    <h2>
        Download Precompiled Datasets &nbsp; <a href="CompRules.html" target="_top">?</a>
    </h2>

    <table
        style="width:50%; border-style: none; border-spacing:0px; background-color: #00507f; color:#ffffff; margin-left:auto; margin-right:auto;">
        <tr>
            <td class="arialtbl2" style="text-align:left;"> Note: Each file is downloaded (use
                <span style="color:#ffff66;">'save as'</span>) as comma separated text file with the
                extension '.csv'. This file may be opened and saved by EXCEL
            </td>
        </tr>
    </table>
    <br/>

    <h2>Available Files</h2>
    <form method="post" action="./CompFiles.aspx" id="fcompfiles">
        <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJNzczNjYyOTY4ZGS141vTb2Mj+J93WU0NnlTsVqoS9xRcV/+YaorZY9kEFw==" />

        <input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="5589E234" />
        <table id="tcompfiles" cellpadding="4" width="80%"
            style="border-color:#00507d; border-collapse:collapse; margin-left:auto; margin-right:auto;">
            <tr>
                <td class="arialtbl" align="center" colspan="3">Archean Cratons </td>
            </tr>
            <tr>
                <td class="arialtblk" align="left">Download</td>
                <td class="arialtblk" align="left">Size (KB)</td>
                <td class="arialtblk" align="left">Last Actualization</td>
            </tr>
            <tr>
                <td class="arialtbl2"><a
                        href="/georoc/Csv_Downloads/Archean_Cratons_comp/ALDAN_SHIELD_-_ARCHEAN.csv">ALDAN SHIELD -
                        ARCHEAN.csv</a></td>
                <td class="arialtbl2">212</td>
                <td class="arialtbl2">10/13/2021</td>
            </tr>
        
            </table>
    </form>
    <!-- Piwik -->
    <script type="text/javascript">
        var pkBaseURL = (("https:" == document.location.protocol) ? "https://piwik.gwdg.de/" : "http://piwik.gwdg.de/"); 
document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E"));
    </script>
    <script type="text/javascript">
        try { 
var piwikTracker = Piwik.getTracker(pkBaseURL + "piwik.php", 31); 
piwikTracker.trackPageView();
piwikTracker.enableLinkTracking();
} catch( err ) {}
    </script>
    <noscript>
        <p><img
src="http://piwik.gwdg.de/piwik.php?idsite=31" style="border:0"
alt=""/></p>
    </noscript>
    <!-- End Piwik Tag -->

</body>

</html>

文件路径在这个文件里http://georoc.mpch-mainz.gwdg.de/georoc/CompFiles.aspx

img