如何使用Python在无头服务器和无GUI的情况下获取/抓取聚合物SPA的网页[英] how to fetch / grab polymer spa webpage by using python with headless server and no GUI

问题描述

我正在尝试获取以下网址的内容:https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html

我的目标是抓取访问者看到的网页内容(源代码),因此在它呈现所有 javascript 等之后.

为此,我使用了此处提到的示例:http://techstonia.com/scraping-with-phantomjs-and-python.html

该示例适用于我的服务器.但挑战是让它也适用于上面提到的基于聚合物的 SPA 站点.这些是真正呈现的 javascript 网站.

我的代码如下:

import platform
from bs4 import BeautifulSoup
from selenium import webdriver

# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
    PHANTOMJS_PATH = './phantomjs.exe'
else:
    PHANTOMJS_PATH = './phantomjs'


# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)

问题在于提供以下结果:

<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
       see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
<a href="#{{page.hash}}">{{page.name}}</a>
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-43475701-2', 'auto'); // ebidel's
  ga('create', 'UA-39334307-1', 'auto'); // pp.org
  ga('send', 'pageview');
</script>
</body></html>

您使用浏览器查看时看到的实际结果与实际结果相去甚远.我的问题......我做错了什么,如果可能的话在哪里寻找解决方案.

推荐答案

我认为你在 Selenium Webdriver 文档.您可以获取动态页面的内容,但必须确保您正在搜索的元素在页面上存在且可见:

import platform
from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')

# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[@id="pages"]/section[1]/div')

# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')

# Print the text within the div
print (res1.text)

如果您还需要获取其他幻灯片的文本,则需要单击(使用 webdriver)需要使第二张幻灯片可见的位置,然后再从中获取文本.

本文地址:https://www.itbaoku.cn/post/1740179.html