使用BeautifulSoup和Python解析文本的问题[英] Trouble Parsing Text using BeautifulSoup and Python

本文是小编为大家收集整理的关于使用BeautifulSoup和Python解析文本的问题的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我正在尝试检索有关法规的评论部分.一个例子是"对专有交易的限制……以自由市场驱动的估值".在 http://www.regulation.gov/#!documentdetail; D = Occ-2011-0014-0032 .

我正在使用Beautifutsoup和Python,并具有以下代码:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder

当我执行"打印汤"时,我会得到一个输出(尽管一个混乱),但是当我执行"打印评论者"时,我将获得"无"作为输出.我不太确定为什么会发生这种情况,并感谢任何帮助.谢谢.

注意:我使用Selenium webdriver尝试绕过JavaScript-这是一种正确的方法吗?

推荐答案

您需要让PhantomJS 明确等待在阅读page_source之前,该元素成为存在.为我工作:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))

本文地址:https://www.itbaoku.cn/post/1740131.html

问题描述

I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.

I am using BeautifulSoup and Python and have the following code:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder

When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.

Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?

推荐答案

You need to let PhantomJS explicitly wait for the element to become present before reading the page_source. Worked for me:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))