如何使PhantomJS WebDriver等待,直到加载的特定HTML元素然后返回页面.[英] How can I make the phantomJS webdriver to wait until a specific HTML element being loaded and then return the page.source?

问题描述

我已经开发了下面的代码,用于Web爬网对象.

要作为输入需要两个日期.然后在这两个日期之间创建一个日期列表,并将每个日期附加到包含位置天气信息的网页URL.然后它将数据的HTML表转换为DataFrame,并且在存储中将数据存储在存储中(基本链接为: https://www.wunderground.com/history/daily/ir/mashhad/oimm/date/2019-1-3和您可以在此示例中看到日期为2019-1-3):

from datetime import timedelta, date
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from furl import furl
import os
import time

class WebCrawler():
    def __init__(self, st_date, end_date):
        if not os.path.exists('Data'):
            os.makedirs('Data')
        self.path = os.path.join(os.getcwd(), 'Data')
        self.driver = webdriver.PhantomJS()
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        # Create list of dates between two dates given as inputs.
        dates = []
        total_days = int((self.end_date - self.st_date).days + 1)

        for i in range(total_days):
            date = self.st_date + timedelta(days=i)
            dates.append(date.strftime('%Y-%m-%d'))

        return dates

    def create_link(self, attachment):
        # Attach dates to base link
        f = furl(self.base_url)
        f.path /= attachment
        f.path.normalize()

        return f.url

    def open_link(self, link):
        # Opens link and visits page and returns html source code of page
        self.driver.get(link)
        html = self.driver.page_source

        return html

    def table_to_df(self, html):
        # Finds table of weather data and converts it into pandas dataframe and returns it
        soup = BeautifulSoup(html, 'lxml')
        table = soup.find("table",{"class":"tablesaw-sortable"})

        dfs = pd.read_html(str(table))
        df = dfs[0]

        return df

    def to_csv(self, name, df):
        # Save the dataframe as csv file in the defined path
        filename = name + '.csv'
        df.to_csv(os.path.join(self.path,filename), index=False)

这是我想要使用WebCrawler对象的方式:

date1 = date(2018, 12, 29)
date2 = date(2019, 1, 1)

# Initialize WebCrawler object
crawler = WebCrawler(st_date=date1, end_date=date2)
dates = crawler.date_list()

for day in dates:
    print('**************************')
    print('PROCESSING : ', day)
    link = crawler.create_link(day)
    print('WAITING... ')
    time.sleep(3)
    print('VISIT WEBPAGE ... ')
    html = crawler.open_link(link)
    print('DATA RETRIEVED ... ')
    df = crawler.table_to_df(html)
    print(df.head(3))
    crawler.to_csv(day, df)
    print('DATA SAVED ...')

出现的问题是循环的第一个迭代是完美的,但第二个迭代与错误表示No tables where found(在table = soup.find("table",{"class":"tablesaw-sortable"})行中发生),这是因为页面源被WebCrawler.open_link之前返回网页完全加载包含表(包含天气信息)的网页的内容.网站还有一个概率拒绝请求,因为它使服务器太忙了.

无论如何,我们可以构建一个继续尝试打开链接的循环,直到它可以找到表格,或者至少等到加载表,然后返回表格?

推荐答案

我使用

在没有问题的情况下获取天气信息.我猜每个请求之间的延迟导致更好的性能.线myElem = WebDriverWait(self.driver, self.delay_for_page)\.until(EC.presence_of_element_located((By.CLASS_NAME, 'tablesaw-sortable')))还提高了速度.

其他推荐答案

您可以使用selenium等待特定元素.在您的情况下,它将是具有"Tablaw-Slowable"类名称的表.我高度建议您使用CSS选择器查找此元素,因为它快速且易于获取所有表元素的错误.

以下是CSS选择器,适用于table.tablesaw-sortable.设置Selenium等到该元素已加载.

来源: https://stackoverflow.com/a/26567563/21567563/2159473

本文地址:https://www.itbaoku.cn/post/1740159.html