Leveraging GPT for On-Demand Web Scraping

GPT Auto Scraper uses LLMs to automatically create and run web-scraping code. We produce Python code tailored to scraping, then execute it automatically to gather information from the target HTML source. By using ChatGPT via a browser automation library, this tool can upload past the large token limits official GPT APIs—making large-scale scraping finally easy to accomplish.

Free?

Yes. Unfortunately, as of 2023, generating and parsing sizable HTML pages can exceed token limits on GPT's API. So, we provide assistance by controlling copy-paste inside a ChatGPT session inside a web browser via Selenium. Note that this is meant for personal use and should not be used for commercial purposes- as it is simply a means of automating your typical copy-paste process for web scraping.

Usage

  1. Download Chrome:
    Chrome is recommended. Alternatively, install Edge or Chromium and update server.py to point to that browser's path.
  2. Install Dependencies:
    pip install -r requirements.txt
    Make sure you have both Selenium and undetected-chromedriver installed.
  3. Run the ChatGPT Browser Server:
    python server.py
    
    # Log into your OpenAI account in the automated browser
    # Press Enter in the terminal once logged in to start the server
    
    This creates a local mock API that routes prompts through ChatGPT.
  4. Run the Scraper Script:
    python scraper.py
    
    Modify scraper.py to specify the desired URL and the structure of data you want to gather.

Future

  • Websites.txt: A list of multiple websites to scrape in batch mode.
  • Settings.json: Easily configure output formats + scraping intervals.

Example

Below is an example of how scraper.py might look when retrieving images for a Google search. This snippet cleans the DOM and prompts GPT to produce specialized Python scraping code in real time:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
from selenium.webdriver.common.by import By
import glob

url = 'https://www.google.com/search?q=hello+world&tbm=isch'

driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)

html = driver.page_source

# Clean up the HTML, remove scripts/styles/images
soup = BeautifulSoup(html, 'html.parser')
for script in soup(["script", "style", "img", "comment", "svg", "meta", "link", "a", "nav", "input"]):
    script.extract()

# Remove extraneous attributes
for tag in soup.find_all(True):
    for attr in tag.attrs.copy():
        if attr not in ['class', 'id']:
            del tag[attr]

# Remove empty elements
for tag in soup.find_all(True):
    if len(tag.get_text(strip=True)) == 0:
        tag.extract()

# Final cleaned HTML
html = str(soup)

# Prepare the GPT prompt
pythoncode = f"""from bs4 import BeautifulSoup
from selenium import webdriver
import json
import time

driver = webdriver.Chrome()
url = '{url}'
driver.get(url)
time.sleep(5)
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'html.parser')"""

query = f"Write Python code to scrape images at {url} using Selenium, then save them to a JSON file. Extend from here:\n\n{pythoncode}\n\nCleaned HTML:\n\n{html}..."

response = requests.post('http://localhost:8000', data={'q': query})
code = response.text

# Attempt to execute the code
for i in range(3):
    try:
        exec(code)
        break
    except Exception as e:
        # Send error feedback to ChatGPT
        response = requests.post('http://localhost:8000', data={'q': f"{code}\n{e}"})
        code = response.text

with open('webscraper.py', 'w') as f:
    f.write(code)