GPT Auto Scraper
Leveraging GPT for On-Demand Web Scraping
GPT Auto Scraper uses LLMs to automatically create and run web-scraping code. We produce Python code tailored to scraping, then execute it automatically to gather information from the target HTML source. By using ChatGPT via a browser automation library, this tool can upload past the large token limits official GPT APIs—making large-scale scraping finally easy to accomplish.
Free?
Yes. Unfortunately, as of 2023, generating and parsing sizable HTML pages can exceed token limits on GPT's API. So, we provide assistance by controlling copy-paste inside a ChatGPT session inside a web browser via Selenium. Note that this is meant for personal use and should not be used for commercial purposes- as it is simply a means of automating your typical copy-paste process for web scraping.
Usage
- Download Chrome:
Chrome is recommended. Alternatively, install Edge or Chromium and updateserver.pyto point to that browser's path. - Install Dependencies:
pip install -r requirements.txt
Make sure you have both Selenium and undetected-chromedriver installed. - Run the ChatGPT Browser Server:
python server.py # Log into your OpenAI account in the automated browser # Press Enter in the terminal once logged in to start the server
This creates a local mock API that routes prompts through ChatGPT. - Run the Scraper Script:
python scraper.py
Modifyscraper.pyto specify the desired URL and the structure of data you want to gather.
Future
- Websites.txt: A list of multiple websites to scrape in batch mode.
- Settings.json: Easily configure output formats + scraping intervals.
Example
Below is an example of how scraper.py might look when retrieving images for a Google search. This snippet cleans the DOM and prompts GPT to produce specialized Python scraping code in real time:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
from selenium.webdriver.common.by import By
import glob
url = 'https://www.google.com/search?q=hello+world&tbm=isch'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
html = driver.page_source
# Clean up the HTML, remove scripts/styles/images
soup = BeautifulSoup(html, 'html.parser')
for script in soup(["script", "style", "img", "comment", "svg", "meta", "link", "a", "nav", "input"]):
script.extract()
# Remove extraneous attributes
for tag in soup.find_all(True):
for attr in tag.attrs.copy():
if attr not in ['class', 'id']:
del tag[attr]
# Remove empty elements
for tag in soup.find_all(True):
if len(tag.get_text(strip=True)) == 0:
tag.extract()
# Final cleaned HTML
html = str(soup)
# Prepare the GPT prompt
pythoncode = f"""from bs4 import BeautifulSoup
from selenium import webdriver
import json
import time
driver = webdriver.Chrome()
url = '{url}'
driver.get(url)
time.sleep(5)
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'html.parser')"""
query = f"Write Python code to scrape images at {url} using Selenium, then save them to a JSON file. Extend from here:\n\n{pythoncode}\n\nCleaned HTML:\n\n{html}..."
response = requests.post('http://localhost:8000', data={'q': query})
code = response.text
# Attempt to execute the code
for i in range(3):
try:
exec(code)
break
except Exception as e:
# Send error feedback to ChatGPT
response = requests.post('http://localhost:8000', data={'q': f"{code}\n{e}"})
code = response.text
with open('webscraper.py', 'w') as f:
f.write(code)