Unlocking AI Web Agents: GPT-4V and Puppeteer Empower Autonomous Browsing

Unlock powerful AI web agents with GPT-4V and Puppeteer! Explore autonomous browsing, web scrapers, and sophisticated AI-driven web research. Discover how these advanced AI assistants can revolutionize tasks, from RPA to customer support.

July 14, 2024

party-gif

Unlock the power of AI-driven web automation with this innovative approach that combines GPT-4V and Puppeteer. Discover how you can build an AI agent that can browse the web, extract data, and complete complex tasks with ease, revolutionizing your workflow and unlocking new possibilities.

Use Case and Market Opportunities for AI Web Agents

One way to look at the potential use cases and market opportunities for AI web agents is to examine previous attempts at building similar systems, their limitations, and how new technologies or changes can potentially change the dynamics.

One direct market category is Robotic Process Automation (RPA), which is a category of software that helps enterprises build automated bots to handle repetitive and standardized tasks like invoice processing or data entry. However, the limitations of RPA solutions are quite clear - they struggle with non-standardized or ever-changing processes, and require high setup costs for each specific automation.

In contrast, AI web agents that can directly control the computer and browser are exciting because they can theoretically handle much more complex situations with much less setup cost. Instead of building specific automations, the agent can simply navigate websites, extract data, and complete tasks regardless of format changes, as the agent can make the necessary decisions.

Beyond just automation, these AI agents can also go beyond traditional RPA to complete more intelligent tasks like customer support, sales, and marketing. By accessing more systems and leveraging their decision-making abilities, these AI "workers" can be deployed for a wider range of use cases, including consumer applications with lower volume.

However, a key challenge in delivering useful AI worker solutions is not just the technical understanding, but also the end-to-end workflow knowledge for specific job functions. A recent research report by Hotspot that surveyed over 1,400 global sales leaders provides valuable insights into the modern sales workflow and AI use cases, which can be very helpful for building AI agents for sales functions.

In summary, the key opportunities for AI web agents include:

  • Handling more complex, non-standardized tasks compared to traditional RPA
  • Reducing setup costs for automations
  • Expanding beyond just automation to more intelligent tasks like customer support and sales
  • Leveraging deep workflow knowledge for specific job functions to build more effective AI agents

Two Approaches to Building AI Web Agents

Approach 1: GPT-4V Powered Web Scraper

  1. Use a Node.js library like Puppeteer to take screenshots of web pages and control the web browser.
  2. Create a Python script that calls the JavaScript file to take screenshots and then uses GPT-4V to extract data from the screenshots.
  3. The Python script defines functions to convert the image to base64, take screenshots, and use GPT-4V to extract information from the screenshots.
  4. The script connects these functions together to create a powerful web scraper that can access websites that normally block scraping services.

Approach 2: Building a Web AI Agent

  1. Create a Node.js file that imports various libraries and sets up an OpenAI instance and a command-line interface.
  2. Implement a highlightLinks function that identifies all the interactive elements on a web page and adds a special attribute to them.
  3. Define a main function that creates a Puppeteer browser, sets up a system message for GPT-4V, and enters a loop where it:
    • Gets a response from GPT-4V based on the user's prompt and the current state of the web page.
    • If the response indicates a link should be clicked, it finds the corresponding element and clicks it.
    • If the response indicates a new URL should be visited, it navigates to that URL and highlights the links.
    • If the response is a regular message, it displays the result to the user.
  4. This web AI agent can navigate through multiple websites, click on links, and complete complex research tasks by leveraging the capabilities of GPT-4V.

Both approaches demonstrate how you can leverage large language models like GPT-4V to build powerful web automation and research tools. The first approach focuses on web scraping, while the second approach creates a more interactive web agent that can navigate and complete tasks on the web.

Building a GPT-4V Powered Web Scraper

To build a GPT-4V powered web scraper, we'll use a Node.js library called Puppeteer to control the web browser and take screenshots. Here's a step-by-step guide:

  1. Create a new file called screenshot.js and import the necessary libraries:
1const puppeteer = require('puppeteer'); 2const puppeteerExtra = require('puppeteer-extra');
  1. Define the URL you want to scrape and a timeout value:
1const url = 'https://en.wikipedia.org/wiki/Main_Page'; 2const timeout = 60000; // 60 seconds
  1. Create an asynchronous function to launch the browser, navigate to the URL, and take a screenshot:
1async function takeScreenshot() { 2 const browser = await puppeteerExtra.launch(); 3 const page = await browser.newPage(); 4 await page.setViewport({ width: 1920, height: 1080 }); 5 await page.goto(url, { waitUntil: 'networkidle0' }); 6 await page.screenshot({ path: 'screenshot.jpg', fullPage: true }); 7 await browser.close(); 8}
  1. In this example, we're using the puppeteer-extra plugin to make the browser less detectable by websites.

  2. Run the takeScreenshot() function to capture the screenshot:

1takeScreenshot();

Now, you can run the script with node screenshot.js, and it will save a screenshot of the Wikipedia homepage to the screenshot.jpg file.

Next, we'll create a Python script that uses the screenshot and GPT-4V to extract data from the website:

  1. Create a new file called vision_scraper.py and import the necessary libraries:
1import os 2import subprocess 3import base64 4import openai 5from dotenv import load_dotenv 6 7load_dotenv() 8openai.api_key = os.getenv("OPENAI_API_KEY")
  1. Define functions to convert the image to base64 and take a screenshot using the screenshot.js script:
1def image_to_b64(image_path): 2 with open(image_path, "rb") as image_file: 3 return base64.b64encode(image_file.read()).decode("utf-8") 4 5def url_to_screenshot(url): 6 if os.path.exists("screenshot.jpg"): 7 os.remove("screenshot.jpg") 8 try: 9 subprocess.run(["node", "screenshot.js"], check=True) 10 return "screenshot.jpg" 11 except subprocess.CalledProcessError: 12 return None
  1. Create a function to use GPT-4V to extract information from the screenshot:
1def vision_extract(image_b64, prompt): 2 response = openai.ChatCompletion.create( 3 model="gpt-4", 4 messages=[ 5 {"role": "system", "content": "You are a web scraper. Your job is to extract information based on a screenshot of a website and user instructions."}, 6 {"role": "user", "content": prompt}, 7 {"role": "user", "content": image_b64} 8 ], 9 max_tokens=2048, 10 n=1, 11 stop=None, 12 temperature=0.7, 13 ) 14 return response.choices[0].message.content.strip()
  1. Tie everything together in a vision_query() function:
1def vision_query(url, prompt): 2 screenshot_path = url_to_screenshot(url) 3 if screenshot_path: 4 image_b64 = image_to_b64(screenshot_path) 5 return vision_extract(image_b64, prompt) 6 else: 7 return "Error: Unable to capture screenshot."
  1. You can now use the vision_query() function to extract information from a website:
1result = vision_query("https://www.linkedin.com/in/your-profile-url", "Extract the work experience section from the profile.") 2print(result)

This example will take a screenshot of the specified LinkedIn profile and use GPT-4V to extract the work experience section. You can customize the prompt to extract different types of information from the website.

Building an AI Web Agent that Browses the Web Like a Human

One way to look at the use cases and market opportunities for self-operating computer systems is to examine previous attempts at building similar systems, their limitations, and how new technologies can potentially change the dynamics.

One such market category is Robotic Process Automation (RPA), which is a category of software that helps enterprises build automated bots to handle repetitive and standardized tasks like invoice processing and data entry. While RPA solutions like UiPath have provided significant value to enterprises, they are limited in their ability to handle non-standardized or ever-changing processes, as well as tasks that involve complex decision-making.

The emergence of multimodal AI agents like GPT-4V, which can directly control the computer and browser, presents an exciting opportunity to address these limitations. Theoretically, these AI agents can handle much more complex situations with much less setup cost. Instead of building specific automations for each website structure, the agent can simply navigate the websites, take screenshots, and extract data, regardless of format changes, as it can make the necessary decisions itself.

Moreover, these AI agents can go beyond just automation and complete intelligent tasks like customer support, sales, and marketing, as they have the ability to access and interact with various systems.

To demonstrate how to build such a web AI agent, let's walk through a step-by-step example. We'll start with a simple GPT-4V-powered web scraper that can take screenshots of web pages and use GPT-4V to extract data. Then, we'll create a more advanced web AI agent that can browse the internet like a human, find and click on links to navigate and gather information.

For the web scraper, we'll use a Node.js library called Puppeteer to control the web browser and take screenshots. We'll then create a Python script that calls the JavaScript file to take the screenshot and uses GPT-4V to extract the data.

For the web AI agent, we'll create a more sophisticated JavaScript file that can highlight interactive elements on the web page, click on links, and navigate to new URLs based on the instructions from GPT-4V. The agent will continuously loop through this process, taking screenshots and passing them to GPT-4V, until it believes it has found the necessary information to answer the user's query.

By the end of this example, you'll have a good understanding of how to build an AI agent that can control your web browser or computer to complete complex tasks.

FAQ