This project uses AutoGen, a powerful open-source framework from Microsoft that lets multiple AI "agents" talk to each other and work together to solve complex tasks. Here's a breakdown of the main building blocks involved:
Think of agents as intelligent team members, each with a specific role. They are role-based components powered by LLMs (Large Language Models) that collaborate through natural language conversations. Agents can be equipped with tools (like web browsers) to perform actions, and optionally with memory to retain context across messages. AutoGen coordinates their interactions, enabling agents to delegate, respond, and adapt dynamically. This allows you to build flexible, multi-agent workflows that simulate real-world teamwork.
In this project, we will focus initially on two types of AutoGen agents:
- WebSurfer Agent: It can open real websites, click buttons, type in search boxes, and interact like a real user.
- UserProxy Agent: Represents you. It acts as your voice in the system and lets you send or receive messages in the agent conversation.
Agents talk to each other using natural language (text), and AutoGen coordinates this as a chat.
Some agents can use tools β code that help them perform specific actions such as:
- Search a database
- Call an API
- Extract info from a document
π‘ In this project we will not be building any custom tools, instead the WebSurfer already has a built-in "tool": it controls a real browser using a system called Playwright. This lets it click buttons, navigate websites, and simulate actions like a human user would.
Memory in AutoGen allows agents to remember what happened earlier. This is useful for:
- Keeping track of what items were already viewed or added
- Referencing earlier steps
- Maintaining conversation history
π‘ In this project, for simplicity we will not be using memory
The MultimodalWebSurfer is a special agent that can:
- Open real websites in a visible browser window (not just simulate clicks invisibly)
- See and interact with whatβs on the page
- Make decisions (e.g what to click, where to type, etc.), using an LLM such as GPT-4o
It uses a tool called Playwright under to control the browser, and GPT-4o to decide what actions to take.
Agents donβt work alone, theyβre grouped into teams that define how they interact. In AutoGen, teams are managed using GroupChat types.
For this hackathon, we will keep it simple with RoundRobinGroupChat and using this type to alternate between:
- The WebSurfer, who browses and clicks
- The UserProxy, who gives instructions and ends the chat
This means agents take turns speaking one after the other: like a conversation where everyone gets a chance to respond.
βΉοΈ Note: AutoGen also supports other group chat types such as:
SelectorGroupChat: Chooses the most relevant agent to respond, instead of going in order.OrderedGroupChat: Follows a strict sequence you define.MagenticOneGroupChat: An advanced setup for coordinating complex, multi-agent workflows with tracing and logic.