Overview
Web sources allow you to import content directly from your website into your agent’s knowledge base. This is the most common way to train your agent on existing website content like product pages, documentation, blog posts, and service descriptions.Discovery Methods
The platform offers four methods to discover and import web content:Quick Scan
Fast domain mapping that quickly discovers pages across your website.
Deep Scan
Thorough crawling with advanced options for precise control over what gets imported.
Sitemap Import
Import URLs directly from your website’s sitemap.xml file.
Manual Entry
Paste specific URLs when you know exactly which pages to import.
Quick Scan
Quick Scan is the fastest way to discover pages on your website. It uses intelligent domain mapping to find pages without fully crawling each one.How to Use
- Select Quick Scan as your discovery method
- Enter your website URL (e.g.,
https://example.com) - Click Scan Domain
- Review the discovered URLs in the pending list
- Save the pages you want to your agent’s knowledge base
Advanced Options
URL Limit
URL Limit
By default, Quick Scan will discover unlimited pages. You can set a limit to cap the number of URLs discovered:
- Unlimited: Discover all available pages
- Custom limit: Set a specific number (e.g., 100 pages)
Deep Scan
Deep Scan provides thorough crawling with fine-grained control over the crawling process. Use this when you need precise control over which pages are discovered.How to Use
- Select Deep Scan as your discovery method
- Enter your starting URL (e.g.,
https://example.com/docs) - Configure advanced options (optional)
- Click Scan Domain
- Monitor the crawl progress in real-time
- Review and save discovered URLs
Advanced Options
Deep Scan offers several configuration options:Max Depth
Max Depth
Controls how many levels deep the crawler will follow links.
Default: 2 levels
| Depth | Behavior |
|---|---|
| 0 | Only the starting URL |
| 1 | Starting URL + pages linked from it |
| 2 | Starting URL + 2 levels of linked pages |
| 3+ | Continues following links to specified depth |
Higher depth values result in more pages but longer crawl times.
Wait Time
Wait Time
Time in milliseconds to wait between requests. This helps avoid overwhelming your server and prevents rate limiting.Default: 200msIncrease this value if your server has rate limiting or if you’re experiencing timeout errors.
URL Limit
URL Limit
Maximum number of URLs to discover during the crawl.
- Unlimited: No cap on discovered URLs
- Custom limit: Stop after discovering specified number of pages
Domain Restriction
Domain Restriction
Controls whether the crawler stays on your domain or follows external links.
Default: Same Domain Only
| Option | Behavior |
|---|---|
| Same Domain Only | Only crawl pages on the same domain as the starting URL |
| All Domains | Follow links to external websites too |
Subpath Restriction
Subpath Restriction
Limit crawling to specific paths on your website. Enter comma-separated paths to restrict the crawler.Example:
/docs, /blog, /productsThis would only crawl URLs that contain /docs, /blog, or /products in their path.Filtering Options
Filtering Options
Additional filters to exclude unwanted URLs:
All filters are enabled by default.
| Filter | What it excludes |
|---|---|
| Skip Social Media | Links to Facebook, Twitter, LinkedIn, etc. |
| Skip File URLs | Links to PDFs, images, downloads, etc. |
| Skip Anchor Links | URLs with # fragments |
Canceling a Crawl
During a Deep Scan, you can click Cancel at any time to stop the crawl. Any URLs discovered up to that point will still be available in your pending list.Sitemap Import
If your website has a sitemap.xml file, you can import all URLs from it directly. This is often the most reliable method for well-maintained websites.How to Use
- Select Sitemap as your discovery method
- Enter your sitemap URL (e.g.,
https://example.com/sitemap.xml) - Click Import Sitemap
- Review the parsed URLs
- Save the pages you want
Finding Your Sitemap
Common sitemap locations:https://yoursite.com/sitemap.xmlhttps://yoursite.com/sitemap_index.xmlhttps://yoursite.com/sitemap/sitemap.xml
Nested Sitemaps
The platform automatically handles sitemap index files - sitemaps that reference other sitemaps. When you import a sitemap index, it will:- Detect that it’s an index file
- Fetch each nested sitemap automatically
- Combine all URLs into a single list
- Support up to 3 levels of nesting
If your sitemap has more than 3 levels of nesting, some deeper sitemaps may be skipped. This limit helps prevent excessively long import times.
Manual URL Entry
When you know exactly which pages you want to import, manual entry is the fastest option.How to Use
- Select Manual as your discovery method
- Paste your URLs into the text area (one per line)
- Click Add URLs
- Review and save
Supported Formats
The manual entry field accepts:- Plain URLs (one per line)
- URLs with or without
https://prefix - Pasted HTML content (URLs will be automatically extracted)
Extracting URLs from HTML
If you copy HTML content (like from a webpage source), the platform will automatically extract all valid URLs from anchor tags and plain text.Managing Pending Sources
After discovering URLs using any method, they appear in the Pending Sources list where you can review and manage them before saving.Filtering Pending Sources
| Filter | Purpose |
|---|---|
| Search | Find URLs containing specific text |
| Exclude | Remove URLs matching patterns (e.g., /admin, .pdf) |
| Type | Filter by discovery method (Quick Scan, Deep Scan, Sitemap, Manual) |
Duplicate Detection
The platform automatically detects duplicates:| Status | Meaning |
|---|---|
| NEW | URL not in your knowledge base |
| Duplicate (in agent) | URL already exists in your agent’s sources |
| Duplicate (in pending) | Same URL already in your pending list |
Saving Sources
Once you’ve reviewed your pending URLs:- Use filters to exclude unwanted pages
- Click Save to Agent to add them to your knowledge base
- Sources will begin processing automatically
Best Practices
Start focused, then expand
Start focused, then expand
Begin with your most important pages (product pages, key documentation, FAQs). Test your agent, then add more content as needed.
Use sitemaps when available
Use sitemaps when available
Sitemaps are maintained by your website and provide the most accurate list of pages. They’re also faster than crawling.
Use exclusion filters liberally
Use exclusion filters liberally
Exclude admin pages, login pages, and irrelevant sections. Use patterns like
/admin, /login, /cart in the exclude filter.Be patient with large sites
Be patient with large sites
Deep scans of large websites can take several minutes. The progress indicator shows real-time status.
Re-import when content changes
Re-import when content changes
When you update your website content, re-import the affected pages to keep your agent’s knowledge current.
Common Issues
Crawl Times Out
If your crawl times out:- Reduce the Max Depth setting
- Increase the Wait Time between requests
- Set a lower URL Limit
- Use Subpath Restriction to focus on specific sections
Sitemap Won’t Load
If sitemap import fails:- Verify the sitemap URL is accessible in your browser
- Check that the sitemap is valid XML
- Ensure your server isn’t blocking automated requests
- Try the direct sitemap URL (not the robots.txt reference)
Missing Pages
If expected pages aren’t discovered:- Check if pages are linked from your starting URL
- Increase the Max Depth setting
- Verify pages aren’t blocked by robots.txt
- Try using Manual Entry for specific pages