I’ve been building a lot of OCI Generative AI Agents for customer demos recently ๐ค, one demo that typically resonates well with customers is a RAG agent that uses text scraped from their public website, for example when working with a council this can demonstrate how residents can use a Generative AI Agent to quickly get answers to their questions about council services…….without the hassle of navigating their maze of a website ๐ฉโ๐ป.
For reference here’s how an OCI Gen AI Agent works at a high-level.

In the real world a Gen AI Agent would use internal data that isn’t publicly accessible, however I typically don’t have access to customers data, therefore the approach of crawling their public website works well to showcase the capabilities of a Gen AI Agent and begin a conversation on real-world use-cases that use internal data ๐.
I wrote a very hacky Python script to crawl a site and dump the content to a text file which can then be ingested into a Gen AI Agent…….however this is super unreliable as the script is held together with sticking plasters ๐ฉน and constantly needs to be updated to work around issues experienced when crawling.
I recently stumbled across a fantastic Python package named Trafilatura which can reliably and easily scrape a site, enabling me to retire my hacky Python script ๐.
Trafilatura can be installed using the instructions here (basically pip install trafilatura).
Once it had been installed, I was able to scrape my own blog (which you are currently reading) using two commands!
trafilatura --sitemap "https://brendg.co.uk/" --list >> URLs.txt
trafilatura -i URLs.txt -o txtfiles/
The first command grabs the sitemap for https://brendg.co.uk, and writes a list of all URLs found to URL.txt.

The second command takes the URL.txt file as input and for each URL within, crawls the page and writes the contents to a text file within the folder txtfiles.

Below is an example of one of the text files that have been output, you can clearly see the text from the blog post scraped.

Such a useful tool, which will save me a ton of time โฑ๏ธ!
