Details
Automating Data Collection for Various Purposes Using AI-Driven Web Scraping

Year: 2025
Term: Winter
Student Name: Zachary Hollmann
Supervisor: Olga Baysal / David Veldhuizen
Abstract: This thesis presents the design and development of an AI-driven proof-of-concept system that automates the collection, validation, and organization of data. Motivated by the significant manual effort observed in earlier development projects in the workplace, where building datasets accounted for up to 75% of total development time, this work proposes a scalable and intelligent solution to streamline the data retrieval process. The system integrates Large Language Models (LLMs) to interpret user created prompts, generate extraction criteria, and guide web crawling activities using tools such as Scrapy and Playwright. Extracted data is subjected to rigorous filtering, and relevance validation before being stored in a structured SQLite database with high quality metadata. This prototype demonstrates substantial gains in efficiency, quality, and scalability compared to traditional manual approaches, while adhering to ethical data collection practices. Results show that the system reduces human workload, enhances the reliability of collected data, and provides a foundation for future research applications.