Result: Sensitive Data Exposure & Web Scraping with Python

Title:
Sensitive Data Exposure & Web Scraping with Python
Contributors:
Teresa Macklin, Michael Rogers, Ali Ahmadinia
Publisher Information:
California State University, San Marcos
Science, Technology, Engineering & Math
Computer Science & Information Systems
Publication Year:
2022
Subject Terms:
Document Type:
Dissertation/ Thesis thesis
Language:
English
Accession Number:
edsbas.E6BD70DB
Database:
BASE

Further Information

We live in a digital era of data, in which almost every aspect of our lives generates or utilizes data. This data captures who we are, what we do, where we go, and more. Organizations around the globe collect this data, store it, purchase, and sell it. Companies use this data for everything from marketing to product design [4]. While governments use this data to identify us and track us [21]. For these reasons and more, our data has value, and is inherently sensitive. Despite recent regulatory attempts to better identify and classify our data, the digital frontier remains a wild place. Compliance regulations vary from organization to organization, and laws struggle to extend beyond borders. All the while, malicious actors work to collect our sensitive data and use it for nefarious purposes. There have been several significant steps towards better security of our data, including the General Data Protection Regulation (GDPR) and initiatives like bug bounty programs. However, even with these new improvements to data security, our sensitive data continues to be a challenge to manage and is inevitably exposed. The focus of this semester-in-residence project is to acknowledge the state of data security, while also developing a potential fix for accidental (or intentional) exposures of sensitive data on the Internet. Using the Python programming language, I've written an easy-to-use program, called WebDataScraper. The program takes a uniform resource locator (URL) as an input and scrapes the target URL for potentially sensitive data. The program offers several scraping options, including: scraping a single URL, scraping the complete directory of a URL, or scraping an entire fully qualified domain name (FQDN). The program is offered for free use on the open-source software platform GitHub. There, the source code is available, and cybersecurity professionals can review the code or "fork" the project, to make their own custom spin-off version. Using this program, a wide variety of individuals can better audit websites for ...