Selenium Webscraper
2023.
Do you need a web scraping solution? Email me: inquire@automatedinnovations.com
Technologies used: Python, Selenium, Docker, Sqlite, Postgresql, Google Cloud Run, GC SQL
For a scraper, there are several considerations:
1 - Making the scraper.
2 - Deploying the scraper.
1- Making the Scraper: How to get the data?
One choice here is to use a library, like requests, to grab the page, then make a soup with your favorite scraping library (like BeautifulSoup) and get the info you want. There are pros and cons to this simple approach. If, however, data is loaded more dynamically by javascript or react.js after certain elements are interacted with, this approach may fail. Then you'll want to consider using selenium to simulate a browser or go to the network tab on inspection and try to find URLs for the data that is loaded and simulate requests for that data.
Here I needed to load dynamic data (arrival and departure times for certain vessels) so I went with Selenium because it doesn't require digging through the network tab and I felt the data on the browser page was unlikely to change and in-fact would be easier to parse than trying to keep track of vague or even randomized data urls.
Choosing selenium will create some challenges for deployment on GCP though.
2- Deploying the Scraper: Make it run in the cloud.
I wanted the cloud deployment to be relatively modular, pay-per-use, and support selenium well. Unfortunately google cloud run matches the first two requirements but not the last. Hosting standalone selenium docker image deployment won't work behind cloud run's authentication mechanism because the selenium grid server does not support authenticated requests.
To work around this is modified selenium's connection class in my python client code to set a google authentication token to pass along with requests to the selenium server living on cloud run. Then I built into this python code which hits the selenium server into a separate docker image that could be scheduled on cloud run as a job.