Creating a bot scraping website data using Python and Selinium

I have been thinking for a long time to start playing around with Python and also wanted to play around with machine learning to predict the lottery numbers. Even that the machine learning part is not in the scope this time, I found out that Python and Selinium was a perfect fit to build up a database with historical data of the lottery numbers. I have the code in a Github repository that is freely available.

Requirements

  • Google Chrome browser
  • ChromeDriver need to be in a directory that is in the path variable.
  • Install selenium pip install selenium.

Good to have

I have quite recently started looking into Python, but what I have understood is that when using pip to install packages, they are ending up globally available and not in each project. The python functionality virtual environment is good to install and I did that.

What should I do? What is the plan? And the goal, what is it?

First of all, I want to get to know Python and Selinium to make a simple bot. The plan is to use the "Norsk Tipping" website to scrape off the winning numbers for the week.

From the screenshot, it is clear that each number is in a HTML element so the plan is to use Selinium to scrape these numbers. The numbers should just be written to the terminal in the beginning and maybe be extended to be written to file and maybe to a database in the end. The goal for the project is to get to know Python and Selinium a lot better by getting my hands dirty and solve a problem.

Setup Selenium

One of the requirements was to download the ChromeDriver and I copied this file to a directory that is in the PATH, /usr/local/bin. The ChromeDriver do not need to be in a directory that is in the PATH, but if it is placed elsewhere the location needs to be specified when the instance of Selenium is created.

from selenium import webdriver
EXE_PATH = r'/Users/username/Downloads/ChromeDriver'
driver = webdriver.Chrome(executable_path=EXE_PATH)
driver.get('https://www.norsk-tipping.no/lotteri/eurojackpot/resultater')

Declaring the EXE_PATH variable on line to and setting the executable_path variable in line 3 is not needed if the executable is moved to a directory in the path and the executable is marked as executable, chmod +x ChromeDriver.

Let's scrape the lottery numbers

To locate an HTML-element, I found that using the Chrome web browser and inspecting the element that I was interested in was the easiest way. In the developer window, you can right-click on the element and Copy XPath. That XPath goes into Selenium and is the way the element is located.

from selenium import webdriver
bot = webdriver.Chrome()
bot.get('https://www.norsk-tipping.no/lotteri/eurojackpot/resultater')
number_1 = bot.find_element_by_xpath('//[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['1']/div').text
number_2 = bot.find_element_by_xpath('//[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['2']/div').text
number_3 = bot.find_element_by_xpath('//[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['3']/div').text
number_4 = bot.find_element_by_xpath('//[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['4']/div').text
number_5 = bot.find_element_by_xpath('//[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['5']/div').text

Let's scrape the date

I also wanted to get the date from the page. The date is in a very human readable format and not so much machine readable, Fredag 29. mai. The scraping from the website is done the same way as for the numbers and then parsing the input and converting it into a datetime. I am adding the time 19:00 here, since I know that the numbers is made available at that time on the date. Since I am adding the timezone, I needed a library that I installed using pip install pytz.

from selenium import webdriver
from datetime import datetime
from pytz import timezone
import pytz

months = { 'januar': 1, 'februar': 2, 'mars': 3, 'april': 4, 'mai': 5, 'juni': 6, 'juli': 7, 'august': 8, 'september': 9, 'oktober': 10, 'november': 11, 'desember': 12 }

now = datetime.now()
date_str = bot.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[1]/div/div').text
day = int((date_str.split(' ')[1])[:-1])
month = int(months[date_str.split(' ')[2]])

dt = datetime(now.year, month, day, 19, 0, 0)
oslo = timezone('Europe/Oslo')
print(oslo.localize(dt))

Automating clicking the previous button

To be able to perform a click event on a button, I had to include a new import statement. Then the same as before to select the HTML-element using the XPath that I find using the Chrome browser.

from selenium import webdriver
from selenium.webdriver import ActionChains

previousBtn = bot.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[1]/div/button[1]')

ActionChains(bot).click(previousBtn).perform()

Then I can use the ActionChains and pass the browser instance and use the click method with the HTML-element. By doing this it will automate the clicking on the previous button.

When testing the website, I can see that the button looks disabled when it is not possible to go any more back in time. Next thing is to stop the automation when the previous button is disabled. The button do not really is disabled, they are setting the opacity: 0.3 when it should have the disabled look and normally the opacity: 1. I found the easiest way to stop when there was no more history data, was with an easy if comparison.

continueRun = True
previously_lottery_date = ''

while continueRun:
    lottery_date = get_date()
    if lottery_date==previously_lottery_date:
        break
    lottery_numbers = get_numbers()
    lottery_extra_numbers = get_extra_numbers()
    get_previous_lottery()
    previously_lottery_date = lottery_date
    print(lottery_date)
    print(lottery_numbers)

Complete example

The complete example that will take the latest lottery numbers and print to the console and then go back in time as long as it is possible to print those numbers to the console as well.

from selenium import webdriver
from selenium.webdriver import ActionChains
from datetime import datetime
from pytz import timezone
import pytz

class EurojackpotBot():
    def __init__(self):
        self.driver = webdriver.Chrome()
        self.months = { 'januar': 1, 'februar': 2, 'mars': 3, 'april': 4, 'mai': 5, 'juni': 6, 'juli': 7, 'august': 8, 'september': 9, 'oktober': 10, 'november': 11, 'desember': 12 }
    def login(self):
        self.driver.get('https://www.norsk-tipping.no/lotteri/eurojackpot/resultater')
    def get_date(self):
        now = datetime.now()
        date_str = self.driver.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[1]/div/div').text
        day = int((date_str.split(' ')[1])[:-1])
        month = int(self.months[date_str.split(' ')[2]])
        dt = datetime(now.year, month, day, 19, 0, 0)
        oslo = timezone('Europe/Oslo')
        return oslo.localize(dt)
    def get_numbers(self):
        lottery_numbers = []
        for index_number in range(5):
            index_number = index_number + 1
            number = self.driver.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[2]/div/div[2]/li['+str(index_number)+']/div').text
            lottery_numbers.append(number)
        return lottery_numbers
    def get_extra_numbers(self):
        lottery_numbers = []
        for index_number in range(2):
            index_number += 1
            number = self.driver.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[2]/div/div[4]/li['+str(index_number)+']/div').text
            lottery_numbers.append(number)
        return lottery_numbers
    def get_previous_lottery(self):
        previousBtn = self.driver.find_element_by_xpath('//*[@id="LotteryGameBoard"]/div/div[1]/div/button[1]')
        ActionChains(self.driver).click(previousBtn).perform()
    def get_all_numbers(self):
        continueRun = True
        previously_lottery_date = ''

        while continueRun:
            lottery_date = self.get_date()
            if lottery_date==previously_lottery_date:
                break
            lottery_numbers = self.get_numbers()
            lottery_extra_numbers = self.get_extra_numbers()
            self.get_previous_lottery()
            previously_lottery_date = lottery_date
            print(lottery_date)
            print(lottery_numbers)

bot = EurojackpotBot()
bot.login()
bot.get_all_numbers()

Conclusion

Getting information of a website using Selenium is really easy and straight forward to do. Getting things done and be productive in Python does not seems like a problem at all. Python is really easy to get into and getting started with and as long as you have any programming experience in other languages, Python should not be a problem at all. Looking forward to play more around using Python.

Teis Lindemark

Read more posts by this author.