How I scrape jobs data from indeed.com with python

So my sister is looking for a part-time job while she studies at uni and, a few weeks ago, she asked me if there was a way to automate job search because she didn’t want to go through an endless list of jobs wasting her invaluable university student’s time reading job posts that in the end wouldn’t match what she’s looking for. 

So, since I’m working primarily with python projects and automating testing with python I gave web scraping a try and managed to build a bot to crawl indeed.com, a popular jobs site. The primary function of this scraper is threefold: 1.  To search for jobs according to query parameters given by the user; 2. Scrape important  data from the jobs found; and 2. Filter out inaccurate results according to terms matching (also provided by the user). 

The bot is made up of several python scripts and config files that make it work but, since I want to focus this article exclusively on data scraping, I will briefly describe how the bot performs the search (without going deep into details) and then will go through the process of data extraction. For a full understanding of the code you can view the whole project itself

Disclaimer

Web scraping is a bit of a grey area in legal terms. Many sites are not perfectly happy about having their data scraped. Some sites have a section dedicated to web scraping where they specify the restrictions that might apply in this regard.

Disclaimer 2

The code shown in this article is not exactly the same as in the actual project stored in GitHub. There’s a bit more logic implied in the development of this scraper but, for the sake of clarity, I adapted it to talk exclusively about web data extraction. That been said let’s jump right into it!

A high-level glance at search functionality

To build up the bot search I studied how indeed’s url (and its query params) changes as I searched for jobs and added different filters. Here’s an example:

https://ie.indeed.com/jobs?q=qa+engineer&l=dublin&%2460%2C000&fromage=15&sort=date

This is a search url that looks for QA Engineer jobs in Ireland, specifically  in Dublin, that are paid 60 k a year, and which posts are 15 days old at maximum. Let’s break it down. 

  • The “ie” before “.indeed.com/” is the country top level domain and it indicates indeed that the search will be focused on companies located in Ireland. I found out the country top level domain could be placed after the “indeed” chunk of the url depending on which country I searched in. For example, for the UK the url would start like this: https://www.indeed.co.uk/ having the country top level domain after “indeed”. Here what I did was to put the country top level domain always after “indeed”, make a request  (explained in a further section) and catch the the possible ConnectionError. If a ConnectionError takes place put the country top level domain before “indeed”. 
  • The “q=” is where you specify the job title. So, if we look for QA Engineer jobs, the parameter would be like “q=qa+engineer”
  • “l=” tells us the specific location of the job. So although we might be looking for jobs in Ireland the user can use this parameter to further refine his search. In our case it would be “l=dublin”
  • “2460%2C000” refers to a base salary of 60 k. It always starts with “%24” and ends with “%2C000” so to look for a $ 60,000 pay we will just put a 60 between those strings. 
  • Lastly, “formage=15” sets the posts recency to 15 days old. 

Those are the most important elements of the search. So without going into details, I wrote code that would prompt the user to provide these parameters and then added logic to concatenate them to form a full search url. As a plus, this info is saved in a config file that will be reused in future queries, so the user doesn’t have to configure the bot again for the same searches. Let’s now jump into the data extraction part. 

Let’s scrape some data

Once my sister configures the bot, she will have a string representing an indeed url. From then on, I wrote code to make a request to it, like a human would do in a browser, and then extract the HTML code of the page. To do this, I combine requests, BeautifulSoup, and Selenium libraries.

Accessing the site

To get to the site and extract its HTML we have to create a soup object (from the BeautifulSoup library). The soup object will use requests and the specific link to pull down its code every time we want to target specific web data. We first import the libraries and then create a link object using requests.get method with the url. I then use the .text method of the link object to pull up the html as plain text. Lastly, this is passed to BeautifulSoup as argument along with “html-parser” to create a soup object.

import requests
from bs4 import BeautifulSoup

# Create a soup object
link = requests.get(url)
site = BeautifulSoup(link.text, 'html.parser')

The soup object called site will be consistently used to find data that can be found right away from the original HTML. To achieve this, I inspected the page to study its structure and identify HTML elements in a unique way. Note that finding unique elements will be key to ensure the data we get is accurate. I recommend using Google Chrome, its developer tools are quite handy to do the job. Just open the url, right clic on it and choose “inspect”

Getting the job title

The job title is in <a> elements with an attribute  called “data-tn-element”, which is used only  for job titles and has the value of  “jobTitle”. Pretty easy. Here’s where the magic of BeautifulSoup comes into play. I used the find_all method of the soup object and targeted every <a> tag with the attribute “data-tn-element”:”jobTitle”. 

jobs_a = site.find_all(name='a', attrs={'data-tn-element': ‘jobTitle'})

This will yield a list (called jobs_a) with all <a> HTML elements containing the job title. Once we have this list it’s just a matter of iterating over it to extract the actual title. Now, the job title itself could be found in various places. It was available as text but I found more reliable to extract it directly  from an attribute named  “title”. 

In this case, I extract all attributes with the .attrs method which returns a dictionary (job_attrs) with the attributes as keys. Lastly I access the key “title” and store its value in a list. 

scraped_job_titles = []
for job in jobs_a:
    job_attrs = job.attrs
    scraped_job_titles.append(job_attrs[‘title'])

Where will you work?

Let’s scrape the job location. The same principles apply here. The location element is a <div> tag with “class”: “recJobLoc”. The location value can be found in the “data-rc-loc” attribute. I apply the same technique as with the title:

scraped_job_locations = []

loc_div = site.find_all('div', attrs={'class': 'recJobLoc'}) 
 
for loc in loc_div:
    loc_attrs = loc.attrs
    scraped_job_locations.append(loc_attrs[‘data-rc-loc'])

The company name

The company name can be found in a <span> tag with “class”:”company”. In this case, there’s no attribute with the company name so I extract the text of every element containing such information.

scraped_company_names = []

company_span = site.find_all('span', attrs={'class': ‘company'})

for span in company_span:
    scraped_company_names.append(span.text.strip())

Note that here I access the element text with the .text method and then parse it out with .strip() to eliminate spaces at the beginning and at the end of the string (in case those are present).

Counting the bills

Although the salary might be one of the most interesting data to scrape, it’s not common to find job posts showing it. When the salary is shown it’s located under a <span> tag with “class”: “salaryText”. The problem here is that whenever the salary is not shown the program will crash when trying to pull it. Therefore I targeted a parent element that is always present and added logic to further scrape the salary as long as the wanted child element is present. The parent element is the job card that holds all relevant info. It’s located in a <div> with “class”: “Jobsearch-SerpJobCard”.

scraped_salaries = []

jobs_divs = site.find_all('div', attrs={'class': 'jobsearch-SerpJobCard'})

for div in jobs_divs:
    salary_span = div.find('span', attrs={'class': 'salaryText'})
    if salary_span:
        scraped_salaries.append(salary_span.string.strip())
    else:
        scraped_salaries.append('Not shown')

Note here I added an if-else statement in which the salary is only extracted if the <span> tag mentioned before exists. Otherwise the salary will be saved as “Not shown”.

Is it worth to work there?

A pretty useful thing to look at is the average rating the job has. I found this information in a <span> element with “class”:”ratingsContent”. As with the salary, many jobs don’t have ratings so, again, I make the soup object navigate to the job card and then I write an if-else statement to extract the rating depending on whether it exist for the current job or not.

jobs_divs = site.find_all('div', attrs={'class': 'jobsearch-SerpJobCard'})

for div in jobs_divs:
 rating_span = div.find('span', attrs={'class':  ‘ratingsContent'})
 if rating_span:
  scraped_ratings.append(float(rating_span.text.strip().replace(',', '.')))
 else:
  scraped_ratings.append(None)

It’s worth noting this chunk of code: float(rating_span.text.strip().replace(',', ‘.’)). The rating is present as text (a string), but I want to extract it as float data. I first strip the text to eliminate possible extra spaces and use .replace to change commas by dots. This is because depending on the country, the rating decimals can be separated by either commas or dots, so I always parse the text to have a dot to separate the whole value from the fractional value. Then I use the float function to convert the string to float data. 

OK, but like…where do I upload my resume?

The application url was not straightforward to get at all. It took me several tries to get it right, there are several ways to approach this but I only found one that has been stable across multiple scenarios. By default, the job info is displayed in the same page after clicking on the job card. But we can also hit cmd/ctrl + click on the job card to open it in a different tab. This allows us to examine the job url. Here’s an example:

https://ie.indeed.com/viewjob?jk=23b22ada48a31916&q=qa+engineer&l=dublin&tk=1eg6p1nvor9ru800&from=web&vjs=3

See all this bunch of characters? After experimenting a bit, I found that only the part that comes before the “&” is important. That leaves us with:

https://ie.indeed.com/viewjob?jk=23b22ada48a31916

In this case, the “jk=” value it’s the id of the job. Which can be found in the very job card element as an attribute which is called “data-jk”. The solution here was to save “https://ie.indeed.com/viewjob?jk=” as a string and then concatenate it with the id of the current job. 

Since the job id can be found as attribute in the job card element I used the same technique as in the job title to extract it. 

view_job_url = ‘https://ie.indeed.com/viewjob?jk=’

scraped_apply_urls = []

jobs_div = site.find_all(name='div', attrs={'class': 'jobsearch-SerpJobCard'})

for div in jobs_div:
    job_id = div.attrs['data-jk']
    apply_url = view_job_url + job_id
    scraped_apply_urls.append(apply_url)

Bear in mind that view_job_url will be of no value if we hard code the country top level domain. I hardcoded it in this article for illustration purposes but in the actual project I added further logic to get the a country-specific job url.

How old is the job post?

There’s a text in each job card indicating when the job was posted and such data is present as a string in an <span> element with “class”: “date”. However, this string can be shown in various ways. Usually there’s a number in such string so I extract the text and parse out any digits present. I used the re (regular expressions) library for that. Nevertheless that number can mean anything, hours, minutes, days.. Therefore I added another if-else block to control for each case. 

import re

scraped_days = []

days_spans = site.find_all('span', attrs={'class': 'date'})

for day in days_spans:
    day_string = day.text.strip()

    if re.findall('[0-9]+', day_string):
        parsed_day = re.findall('[0-9]+', day_string)[0]
        if 'hour' in day_string:
            job_posted_since = str(parsed_day) + ' hours ago'
        elif 'day' in day_string:
            job_posted_since = str(parsed_day) + ' days ago'
        elif 'week' in day_string:
            job_posted_since = str(parsed_day) + ' weeks ago'
        elif 'month' in day_string:
            job_posted_since = str(parsed_day) + ' months ago'
        else:
            job_posted_since = str(day_string)
    else:
        job_posted_since = 'today'

    scraped_days.append(job_posted_since)

It can be also the case that there’s no way to control for the presence of the words “hour”, “day”, etc (for instance, the string might be in other language than English), in which case I only extract the string literally as it is. For sure there are ways to translate and normalise it, but I keep it as it is to not complicate it much. 

The job description

indeed.com  shows a summary in the job card. This information can be easily extracted with BeautifulSoup. However, this seemed to me not descriptive enough so I decided to scrape the full description. To do this, it was necessary to click on the job card so the full description would show up in an emergent “side card”. And here is where Selenium comes into play. Here are the libraries I imported in my scraper:

from selenium import webdriver
from selenium.common.exceptions import ElementClickInterceptedException, TimeoutException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

Put simply, I’m importing selenium libraries to create an object which will allow python to interact with the page through the web browser’s driver. Also, I imported extra libraries to handle exceptions and make the process pause if necessary.

To to automate browser actions with Selenium you need a browser and its driver to “manipulate” it. In my case, I use Google Chrome and chromedriver. You can download the driver and store it near your python script. The first step is to create a driver object that we can later use to interact with elements of the page. 

driver = webdriver.Chrome(‘{path where the driver is saved}’)

I also created a a wait object since often times the site takes time to load, and if we try to interact with elements that haven’t loaded the code will crash. Therefore I use a wait object to pause the script until the object I want to interact with is ready. To create the wait object we have to pass the driver object and an amount of seconds to wait as arguments (I set the waiting time to 10 seconds).

wait = WebDriverWait(driver, 10)

The next step is to access the page locate each job card and click on them. I used the .get method to access the url. Then, I make the script wait until all elements with “class”:”jobsearch-SerpJobCard” are present (the job cards). Next, I made a list containing all job cards that I later iterate over to click on each of them.

driver.get(https://ie.indeed.com/jobs?q=qa+engineer&l=dublin&%2460%2C000&fromage=15&sort=date)

wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'jobsearch-SerpJobCard')))

jobs = driver.find_elements_by_class_name('jobsearch-SerpJobCard')

Now that all job cards elements are stored in a list, we can iterate over it and click on each of them. To prevent the code from crashing I use the wait object to wait until the element I want to click is clickable. Then I introduced a try-except block to click and handle exceptions if they arise. 

scraped_descriptions = []

for job in jobs:
 wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'jobsearch-SerpJobCard')))
 try:  # click on the job card and add its description to descriptions list
  job.click()
  wait.until(EC.presence_of_element_located((By.ID, 'vjs-content')))
  scraped_descriptions.append(driver.find_element_by_id('vjs-content').text)
 except ElementClickInterceptedException:
  # if ElementClickInterceptedException, scroll away and try again
  driver.execute_script("window.scrollTo(0,document.body.scrol
lHeight);")
  job.click()
  scraped_descriptions.append(driver.find_element_by_id(‘vjs-content').text)

Let’s break down this code block a little further. For each job, once the job card is clickable, we click on it and wait until the description (i.e the element with “id”:”vjs-content”) is loaded. Then we extract its text and append it to a list of descriptions. In case an ElementClickInterceptedException is raised, I make selenium scroll away and get the description text again. Often times, elements can’t be clicked because another element receives the click accidentally. Scrolling away is usually enough to handle this issue, and that is achieved with the .execute_script method and passing it window.scrollTo(0, document.body.scrollHeight) as an argument. Then I try to extract the description text again and add it to the list. 

Conclusion

That’s pretty much it for for web data extraction! After all this data is gathered, there’s more code to filter out results by matching titles and descriptions with terms given by the user. At the end of the process the bot sets up remaining jobs’ data in a data frame and saves it in a csv file with user’s search-relevant jobs ready to be applied. If you want to get into the details of how it’s build you can access my GitHub repository through my portfolio page.

I hope you found the article entertaining, see you in the next post!

Leave a Reply

Your email address will not be published.