Python Web Scraping

Creating My Own Dataset using BeautifulSoup (Web Scraping)

Description

The goal of this project is to create my own dataset using Web Scrapping, extracting Boston apartment rental data from the RentHop site and saving it into a CSV file.

I will follow 4 simple steps:

  1. Access Web Page
  2. Locate Specific Information
  3. Retrieve Data
  4. Save Data

Loading Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

Verify that requests and BeautifulSoup are already installed. If not, the next steps are required to run the project code.

pip install beautifulsoup4
pip install requests

Accessing Web Page

I will use the library requests to access RentHop site.

r = requests.get('https://www.renthop.com/boston-ma/apartments-for-rent')
r.content

I am going to use BeautifulSoup to do the HTML parsing. I’ll create a BeautifulSoup object, and apply a filter to get the <div> tags in the code.

# Creating an instance of BeautifulSoup
soup = BeautifulSoup(r.content, "html5lib")

listing_divs = soup.select('div[class*=search-info]')
print(listing_divs)

Locate Specific Information

The reason for having all the <div> tags into my list listing_divs is because contains the listing data of 20 apartments. Now I should pull out the individual data points for each apartment.

These is the information I want to target:

  • URL of the listing
  • Address of the apartment
  • Neighborhood
  • Number of bedrooms
  • Number of bathrooms
# Retrieving data from one record

url = listing_divs[0].select('a[id*=title]')[0]['href']
address = listing_divs[0].select('a[id*=title]')[0].string
neighborhood = listing_divs[0].select('div[id*=hood]')[0].string.replace('\n','')

print(url)
print(address)
print(neighborhood)

Retrieving Data

To have data from more than 20 records, it is necessary to iterate on several pages. Using the Search option on the site, I got the below URL, it will help to navigate in different pages through the last parameter page.

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=0

Let’s create a simple code to generate the URLs from page 1 to 4:

url_prefix = "https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page="
page_number = 0

for url in range(4):
    target_page = url_prefix + str(page_number)
    print(target_page + '\n')
    page_number += 1
# Creating a function to retrieve data from a list of <div>s

def retrieve_data(listing_divs):    
    listing_list = []
    for index in range(len(listing_divs)):
        each_listing = []
        current_listing = listing_divs[index]

        url = current_listing.select('a[id*=title]')[0]['href']
        address = current_listing.select('a[id*=title]')[0].string
        neighborhood = current_listing.select('div[id*=hood]')[0].string.replace('\n','')

        each_listing.append(url)
        each_listing.append(address)
        each_listing.append(neighborhood)

        listing_specs = current_listing.select('table[id*=info] tr') 

        for spec in listing_specs:
            try:
                each_listing.extend(spec.text.strip().replace(' ','_').split())
            except:
                each_listing.extend(np.nan)
        listing_list.append(each_listing)

    return listing_list
# Looping and getting data from 350 pages (part of the result of searching)

url_prefix = "https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page="
page_number = 1

all_pages_parsed = []
pages = 350

for url in range(pages):
    target_page = url_prefix + str(page_number)
    page_number += 1

    r = requests.get(target_page)
    
    # Getting a BeautifulSoup instance to be able to retrieve data
    soup = BeautifulSoup(r.content, "html5lib")

    listing_divs = soup.select('div[class*=search-info]')
    
    one_page_parsed = retrieve_data(listing_divs)
    all_pages_parsed.extend(one_page_parsed)

Save Data on a CSV File

df = pd.DataFrame(all_pages_parsed, columns=['url','address','neighborhood','price','rooms','baths','none'])
df.head()

And now, the last step!

# Writing a comma-separated values (CSV) file
df.to_csv('apartments_leasing.csv', index=False)

Conclusions:

  • Applying web scraping allows us to create our own datasets for future analysis. This is just an example but there are a lot of sources on the web
  • In order to easy the extraction and location of the data we need, it is crucial to know the web page code to scrap
  • BeautifulSoup is a helpful and powerful tool for web scraping, it is easy to learn and it has very good documentation that you can check out on this link
  • BeautifulSoup requires an external library to make a request to the website, in this case, I use Requests and that dependency did not represent any disadvantage for this specific project
  • I invite you to review the complete code of this project on my GitHub repository