Basics of web scraping in R with rvest

Web scraping may seem very difficult, but with some basic R knowledge you can easily scrape your first website. In this article I explain how to scrape information from TripAdvisor, in particular information of the best restaurants in New York, including their ratings, type of cuisine and location.

 

Content

  1. Select CSS classes
  2. Web scraping with rvest
  3. Loop through objects and pages
  4. Full code

 

Select CSS classes

The easiest way to scrape data from webpages is to use the rvest library in R. With just a view simple lines of code you can scrape and structure your data, which I will show you in an example. It might be helpful if you have some basic knowledge in html and css, but if not there are some tools which can help you out.

With rvest you can scrape data based on css elements in web pages. Every website has a css document which applies styles, such as background colors, font seizes and positioning of elements. You can find the css style for each element in the developer tools by pressing F12 on your keyboard. A side window will open which looks similar to the screenshot below.

Each element on a webpage has it's own css styles

At the right side a new sidebar appears with the source code of the website. The tab ‘Styles’ shows the styling of all the elements on the web page. There is an arrow in the top left corner of the developer sidebar. By clicking the arrow you can hover over the elements on the web page. It inspect the class names per element, which is exactly what we need for our web scraping.

If you are not familiar with CSS, then try using SelectorGadget. This is a tool which helps you to find the CSS class names of the elements on a web page. It also has a Google chrome extension, which makes it even easier.

For the TripAdvisor case, I want to scrape some specific elements from the web page such as the name of the restaurant, the location, the rating and number of reviews. In the screenshot below I took restaurant Daniels as an example and added the class names of the elements.

Selected CSS classes for web scraping tutorial

In the next steps we need these CSS classes to scrape the data.

 

Web scraping with rvest

Now we know which elements to scrape from the web page, it is time to open R and get rvest running. First we need to install the rvest package by simply running the following code:

The next step is to read the full web page which can be done with the function  read_html() and adding the URL as its argument. You can store the output in a new object, in this example I call it  tripadvisor_restaurant . This object now stores all the information we need, we only need scrape the right elements from it. We already know the CSS classes of the elements we are interested in. We can call the CSS classes with the function  html_nodes() . In return it gives you all the meta elements attached to this class, for example a reference link, an image source, mouse over event or formatting styles. By adding  html_text() we get the text inside the class rather than the meta data of the class.

The example above shows the number of reviews the restaurant has received. We can do the same for the cuisine and location of the restaurant.

Finally I also want to scrape the rating per service. To get the rating is slightly different, because the rating is not shown as a number but as an image. The number of colored bubbles is the actual rating. Luckily, this information is stored in the alt text of the class.

I have done the same thing for the name of the restaurant and the overall rating. I want to combine all the scraped data into a data frame with one row.

Now we have one record for the restaurant in our data frame. For readability purposes, I create the function  getRestaurant(tripadvisor_restaurant) which includes all previous steps.

 

Loop through objects and pages

It’s nice that we have scraped the information of one restaurant without spending that much time on coding. But what if we want to collect information for 100’s of restaurants? It won’t be very efficient to rerun the script while changing the URL. Luckily, the rvest package has some functions which allow you to navigate through pages.

By default, TripAdvisor shows 30 restaurants on one page. First, I will explain how to loop through these 30 restaurants on a single page. After that, we will loop through pages each containing 30 restaurants.

List of restaurants on one page

On TripAdvisor, I want to loop through the first 30 restaurants in New York. If you scroll down the web page, you find all the restaurants including a clickable name which redirects you to the TripAdvisor page of the restaurant. The name of the CSS class is .property_title and the attribute href contains the link to the restaurant page on TripAdvisor.

Now you can simply loop through the 30 URL’s to get the information per restaurant and combine this into one data frame.

That wasn’t to difficult! Finally, I want to have the data of 300 restaurants in one data frame. The CSS class .nav.next contains a link to the next page with another 30 restaurants. The function  jump_to() follows the link declared as argument. In order to use  jump_to() we need to set up a session with  html_session() just like we did with  read_html() .

Now we have a data frame with 300 restaurants, it’s name, location, cuisine and ratings!

 

Full code

 

World full of data author

Who I am


Hi! My name is Claudia, a freelance data analyst/scientist. This is my space on the internet where I share knowledge and experience with everyone who wants to become a better analyst. Read more about my work as a freelancer here.

Share this post on

1 Comment

  1. Francesco Zany

    November 21, 2017 at 21:14

    hi Claudia, thank you very much for this article. I am not a coder but I am trying to understand how to scrape tripadvisor reviews with Rvest. I need data about the content of a review, the rating etc. I tried to imitate your code but it does not work… how can I do?

    library(rvest)
    library(plyr)

    tripadvisor_home <- html_session("https://www.tripadvisor.ie/ShowUserReviews-g187870-d7623031-r541994961-Hotel_Ai_Cavalieri_di_Venezia-Venice_Veneto.html&quot😉

    getRestaurant <- function(tripadvisor_restaurant){
    reviews %
    read_html() %>%
    html_nodes(“#REVIEWS .innerBubble”)

    id %
    html_node(“.quote a”) %>%
    html_attr(“id”)

    quote %
    html_node(“.quote span”) %>%
    html_text()

    rating %
    html_node(“.rating span”) %>%
    html_attr(“class”)

    mobile %
    html_node(“.rating a”) %>%
    html_attr(“class”)

    date %
    html_node(“.rating .ratingDate”) %>%
    html_attr(“title”)

    review %
    html_node(“.entry .partial_entry”) %>%
    html_text()

    stayed %
    html_node(“.recommend span”) %>%
    html_text()

    record_restaurant <- data.frame(id = id,
    quote = quote,
    mobile = mobile,
    date=date,
    review=review,
    rating = rating
    stringsAsFactors = FALSE
    )
    record_restaurant
    }

    get30RestaurantsXpages <- function(tripadvisor_home, X){
    data <- data.frame()
    i = 1
    for(i in 1:X){
    if(i != 1){ # Go to next page but don't skip the first page
    next_URL %
    html_nodes(“nav.next.taLnk”) %>%
    html_attr(“href”)
    tripadvisor_home <- jump_to(tripadvisor_home, paste0("https://www.tripadvisor.com&quot;, next_URL))
    }
    dfRestaurants <- getRestaurant(tripadvisor_home)
    data <- rbind.fill(data, dfRestaurants)
    print(paste0("Page ", i))
    }
    data
    }

    restaurants <- get30RestaurantsXpages(tripadvisor_home, 2)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.