How to do Web Scraping (Easy Explanation)

Written by Nitish Jha


Hello fellow enthusiasts, if you are eager to learn about machine learning or want to explore the field of data science then it is recommended to follow the complete life-cycle of a data science project. Yes, from beginning to end! So without wasting any time, let’s start with data collection. It is the second step you need to perform after defining the problem and then deciding what data matters for the task.


Imagine yourself as a data scientist working at XYZ.Inc and you have a problem statement in front of you to solve, you have analyzed the problem statement and decided that you will need to analyze news articles. But the problem is you don’t have the dataset. Now, you can either collect the data using an API if the data provided by the API matches your requirement or you will need to collect data from the internet available for the public.

The best solution would be creating a python script which will get the task done for you automatically. For this, you will create a web scraping program that will extract the desired data.


Web scraping refers to a process of using bots (software automation script) to extract data from the world wide web.

Today, we will use the BeautifulSoup python package for this with requests to open the desired URL.


Let’s install the library first:


Open your command prompt and enter the following command to install BeautifulSoup:


pip install beautifulsoup4

Now, open the Python IDLE and import the library as follows:


from bs4 import BeautifulSoup

If you don’t see any error that means you have successfully installed the bs4 library.

For more in-depth details visit: Click here


The other package that we are gonna use comes pre-installed thus, we can move to our code.


Let’s import our libraries

from bs4 import BeautifulSoup
from urllib.request import urlopen

Now, let’s see the news article that we want to extract data from. To keep this simple, we will be scraping a simple news article that you can find here https://www.bbc.com/sport/football/46897172


We will store this URL as a string in Python as follows


URL = "https://www.bbc.com/sport/football/46897172"

After this, we need to open the URL using our code, we can do that as follows

(We are using try and except block, in case the connection fails)


try:
	page = urlopen(URL)
	print("Connection successful")
except:
	print("Unable to connect")

Our connection was successful so we can move further.

If you try to print the page at this moment then you will get output something similar to this


print(page)
<http.client.HTTPResponse object at 0x7f4ee147e390>

It is an HTTP response object file, to read this we will use BeautifulSoup

soup = BeautifulSoup(page, "html.parser")

Here, we have passed our HTTP response object and HTML parser because our response is written in HTML. You can see the whole HTML of the page by printing the ‘soup’ variable.

You will notice that we have extracted the whole page and it contains some data that we don’t need.

Let us now focus only on the desired data, for this, we will use a neat little trick called inspecting a webpage.

Go to the webpage that we are trying to extract and press “Ctrl+Shift+i” or do a right-click and choose “Inspect”


You will see something like this


It is an element inspector that tells you about each element present on a webpage.

Cool! Isn’t it?


It’s time that we find where our data lies on this web page, you will need a little knowledge of HTML for this.


Press “Ctrl+Shift+c” to enter select element mode. Now you can hover your mouse over the webpage and find its tags and attributes.

Our data lies in a <div></div> with the “class” = “story-body sp-story-body gel-body-copy”

Let’s fetch the data present in this div using python code

content = soup.find("div", attrs ={"class":"story-body sp-story-body gel-body-copy"})

We are using the find method to find the part of the HTML where our data lies. In this case, it was present in a div tag with unique attributes as the class of the element.

Now, the content has only the required elements present, you can check by printing the content variable.

We can see that our data lies in multiple <p></p> tags and we want our data in text format, we can do that by using a loop for all the <p></p> tags as following

article = ""
for para in content.findAll("p"):
  article += " " + para.text

Here, we are looping through the text present in p tags and appending the next one to the previous one.

You can print the article and check your data, it is in python string format.

print(article)
output: Cristiano Ronaldo's header was enough for Juventus to beat AC Milan and claim a record eighth Supercoppa Italiana in a game played in Jeddah...

If you want to store this data into a file then you can do it by

with open("ScrappedArticle.txt","w") as f:
  f.write(article)

You will get a new text file “ScrappedArticle.txt” with the data present in it.


Voila! You have accomplished web scraping.


If you want to scrape multiple pages then try to find the patterns in them because most of the pages of a site follow a single pattern which will allow you to loop through the pages to scrape them all.


I hope you found this useful, do share it with fellow ML/DS enthusiasts.


About the person behind the keyboard: Nitish is a passionate Machine Learning Engineer and an amazing blogger. He is currently pursuing B.Tech. If you guys want to contact him, just click on his name.


83 views0 comments

Recent Posts

See All

©2020 by Machine Learning Man. Created by Gaurav Chatterjee