What is a Web Crawler?
A web crawler, also known as a spider or a spiderbot, is a software program that systematically browses the internet, indexing content by visiting web pages and reading their data. This automated agent is used by search engines to update their content or index new pages, contributing significantly to search engine optimization.
How do Search Engines Crawl Websites to Index Web Pages?
Search engines employ web crawlers to navigate the web and index information. These crawlers retrieve a webpage, parse the content to understand the topics covered, and follow links to other pages. They collect details from these pages, which the search engine later processes and indexes. This enables the search engines to quickly retrieve relevant information in response to user queries.
BeautifulSoup
BeautifulSoup is a Python library designed to parse HTML and XML documents. It simplifies tasks like iterating, searching, and modifying the parse tree and searching for HTML elements by attributes, making it an essential tool for web crawling applications.
Now, let’s implement a Web Crawler in Python for crawling search results from Bing, DuckDuckGo and Yahoo.
Code Implementation
from bs4 import BeautifulSoup
import requests
# Function to fetch Bing search results
def get_bing_results(search_query):
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(f"https://www.bing.com/search?q={search_query}", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
results = []
for item in soup.find_all('li', attrs={'class': 'b_algo'}):
title = item.find('h2').text if item.find('h2') else ''
link = item.find('a')['href'] if item.find('a') else ''
snippet = item.find('p').text if item.find('p') else ''
results.append({'title': title, 'link': link, 'snippet': snippet})
return results
# Function to fetch DuckDuckGo search results
def get_duckduckgo_results(search_query):
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(f"https://duckduckgo.com/html/?q={search_query}", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
results = []
for item in soup.find_all('div', attrs={'class': 'result__body'}):
title = item.find('a', attrs={'class': 'result__a'}).text if item.find('a',
attrs={'class': 'result__a'}) else ''
link = item.find('a', attrs={'class': 'result__a'})['href'] if item.find('a') else ''
snippet = item.find('a', attrs={'class': 'result__snippet'}).text if item.find('a', attrs={
'class': 'result__snippet'}) else ''
results.append({'title': title, 'link': link, 'snippet': snippet})
return results
# Function to fetch Yahoo search results
def get_yahoo_results(search_query):
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(f"https://search.yahoo.com/search?p={search_query}", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
results = []
for item in soup.find_all('div', attrs={'class': 'Sr'}):
title = item.find('h3').text if item.find('h3') else ''
link = item.find('a')['href'] if item.find('a') else ''
snippet = item.find('p').text if item.find('p') else ''
results.append({'title': title, 'link': link, 'snippet': snippet})
return results
# Function to compare search results from Bing, DuckDuckGo, and Yahoo
def compare_search_results(search_query):
bing_results = get_bing_results(search_query)
duckduckgo_results = get_duckduckgo_results(search_query)
yahoo_results = get_yahoo_results(search_query)
# Combine results for comparison
combined_results = {
'Bing': bing_results,
'DuckDuckGo': duckduckgo_results,
'Yahoo': yahoo_results
}
return combined_results
# Main function to execute the search comparison
def main():
search_query = "learn python"
results = compare_search_results(search_query)
for search_engine, result_list in results.items():
print(f"Results from {search_engine}:")
for result in result_list:
print(f"Title: {result['title']}")
print(f"Link: {result['link']}")
print(f"Snippet: {result['snippet']}")
print("\n")
if __name__ == "__main__":
main()
Output would be something like this
Results from Bing:
Title: Learn Python - Free Interactive Python Tutorial
Link: https://www.learnpython.org/
Snippet: WEBThis site is generously supported by DataCamp. DataCamp offers online interactive Python Tutorials for Data Science. Join over a million other learners and get started learning Python for data science today! Take the Test. learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast.
Title: Python Tutorial - W3Schools
Link: https://www.w3schools.com/python/
Snippet: WEBPython Examples. Learn by examples! This tutorial supplements all
...
Results from DuckDuckGo:
Title: python basics for beginners - Find the Books Of Your Choice
...
Results from Yahoo:
Title: www.python.org › about › gettingstartedPython For Beginners | Python.org
Link: https://r.search.yahoo.com/
...
Steps to Execute the Script
Install Dependencies
Ensure you have the necessary Python libraries installed. You can install them using pip.
pip install requests beautifulsoup4
Run the Script
Save the script to a file, e.g., search_comparison.py, and run it using Python.
python search_comparison.py
Note: When using tools like Requests and BeautifulSoup to fetch search results from search engines, you might encounter "no results" messages (like, “There are no results for learn python”). This might be due to sites blocking you as it detects and blocks automated requests.
To address this:
1. Ensure Proper URL Encoding: Verify that your search query is correctly formatted in the URL.
2. Moderate Request Frequency: If you frequently run queries in quick succession, space out your queries by using delay (like time.sleep()) to avoid triggering automated traffic limits.
3. Check Query Validity: Experiment with different queries to ensure the issue isn’t due to the specificity of your search terms.
Following these practices will help maintain effective and respectful use of web services in your automation tasks.
Code Explanation
We’ll only be discussing get_bing_results(search_query)
function, since the other two functions (get_duckduckgo_results(search_query)
& get_yahoo_results(search_query)
) work similarly.
While the code for requesting the HTML of search pages is nearly identical across Bing, DuckDuckGo, and Yahoo (with the URLs being the only difference), the code that extracts ‘link title,’ ‘link URL,’ and ‘link description’ from search results varies for each search engine. This is due to the distinct HTML structures of each website. However, once you’ve understood the code for extracting link information from Bing’s search results, you can easily adapt these techniques to handle the HTML of DuckDuckGo and Yahoo as well.
Let’s understand get_bing_results (search_query) function, line-by-line
headers = {"User-Agent": "Mozilla/5.0"}
- This line specifies
‘Mozilla/5.0’
as the user agent in HTTP requests, making them appear as if they originate from a web browser. This can help prevent your script from being blocked by websites that restrict automated scripts or bots. - Websites often use the “User-Agent” HTTP header to identify the type of device and browser making a request. This helps them tailor content appropriately and also detect automated scripts or bots.
This technique is a common practice in web crawling to bypass simple bot detection mechanisms employed by many websites.
response = requests.get(f"https://www.bing.com/search?q={search_query}", headers=headers)
- This line uses the
requests.get
method to perform an HTTP GET request to the URLhttps://www.bing.com/search?q={search_query}
. Thesearch_query
variable specifies the search term. Theheaders
dictionary is passed to customize the HTTP headers for the request.
Here, response
object contains all the information returned by the server, including the status code, headers, and the body of the response. You can access the content of the response using response.text
for text content or response.json
for JSON content.
Further, you can check the status code of the request using: response.status_code
. If it is something like 2xx – the request was successfully received, understood, and accepted. You can read more about status codes – here.
soup = BeautifulSoup(response.text, "html.parser")
- This line initializes a BeautifulSoup object with the content of the response. The “html.parser” argument specifies that the python’s built-in HTML parser should be used to parse this content.
for item in soup.find_all('li', attrs={'class': 'b_algo'}):
- This line iterates over each element in the HTML parsed data
(soup)
that matches the criteria: an<li>
tag with a class attributeb_algo
. This class is specific to Bing and is used to identify individual search result entries on the Bing’s search page. find_all()
method is used for finding out all tags with the specified tag name or id and returning them as a list of type bs4 objects.
Extracting HTML Tags
If you’ll right click on a search link and click Inspect (this is for Chrome browser) on the search page of Bing to open developer tools, you’ll find that page’s HTML is something like this:
To inspect a specific search result, click on an HTML element with the <li>
tag and the b_algo
class. You should see a structure similar to this:
Here, we’ll only be focusing on <h2>
element, <a>
element and <p>
element which are within <li>
tag having b_algo
class. We’ll use these HTML elements to extract the link information ie.
- link title
- link url
- link description or snippet
title = item.find('h2').text if item.find('h2') else ''
- Extract Title: Inside the loop, this line tries to find an
<h2>
tag within eachitem
(which represents a particular search result). If it finds an<h2>
tag, it extracts the text content of this tag to get the title of the search result. If no<h2>
tag is found, it setstitle
to an empty string.
link = item.find('a')['href'] if item.find('a') else ''
- Extract Link: This line finds the first
<a>
tag within eachitem
and retrieves the value of thehref
attribute, which contains the URL of the search result. If no<a>
tag is found, it setslink
(url) to an empty string.
snippet = item.find('p').text if item.find('p') else ''
- Extract Snippet: This line attempts to find a
<p>
tag within eachitem
. If a<p>
tag is found, it extracts the text (typically a summary or a snippet of the content) from this tag. If no<p>
tag is found, it setssnippet
to an empty string.
results.append({'title': title, 'link': link, 'snippet': snippet})
- Store Result: This line appends a dictionary to the
results
list. Each dictionary contains keys'title'
,'link'
, and'snippet'
, corresponding to the extracted data from each search result.
Overall, the function retrieves search results from Bing and returns them as a list of dictionaries, where each dictionary includes the title, link, and snippet of a search result. Similarly, the other two functions perform the same tasks for DuckDuckGo and Yahoo search results.
Some additional points
The script can further be modified to:
- Compare the ranking of websites across different search engines.
- How different are the sources and content types returned by each engine?
- What unique results does each engine offer?
etc.
Practice Project
To further explore web crawling, consider developing a crawler for Google search results. Given Google’s advanced anti-crawling measures, it’s advisable to use Selenium, a tool that automates browser actions to mimic human browsing behavior. This can help avoid detection and blocking automated requests by Google from libraries like Requests and BeautifulSoup, allowing for effective data collection and analysis.
This project highlights good programming practices such as respectful web-crawling, careful handling of HTTP headers, ensuring both efficient data collection and ethical considerations in web crawling. On further exploration you can learn handling dynamic web pages and managing complex crawling scenarios (such as real-time price monitoring of products on e-commerce websites or sentiment analysis of product reviews and others), aligning with best practices in web crawling.