I get all the tags inside the div. How can I limit the reach of the for loop to include only the content I want?

Issue related to: Web Scraping

Module used: BeautifulSoup, Requests

Issue link: https://www.reddit.com/r/learnpython/comments/fulqjy/web_scraping_issue/

Issue:

Hi there,

I’m trying to get a few fields from https://www.merriam-webster.com/word-of-the-day

I need the “word“, the “definitions“, and the “Did you know?” section. I was able to get the word, and the word “Definition” by targeting h1 and h2 tags.

Problem 1: I need the definitions, but if I target the <p> tags with a for loop I get all the tags inside the div. How can I limit the reach of the for loop to include only the definitions? Bear in mind that sometimes there are 2 definitions and some days there are 5.

Problem 2: Very similar to problem 1. I only need the content inside the “Did you know?” section but by targeting the <p> tag it spits out content I don’t need.

Problem 3: I’m a beginner, please be patient.

Thank you!!

Here is my code:

from bs4 import BeautifulSoup
import requests
sauce = requests.get('https://www.merriam-webster.com/word-of-the-day').text
soup = BeautifulSoup(sauce, 'lxml')
article = soup.find('article')
word = article.find('div', class_='word-and-pronunciation').h1.text
definition = article.find('div', class_='wod-definition-container').h2.text
print(word)
print(definition)
for s in soup.find_all('article')[0].find('div', class_='wod-definition-container').find_all('p'):
 print(s.text)

Short answer to “I get all the tags inside the div. How can I limit the reach of the for loop to include only the content I want”?

You need to use the function

definition.find_all('p', recursive=False)

Solution, Walk through and Explanation:

First we need to break down the problem. The user would like to get hold of three section from the merriam-webster web page ;

  • The word
  • The definition
  • The “Did you know?”

To tackle this we will be using the modules associated to web scraping, BeautifulSoup and Requests (You can read about webscraping here):

from bs4 import BeautifulSoup
import requests

Next we need to get the webpage – requests.get() will get hold of the webpage. As we are only interested in the content we will use .text We will then store this in the variable sauce

sauce = requests.get('https://www.merriam-webster.com/word-of-the-day').text

To allow us to manipulate and get hold of specific sections of the webpage, we need to create a soup – we can use the BeautifulSoup function, Inside we will pass the variable sauce we call this variable soup

soup = BeautifulSoup(sauce, 'lxml')

We have done the preliminary work now will now get hold of the word – We first need to inspect the web page by going to the site and viewing page source:

<main>

<article>

<div class=”article-header-container wod-article-header”>

<span class=”w-a-title margin-lr-0 margin-tb-1875em”>

Word of the Day : April 4, 2020 </span>

<div class=”under-widget-title-line”></div>

<div class=”quick-def-box”>

<div class=”word-header”>

<div class=”word-and-pronunciation”>

<h1>solecism</h1>

<a class=”play-pron wod-autoplay” data-lang=”en_us” data-file=”soleci01″ data-dir=”s” href=”javascript:void(0)” title=”Listen to the pronounciation of solecism”>play <span class=”play-box”></span></a>

</div>

</div>

<span class=”scrollDepth” data-eventName=”wotd-headword”></span>

<!– end simple definition header –>

We can see that the word is in the tags: main > article > div class=”word-and-pronunciation” > h1

We can now use this information: we know that the page only has one h1 tag so we will use the soup to target that specific tag – we can do this by using soup.find () inside the brackets we will pass the ‘h1’ tag. As we are only interested in the content we can use .text we will then store this as a variable word:

word = soup.find('h1').text
print(word)

We will print the word and we should get a result

The next task is to get the definition, This is similar to getting the word. First we will inspect the page and locate the part we want

In this case it was main > article > div class=“lr-cols-area clearfix sticky-column state-middle” > div class=”left-content” > div class=”wod-article-container” > div class=”wod-definition-container” > p

The issue with this task is that there can be multiple definitions for a word so instead of finding one p tag we will use the soup to find the div class – we can do this by using soup.find () inside the brackets we will pass the ‘div’ tag and class_=”wod-definition-container” to select this specific division. We will store this as the variable definition:

definition = soup.find('div', class_='wod-definition-container')

Now we need to find all the p tags inside of the div tag and print them out – To do this we will use the function find_all() on definition so that we have definition.find_all(). We will pass ‘p’ and recursive=False as we do not want to descend into other div tags. We should now have:

definition.find_all('p', recursive=False)

This will collect all the p tags add add them to a list. We can now use a for loop to print out all the definitions like so:

for p in definition.find_all('p', recursive=False):
print(p.text)

Note that we are printing p.text as we only want the content and not the tag.

The final task is to get the contents of the “did you know section”. Again this is similar to finding word, we will locate the part we want main > article > div class=“lr-cols-area clearfix sticky-column state-middle” > div class=”left-content” > div class=”wod-article-container” > div class=”wod-definition-container” > div class=”did-you-know-wrapper” > div class=”left-content-box” > p

The p tag is inside of the div class=”left-content-box” which is inside of the div class=”wod-definition-container”. Fortunately for us we have created a soup for the div class=”wod-definition-container” which we named definition, we can reuse this to find the div class=”left-content-box” – we can do this by definition.find(‘div’, class_=’left-content-box’) thereafter we can add .find(‘p’) to find all the p tags inside of the class so that we know have definition.find(‘div’, class_=’left-content-box’).find(‘p’). We will store this in the variable dyk:

dyk = definition.find('div', class_='left-content-box').find('p')

We can then print the contents

print(dyk.text)

Final Code:

from bs4 import BeautifulSoup
import requests
sauce = requests.get('https://www.merriam-webster.com/word-of-the-day').text
soup = BeautifulSoup(sauce, 'lxml')
word = soup.find('h1').text
print("\n" +word+ "\n")
definition = soup.find('div', class_='wod-definition-container')
for p in definition.find_all('p', recursive=False):
print(p.text)
dyk = definition.find('div', class_='left-content-box').find('p')
print("\n" +dyk.text)

Output:

solecism

1 : an ungrammatical combination of words in a sentence; also : a minor blunder in speech
2 : something deviating from the proper, normal, or accepted order
3 : a breach of etiquette or decorum

The city of Soloi had a reputation for bad grammar. Located in Cilicia, an ancient coastal nation in Asia Minor, it was populated by Athenian colonists called soloikos (literally “inhabitant of Soloi”). According to historians, the colonists of Soloi allowed their native Athenian Greek to be corrupted and started using words incorrectly. As a result, soloikos gained a new meaning: “speaking incorrectly.” The Greeks used that sense as the basis of soloikismos, meaning “an ungrammatical combination of words.” That root, in turn, gave rise to the Latin soloecismus, the direct ancestor of the English word solecism. Nowadays, solecism can refer to social blunders as well as sloppy syntax.