본문 바로가기
Python

[Python] Beautiful Soup4 module

by llHoYall 2020. 10. 9.

※ 2020.10.09

 

Beautiful Soup is a library that makes it easy to scrape information from web pages.

In addition, Beautiful Soup is a Python library for pulling data out of HTML and XML data.

I will focus on the things that are frequently used.

Installation

$ pip install beautifulsoup4

Usage

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

Import the BeautifulSoup module, and give the data and parser to it.

Usually, we get data with the requests module and handle it with HTML parser.

If you need information about the requests module, please find this posting.

2020/10/08 - [Python] - [Python] requests module

Attributes

- Get tag with the tag name

You can get an HTML tag with the tag name.

soup.tag

Look at this example, you can get more data from the tag.

tag = soup.div

print(tag.name)
print(tag['id'])
print(tag.get('class'))
print(tag.attrs)

As you can see, you can get the name of the tag, attribute value of the tag with an attribute name of the tag, and attributes list of the tag

- Get the contents of the tag

title = soup.h1.string

for string soup.h2.strings:
    print(repr(string))

for string soup.h3.stripped_strings:
    print(repr(string))

string attribute helps to get the content of the tag.

string gives the iterable contents of the tag with whitespaces.

stripped_string gives the iterable contents of the tag without whitespaces.

- Get the children tags of the tag

tag = soup.head

print(tag.contents)
for child in tag.children:
	print(child)

contents attribute gives the list of the child tags.

children attribute gives the child tags as iterable type.

- Get the sibling tags of the tags

prev_tag = soup.img.previous_sibling
next_tag = soup.img.next_sibling

Functions

- Find a tag

span = soup.find("span")

find() function returns the matched tag with given.

items = soup.find_all('li')

find_all() function returns all of the matched tags with given.

 

These functions are also used with attributes or their contents.

soup.find('a', {'class': 'link'})

soup.find_all('h1', {'class': 'title'})

 

You can limit the results of the find_all() function.

soup.find_all('div', limit=2)

 

You can limit the scope of where to find.

soup.find('section', recursive=False)

soup.find_all('nav', recursive=False)

Now, BeautifulSoup will not found tags from the children of the tag.

- Use Selector

You can also use CSS's selector.

soup.select('p:nth-of-type(3)')

- Get text inside a tag

soup.get_text()

You can specify a string to be used to join the bits of text together.

soup.get_text(',')

You can also strip the text string.

soup.get_text(strip=True)

- Check the tag

tag.has_attr('href')

The has_attr() function checks whether the tag has attributes.

- Show the HTML document as human-readable

print(soup.a.prettify())

Example

Let's get the contents from the hacker news site as a practice.

Open the browser and go to the site.

Open inpect and inspect the site.

 

 

Now, let's get these.

import requests
from bs4 import BeautifulSoup


URL = f"https://thehackernews.com/"


result = requests.get(URL)
soup = BeautifulSoup(result.text, "html.parser")

titles = soup.find_all("h2", {"class": "home-title"})
for title in titles:
    print(title.string)

How about the posted date?

 

 

Aha~ it is also so easy.

dates = soup.find_all("i", {"class": "icon-calendar"})
for date in dates:
    print(date.next_sibling.string)

I used the next_sibling attribute because the date string is a sibling of i tag.

Let's get one more thing.

 

 

The author is the same as the date.

authors = soup.find_all("i", {"class": "icon-user"})
for author in authors:
    print(author.next_sibling.string.strip())

The one different thing is whitespace, so I use strip() function to remove it.

This is a practice, so let's grouped this information as a dictionary.

import requests
from bs4 import BeautifulSoup


URL = f"https://thehackernews.com/"


result = requests.get(URL)
soup = BeautifulSoup(result.text, "html.parser")

titles = soup.find_all("h2", {"class": "home-title"})
dates = soup.find_all("i", {"class": "icon-calendar"})
authors = soup.find_all("i", {"class": "icon-user"})

scraped_data = []
for i in range(len(titles)):
    scraped_data.append(
        {
            "title": titles[i].string,
            "date": dates[i].next_sibling.string,
            "author": authors[i].next_sibling.string.strip(),
        }
    )
print(scraped_data)

This is a full source.

Now, I'm sure you will be able to scrape the data that you want.

Try what you want to make !!

'Python' 카테고리의 다른 글

[Python] pathlib module (Object-Oriented Filesystem Paths)  (0) 2020.10.13
[Python] Getting Started with Flask  (0) 2020.10.11
[Python] requests module  (0) 2020.10.08
[Python] Abstract Class  (0) 2020.09.28
[Python] pip  (0) 2020.09.11

댓글