※ 2020.10.09
Beautiful Soup is a library that makes it easy to scrape information from web pages.
In addition, Beautiful Soup is a Python library for pulling data out of HTML and XML data.
I will focus on the things that are frequently used.
Installation
$ pip install beautifulsoup4
Usage
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
Import the BeautifulSoup module, and give the data and parser to it.
Usually, we get data with the requests module and handle it with HTML parser.
If you need information about the requests module, please find this posting.
2020/10/08 - [Python] - [Python] requests module
Attributes
- Get tag with the tag name
You can get an HTML tag with the tag name.
soup.tag
Look at this example, you can get more data from the tag.
tag = soup.div
print(tag.name)
print(tag['id'])
print(tag.get('class'))
print(tag.attrs)
As you can see, you can get the name of the tag, attribute value of the tag with an attribute name of the tag, and attributes list of the tag
- Get the contents of the tag
title = soup.h1.string
for string soup.h2.strings:
print(repr(string))
for string soup.h3.stripped_strings:
print(repr(string))
string attribute helps to get the content of the tag.
string gives the iterable contents of the tag with whitespaces.
stripped_string gives the iterable contents of the tag without whitespaces.
- Get the children tags of the tag
tag = soup.head
print(tag.contents)
for child in tag.children:
print(child)
contents attribute gives the list of the child tags.
children attribute gives the child tags as iterable type.
- Get the sibling tags of the tags
prev_tag = soup.img.previous_sibling
next_tag = soup.img.next_sibling
Functions
- Find a tag
span = soup.find("span")
find() function returns the matched tag with given.
items = soup.find_all('li')
find_all() function returns all of the matched tags with given.
These functions are also used with attributes or their contents.
soup.find('a', {'class': 'link'})
soup.find_all('h1', {'class': 'title'})
You can limit the results of the find_all() function.
soup.find_all('div', limit=2)
You can limit the scope of where to find.
soup.find('section', recursive=False)
soup.find_all('nav', recursive=False)
Now, BeautifulSoup will not found tags from the children of the tag.
- Use Selector
You can also use CSS's selector.
soup.select('p:nth-of-type(3)')
- Get text inside a tag
soup.get_text()
You can specify a string to be used to join the bits of text together.
soup.get_text(',')
You can also strip the text string.
soup.get_text(strip=True)
- Check the tag
tag.has_attr('href')
The has_attr() function checks whether the tag has attributes.
- Show the HTML document as human-readable
print(soup.a.prettify())
Example
Let's get the contents from the hacker news site as a practice.
Open the browser and go to the site.
Open inpect and inspect the site.
Now, let's get these.
import requests
from bs4 import BeautifulSoup
URL = f"https://thehackernews.com/"
result = requests.get(URL)
soup = BeautifulSoup(result.text, "html.parser")
titles = soup.find_all("h2", {"class": "home-title"})
for title in titles:
print(title.string)
How about the posted date?
Aha~ it is also so easy.
dates = soup.find_all("i", {"class": "icon-calendar"})
for date in dates:
print(date.next_sibling.string)
I used the next_sibling attribute because the date string is a sibling of i tag.
Let's get one more thing.
The author is the same as the date.
authors = soup.find_all("i", {"class": "icon-user"})
for author in authors:
print(author.next_sibling.string.strip())
The one different thing is whitespace, so I use strip() function to remove it.
This is a practice, so let's grouped this information as a dictionary.
import requests
from bs4 import BeautifulSoup
URL = f"https://thehackernews.com/"
result = requests.get(URL)
soup = BeautifulSoup(result.text, "html.parser")
titles = soup.find_all("h2", {"class": "home-title"})
dates = soup.find_all("i", {"class": "icon-calendar"})
authors = soup.find_all("i", {"class": "icon-user"})
scraped_data = []
for i in range(len(titles)):
scraped_data.append(
{
"title": titles[i].string,
"date": dates[i].next_sibling.string,
"author": authors[i].next_sibling.string.strip(),
}
)
print(scraped_data)
This is a full source.
Now, I'm sure you will be able to scrape the data that you want.
Try what you want to make !!
'Python' 카테고리의 다른 글
[Python] pathlib module (Object-Oriented Filesystem Paths) (0) | 2020.10.13 |
---|---|
[Python] Getting Started with Flask (0) | 2020.10.11 |
[Python] requests module (0) | 2020.10.08 |
[Python] Abstract Class (0) | 2020.09.28 |
[Python] pip (0) | 2020.09.11 |
댓글