کتابخانه استخراج داده از صفحات وب tmayt

نویسنده : طاها آیت اللهی

آخرین آپدیت : 1400/10/18

زمان مورد نیاز برای خواندن : 6

نصب کتابخانه پایتون Beautiful soup

این کتابخانه پایتون را می‌توان به راحتی با دستور pip install beautifulsoup4 نصب کرد.

برای شروع کار با این کتابخانه نیاز به داده‌ای از جنس html داریم بنابراین متغیری را با این نوع داده پر می‌کنیم.

html_doc = """<html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """

حال برای تجزیه و تحلیل کردن این داده از BeautifulSoup از کتابخانه bs4 استفاده می‌کنیم و متغیر خود را به این کلاس می‌دهیم.

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify())

سپس با اتربیوت prettify می‌توانیم زیبا شده‌ی (مرتب شده) آنرا در خروجی نمایش دهیم.

حال با مقدار soup که خودمان ساختیم مقادیری را می‌توانیم از آن درخواست کنیم.

soup.title # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p # The Dormouse's story soup.p['class'] # u'title' soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> همانطور که میبینید میتوانیم با دستور find_all و مشخص کردن تگی مشخص تمامی آنها را به صورت یک لیست بگیریم و همینطور میتوانیم با دستور get_text تمامی محتوای متنی این html را استخراج کنیم print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ...

وارد کردن داده‌ی bs4 به کمک فایل

خب برای ساخت متغیر soup و آماده‌سازی داده‌ی html راه‌های مختلفی را می‌توانید استفاده کنید.
یکی از آنها خواندن از روی فایل html می‌باشد و کد آن به صورت زیر نوشته می‌شود.

from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') soup = BeautifulSoup("<html>a web page</html>", 'html.parser')

وارد کردن داده‌ی bs4 به کمک request

راهی دیگر که می‌توان استفاده کرد این است که با ارسال درخواست به یک وبسایت کد html صفحه آنرا گرفت و به bs4 داد.

import requests from bs4 import BeautifulSoup page = requests.request('GET','http://google.com') soup = BeautifulSoup(page.text, 'html.parser') print(soup.prettify())