Get Text from the website by Python3(Remove HTML)

Last time I created a program that displays only title, keyword, the description in Meta.
This time I created Extracting text for Top page, and Linked page。
Modification:
1.Add HTML Perse、CSV file library
Add import below.
==================
from urllib.parse import urljoin
from bs4 import  Comment
import csv
===================
2.Open files for saving the texts
3.Remove the sections of “comment” , “script” ,”style” ,”href” tag
4.Extract Text
5.Write to the CSV file.
6.Added Exceptional Processing because of the error(Occured in Autorization site(Login required)
Next time、I will create authorization and analyze parts

*Following is the code I added this time.

#Get LINks
soup = BeautifulSoup(html, ‘html.parser’)
url_items = soup.select(‘li’)
#FILe setting
file_path = “webscraping.csv”
csv_file = open(file_path, ‘wt’, newline = ”, encoding = ‘utf-8’)

 

for url2 in url_items:
try:
#初期化
csv_data = []
csv_write = csv.writer(csv_file)
a = url2.a
if a isnotNone:
url3 = a.attrs[“href”]
url4 = urljoin(url, url3)
print(url4)
res = urllib.request.urlopen(url4)
soup = BeautifulSoup(res, ‘html.parser’)
#Tite
title1 = soup.find(‘title’)
print(“Title = “, title1)
#Header
header=soup.find(“head”)
#description
description = header.find(“meta”, attrs={“name”: “description”})
#Keyword
keywords=header.find(“meta”, attrs={“name”: “keywords”})
print(“詳細 = “, description)
print(“Keyword = “, keywords)
#タイトル_詳細_keyword_追加
csv_data.append(title1)
csv_data.append(description)
csv_data.append(keywords)
csv_write.writerow(csv_data)
# コメントタグの除去
for comment in soup(text=lambda x: isinstance(x, Comment)):
comment.extract()
# scriptタグの除去
for script in soup([‘script’,’style’]):
script.decompose()
# テキストだけの抽出
for text in soup.find_all(text=True):
#初期化
csv_data = []
if text.strip():
print(text)
#Contents add
csv_data.append(text)
csv_write.writerow(csv_data)
except urllib.error.HTTPError as e:
print(e.code) # 404など
#Close
csv_file.close()
##===================================================
print(“end”)
#====================================================