Tag Archives: scraping

Open the browser by “Selenium”

Do a simple scraping with Selenium. This time task is Only opening the browser.
“Selenium” is a library that uses tests for the browser and automates the operation.

If you use “Selenium,” The browser is opened.

If you use “Headless,” You can scrape and operate the browser without opening.

 

Preparation

1. Install Selenium
pip install selenium
2. Chrome web driver download

pip install chromedriver-binary

However, an error occurred after executed the program.

It seemed that version of Chrome is different from the driver.

The driver is the latest Ver. 74 but browser is ver. 73.

But when I looked at Chrome’s help, Message was “No update, latest version”.

It seemed that ver. 74 beta was released just one week ago.

To solve the problem,

I downloaded the appropriate version “Exe” file from the website that I got from the error message,

And executed it on the program.

After that, I opened the webpage that had the above article and took a screenshot.
Bellow is a simple program to open a browser with “Selenium.”

===========================

#import chromedriver_binary*comment out
from selenium import webdriver
#URL setting
url = “https://www.excite.co.jp/news/article/Cobs_1890436/”
# Headless mode
options = webdriver.ChromeOptions()
options.add_argument(‘-headless’)
# Run Chrome(modified after downloaded)
browser = webdriver.Chrome(executable_path= r”d:\driver\Chromedriver.exe”,options=options)
# Load URL
browser.get(url)
# Screenshot saving
browser.save_screenshot(“website.png”)
# End
browser.quit()

===================

Result

Below is a Screenshot Extracted by selenium.
#Scraping #Selenium #python  #DataAnalysis #Computesceince #Web #Webdriver
#Chrome #test #automation

WEB Scraping and Cleansing by Python Part2

This time, I added the cleansing program that removes the useless section(TAG: “script,” “a,” “class”…..),

and created images that show the frequent rate of words by WordCloud“.(March 22th, 2019)

(Environment: Windows10 +Python 3.6.5+ visual studio code)

 

#Wordcloud Image creating the section

def WordCheck(texts, num):

    #In the case of windows,  you should transform “Utf-8”, Otherwise becomes Error
    b= texts.encode(“CP932″,”ignore”)
    wc = WordCloud(background_color=”white” ,\
    font_path= r”d:\fonts\noto\NotoSansCJKjp-Regular.otf”,\
    width=800,height=600).generate(texts)
    wc.to_file(“d:\wordcloud_”+ str(num) +”.png”)

 

#HEader impersonation(You should add the bellow to prevent “urlOpen’s Error”)
ua = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ‘\
‘AppleWebKit/537.36 (KHTML, like Gecko) ‘\
‘Chrome/55.0.2883.95 Safari/537.36 ‘

#GET Linked URL(Creating the images for All linked pages)

↓Remove the texts of “<a href =”..”>” in the footer section

for script in soup([‘script’,’style’,’a’,’img’]):
script.decompose()

 

The image of Frequent characters

Left(Before cleansing)=>Right(After cleansing)

”Art,” ”Video” in the footer section Disappeared !!

Get Text from the website by Python3(Remove HTML)

Last time I created a program that displays only title, keyword, the description in Meta.
This time I created Extracting text for Top page, and Linked page。
Modification:
1.Add HTML Perse、CSV file library
Add import below.
==================
from urllib.parse import urljoin
from bs4 import  Comment
import csv
===================
2.Open files for saving the texts
3.Remove the sections of “comment” , “script” ,”style” ,”href” tag
4.Extract Text
5.Write to the CSV file.
6.Added Exceptional Processing because of the error(Occured in Autorization site(Login required)
Next time、I will create authorization and analyze parts

*Following is the code I added this time.

#Get LINks
soup = BeautifulSoup(html, ‘html.parser’)
url_items = soup.select(‘li’)
#FILe setting
file_path = “webscraping.csv”
csv_file = open(file_path, ‘wt’, newline = ”, encoding = ‘utf-8’)

 

for url2 in url_items:
try:
#初期化
csv_data = []
csv_write = csv.writer(csv_file)
a = url2.a
if a isnotNone:
url3 = a.attrs[“href”]
url4 = urljoin(url, url3)
print(url4)
res = urllib.request.urlopen(url4)
soup = BeautifulSoup(res, ‘html.parser’)
#Tite
title1 = soup.find(‘title’)
print(“Title = “, title1)
#Header
header=soup.find(“head”)
#description
description = header.find(“meta”, attrs={“name”: “description”})
#Keyword
keywords=header.find(“meta”, attrs={“name”: “keywords”})
print(“詳細 = “, description)
print(“Keyword = “, keywords)
#タイトル_詳細_keyword_追加
csv_data.append(title1)
csv_data.append(description)
csv_data.append(keywords)
csv_write.writerow(csv_data)
# コメントタグの除去
for comment in soup(text=lambda x: isinstance(x, Comment)):
comment.extract()
# scriptタグの除去
for script in soup([‘script’,’style’]):
script.decompose()
# テキストだけの抽出
for text in soup.find_all(text=True):
#初期化
csv_data = []
if text.strip():
print(text)
#Contents add
csv_data.append(text)
csv_write.writerow(csv_data)
except urllib.error.HTTPError as e:
print(e.code) # 404など
#Close
csv_file.close()
##===================================================
print(“end”)
#====================================================