Web Scraping

web scraping

Simple Web Scraping by Python 3

(Getting links on the Top page)

  • Environment

IDE: PyCharm community 2018. 2

Language: Python 3. 6. 5

Libraries: BeutifulSoup

OS: Windows10

I detected the links to the base-page and extracted the title and the META tag (keyword, description).

I didn’t use PyCham for a long time, so I found its start-up is so heavy. Why so such a heavy? (It may be due to the performance of my laptop, but Visual studio code doesn’t so slow for start-up. So next time I may use VSC.)

I want to buy a new PC. (TT).

..Or expand the memory.

Well, below is the program.

It’s an easy one that just detects the first links with a base-site URL.

 

CODE ========================================

from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
import requests
url = ‘https://edition.cnn.com/’
ua = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ‘\
‘AppleWebKit/537.36 (KHTML, like Gecko) ‘\
‘Chrome/55.0.2883.95 Safari/537.36 ‘

req=urllib.request.Request(url, headers={‘User-Agent’: ua})
html = urllib.request.urlopen(req)

#’#link retrieval
soup = BeautifulSoup(html, ‘html.parser’)
url_items = soup.select(‘li’)

for url2 in url_items:
a = url2.a
if a is not None:
url3 = a.attrs[“href”]
url4 = urljoin(url, url3)
print(url4)
res = urllib.request.urlopen(url4)
soup = BeautifulSoup(res, ‘html.parser’)

#’#Title Tag
title1 = soup.find(‘title’)
print(“Title = “, title1)
#’# head
header=soup.find(“head”)
#Details
description = header.find(“meta”, attrs={“name”: “description”})
#’#keyword
keywords=header.find(“meta”, attrs={“name”: “keywords”})
Print (“Details =”, description)
print(“Keyword = “, keywords)

============================================

 

A lot of errors occurred despite the small program, such as “Unknown URL..”

Searching with Google, there were many sample programs. However, None of them were able to run smoothly.

The Websites that introduced basic and simple programs appeared on the first page on google, but it was difficult to find the solution of critical issues on Google search.

In the end, I incorporated many parts of programs on the websites and reference book.

Below is a video of sample program running.

There were surprisingly many links.

Next time, I would like to export them to a file and analyze it.

I  quit writing this time

See you next time.

Bye!! Chao! Auf Wiedersehen!

March 10, 2019

* I’ve just remembered that 8 years passed since the Great East Japan Earthquake(March 11, 2011) because I wrote a date.

I experienced the Kumamoto Earthquake 3 years ago, so I have something on my mind about the earthquake.

My relatives can’t come for shopping because a road in the mountain in my town doesn’t repair yet.

As time goes by, everybody forgets……

Reference URL :

https://teratail.com/questions/130155

https://note.nkmk.me/python-beautiful-soup-scraping-yahoo/

Reference Book: “Scraping & machine learning by Python”

#Python #PyCharm #Webscraping #development #windows #link