WEB Scraping and Cleansing by Python Part2

This time, I added the cleansing program that removes the useless section(TAG: “script,” “a,” “class”…..),

and created images that show the frequent rate of words by WordCloud“.(March 22th, 2019)

(Environment: Windows10 +Python 3.6.5+ visual studio code)

 

#Wordcloud Image creating the section

def WordCheck(texts, num):

    #In the case of windows,  you should transform “Utf-8”, Otherwise becomes Error
    b= texts.encode(“CP932″,”ignore”)
    wc = WordCloud(background_color=”white” ,\
    font_path= r”d:\fonts\noto\NotoSansCJKjp-Regular.otf”,\
    width=800,height=600).generate(texts)
    wc.to_file(“d:\wordcloud_”+ str(num) +”.png”)

 

#HEader impersonation(You should add the bellow to prevent “urlOpen’s Error”)
ua = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ‘\
‘AppleWebKit/537.36 (KHTML, like Gecko) ‘\
‘Chrome/55.0.2883.95 Safari/537.36 ‘

#GET Linked URL(Creating the images for All linked pages)

↓Remove the texts of “<a href =”..”>” in the footer section

for script in soup([‘script’,’style’,’a’,’img’]):
script.decompose()

 

The image of Frequent characters

Left(Before cleansing)=>Right(After cleansing)

”Art,” ”Video” in the footer section Disappeared !!