dimanche 19 juin 2016

Issues Saving Scraped Data to CSV Properly

I'm having an issue in Python with saving data that I've scraped into a CSV-type format.

My code right now, visits a website, finds the latest n articles, gets those links, then iteratively visits each link and gets content from each article, and saves that scraped content to separate lists, dependent on the type of data. Below is the code:

from bs4 import BeautifulSoup
import requests
import csv

#DEFINE SOME VARIABLES
url = "http://blacklistednews.com"
count = 5 #NUMBER OF ARTICLES TO GET

#FETCH URL & CREATE SOUP OBJECT
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')

#CREATE LIST TO STORE LINKS
link_list = []

#GET LATEST ARTICLE LINKS & STORE IN LIST
for link in soup.find_all('header', {'class': 'meta'})[0:count]:
    #LINKS
    links = (link.find('a')['href'])
    link_list.append(links)

#CREATE LISTS TO STORE CONTENT
t_list = ["Title"]
c_list = ["Content"]
s_list = ["Source"]

#TAKE LINKS LIST & APPEND CONTENT
i = 0
for link in link_list:
    #GET URL, FETCH URL, CREATE SOUP OBJECT
    url = link_list[i]
    data = requests.get(url).text
    soup = BeautifulSoup(data, 'lxml')

    #GET TARGET CONTENT
    titles = soup.find('h1').text
    contents = soup.find('div', {'id': 'newsdetail'}).text
    sources = soup.find(text=lambda text: text and "Source:" in text).find_next_sibling("a")["href"]

    #APPEND CONTENT TO LISTS
    t_list.append(titles)
    c_list.append(contents)
    s_list.append(sources)

    i += 1


i = 0
with open("outputdata.txt", "w") as output:
    newwriter = csv.writer(output, delimiter='|')
    for i in range(0,len(t_list)):
        newwriter.writerow([t_list[i], c_list[i], s_list[i]])
        i += 1

The above approach works fine for generic data in lists that I created while testing the concept, but seems to get garbled with the scraped content. I have a suspicion that it's something to do with the "n" characters in the output, but am not knowledgeable enough to know how to diagnose.

My goal, using the code above, is to create a list in the following format, using a "|" pipe character as a delimiter:

Title[0] | Content[0] | Source[0]
Title[1] | Content[1] | Source[1]
Title[2] | Content[2] | Source[2]
Title[3] | Content[3] | Source[3]
   ...   |    ...     |   ...
Title[n] | Content[n] | Source[n]

The result should be able to be interpreted by Excel as a three-column CSV with iterative data for each row. Right now, the first row with the titles is outputting fine, but the rest seems to get spit all over the place.

Currently, the output is as follows:

Title|Content|Source

The first row consists of the first items of each list, which I specify. After that, the first column (when interpretted by importing into Excel) is the only one that gets any content, and that content is added to another row for each n in the content. It looks, generally, like the following:

Title of the first article|"
beginning text of the first article
continued text of the first article

first block of the main content of article 1
second block of the main contenet of article 2
third...
fourth...
so-on...
so-forth...

"|source url of first article

Title of second article|"

The first row outputs fine, separating the values of each list as intended. From then on, it seems to save each block of text up to a n character on a new row. Each Title, Content, and Source item are then saved from then on within the first column, and span many rows.

UPDATE:

After reading UnicodeEncodeError: 'charmap' codec can't encode characters I took the following line:

newwriter.writerow([t_list[i], c_list[i], s_list[i]])

And changed it to read as follows:

newwriter.writerow([t_list[i].encode("utf-8"), c_list[i].encode("utf-8"), s_list[i].encode("utf-8")])

This seems gets much closer to my desired output, but still reflects the following issues:

Title is written as b'Title'

I'm getting a lot of content that saves like b'nnnnnNot smiling as much these days.nnThis is from February and I'm not entirely sure how to tidy such output up into a more read-able format.

It does however, now write in the one item per column, three items per row format which I intended, so the only issue I'm having now is cleaning things up.

Aucun commentaire:

Enregistrer un commentaire