Extracting text by looking for classes - using lxml

Using Scraperwiki to extract text from a webpage:

This uses the lxml module - documentation here: http://lxml.de/lxmlhtml.html#parsing-html
 

# import a module (library) that helps us do scraping
import scraperwiki
# import another that helps us extract things from the scraped data
import lxml.html
# use that module's scrape function to grab the contents of a URL and put it in the variable HTML
html scraperwiki.scrape("http://www.nhs.uk/Services/Trusts/GPs/DefaultView.aspx?id=5PG")
# use the lxml.html's fromstring function to grab some structured data, put in a variable called gplist
gplist lxml.html.fromstring(html
# get the first <p class="child-org-name"> <a> tags from within that, put in a list(?) called gpname. The class is indicated by the period before it.
gpname gplist.cssselect(".child-org-name a")
#loop through the list of items and...
for gp in list(gpname):
     record "gp" gp.text # create a column name and store the text of each occurrence
     scraperwiki.sqlite.save(["gp"]record# save the records one by one

Python classes

# Classes save you time when creating new objects by giving them pre-set qualities
# So for example if instead of having to write script that said 
#'this MP gets a certain number of votes and represents this party etc. etc.'
# You can create an MP 'class' which says *all* MPs have votes, parties, gender, age, etc.
# Then when you create a new MP you only have to fill in the values of those votes etc.
# You can save further time by having default values, e.g. saying your MPs represent Labour unless otherwise specified.

# Classes are created with the class keyword:

class Food:

#the colon starts the indented section that will define this class
#that section will include the variables, functions and constructor that define the class:

class Food(object):
    def __init__(self, calories, weight):
        self.calories = 0
        self.weight = 0

#__init__ is a 'class constructor' that 'initialises' (sets) particular qualities of the class
#again this needs to end with a colon:
#and further indented lines define default values for those qualities
#the 'self' bit basically refers to the individual instance (object) of this class that is later created
# - it is saving you writing more complex code when you create an instance of the class. Just do it.

#when you create a new instance of that class, e.g. a new item of food, it will have those qualities at those default values
#or at values defined in the parameters you pass in creating that new instance, e.g.

cheese = Food(34, 100)

# This code creates a new variable 'cheese' and assigns it values from the 'Food' class
# Essentially this line goes to look for the script creating the Food class, finds that it has calories and weight, and assigns those qualities to 'cheese'
# The 2 numbers in brackets - 34 and 100 - are assigned ('passed') to those qualities in turn, so cheese has 34 calories and a weight of 100.
#If no numbers were given - i.e. if the brackets were empty - then the new object 'Cheese' would have the default values of the class 'Food' - in this case, 0 and 0.

#Like any variable, once created you can do various things with cheese - ask how many calories it has, change its weight, add its weight to the weight of other objects, and so on.