Skip to main content

python xmltv grabber for Brazil (source - http://tudonoar.uol.com.br)

Project description

TV Grab Brazil (source - http://tudonoar.uol.com.br/)

It requires the python tv grabber library: pytvgrab-lib
(see http://pytvgrab.sourceforge.net)
It is used to extract information from the source webpage and
outputs in the xmltv format. (version 0.5.15 see
http://xmltv.sourceforge.net)



The guide provider: http://tudonoar.uol.com.br provides the guide
in 3 levels:

Channel Listing: here we get the channel name and url for its
programs list
+-> Channel Programs: here we get the program name and time
for a channel
+-> Program Information: here we get the program info


My approach to grab the guide
=============================
(To be used by new grabbers developers)

To parse the web guide I use a Customized HTMLParser
(customizedparser) that is a parser that ignores some tags,
parsing just the wanted ones, like <table>, <tr>, <td>, <a>; and
just some attributes, like href, ... This class map the HTML to a
python equivalent structure based on the Tag class, which has a
name, attributes, contents data and children.

Then I get the structure and call one of the 3 functions that
'knows' how to get the data we want from each document. They are:

- get_channels(): this function knows how to get the channel
name and url and returns a list of tuples ( name, url );

- get_programs(): this one knows how to get the programs from
a given channel and returns a list of tuples;

- get_program_info(): this knows how to get the program
information and returns a dict with the parsed xmltv data.

So, to get the whole guide is easy: First, I grab the first
page and use get_channels() to parse it and get the channels name
and url. Then for each channel, go for that URL and use
get_programs() to get the programs names, start time and url. If
the program has a url, grab it and use get_program_info() to get
its xmltv information.

Each get_{channels,programs,program_info}() has its get_url_
and process_ functions. This low coupling helps to print out
information and debug. For instance the main functions get the
url, download data, process it and return the results. All these
parts are separated.
When an error happens, the grabber dumps the file and exits
with -1.


NOTICE: You may define your own functions!!!

NOTICE: The re_clean and clear_html() are fully optional! It's
just something I use to get ride of bullshit and maybe fix
some html errors.

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page