pytvgrab-br_uol

python xmltv grabber for Brazil (source - http://tudonoar.uol.com.br)

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
- End Users/Desktop
License
- OSI Approved :: GNU General Public License (GPL)
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python

Project description

TV Grab Brazil (source - http://tudonoar.uol.com.br/)

It requires the python tv grabber library: pytvgrab-lib
(see http://pytvgrab.sourceforge.net)
It is used to extract information from the source webpage and
outputs in the xmltv format. (version 0.5.15 see
http://xmltv.sourceforge.net)

The guide provider: http://tudonoar.uol.com.br provides the guide
in 3 levels:

Channel Listing: here we get the channel name and url for its
programs list
+-> Channel Programs: here we get the program name and time
for a channel
+-> Program Information: here we get the program info

My approach to grab the guide
=============================
(To be used by new grabbers developers)

To parse the web guide I use a Customized HTMLParser
(customizedparser) that is a parser that ignores some tags,
parsing just the wanted ones, like <table>, <tr>, <td>, <a>; and
just some attributes, like href, ... This class map the HTML to a
python equivalent structure based on the Tag class, which has a
name, attributes, contents data and children.

Then I get the structure and call one of the 3 functions that
'knows' how to get the data we want from each document. They are:

- get_channels(): this function knows how to get the channel
name and url and returns a list of tuples ( name, url );

- get_programs(): this one knows how to get the programs from
a given channel and returns a list of tuples;

- get_program_info(): this knows how to get the program
information and returns a dict with the parsed xmltv data.

So, to get the whole guide is easy: First, I grab the first
page and use get_channels() to parse it and get the channels name
and url. Then for each channel, go for that URL and use
get_programs() to get the programs names, start time and url. If
the program has a url, grab it and use get_program_info() to get
its xmltv information.

Each get_{channels,programs,program_info}() has its get_url_
and process_ functions. This low coupling helps to print out
information and debug. For instance the main functions get the
url, download data, process it and return the results. All these
parts are separated.
When an error happens, the grabber dumps the file and exits
with -1.

NOTICE: You may define your own functions!!!

NOTICE: The re_clean and clear_html() are fully optional! It's
just something I use to get ride of bullshit and maybe fix
some html errors.