A set of utilities for processing XML documents and converting to other formats
Project description
xmlutils.py is a set of Python utilities for processing xml files serially, namely converting them to other formats (SQL, CSV, JSON). The scripts use ElementTree.iterparse() to iterate through nodes in an XML file, thus not needing to load the whole DOM into memory. The scripts can be used to churn through large XML files (albeit taking long :P) without memory hiccups.
Blind conversion of XML to CSV and SQL is not recommended. It only works if the structure of the XML document is simple (flat). On the other hand, xml2json supports complex XML documents with multiple nested hierarchies. Lastly, the XML files are not validated at the time of conversion.
Kailash Nadh, October 2011
License: MIT License
Documentation: http://nadh.in/code/xmlutils.py
Installation
With pip or easy_install
pip install xmlutils or easy_install xmlutils
Commandline utilities
xml2csv
Convert an XML document to a CSV file.
xml2csv --input "samples/fruits.xml" --output "samples/fruits.csv" --tag "item"
Arguments
- --input
Input XML document’s filename*
- --output
Output CSV file’s filename*
- --tag
The tag of the node that represents a single record (Eg: item, record)*
- --delimiter
Delimiter for seperating items in a row. Default is , (a comma followed by a space)
- --ignore
A space separated list of element tags in the XML document to ignore.
- --header
Whether to print the CSV header (list of fields) in the first line; 1=yes, 0=no. Default is 1.
- --encoding
Character encoding of the document. Default is utf-8
- --limit
Limit the number of records to be processed from the document to a particular number. Default is no limit (-1)
- --buffer
The number of records to be kept in memory before it is written to the output CSV file. Helps reduce the number of disk writes. Default is 1000.
xml2sql
Convert an XML document to an SQL file.
xml2sql --input "samples/fruits.xml" --output "samples/fruits.sql" --tag "item" --table "myfruits"
Arguments
- --tag
the record tag. eg: item
- --table
table name
- --ignore
list of tags to ignore
- --limit
maximum number of records to process
- --packet
maximum size of an insert query in MB (MySQL’s max_allowed_packet)
xml2json
Convert XML to JSON. xml2json supports hierarchies nested to any number of levels.
xml2json --input "samples/fruits.xml" --output "samples/fruits.sql"
Modules
xmlutils.xml2sql
from xmlutils.xml2sql import xml2sql converter = xml2sql("samples/fruits.xml", "samples/fruits.sql", encoding="utf-8") converter.convert(tag="item", table="table")
Arguments
tag -- the record tag. eg: item table -- table name ignore -- list of tags to ignore limit -- maximum number of records to process packet -- maximum size of an insert query in MB (MySQL's max_allowed_packet) Returns: { num: number of records converted, num_insert: number of sql insert statements generated }
xmlutils.xml2csv
from xmlutils.xml2csv import xml2csv converter = xml2csv("samples/fruits.xml", "samples/fruits.csv", encoding="utf-8") converter.convert(tag="item")
Arguments
tag -- the record tag. eg: item delimiter -- csv field delimiter ignore -- list of tags to ignore limit -- maximum number of records to process buffer -- number of records to keep in buffer before writing to disk Returns: number of records converted
xmlutils.xml2json
from xmlutils.xml2json import xml2json converter = xml2json("samples/fruits.xml", "samples/fruits.sql", encoding="utf-8") converter.convert() # to get a json string converter = xml2json("samples/fruits.xml", encoding="utf-8") print converter.get_json()
Arguments
pretty -- pretty print?