I’m currently working on a new project in the evenings and weekends that involves playing with merchant datafeeds.
The big lesson I’ve learnt in the last 48 hours is that XML is a pig to work with.
The data from merchant X weighs in at 84 megs as a CSV. If you grab the same data as XML you end up with a massive 261 megs. Sure, hard drives are cheap, but the server load goes through the roof when it has to process the larger XML file…
Moral of the story – stick to CSV
hostyle says
Not sure of original author, but to quote someone on the internet: “XML is like violence: if a little doesn’t solve the problem, use more.”
Dominykas says
Um. I have to completely disagree with you. The problem is that CSV is not a worldwide standard as of yet – Germans use “,” (comma) as a decimal separator, thus their Excel exports (and imports) stuff using “;” (semicolon) as a “value separator” – rather than the English/American “.” (dot) for decimal and “,” comma for values. Consider the fact, that other countries have even more decimal separator symbols – I haven’t done my research fully there, but I’d suppose there might a problem or two elsewhere. Sure – CSV is quick’n’easy’n’dirty, but XML gives you the real thing. That of course does not apply to “local market” products.
Michele Neylon says
@Dominykas – the software I’m using can handle multiple formats, but the HUGE XML files are not making it happy
hostyle says
Dominykas: how is that a problem for csv ?
US/UK: “12,345.678”,”whatever”,”blah”
European: “12.345,678”,”whatever”,”blah”
Hugh says
Michele,
Some prefer XML for constantly updated items like news feeds etc, as it’s easier to figure out and parse small chunks of it.
I’m working on a price comparison site at the moment, and we’ll be pulling in 100’s of datafeeds from loads of merchants. For this it makes sense for us to use CSV – we download each feed once a day using cron, and run a shell script once a day to unzip and import each feed into the database. Currently it takes about 2 hours for a full update, but based on what i’ve seen testing xml feeds, it’d take at least double that using xml.
CSV – simple and effective.
Michele Neylon says
Hugh
You’re in the same boat as me so 🙂
Michele
Ken Stanley says
CSV is great for keeping file sizes down if there’s a linear pattern to the content, like a SQL dump. As Dominykas said, it’s not a standard and if the data needs to be portable, this can cause problems. XML is great for storing scalable pattern, non-linear data and has its uses too – but the nature of it, where each piece of data is marked-up/tagged means that it’s seriously bloated. XML and CSV are very different in my opinion. I’ll rarely use XML where CSV will suffice.
Tom Gleeson says
I once wrote a post about the great data lingua franca debate (http://blog.gobansaor.com/2007/03/03/tables-vs-xml-the-data-lingua-franca-debate/)
but of course there was no debate, at least then, good to see others appreciating the “power” of the humble CSV table 😉
Tom
Michele Neylon says
@Ken – the data I’m working with is provided by various merchants. Using the XML version simply adds bloated files with an expensive processing overhead. The CSV files are relatively light by comparison
@Tom – The right tool for the job 🙂