Voidstar - On writing spiders and scutters for FOAF

On writing spiders and scutters for FOAF

We've learnt a lot from RSS about how to play nicely if you're writing aggregators, spiders and scutters. All this is relevant to FOAF along with some other issues.

Bandwidth
The first thing to be aware of is bandwidth. Both yours and the sites your collecting FOAF from. The simplest tweak is to support gzip encoding. foaf, rdf and xml files are essentially text so they compress well. And most web servers out there support gzip encoding if the client tells them that it understands gzip. This should be completely transparent to your code, but make sure your http collection toolkit supports gzip and uses it.
See Mark Pilgrim's page about this.

There are two standards for applications to make known to a client that data hasn't changed. etag and last-modified. Your scutter should store these against the feed data and put them back into the headers when requesting data. If the server is properly setup it should use these to return a 304 not modified header rather than the full file. See this tutorial on using them

Even if you don't get these headers, it's useful to know if a feed has changed. One way is to generate an MD5 hash and store it. Then compare it next time you visit.

Use HTTP Codes
There are a set of well specified HTTP codes that servers generate. Use them. Particularly important are 301 Permenanent redirect and 404 not found. If the server is telling you that the data has permanently moved, use that fact. And if the file is no longer there, don't go on asking for it repeatedly.
Mark Pilgirm has a set of test files for all the different codes.

Play Nicely
You're writing a robot. So pay attention to robots.txt

Some sites, like Ecademy serve large numbers of foaf files. It's not fair on the server to hammer requests at them as fast as you can. This is effectively a denial of service attack and webmasters who notice this may well band you permanently. It's good practice to wait a second or two between successive requests to the same host. Note that some hosts use sub-domains like mine.domain.com, your.domain.com. So Wait 1 between calls to domain.com

Refresh times
RSS can change quite quickly whereas FOAF probably won't. So a refresh strategy is less important. One good approach is to start with a refresh of say 6 hours. If the data hasn't changed (see etag, last-modified) then double the refresh to 12 hours up to a maximum of a few days. If the data does change go back to your minimum refresh period.

There's an RSS problem where lots of client aggregators all collect feeds at :00:00 after each hour. If they're clocks are reasonably well synced, popular sites will get hit by large numbers of requests at :00:00 So when you set up your cron job pick a random(ish) time in the hour to start.

Caching the data and default pages
I think a lot of FOAF applications are going to collect the data on the fly. This means that every hit on your application is going to generate a hit on the data source. Now given that you probably have lots of alternate pages to look at I don't think this is going to be that much of a problem. But think carefully about the bandwidth implications for the source. If you get slashdotted and you,ve coded well to withstand it, you don't want to slashdot the source in turn. Probably the most obvious danger is having a default page with a link to a default source.

The alternative to this is to cache the data locally. On the surface this looks like a good approach for several reasons. But it leads you into a set of issue to do with how current the data is. And these issues are still being worked out. Take an example. Say you collect data that says that A knows B. A week later A falls out with B and they both remove the foaf:knows statement. Your data is now wrong.

Mirrored from the Ecademy FOAF club

[ << TypePad is coming ] [ XML-RDF Validity >> ]
[ 03-Aug-03 12:38pm ]