Friday, December 30, 2005

Subversion with Python 2.4

Buried deep in the comments on one of Simon Brunner's posts I found a reference to Eugene Lazutkin's setup where he hacked the Python SVN bindings to work with Python 2.4. I am pleased to report that his hack worked beautifully for me and I am now the proud owner of a Trac installation running on Subversion v1.2.3 and Python 2.4.

The trick is to download the Python 2.3 bindings in ZIP format, then unzip them and drop them into your site-packages folder. Once this is done, fire up a hex editor and edit each of the DLLs that are installed in the libsvn package and replace all instances of the string python23.dll with the string python24.dll. That's all there is to it.

Saturday, December 10, 2005

YAXL v0.0.16 released

I just released version 0.0.16 of YAXL, tightening up QName manipulation and overall documentation. Check out the website.

Update: the source in the website looks horrible. I'm working on fixing it.
Update (20051219): the source in the website is fixed.

Tuesday, November 22, 2005

typecheck v0.2.0 released

I just released version 0.2.0 of the typecheck module to the Python Package Index. The module's website has been updated to reflect the new version number and link to the newest build but has not yet been modified to document the changed interface after the merge with Collin's code. He tagged the 0.2.0 version yesterday so we still need to catch up on a couple of minor documentation points.

The new version of the module maintains 90% compatibility with my old API - the only change is that you can no longer elect to not typecheck a parameter; a feature that I hope we will be able to bring back in a subsequent release.

Wednesday, November 16, 2005

YAXL v0.0.15 released

I just released a maintenance version of YAXL. This release contains a couple of passing test cases that had been waiting for release for a little while. I also fixed an embarrassing bug in the Python Egg distro that screwed up the builtin help.

YAXL's Subversion repository is now public. You can browse version 0.0.15's tag or just snoop around in the trunk to find the bleeding-edge stuff.

New code updates for typecheck

I will be merging the code in my typechecking module with Collin Winter's codebase. The new subversion repository for the continued development of this module is at http://www.ilowe.net/software/typecheck/svn.

We should be releasing a new version shortly with a whole whack of other/new features that Collin has implemented such as return-type checking and logical operators on types (and and or). Look for version 0.2.0 in your local Cheeseshop RSS Feed.

Wednesday, October 19, 2005

Email tacks

Rob Bray suggests a different tack to take for spam prevention/reduction.

Rob blogged about his resistance to the penny tax for emails that Tim Bray suggested in a recent blog post. In this installment, however, he changes his mind and instead comes up with a brilliant solution: only un-answered emails cost you anything. Most of us write emails to people and receive answers from those same people; spammers on the other hand send emails to thousands of people and only a few (idiots) reply. If we could make it really costly to send un-answered emails and cheap to send answered emails we might have the beginnings of a really nice, low-tech solution.

This suggestion of Rob's reminds me of a socio-economic system called Stone Society described by Peter Merel. The system involves creating an artificial resource that is then exchanged and manipulated by participants in order to allow decision-making to proceed.

In the world that Rob describes, you would exchange tokens freely with people with whom you have a back-and-forth. Spammers would simply send you tokens which you could accumulate. In other words, spam would be beneficial to you even if you didn't want to receive it.

If the tokens here were indeed pennies you would actually get paid to receive spam. You could still have email filters to make sure you don't have to read it. This plan is all about raising the barrier to entry for spammers.

Tuesday, October 18, 2005

Hacking the iPod nano

I hacked together a little Python script that scrapes Atom feeds and transforms them to iPod "contacts" allowing me to read blog posts from my nano. It uses html2text right now but a lot of entries with embedded HTML come out looking pretty weird. I'll post the script once I clean up the text generation.

I managed to get a couple of the vCalendar features working and found out that even though Apple says you need iSync or iCal to use the nano to add TODO items, you can just add standard vTodo objects to a vCalendar stream and it works just fine.

Also, the rdate property is what controls repeating tasks/events - not any of the other ones that purportedly serve the same purpose.

Now it's time to go to bed...

Sunday, October 16, 2005

String/XML representations in YAXL

Damian Cugley asks


Could you eliminate the need for the text member variable by using str(elt) instead? Or would that interfere with using str to return the XML representation for the whole element?


In YAXL v0.0.14 I've split the representation of an element between str and repr so that the former would be a shortcut to return the text property and the latter would return the XML representation of the element.

I'm pleased with the separation since now if you evaluate an element at the interactive prompt, it returns the XML for the element but if you print or interpolate the element its string value will be used. There is also a nice parallel with the way nodes are handled in XPath.

YAXL elements retain the text property in order to continue to support the oodles of screaming fans who have already shipped billion-dollar projects based on my little library.

Thanks for the suggestion Damian!

Fall cleaning

Justin and I cleaned out some of the more cobwebby parts of mystic yesterday. We wanted to install Python 2.4 with subversion, trac and apache 2. It took a fair bit of wrangling (i.e. building from source) to get everything working together after which Justin realised that we hadn't upgraded to Sarge (duh!). He rebuilt everything using the latest packages and it looks OK so far.

The transition from Apache 1.3 to 2.x was pretty smooth with the SSL stuff at the high end weighing in at about 20-30 minutes to figure out. Although the new sites-available/sites-enabled organization is nicely architected, it does make for some pretty painful contortions when performing wholesale changes on a group of virtual hosts. For example, we started off testing the new installation of Apache 2.x on port 81 (which required that we split out the vhost declarations into files of their own) and then proceeded to switch it to port 80 (which ended up requiring a quick cat/sed/re-direct combination). Without a good knowledge of shell scripts or some scripting language this would have been really painful.

The other rough item on our plate was permissions when using the Berkeley DB version of a subversion repository. My advice: don't go there. Just stick to FSFS and you should be OK. This being said, I had originally created the repository in FSFS but had subsequently dumped it and loaded it onto mystic. It seems that I did something wrong because it loaded in Berkeley DB mode. All that is fixed now and taken care of.

We also tossed in Apache subversion support for good measure so now we can setup anonymous browsing and reading of our SVN repositories.

Saturday, October 15, 2005

iPod nano in da house

Vem and I finally both got our iPod nanos! What a cool gadget. These things are too slick to be real!

My one small gripe is that we ordered a bunch of stuff from Apple and they shipped it all separately with the predictable result that the iPod armband for Vem's gym sessions arrived first (wow, look - an armband) followed by her iPod, followed several days and a trip to the local Fedex detention centre for un-claimed packages later by mine own.

Rumors about the nano being scratch-prone seem to have some truth to them as I can already (after a fay and a half) detect small scoring on the exterior plastic. Of course, we ordered protective "skins" but since Apple hasn't shipped them yet (and nobody in their right mind would leave an un-opened nano sitting on the shelf for three weeks) we may find that our nanos are not as pristine as they could be. Ah well...

Still, the photo feature is pretty sweet (useless but sweet) although it would be nice to be able to set a photo that displays automatically when you power on (sort of like Vem's digital camera), the rating feature is very cool (I was just ranting to Vem the other day about needing one of these). iTunes is not all it could be but I haven't started playing around with the iPod libraries out there to see what I can hack up.

Anyways, overall we love our nanos and would buy them again: 9.8/10

Monday, October 10, 2005

YAXL v0.0.12 released

Over the weekend I've managed to work in a number of bugfixes, add some more XPath support, add sequence-style access for children and implement a whole slew of namespace features. The latest source version includes 78 unit-tests that cover all this fancy functionality.

I'm still debating the question of whether or not to allow access to children via property names (like xmltramp). The problem is that you end up having to mangle either child names or "real" property names. Also, you need to have a mangled way for dealing with node-sets. YAXL curretly supports XPath even in XML fragments so theoretically you should use that, but it's hard not to like root.title as opposed to root('title'). More complex queries, however, reveal the power of XPath: compare [x for x in root.head.children if x.localname == 'style'] to root('head//style') to retrieve all style elements of an HTML document. Or even root('//p') to retrieve all the paragraphs.

Ideally, YAXL would support both methods but I still need to come to terms with all the mangling required.

Friday, October 07, 2005

Expat doesn't support namespace prefixes

This via "Crest" who sent in some nice stack traces generated by YAXL. I wrote in a fix yesterday to handle a missing QName but I didn't realise that expat would throw a SAXException. As a result, nobody using expat can use the parse function until I add a try/except for that exception.

Thursday, October 06, 2005

SAX Parser Funny Stuff

It turns out that the startElementNS method on a SAX ContentHandler is not necessarily passed the QName of the element. The name is passed as a URI/element name tuple, but the QName is not mandatory unless the feature_namespace_prefixes feature is enabled. The following code shows how to do this:


from xml.sax.handler import feature_namespaces, feature_namespace_prefixes
from xml.sax import make_parser

contentHandler = MyContentHandler()

parser = make_parser()

# Perform namespace processing
parser.setFeature(feature_namespaces, True)

# Report original prefixed names (i.e. qname is passed to startElementNS)
parser.setFeature(feature_namespace_prefixes, True)

parser.setContentHandler(contentHandler)

parser.parse(inputSource)


I was made aware of this issue when YAXL failed to parse XML source properly on somebody else's machine; the elements were being named "None". That tip led to this fix. Merci Marc!

YAXL Reviewed

Jeremy Jones blogs about YAXL (already!). I agree with Jeremy that ElementTree is pretty much the best XML tool there is out there for Python. I also agree that the children attribute should not contain descendants. It never really did but since YAXL Elements are represented as XML fragments, it may have seemed that way.

Here's what an interactive session looks like:


>>> import yaxl
>>> x = yaxl.Element('x')
>>> x.append('y').append('z').append('w')
<w />
>>> x.children
[<y><z><w /></z></y>]


It may seem from the output that the z and w elements are returned as part of children but they are just part of the XML fragment that y outputs.

In the latest release (0.0.6) you can also "call" the element with an XPath of child::* and get back a list of immediate children.

Wednesday, October 05, 2005

YAXL gets XPath

I added some basic XPath support to YAXL this morning. It now supports both abbreviated and unabbreviated XPath queries on Element objects. Currently only attribute-value and node/node-set selections are supported. Also, only the following axes are allowed in queries: self, parent, ancestor, ancestor-or-self, child, descendant, descendant-or-self

Especially cool (IMO) is the fact that YAXL supports two methods of performing XPath queries: you can call select on an instance of Element or you can call an instance of Element and supply the XPath as the only parameter.

Tuesday, October 04, 2005

Something new, something old

I just launched a new site at http://www.ilowe.net. I will be retiring the schmeez.org site (the permanent redirects are already in place for the 3 people that visit regularly) for many reasons (not least of which is having to spell it every time I tell somebody where my "homepage" is).

I also released the first version of YAXL, a Python module for reading, writing and manipulating XML. It will form the basis for an Atom library I am currently writing.

Thursday, September 29, 2005

New version of typecheck module

I've released version 0.1.4 of my typechecking module for python. This version contains a fix that allows you to use doctest tests within typechecked functions.

This doesn't work out of the box since the doctest module looks for tests by recursing through a set of objects and looking for docstrings. If it finds a function, it checks that the functions func_globals object is the same object as the function's module's __dict__ property. This is not true of any decorated function where the decorator is defined outside the module the function is defined in.

Anyway, the new typecheck.doctest module allows you to bypass the check that doctest usually uses and replaces it with a much more liberal one that just verifies that a function's __module__ property is the same as the name of the module in which the function is defined.

Tuesday, September 13, 2005

All locked up and no place to go

Dave Pollard asks


What does it take to make someone so dissatisfied, so unable to bear the life they lead, that they can walk away and start a new, better life, a different way, elsewhere? And what does it take before we realize that the prisons in which each of us live, societally and metaphorically, are prisons without locks, just waiting for us to let ourselves out?


Personally what keeps me here are the links I have with other people. I have frequently said that if I had none of these I would quickly enroll myself in a monastery somewhere for 10 years. In the same way that Scott Peck (A Road Less Travelled) says that everybody could benefit from psychotherapy I believe that everybody could benefit from some solitude and meditation.

I pick 10 years since I figure the schedule goes something like this:


  1. Spend the first 3-4 years getting used to my new surroundings, learning to be "disconnected" from all the things I am accustomed to.

  2. The next 2-3 years are spent learning how to be alone with myself and how to think when I'm alone. I currently feel a need to bounce ideas off others and this would not be possible in an ideal circumstance.

  3. The final years are spent in quiet contemplation of myself, the world and the divine. Hopefully I can then choose whether or not to "go back" with some sort of idea both of what I am leaving and of where I am going.



I used to have nightmares where somebody was pressing me for an answer I didn't have and all I could muster was "Wait! Wait!" over and over. The frustration of not being able to stand back, take a deep breath, and see things the way they are is a constant enemy of tranquility and clear vision.

I suppose it's possible to become dissatisfied with bits of your life and work to change them, but the same obstacles rear their ugly heads every time: fear, inertia and peer pressure in its various forms.

Saturday, September 10, 2005

It's all about the user experience

Kudos to ArgoUML for having a Java WebStart version of their software. Launching this over the web is so easy it makes you wonder why you bother actually "installing" anything.

Monday, August 29, 2005

What's good for the goose

I discovered last night that Microsoft's web servers display an error for the URL http://www.microsoft.com/.net. Now, I may be a bit demanding but don't you think that a company like that should support tacking ".net", "office", "visualstudio", etc. to their base URL?

As a side note, using any additional path element that starts with a "." causes the same error. Requesting http://www.microsoft.com/visualstudio and others provide a reasonable error page that at least allows you to search on the site.

Wednesday, August 10, 2005

Why do we treat email differently than a phone call?

Jason at 37signals.com asks "Why do we treat email differently than a phone call?"

I think the answer is a lot simpler than the dozens of commenters imply: we treat email differently because text can be manipulated in ways that audio content cannot.

I could record all the phone calls I take or make but what would be the point? I couldn't go back and search through them, it's impossible to "scan through" a conversation to see the interesting/important bits, etc.

Although it's true that you could do all of this with transcripts of phone calls the fact remains that transcribing those calls is a Herculean task; either you need a dedicated secretary or you need to be very understanding of the mistakes made by your voice recognition software. All of this doesn't even get into the difficulty of either the secretary or the software dealing with accents, voice modulation and so on.

A lot of the meaning in a conversation (even on the phone) comes from things other than the actual words used. With email, the entirety of the conversation is captured. You see what you saw when the email first arrived.

Bottom line, we treat phone calls and email differently because the two media (voice and text) are divergent in so many aspects.

Monday, August 01, 2005

Ignorance and judgement

David Toub blogs about snobbery, elitism and "atty-tood". Sadly, I think most elitism and snobbery comes from simple ignorance. Most people lose at least some part of their superiority when they are (properly) exposed to the thing that they deride.

In the world of programmers nowhere is this more prevalent than in the language/editor/platform wars. A lot of these wars are started in various fora by somebody who has a certain degree of knowledge about the language/editor/platform whose virtues they extoll (enough to know which features to laud at least) but little knowledge of the competing technologies.

Although editors and platforms may indeed provide different features, languages, like music, can say anything to those who want to listen. Sure they have different flavours; that's why I prefer Python to Ruby (down in the back row). But at the end of the day they are all capable of moving the machine to do the same things (notwithstanding the fact that some languages cannot do certain things for architectural reasons).

Those who refuse to acknowledge certain basic qualities of these forms of expression (music, languages, fashion, painting, sculpture) do so out of ignorance and myopia. And prejudice is never far behind ignorance and judgement.

Saturday, July 23, 2005

Weekend of the WOD

It's been a busy weekend so far. Today I cleaned up the logging code in the WOD and added a setuptools script for building eggs and so forth. I moved a huge amount of code around and consolidated a number of dangling modules at the wodfs.* level into the wodfs package.

I also threw together a wodfs.SimpleSystem class that can be easily used whenever you want to hack with the WOD programmatically (i.e. not through the filesystem). I also shifted the code from wodfs.server so you can call wodfs.start(port=9000) to get a self-documenting XMLRPC server backed by the WOD.

I then shuffled the website around and moved everything to wodfs.org. The old address now issues a permanent redirect to the new one.

Type-checking module for Python

I just published version 0.1.0 of typecheck, a type-checking module for Python. It is registered at the Python Package Index and everything.

I wrote the setup scripts in about fifteen minutes and tweaked them for an hour or so using setuptools which I am now convinced kicks some pretty major ass. It was really easy to package everything up and upload it to PyPI. setuptools can even do special tricks like allowing you to deploy a "development" version of your package that you can edit from your checkout directory but that still gets included in sys.path - really neat stuff.

Thursday, July 21, 2005

Pattern recognition

Jeff Atwood describes the practice of Just Try Again where coders sometimes re-execute code that fails to see if it "does it again".

I think that one of the benefits of running code multiple times without modification is that it allows us to see patterns. It's a bit like trying to pick out the lyrics in a song: listening to the song once is not enough. In fact, even several times may not be enough to allow you to isolate the different ways the singer has of pronouncing certain words. Depending on which words are grouped with which others the pronounciation may change. This is all very similar to the way that different components in a software system act together to increase the overall complexity of the task of isolating an error.

So when I re-run broken code without changes what I'm really doing is looking for patterns in different parts of the code on each run. Doing so allows me to break down the run into a set of behaviours which I can then analyse to figure out what is out of place.

Universalism and a machine existance

In his article WWW and UU and I Tim Berners-Lee says:

The whole spread of the Web happened not because of a decision and a mandate from any authority, but because a whole bunch of people across the 'Net picked it up and brought up Web clients and servers, it actually happened.


This really captures (for me) the way we should be living our lives: not as subjects to an external authority but by working together towards a "good" that can benefit us. Our own internal authority should allow us to interoperate and co-exist much better than machines. And yet it seems they have the upper hand.

Tuesday, July 19, 2005

Myofilms Launch

I just launched the Myofilms Blog with Olivier. He promises to write and keep in touch.

Thursday, July 14, 2005

Python gets CPAN

Phillip J. Eby has put together a set of extensions to distutils called setuptools. One of the included scripts, easy_install.py is a shot at duplicating CPAN/apt-get and other package management tools.

setuptools also works with Python Eggs, a package format that extends the distutils format.

Although I haven't tried to package anything as an Egg yet I have tried the setuptools package (which incidentally comes with a bootstrap script that downloads and installs the package automatically). Everything ran straight out of the box and I was able to install and upgrade several packages already installed on my system very easily. Great work! This is something Python has sorely needed for a while.

Tuesday, June 14, 2005

Sometimes I feel so alone

All I really want is to use Python to connect to an MS SQL 2000 database without using a DSN or ODBC. Is it really that complicated? Is there really nobody else that has ever wanted this before (and moved beyond complaining on mailing lists)?

Friday, June 10, 2005

Different strokes

As I was randomly flipping through pages on Ward's Wiki filled with code examples comparing Ruby and Python I felt the urge to fire off this message into cyberspace: don't ever think it impresses me that you can write code in your language that I can write in mine. It should be obvious to anybody with an internet connection and a propensity for coding that any two turing-complete languages are mathematically identical. I obviously like my language and you like yours. It is just plain asinine to think that by showing me you can write code that does the same thing as my code you can convince me that your language is better. We all use the languages we use because a) we were born to it (i.e. it's all we know) or b) we like it best. Either way, it's not impressive to see the same algorithm implemented in a different language alongside a claim of superiority. 'Nuff said.

Tuesday, June 07, 2005

What to do, what to do

What do you do when you ask somebody "can you guys handle XMLRPC" and they respond "yeah, sure, XMLRPC, LMNOP, whatever you like"?

Monday, May 30, 2005

AJAX encapsulation and some gooey musings

Jon Udell blogs about AJAX encapsulation with TIBCO General Interface. This is one seriously cool product and I'm now officially itching to get my hands on it.

On the other hand it makes me wonder why we persist in trying to rebuild the functionality of the OS (windowing systems and so on) inside the browser. More and more it seems to me that we need to have a complete re-write of the OS windowing system to take advantage of all the things we have learned about GUIs in the past little while.

Imagine if the "windowing" software for the OS was actually just a giant browser. You'd get automatic "Active Desktop". You'd be able to open frames and dialogs and file choosers and all the other crunchy goodness that we have painstakingly baked into the browser. You'd also stop worrying about whether your application's super-duper feature was going to work on MacOS, Windows and Linux since all three could run this "super-browser" as the base object in the windowing system.

There's probably a really good reason (other than the phenomenal effort required) to not do this... but I can't think of it.

Saturday, May 28, 2005

Interviewed by Richard Jones

I asked Richard Jones to interview me.

I'm sure that my answers (found below) are a bit longer than most people's attention spans. To alleviate my guilt, please follow these instructions and further the meme:


  1. Leave me a comment saying, "Interview me."

  2. I will respond by asking you five questions. I get to pick the questions.

  3. You will update your weblog with the answers to the questions.

  4. You will include this explanation and an offer to interview someone else in the same post.

  5. When others comment asking to be interviewed, you will ask them five questions.



1. Why do you write a weblog?



I write to try to capture my thoughts. I suppose that I don't really care about other people reading them but on some level I like the fact that they do. There is an ego trip associated with putting your thoughts out there and having other people read them. At the same time, I get to use readers as a sounding board: anybody who disagrees with what I write is welcome to comment and hopefully give me some insight that I hadn't previously seen.

I decided a while ago that I didn't want to have the kind of blog that goes on and on about what flavour of ice cream I ate this morning or which of my friends I will be seeing tonight. I think that in the future I will come back and read my thoughts from this year and last year and try to see what changes took place in my head and when (and maybe even why). Being able to see what I was thinking when certain events took place is kind of cool. I don't want to look back and see a list of what I was doing - I want to see what I was thinking about.

Finally, I blog because I'm fundamentally a yapper. I have an answer to every question (even though it might not be a great answer) and I sometimes feel the need to get that answer off my chest even if nobody is listening. That's why I sometimes blog about morality or society. My friends laugh at me (kindly) because I always have something to say about everything so I figure that some of that should end up here.

2. What's the coolest online community enabling system right now?



I think that wikis hold the #1 spot for me. They allow users to collaborate, chat and create together which is something that most other systems lack for one reason or another. A lot of features from other systems could be folded into wikis but even the basic (simplest) wiki system is far more powerful than most other systems.

Blogs are cool but they don't really create tight communities. They create clouds of people that read each other's blogs and leave comments for each other but they remain fundamentally un-collaborative publishing mechanisms - a way for me to tell you what I think/feel/do.

IRC has more immediate interaction but loses completely on the asynchronous level. I know, I know, there are bots that will tell you when the last time foxy6969 was in the channel but it's not the same. Wikis are always now.

Usenet has asynchronicity but is more like mailing lists. In fact, let me just lump mailing lists in here as well (even though there are access restrictions to most mailing lists that don't exist for most newsgroups) and say that both allow you to post articles (like a blog) on a given topic. I would say that Usenet fosters more community spirit than blogs but still loses to wikis.

Online games (like MMORPGs and so on) also foster great community spirit but are very limited in their focus. Even though most wikis have a theme, almost all have a very heterogenous collection of pages.

Tagging services like Flickr and del.icio.us don't create communities. They expose existing interest groups. You and I may tag the same resource the same way but have nothing else to say to each other. Also, tagging is not conversation (no matter what the pundits say) and is an incredibly low-bandwidth medium of communication. The lack of consistent commenting means that tagging services are more like personality quizzes than they are like a community blackboard.

So wikis is my final answer.

3. What on Earth is WOD?



The WODFS is a Write-Once Distributed FileSystem. I wrote it in the winter of last year with my friend Justin and we put a lot of work into our "baby". The system is based on the design of archival storage systems and distributed filesystems and its main goal is to allow users to simulate an infinite, indestructible hard-drive.

The WOD allows you to keep saving data forever to a network of computers that store only the unique parts of your data. This means that each "chunk" of data is only ever stored once (conceptually). Each node in the network caches the most frequently used blocks so that you can achieve a disk access speed that is reasonable.

What's really cool is that since the WOD is write-once, you never actually "lose" any data. Even deleted data can be easily retrieved.

The system includes a set of extensions to the filesystem that allow you to create special folders and access files that contain data from the past. For example, if you have a folder called myprojects where you put all your code, you can open (either by creating it in Windows or by cd'ing to it in UNIX) a folder called myprojects@2004-01-01 and you will see that folder's contents exactly as they were on the first of January, 2004. This works with any folder (even the root) in your filesystem. Individual files track every single saved modification and allow you to browse through the history of the file using dates (as explained above) or by appending @ and a negative number representing the number of steps backwards to take.

I'm working hard right now to finish the code to hook the system into the Windows explorer but right now things run only on UNIX. Performance is not as good as we would like but the whole thing is written in un-optimized Python so that is to be expected.

4. What's your favourite Python hack?



I think that right now, my favourite hack has to be the soft reference module that we developed for the WOD. We hooked into the GC routines of the Python runtime to manage a set of objects that gets collected only when the memory they use is needed.

Close runners-up include hacks that involve meta-classes or decorators.

I'm not a big fan of cramming as much code as possible into one line. I like to see elegance and consistency. By consistency I mean that even hacks should be done in the spirit of the language used.

I've spent the last years coding in Java and so I really like hacks that show flexible (sometimes even functional) capabilities of languages. I like to see languages pushed to do things that they were not necessarily meant to do in clean and cogent ways.

I've always been a big fan of LISP and Scheme so I like to see hacks that modify language structure or build "small languages" or that modify a set of entities (objects or functions or whatever) transparently and orthogonally. That's why I like playing with decorators and meta-classes.

5. Blogs vs. Usenet, fight to the death. Who wins?



Blogs win. Honestly I think that blogs and usenet are pretty similar and if bloggers had some more tools (other than Trackback and comments) to allow a threaded style of linking between their posts (like metadata that allowed a tool to generate lists of related posts) then blogs would be a lot like Usenet but without the crappy flamefests and trolling and religious wars.

Blogs mean I get to choose whose writings I read. I like that choice. There's no spam and in general most blogs are consistent enough that I know after a little while whether or not to keep reading the blog.

Usenet can go any way. There are often spam messages posted to newsgroups. There are trolls who spend their time wasting other people's lives. Also, newsgroup organization is more centralized than blog management (says the guy who blogs on blogspot.com).

All in all, blogs are more flexible and allow more personal choice and freedom. They get my leftist vote.

Thursday, May 26, 2005

Tag maintenance

Clay Shirky blogs about the Dynamic growth of Tag Clouds. He put together a script that showed how the cloud of tags associated to a URL on del.icio.us grows over time.

An interesting aspect (mentioned later in his post) is that some of the top tags for a resource only percolate to the top later in the tagging process. The example he uses is the tag "ajax" being applied to the "original Adaptive Path article". The "ajax" tag wasn't used until after more than 1/3 of taggers had already tagged the resource.

This begs the questions: how can users who are unaware of a certain term ever going to use that term as a tag; and how will that term ever be added to the list of tags used by those who have already tagged the resource?

To be clear: I tagged the original article. I did not however use the tag "ajax" since at the time the term meant nothing to me. It had not yet caught on as a global term for the practice of using javascript to communicate with a server behind the scenes. Today, I might very well attempt to find that article using the "ajax" tag. Of course, this wouldn't work since I never used that tag. It would be nice if I could search my tagged resources using others' tags when I don't have the tag I'm searching for. Obviously my own tags should have priority but if I search my resources for the "ajax" tag, the resources that I have tagged and that others have tagged as "ajax" should show up in my search.

This points to a general maintenance problem with tagging (and I would venture to guess with classification schemes overall): how is the metadata maintained cheaply. Currently, tagging a resource into del.icio.us takes me less than 30 seconds. It would be a real pain if I had to review my resources all the time and add new tags.

Maybe another solution (other than the one listed above) would be to produce a feed of resources that I have tagged and an interface that would guess which tags from others would be most useful to me.

It feels like we are moving beyond tagging to searching. So is the issue the technology behind tagging/classification or behind searching?

Thursday, May 19, 2005

Sparklines

I discovered a fantastic page containing a draft chapter from Edward Tufte's new book via Pensieri di un lunatico minori.

Digging deeper as is my wont I found a whole whack of cool articles about sparklines and their implementations.

http://bitworking.org/news/Sparklines_in_data_URIs_in_Python
http://www.ietf.org/rfc/rfc2397
http://www.bissantz.de/sparklines/
http://agiletesting.blogspot.com/2005/04/sparkplot-creating-sparklines-with.html
http://dealmeida.net/en/Projects/PyTextile/sparklines.html
http://www.waler.com/flint10.exe

[Update: fixed link to Pensieri di un lunatico minori]

WODFS storage architecture

An off-the-cuff description of the WOD's storage system

OK, I admit it. We shamelessly ripped off the idea for how to break up and store data from the Venti filesystem papers.

So the basic idea is that each file is broken up into a number of blocks. Each block is 1024 bytes in size (for now, more later on variable-sized blocks) and its fingerprint is the SHA-1 hash of the data in the block.

A given block can be either a data block or a pointer block. Pointer blocks have pointers to other pointer blocks or to data blocks. Each file is conceptually a tree of pointer blocks that end up pointing to leaf data blocks. Because has as its fingerprint the hash of the data in the block, each block can be interpreted as either a data block or a pointer block. This is not an issue to begin with but as the system fills with data, we project that this will result in considerable savings in terms of space since the data and the data structure are expressed in the same way.

In addition, each file is uniquely identified by a fingerprint which is the hash of its contents: the leaf data blocks are referenced by the pointer blocks all the way up to a root block for the file; that block's fingerprint is the file's fingerprint. Similarly, the entire filesystem is represented by a single root fingerprint. This root fingerprint is stored between sessions and is, in fact, the only item necessary to rebuild the entire system from scratch.

For the moment the type of a block is decided by a flag in the pointer block that points to it. This is a bit fragile since there is a limited amount of space for flags and thus the scheme could become unwieldy at some point in the future.

Since blocks are context-neutral they can be shared by different instances of the WOD running on the same machine.

The blocks are stored on the disk contiguously without any specific order (this sucks because retrieval is via random-access only). To speed up block retrieval, an index maps fingerprints to offsets in the data "file". This index can be rebuilt from the data "file" at any time and so it is not crucial to the functioning of the system. Theoretically the index could be stored within the system itself and loaded after the system is initialized. In this way the bootstrap routines would be a bit slower but once the index was loaded the whole system would speed up.

Right now the big hit for performance is in what we call the chainer. This part of the system takes a contiguous stream of bytes and breaks it up into a tree of blocks. It would be really nice to figure out a way to derive the "chained fingerprint" (i.e. the root fingerprint for the processed file) from the initial data in order to be able to skip the chaining phase when it is not required. Already the system discards write hits for blocks that are pre-extant in the data "file"; it would be nice to bring that up a level so that entire files could be discarded if their contents were known (this could be trivially done by adding the original hash to the root block since the original hash of all files with the same root block would be identical).

Moral objections to medical procedures

Medical decisions should not be coloured by a doctor's personal beliefs

David Toub blogs about "compulsory abortion training" for OB/GYNs. He mentions that at Yale it is not compulsory for students to learn abortion procedures although they did have to learn how to handle post-abortion patients.

It sickens me to think that there are doctors in the world that may be (by hook or by crook) imposing their own moral views on their patients.

The idea that I would have to change doctors because mine refuses to perform a certain procedure on moral grounds seems to me like a direct violation of the Hippocratic Oath which in theory involves providing the best care possible for a sick patient. It must be said that the original oath contained a clause that specifically forbade abortion; but that clause is absent from most modern versions of the oath.

Should a Jewish doctor be allowed to refuse to perform a porcine heart implant on the grounds that they do not believe that a pig's heart should be in a human body? Should any doctor refuse to perform a blood transfusion because they believe that their blood is their own and that to accept another's blood is to import impurities?

Science and scientists have always been on the razor edge of this knife: can I refuse to employ my knowledge because of a known outcome that I personally find reprehensible?

In many cases the answer is "yes". Engineers can decide to not build bridges that they believe will cause environmental damage, atomic physicists can refuse to build a better bomb, etc. But doctors, even more so than other scientists, are in the tricky business of directly saving lives.

To say that a doctor should be able to pick and choose which conditions he treats would be like saying that a policeman could pick and choose whom he protects; or like saying that a fireman could decide to not put out a fire based on what business took place in the building; or like a teacher being allowed to pick and choose which of his students he teaches.

When your profession places you in a position of responsability for the well-being of somebody else I think that your decisions for the treatment of that person (or that person's property in the case of the fireman) should be made based on the person's system of beliefs and not on your own. It's one thing for a doctor to advise against a certain operation (although the morality of the thing becomes very grey indeed) it's another entirely to relegate the patient to a lesser (or even just another) doctor just to satisfy one's own sense of moral values.

When somebody depends on you, putting them at risk to satisfy your own subjective reality can hardly be said to be the moral high ground.

Tuesday, May 17, 2005

Serendipity in RSS-land

In Jon Aquino's post Tip: Random-number generator for Firefox, he lists some of the things that he uses the random number for. He makes a choice between emailing a random contact or reading a page out of any one of a number of different sources. It would be kind of cool to have an RSS feed with a random page from a number of sources that would provide a little serendipity.

On a slightly related note, sometimes it's nice to let a couple of days go by without reading a given feed. It let's the juicy bits pile up so you can read them all at once.

Wednesday, May 11, 2005

Bitten by indentation

Python should raise more appropriate exception when the indentation in a class definition is not consistent.

Python allows you to use either spaces or tabs to indent your code; but you can't use both. You can't mix and match. Now this is not a big deal (although it does sort of violate the "there is one way to do things" rule that Python seems to adhere to fairly consistently) but it does mean that if you cut and paste some code from another file you could be lining yourself up for problems.

The thing is that Python, for some reason, doesn't complain when your indentation is not consistent within the definition of a class; instead, it gives an error that has nothing to do with the actual problem. For example, the following code:


1 class t:
2 [tab]def f():
3 [tab][tab]x = 5
4 [space][space]print x
5
6 [tab]def x():
7 [tab][tab]pass
8
9 t()


Will raise a NameError and say that 'x' is not defined on line 4. I don't understand why this doesn't raise an IndentationException the same way it would outside the class definition.

Tuesday, May 10, 2005

RSS Reading in Firefox

I just installed the Firefox Bookmark Synchronizer. I have had Sage installed for a while but now I can actually have my feeds synchronized between home and the office. Next on the list is getting my browser running off my USB stick so I can literally take my browser anywhere I go.

Monday, May 09, 2005

The Whisper Campaign

Dave writes that he has had a change in philosophy but I'm not sure this is really a change but maybe more of a refinement of previous ideas: there are certain types of praise that are more suited to being given in private.

I think the key factor may be intimacy; if I have intimate praise for somebody, sharing it publically may work against me more often than not. Public praise should focus on shared goals whereas private praise should focus on personal and/or intimate aspects of the relationship. When you tell somebody that they are "amazing" you are referring to some aspect of your relationship with that person that causes you to be amazed. This is not something that needs to be bandied about or else it loses it's subtlety.

In social gatherings my girlfriend and I often spend a lot of time apart, flitting from one group of people to the next. We always manage to catch the other's eye however and deliver a small, subtle gesture of caring even across a crowded room.

I think private praise gives a feeling of belonging and closeness that is not acheived via public praise.

Thursday, April 28, 2005

Exception-based Switch-Case and the flavour of Python

There's been a lot of discussion around switch-case flow-control in Python lately. With this recipe, Zoran Isailovski weighs in with what I find is a nice, clean, pythonic way of handling switch statements. He references another recipe by Brian Beck that handles the switch with a for loop (very cool).

I notice a comment on Zoran's recipe that references the C2 wiki's page on not using exceptions for flow control. Now I spend most of my time switching between Java and Python and it's true that I would never dream of using exceptions to control the flow in my Java application, but it's different in Python. In standard Python, iterators raise an exception to stop iteration and this is a pattern that is fairly well-recognized in the community.

All of which just points to the differences in flavour between Python and Java.

Lexical Analysis, Python-style

I've always wanted to be able to do lexical analysis Python-style and I've always had the feeling that it should be done with regexps instead of character by character. Now Jason Diamond and Frederik Lundh have presented recipes for doing just that. I'll have to try out these techniques and see if a re-write of the parser for Founder makes things cleaner.

Wednesday, April 27, 2005

Embedding data in HTML documents with the "data" URL scheme

The "data" URL scheme allows you to embed data directly in an HTML document. By encoding the data in Base64 you can even embed binary data straight in the page. The following example illustrates this by embedding the source of an image directly in the image tag itself.


<img src="

AB6P612AAAAmklEQVR42uWWSw6AIAxE7VXg/keqV8FPE12AdUYgamDREFomfVBIJaU0jTSEA

xbZbI8zaqXs6OwuUdVbkRDjamfVY4IEUzHILmQ4OubygHPOzwIjGV4CO5y4bkNgSpAApjiR/

CgGZwVxEcD2ulnO/06gT6uylp5dbE1W5SoIoQFwZaL9gAvkgwCfygb8cqeV9wn9ehuTH621X

AArWx8EprTSmAAAAABJRU5ErkJggg==">


The code above produces the following graphic

Monday, April 18, 2005

How well do you know Python -- Name mangling

In his article Spyced: How well do you know Python, part 1, Jonathan Ellis points out the danger of using exec. This principle shows up in inheritance too if you're not careful (or you come from Java):


>>> class A:
... def __init__(self):
... self.__x = 5
... def getx(self):
... return self.__x
...
>>> class B(A):
... def getx(self):
... return self.__x + 5
...
>>> b = B()
>>> b.getx()
Traceback (most recent call last):
File "", line 1, in ?
File "", line 3, in getx
AttributeError: B instance has no attribute '_B__x'

Friday, April 15, 2005

Duck Typing dilemmas - A strawman for "pure" Python interfaces

The Duck Typing debate is less about type-checking and more about expectations

Cedric Beust blogs about The Perils of Duck Typing and he's not alone. Many people have been talking about dynamic and static typing for quite a while now and it's very close to being an Emacs vs. vi argument.

I don't think the problem here is really the absence of static typing but more the absence of "good" design or reasonable expectation management. Let me give an example (so you can make fun of me if I'm wrong).

In Python, a number of methods and functions take "file-like objects" as parameters. Now some of these methods actually want all the methods that files implement and others are content with just having read and readline. This interface is obviously a little hard to code to since in order to implement it you need to read the specific requirements of the method/function you're calling.

Now the only thing wrong with this whole setup is that you really have no idea what methods to implement and as Cedric points out, you may omit the implementation of a method that is not called directly from the function you're calling. So here's my strawman proposal to those naysayers who demand more static typing: just implement an empty base class that throws NotImplementedException from your methods. Note that in Python this class wouldn't even need to be bundled with the base distribution since Duck Typing works whether you like it or not.

In the function below it doesn't matter what type the fileobject parameter is. It will work fine with a file from the open or file functions as well as with any other iterable object.


def line_count(fileobject):
count = 0

for i in fileobject:
count += 1

return count


Now let's assume that we want to make it explicit that you need to pass an iterable object. Most pythonistas will tell you to re-write the function like this.


def line_count(iterable_object):
count = 0

for i in iterable_object:
count += 1

return count


Now that's nice and clear. We can tell (as long as we have two brain cells to rub together) from the name of the argument what the type should be. Of course then the whole discussion takes a nasty swerve towards the "Hungarian notation debate" side of things.

Putting the notion of encoding the type in the argument name aside we can still provide a nice way to show what type is expected. All we have to do is write our function like this.



class Iterable(object):
def next(self):
raise NotImplementedException

def __iter__(self):
return self

def line_count(itobj):
"""Pass me an instance of Iterable, please."""

count = 0

for i in itobj:
count += 1

return count


Now, because Python supports multiple inheritance, we can just use the Iterable "interface" as a mixin. It doesn't really matter whether we use it or not because the methods that it defines are clearly described and listed in the class definition. In addition, the function will still accept non-Iterable objects that adhere to the "Iterator types" description in the Python Library Reference. It's just that designing like this makes it simpler on the client to use the library you've developed. Note also that any class that implements the "Iterator types" set of methods could also just mix in the Iterable class for the hell of it.

I know that this seems like a lot of flailing for nothing (since there is still no type-checking until runtime at which point it still happens using Duck Typing) but the point here is that I think the Duck Typing argument is a design argument and not a dynamic vs. static typing argument. You can implement Java-like interfaces in a dynamic language, they just won't be checked at compile-time; but maybe that's OK. The real goal is to: a) make sure that the implementor knows all the methods they need to support and b) make sure that you are not passing the wrong instance by mistake to your function. I think that by adhering to a few simple design guidelines both of those goals are pretty easy to accomplish.

NTP on Windows - Keeping up with the time

You can easily keep your local clock synchronized on recent versions of Windows

The NTP (Network Time Protocol) allows you to synchronize your local time to the time maintained by a set of servers. All those servers are fed (eventually, somewhere up the chain) by a clock of incredible precision (no, not Big Ben). UNIX users have had this feature since its creation and now it's time (no pun intended) for Windows users to have their day in the sun.

Just go to your start menu, select "Run..." and type "cmd" into the little text box. Click OK and type this line into the resulting command line:


net time /setsntp:pool.ntp.org


Your clock should now remain synchronized and up-to-date until the end of time.

Friday, April 08, 2005

Version 0.1.0 of WODFS Released

We finally released a version of the WOD. It still requires a tremendous amount of effort for others to get it to run (you need to install a bunch of dependencies) but we wanted to "release early, release often". The next release should contain a couple of improvements such as Windows support, limited WebDAV support, XMLRPC support and a re-structuring of some of the filesystem level code.

Wednesday, April 06, 2005

Podcasting meets blogging

Jon Aquino has started playing with an automatic blog to podcast script that looks pretty cool.

Tuesday, April 05, 2005

One box where two will do

By adding a keyword search for Google in Firefox you can remove the need for the extra Google search box

All you have to do is go to http://www.google.com and right-click on the form. Select "Add a keyword for this Search" and fill in the boxes in the popup. You can now use the keyword you just created in the address bar instead of using the right-hand-side Google search box.

Saturday, April 02, 2005

Reading between the lines

Reality can never be communicated, only experienced.

Words can never cause another person to experience anything. Only the receiver can perform the act of experiencing. We speak in order to draw smaller and smaller circles around the experience we are trying to provoke in the other person. When the circles are as small as we can go, we have a reasonable assurance that the other person has experienced the same thing that we have.

Writing (or saying) something using a lot of words allows us to refine the various possible meanings of the communication to a point where we not only have been able to express ourselves (i.e. we feel like we have accurately represented the experience) but also have been understood (i.e. we feel that the recipient has accurately reproduced the experience for themselves). This is easy. The way to do it is to start talking (or writing) and not stop until the other person is able to communicate back that they have experienced the same thing.

When we read religious or spiritual texts we often are confronted with the fact that the text is much shorter than what we really need in order to be able to reproduce the author's experience. This is simply because it would be impossible to expect an author to be able to write a text that could be understood by anybody at any time in any situation.

Writers of these texts therefore resort to a form of compression that allows them to accomplish two goals.

The first goal is to express themselves in a form that can be understood. It is not important how long it takes for the understanding to occur and indeed many of the authors (of these texts) that are read today have been dead for some time. The notion that it is irrelevant how much time and effort are required to understand the text is an interesting one since it underlines the fact that it is the understanding of the communication that is the ultimate goal.

The second goal is to allow the communication of the concept to be passed on even by those who do not understand it. In order for this to be possible it is important for the concept to be expressed simply. If the simplicity of the expression is great enough then even the greatest fools of the earth will propagate the message until such time as someone with the capacity to understand it can hear it.

So we have two conflicting goals: understanding and simplicity. How then are we to formulate a concept that is inherently complicated and deep in a way that is simple? Recall that simple in this context refers to the complexity of the message, not the complexity of the concept.

The Tao Te Ching expresses complicated concepts in simple verse. The Bible expresses the truth in the words of Jesus via parables. A highschool student could learn and recite each of these stories with little difficulty. However without the key to what is locked inside, they remain opaque and confusing.

The key is to "explode" the concepts hidden within each story; to tease each sentence apart and to push our understanding of each aspect of the work to it's limit. We need to read between the lines. Unfortunately, there are not very many lines so there appears to be a huge number of possible interpretations for each of these esoteric texts. The trick then is to draw as many lines as we can so that the space between them becomes smaller and smaller until, one day, perhaps, we will experience reality.

Wednesday, March 30, 2005

The saga of USENET: The insanity continues

This is beautiful. A thread about acquaintance spam spiraled off into a furious bout of invective and harsh words. It amazes me that people who should know better don't. Now it may be true that somebody "started it", but at least they had the tact and intelligence to back off and not involve themselves in the comment war that ensued.

I just found it hilarious that a supposedly high signal/noise ratio medium (blogging/RSS/contemporary "collaboration" tools) can be reduced to a tangled USENET thread in less time than it takes to click on the "BlogThis" bookmarklet in your toolbar.

Saturday, March 26, 2005

Python typing part II

Again, I still don't understand why we really need this type checking stuff. It seems to me that since we are adding type checking that will be done at runtime anyway the whole thing is a bit of a moot point. I may be missing something critical here but AFAIK the BDFL is not considering modifying the compiler/interpreter to handle type-safety. Why not just have a library with decorators that allows you to do type checking?

I've hacked together a strawman typechecking decorator that I would appreciate comments and flames on. I'm just trying to get a grip on what the issue with keeping the language small and consistent is.

Friday, March 25, 2005

Static typing and Interfaces in Python

OK. So here's the obligatory weigh-in on static typing in Python. I don't see why we need to introduce a new syntax fior something that could be handled very cleanly with decorators (and I'm sure many people have argued this point).

I also choke on the whole "Interface" thing. We don't need those either. Just create a class with a bunch of raise 'Not Implemented' statements for all the method implementations. Anybody subclassing the class would be forced to implement the methods. We could even have a meta-class for it that checks that all the methods are implemented - again, no need for extended syntax.

One of the things I like most about python is its consistency: things usually work the same way. Let's keep it nice, simple and predictable.

On a slightly related note I see that the BDFL is considering wiping out map, filter and reduce. I applaud this since the list comp syntax neatly handles these functions' roles.

Friday, March 11, 2005

Claiming my blog at blogshares

So I joined Blogshares to see what the fuss is about.

Listed on BlogShares

Wednesday, March 09, 2005

Reading the fine print

Reading the printer-friendly version of articles removes distractions.

To my great chagrin a number of signal-heavy sites now have ads scattered all over their articles, breaking up the flow and generally getting in the way. Not to mention the Flash ads that whirl and blink or the DHTML ads that seem to want to take over the entire screen. I'm looking at you IBM, and ITworld.com and cnet.com and all the others that for years have had great technical/business/whatever-you-want articles but who now waste my time and visual bandwidth with their crappy, non-targetted ads.

No more.

From now on I will be reading only the "printer-friendly" version of these articles. This version contains no ads (well, there may be some but they are usually at the top or bottom) and in general has a better layout since they have removed the ugly navigation that these providers have managed to plaster all over 2/3 of my screen.

On a related note, check out Greasemonkey - a firefox extension that allows you to run custom DHTML for each site you visit. By default this extension ships with a script that disables Google ads. When I say disables, I actually mean hides since the original GET to the server actually returns the HTML page with the ads built in but Greasemonkey strips it out on the client side. So your favourite blogger is still reaping the benefits of the ad bar but you don't have to suffer the indignity of giving up screen real-estate.

This is so wrong!

Now I'm not usually a namby-pamby "love the animals" kind of guy; and I've eaten (and enjoyed) my fair share of rabbits, but this is just too much.

Introducing the main idea

Start each article/post with a single phrase that briefly describes the idea being communicated.

I've been reading Dave Pollard's excellent weblog for a little while now and I've noticed that at the beginning of each article he writes "The Idea" followed by a description of what he wants to communicate in the article.

This is a great tool for focussing the contents of an article and it also allows readers to read the full story with the basic idea in mind. It provides a sort of scaffolding for the rest of the article so that additional ideas can glom onto the original one to create a (hopefully) coherent whole.

Monday, March 07, 2005

Automatic filing with del.icio.us

From Otaku, Cedric's weblog: Automatic filing with del.icio.us I made a couple of modifications to allow the launch to happen in another window. Drag this bookmarklet to your toolbar to apply the toread tag to a page.

Cedric also made a delete from del.icio.us bookmarklet.

Wednesday, March 02, 2005

Online medical records - strawman

David Toub writes that the three main obstacles to having online medical records for patients are
  1. Concerns about security (getting better, but reports of major academic institutions being hacked don't help. Regardless, EMRs are still inherently more secure than paper records)

  2. Cost (big issue-a small practice just doesn't have $20k to blow on a new system)

  3. Startup: How to input thousands of paper-based records into an EMR fast and inexpensively
So let's take a look at these items one by one.

Security concerns

It is normal, I think, in this day and age, to worry about the accessibility of your data online. With so many scandals and so much spam and hacking it is easy to become overly anxious about exposing your information to a potentially hostile environment. The answer to this is a liberal application of strong encryption. I'm not talking about weak SSL with it's pansy 128 bit keys (many implementations of which have failed to demonstrate that there is no redundancy in the key bits), I'm talking about public key cryptography combined with some sort of DRM tool that would allow doctor's to access a patient's information once, in the controlled context of a visit.

I recently discovered a feature in PGP that allows you to encrypt a text document to a built-in viewer that is distributed with PGP and that in addition to requiring you to enter the appropriate passphrase for the key also displays the text in a non-copyable, tempest technology resistant window. Something like this could be used so that a patient would be able to authorize a given physician to access their data once (and only that one time). The physician would have full access to the patient's file for the duration of the consultation but afterwards they would be unable (even for their own purposes) to retrieve information other than
  1. The fact that the given patient visited them
  2. The treatment that they provided
  3. Perhaps (but not necessarily) a general description of the problems that the patient presented with
Obviously, it would be important to allow physicians to update the patient's records in such a way that the next time the patient visited the physician would be able to view their own notes on the previous treatment.

The key point here is that it is definately possible to build a system where the patient controls all access to their information. This would also limit physicians' liability in certain situations. For example, staff from a doctor's office would be categorically unable to share any details about patients other than their name (or other identifier) and the fact that they visited. It would even be possible to visit a doctor anonymously. As long as the physician has access to your records, they don't really need to know who you are. This would encourage people to use these records for things that could be traced back to them disfavourably such as abortions, HIV/AIDS treatments and so on.

Cost of a new system

In order for the system described above to really work it would need to be de-centralized and managed by each patient individually. Any attempt to centralize the system would probably open the door for abuse (if for no other reason than administrators of the central system could track usage statistics in a non-anonymous way).

I don't think that physicians would need to pay anything at all to use the system (other than the fees for a computer and an internet connection).

We have seen from systems like blogging, FOAF, bookmarks and so on that privately owned and managed resources can still be shared profitably.

Cost of inputting older records

I don't think this is really a big deal either. The fact of the matter is that right now (most) medical records are maintained by each healthcare institution individually. When a patient goes for X-Rays or for blood tests or any other tests, the testing body needs to send the results of the tests back to the physician. How easy is it really for a given physician to find out
everything about the patient that is presenting? Not very, I'll wager. This is mainly due to poor organization and not any malice on anybody's part.

There isn't really a need to input
all prior records at once. It would (or should anyway) be enough to input them as they become pertinent to the current activities of the patient.

For example, let's say that a patient had a colonoscopy at the request of their physician last year. Whatever condition prompted the test was resolved weeks after the test. The results of that test (and even to a certain extent the fact that it was performed) is of no import the next year when the patient presents with a broken wrist. In that situation, the physician would simply enter the new information without worrying about the lack of the results of the colonoscopy (never mind the fact that the patient should be able to selectively allow the physician to only access the parts of their record that are pertinent).

If on the other hand the same patient presented with bloody stool and the physician wanted to order a fecal occult blood test they should be (perhaps verbally) made aware of the colonoscopy the previous year. At that point the physician would order the results from the lab (or get them from the patient). So far I think this is how things work today. The difference is that once the test results are obtained the physician would enter the new information into the system.

In this way the system would be built up incrementally over time and would not require a huge amount of up-front investment.

Tuesday, March 01, 2005

The future flavour of webservices

You've probably heard about webservices: they're those things that use SOAP, XML-RPC and a whole alphabet soup of technologies better left untouched. Another acronym to know is REST. It is the red-headed stepchild of webservice - the technology that IBM, Microsoft, BEA and all the others don't want you to know about. Why? Because it exists already. Your browser already uses REST to browse the web. Certain tools (like WebDAV) use REST to store and modify resources on the web. And where would we be without POSTs from forms?

SOAP and XML-RPC are both suites that allow you to exchange XML-encoded messages with other hosts on the network. Let's say you want to find all the top scoring players in the NHL. You prepare an XML document that describes your request and you POST it to a specific host that you know will actually respond. The URL you post to is called the "endpoint" and theoretically represents a resource or process. The host returns another (specially prepared) XML document that (once you un-wrap it) contains the list of players.

In REST things are much simpler. All you do is GET the URL. No fancy documents or anything. GET the URL and the server returns the list.

What is interesting however is the difference in viewpoint between what are now being called the WS-* camp (the SOAP and XML-RPC guys) and the RESTafarians.

The RESTafarians are very focussed on the semantic nature of URLs, often talking about the relationship between URLs and the resource representations they refer to. WS-* guys often talk about how the latest envelope spec can be enhanced.

In case you're wondering, I'm pretty much sold on RESTafarianism although I do like XML-RPC since it has a much simpler design/interface than SOAP.

Here is a very cogent introduction to the principles and philosophies of REST: http://naeblis.cx/rtomayko/2004/12/12/rest-to-my-wife

Here is the home of XML-RPC: http://www.xmlrpc.com

Here is the home of the SOAP working group: http://www.w3c.org/2000/xp/Group/

Saturday, February 26, 2005

Security in the WOD's networking layer

When you use the WOD, you are connected to the global network of WOD users. You immediately begin exchanging data packets with all your peers on the network to increase redundancy for your packets and to ensure that your data is stored permanently. In addition, clients may request specific packets at any time - requests that must be replied to if the network is supposed to work properly.

In order to guarantee a certain level of privacy for each WOD user, we use a packet-level encryption scheme. This scheme needs to allow the WOD's similarity-based capabilities to continue working and therefore encrypting a given cleartext packet must produce the same ciphertext for all clients that perform the encryption. This is of critical importance since the benefits gained from the self-similarity of data in the network are lost immediately if there are two possible ciphertext results for the same cleartext.

To circumvent this issue, each packet is encrypted with it's own MD5 hash. This guarantees that every client encrypting a given packet will encrypt it exactly the same way. Pointer blocks/packets maintain both the fingerprint of the packet (the SHA-1 hash of the cleartext) as well as the password of the packet (the MD5 hash of the cleartext). The security of MD5 is not really an issue here since collisions in the hashing function do not significantly reduce the strength of the generated password.

Packets are encrypted using AES - a symmetric encryption algorithm. This is because the key for a given encrypted block is determined by the block's data itself. Generating public/private keypairs might enhance security but would make things a lot more complicated to implement.

This encryption scheme is not proof against chosen plaintext attacks: if an attacker has a file and wants to prove that you have the file too they are able to regardless of the encryption. This does mean however that only people with the same files as you can prove you have those files. Users would be well advised to employ third-party (higher level) encryption for their files if additional security is needed at this level. The rationale for not implementing per-user security is that, for the most part, as long as I own the same file as you, I don't care if you see that I own it.

Tuesday, February 22, 2005

ISBN Linker Bookmarklet

Drag and drop the [ISBN Linker] into your toolbar and click it to transform ISBN numbers in the current page into links to Amazon.com.

Thursday, February 17, 2005

The WOD's network design

Each system on the network is a node and each node is connected to zero or more peers.

Nodes


When a node is connected to less peers than its threshold, it broadcasts a packet onto a multicast address with the port on which it would like to receive data. It also listens on the multicast address and picks up advertisements for other nodes until it has reached its threshold. Once the threshold is reached the node stops advertising itself and stops listening on the multicast IP. If the number of connected peers ever drops below the threshold, the node will begin to advertise itself and listen again for other peers.

Peers

Once a packet is received from a prospective peer, the node examines the requested port and then connects to the peer on that port. Packets sent by any peer are gathered together and sent into a central pipe.

The central pipe

The central pipe of packets coming in from the network is watched by a series of functions that have registered to be notified of incoming packets. Each watcher may operate on a packet before it is passed on to the next watcher.

By default, a watcher is set to pick up request packets and handle them. If the cache contains the fingerprint requested, the data for the fingerprint is packaged up and the node connects straight back to the requestor to deliver the data. If the cache does not contain the fingerprint, the request is rebroadcast to a single peer. In this way requests move across the graph of nodes and are replied back to their sources once they can be fulfilled.

This means that the requesting of a fingerprint on the network acts a little like a future (we should probably have explicit support for that).

Another default watcher receives data packets and processes them.

Data Packet Processing

Each data packet is picked off the network and examined; if the TTL (a stupid name but I'll explain it in a second) is 1 then the data is stored on the node and not re-transmitted. If the TTL is greater than 1 then the node stores the data and re-transmits the packet after decrementing its TTL. In this way data blocks get stored on several nodes in the network to provide redundancy. Each node will be stored by a number of nodes equal to the TTL set by the original emitter. So if a packet is emitted with a TTL of 5, it is guaranteed to be stored on 5 nodes other than the original before the packet is removed from the network and no longer broadcast.

Wednesday, February 16, 2005

SHA-1 Broken

From Schneier on Security: SHA-1 Broken:

SHA-1 has been broken. Not a reduced-round version. Not a simplified version. The real thing.

The research team of Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China) have been quietly circulating a paper describing their results:

  • collisions in the the full SHA-1 in 2**69 hash operations, much less than the brute-force attack of 2**80 operations based on the hash length.
  • collisions in SHA-0 in 2**39 operations.
  • collisions in 58-round SHA-1 in 2**33 operations.

This attack builds on previous attacks on SHA-0 and SHA-1, and is a major, major cryptanalytic result. It pretty much puts a bullet into SHA-1 as a hash function for digital signatures (although it doesn't affect applications such as HMAC where collisions aren't important).

This is pretty annoying since the WOD uses SHA-1 to calculate the fingerprints of the blocks in the system. We pretty much depend on the fact that the SHA-1 algorithm requires 280 comparisons in order to generate a data block whose fingerprint collides with another block's. The reduction in the overall namespace for fingerprints is a real pain. I'm already starting to look for another option that will fit in 160 bits and will have the original unrestricted namespace.

It also occurs to me at this point that we will need to build in some support for changing the hash algorithm used for fingerprints to allow for eventual upgrades to the hashing algorithms we are using. Maybe we need to add a couple of bits to a fingerprint that we can use to specify the algorithm used to generate the fingerprint. Of course, this would add a certain amount of overhead to the whole operation since we would have to decide which algorithm to use each time. We also run into the problem of figuring out how/when to change the algorithm, in other words, how do we figure out that a given algorithm is generating collisions now?

Friday, February 04, 2005

Tip for URLs in Podcasts

Try using TinyURL to create URLs for all the links you want to give out in the program.

It would be nice to have a service similar to TniyURL that would produce mnemonic (pronounceable) URLs as well as returning the same "tiny" URL for a given URL every time it is entered.

Thursday, February 03, 2005

who's important? who cares?

David Toub asks who's important? who cares?

I think that people always want to categorise things; it's part of our basic pattern recognitory mindset. When we try to classify however we often get caught up in our own constructs. I'm talking here about hierarchical classification versus categorization versus any other method of classifying. A debate is raging right now about "tagging" given the proliferation of services such as flickr and del.icio.us. The debate centers around the idea that tagging provides a flat namespace that gets corrupted/polluted when the same tag can have multiple semantic values. Tim Bray gives the example of military "drills" versus oil "drills". Because the system that we invented (tagging in this case) is built in a certain way it constrains us to act (and even think sometimes) in a way that is congruent with it's physical design.

I was talking to a friend the other day and he mentioned that a guy we have know for years and years (and never particularly liked) happened to be around and start beatboxing (vocal percussion where you make the sound of drums and cymbals with your mouth). It turns out that the guy was quite good at it. So my friend asked the guy for his number and then felt all bad that he had known the guy for years and never asked.

Each person presents a number of facets to the world and some of them may be interesting to you and others won't be. I don't think it's fair to beat yourself up just because you didn't like someone before discovering a certain facet of theirs. I'm sure that had I met Beethoven or Mozart when they were kids I wouldn't have hung out with them much: they would have been too much into music and I would have wanted to talk about and do other things.

Our faults are what make us unique. If we all were perfect we would all be identical (assuming there is a single definition of perfect). The difference between a master and an apprentice is that the mast selects which faults to show and which to hide which gives their practice a unique flavour that is hard or impossible to duplicate. The apprentice spends so much time fussing with the tools and are not able to choose which of their faults to hide which makes their practice unpolished and rough.

So to categorize people is to either deny that there is anything more to them than the single aspect which we are considering as a basis for categorization or to say that we don't care about the other aspects of the person. Now, when we are doing studies on fertility rates by geographical area it is obvious that we don't care about professions, but if we are looking at ranking people there are often a lot more aspects involved in the ranking than simple seniority or even a simple qualitative assessment of skills.

Monday, January 31, 2005

Tweak Firefox to work faster

Go to about:config and set network.http.pipelining to true and network.http.pipelining.maxrequests from 4 to 10.