From the hip: February 2005

Saturday, February 26, 2005

Security in the WOD's networking layer

When you use the WOD, you are connected to the global network of WOD users. You immediately begin exchanging data packets with all your peers on the network to increase redundancy for your packets and to ensure that your data is stored permanently. In addition, clients may request specific packets at any time - requests that must be replied to if the network is supposed to work properly.

In order to guarantee a certain level of privacy for each WOD user, we use a packet-level encryption scheme. This scheme needs to allow the WOD's similarity-based capabilities to continue working and therefore encrypting a given cleartext packet must produce the same ciphertext for all clients that perform the encryption. This is of critical importance since the benefits gained from the self-similarity of data in the network are lost immediately if there are two possible ciphertext results for the same cleartext.

To circumvent this issue, each packet is encrypted with it's own MD5 hash. This guarantees that every client encrypting a given packet will encrypt it exactly the same way. Pointer blocks/packets maintain both the fingerprint of the packet (the SHA-1 hash of the cleartext) as well as the password of the packet (the MD5 hash of the cleartext). The security of MD5 is not really an issue here since collisions in the hashing function do not significantly reduce the strength of the generated password.

Packets are encrypted using AES - a symmetric encryption algorithm. This is because the key for a given encrypted block is determined by the block's data itself. Generating public/private keypairs might enhance security but would make things a lot more complicated to implement.

This encryption scheme is not proof against chosen plaintext attacks: if an attacker has a file and wants to prove that you have the file too they are able to regardless of the encryption. This does mean however that only people with the same files as you can prove you have those files. Users would be well advised to employ third-party (higher level) encryption for their files if additional security is needed at this level. The rationale for not implementing per-user security is that, for the most part, as long as I own the same file as you, I don't care if you see that I own it.

Tuesday, February 22, 2005

ISBN Linker Bookmarklet

Drag and drop the [ISBN Linker] into your toolbar and click it to transform ISBN numbers in the current page into links to Amazon.com.

Thursday, February 17, 2005

The WOD's network design

Each system on the network is a node and each node is connected to zero or more peers.

Nodes

When a node is connected to less peers than its threshold, it broadcasts a packet onto a multicast address with the port on which it would like to receive data. It also listens on the multicast address and picks up advertisements for other nodes until it has reached its threshold. Once the threshold is reached the node stops advertising itself and stops listening on the multicast IP. If the number of connected peers ever drops below the threshold, the node will begin to advertise itself and listen again for other peers.

Peers

Once a packet is received from a prospective peer, the node examines the requested port and then connects to the peer on that port. Packets sent by any peer are gathered together and sent into a central pipe.

The central pipe

The central pipe of packets coming in from the network is watched by a series of functions that have registered to be notified of incoming packets. Each watcher may operate on a packet before it is passed on to the next watcher.

By default, a watcher is set to pick up request packets and handle them. If the cache contains the fingerprint requested, the data for the fingerprint is packaged up and the node connects straight back to the requestor to deliver the data. If the cache does not contain the fingerprint, the request is rebroadcast to a single peer. In this way requests move across the graph of nodes and are replied back to their sources once they can be fulfilled.

This means that the requesting of a fingerprint on the network acts a little like a future (we should probably have explicit support for that).

Another default watcher receives data packets and processes them.

Data Packet Processing

Each data packet is picked off the network and examined; if the TTL (a stupid name but I'll explain it in a second) is 1 then the data is stored on the node and not re-transmitted. If the TTL is greater than 1 then the node stores the data and re-transmits the packet after decrementing its TTL. In this way data blocks get stored on several nodes in the network to provide redundancy. Each node will be stored by a number of nodes equal to the TTL set by the original emitter. So if a packet is emitted with a TTL of 5, it is guaranteed to be stored on 5 nodes other than the original before the packet is removed from the network and no longer broadcast.

Wednesday, February 16, 2005

SHA-1 Broken

From Schneier on Security: SHA-1 Broken:

SHA-1 has been broken. Not a reduced-round version. Not a simplified version. The real thing.

The research team of Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China) have been quietly circulating a paper describing their results:

collisions in the the full SHA-1 in 2**69 hash operations, much less than the brute-force attack of 2**80 operations based on the hash length.
collisions in SHA-0 in 2**39 operations.
collisions in 58-round SHA-1 in 2**33 operations.

This attack builds on previous attacks on SHA-0 and SHA-1, and is a major, major cryptanalytic result. It pretty much puts a bullet into SHA-1 as a hash function for digital signatures (although it doesn't affect applications such as HMAC where collisions aren't important).

This is pretty annoying since the WOD uses SHA-1 to calculate the fingerprints of the blocks in the system. We pretty much depend on the fact that the SHA-1 algorithm requires 2⁸⁰ comparisons in order to generate a data block whose fingerprint collides with another block's. The reduction in the overall namespace for fingerprints is a real pain. I'm already starting to look for another option that will fit in 160 bits and will have the original unrestricted namespace.

It also occurs to me at this point that we will need to build in some support for changing the hash algorithm used for fingerprints to allow for eventual upgrades to the hashing algorithms we are using. Maybe we need to add a couple of bits to a fingerprint that we can use to specify the algorithm used to generate the fingerprint. Of course, this would add a certain amount of overhead to the whole operation since we would have to decide which algorithm to use each time. We also run into the problem of figuring out how/when to change the algorithm, in other words, how do we figure out that a given algorithm is generating collisions now?

Friday, February 04, 2005

Tip for URLs in Podcasts

Try using TinyURL to create URLs for all the links you want to give out in the program.

It would be nice to have a service similar to TniyURL that would produce mnemonic (pronounceable) URLs as well as returning the same "tiny" URL for a given URL every time it is entered.

Thursday, February 03, 2005

who's important? who cares?

David Toub asks who's important? who cares?

I think that people always want to categorise things; it's part of our basic pattern recognitory mindset. When we try to classify however we often get caught up in our own constructs. I'm talking here about hierarchical classification versus categorization versus any other method of classifying. A debate is raging right now about "tagging" given the proliferation of services such as flickr and del.icio.us. The debate centers around the idea that tagging provides a flat namespace that gets corrupted/polluted when the same tag can have multiple semantic values. Tim Bray gives the example of military "drills" versus oil "drills". Because the system that we invented (tagging in this case) is built in a certain way it constrains us to act (and even think sometimes) in a way that is congruent with it's physical design.

I was talking to a friend the other day and he mentioned that a guy we have know for years and years (and never particularly liked) happened to be around and start beatboxing (vocal percussion where you make the sound of drums and cymbals with your mouth). It turns out that the guy was quite good at it. So my friend asked the guy for his number and then felt all bad that he had known the guy for years and never asked.

Each person presents a number of facets to the world and some of them may be interesting to you and others won't be. I don't think it's fair to beat yourself up just because you didn't like someone before discovering a certain facet of theirs. I'm sure that had I met Beethoven or Mozart when they were kids I wouldn't have hung out with them much: they would have been too much into music and I would have wanted to talk about and do other things.

Our faults are what make us unique. If we all were perfect we would all be identical (assuming there is a single definition of perfect). The difference between a master and an apprentice is that the mast selects which faults to show and which to hide which gives their practice a unique flavour that is hard or impossible to duplicate. The apprentice spends so much time fussing with the tools and are not able to choose which of their faults to hide which makes their practice unpolished and rough.

So to categorize people is to either deny that there is anything more to them than the single aspect which we are considering as a basis for categorization or to say that we don't care about the other aspects of the person. Now, when we are doing studies on fertility rates by geographical area it is obvious that we don't care about professions, but if we are looking at ranking people there are often a lot more aspects involved in the ranking than simple seniority or even a simple qualitative assessment of skills.

From the hip