More recently, I sat down over a lunch in San Francisco with Martin and Mudassir Azeemi, a UX designer that I know, ostensibly to discuss hypervoice. We also ended up discussing the theoretical underpinnings of networks, the psychological and social pressures of communications, and the future of the Internet.
Steven: What’s wrong with traditional voice telecommunications—what we’re all used to today? Voice usage has been dropping, apparently in favor of other channels, for a few years now. Is this the end, or does it just mark a need for change?
Martin: I’ll answer your question by flipping it on its head. What does voice telecommunications do right? There has clearly been value in it for over 100 years. It tries to recreate the experience of being there with somebody else. The highest aspiration a telephone call had was to be as good as being there. It falls short in that it creates a session-based interaction that doesn’t match how we are interacting physically in the same space.
We can have an ongoing relationship, a conversation, across several media. I was outside on the street, checking my email to see where we were supposed to meet. We bumped into each other in the reception area, but there was no obvious place the session started.
Humans are not session-based creatures. When we met each other here, I didn’t recognize you from behind at first. Then you turned around and saw me, and we had a shared context, which was that we were both now in the same restaurant, where we had intended to be.
But a telephone call is not contextual; there is no context that it brings into an interaction. So, in many ways, it’s quite unnatural. The early telephone companies had to teach their users what to do and say. Before this, people regarded speaking to someone to whom you hadn’t been introduced as impolite. It was a wholly unnatural experience in that context of culture.
So, a whole bunch of social expectations were slowly established as to how we should talk on the phone. A dialect emerged around telephone calling that hadn’t previously existed.
In the meantime, the world has moved on: we’re now all carrying portable computers, and there is a multiplicity of ways for us to communicate with one another—and in particular, we’ve gotten used to these little minicomputers being amplifiers for every thought and idea that we have.
The things that we haven’t yet been able to amplify—aside from merely recreating voice at a distance—are our spoken thoughts, ideas, and gestures. If I text you, you can forward the text; if I tweet, you can retweet it. Twitter is a classic amplification engine for ideas.
But every spoken word is ephemeral and unindexed. So, in a world where you’ve got a choice between amplified thoughts and unplugged thoughts, the rock-and-roll version of communications will gather a lot more attention and use versus the purely acoustic version.
So, not surprisingly, telecommunications comparative advantage starts to fade away.
Mudassir: And that’s the reason text messaging and other text-based communications are on the rise? That’s why the mobile industry is capping minutes?
Martin: Yeah. The ratio of cost to benefits is changing. In some ways, telecommunications is becoming more costly, because people are spending more of their time in telephony-unfriendly environments. But its relative benefits compared to other things is improving.
Steven: So, what should telecommunications—which I guess we can interpret as meaning telepresent communications—be like? What attributes do they need to express for us to bring them more adequately into the digital era?
Martin: If you are a phone company, you are caught in a very awkward place. The mantra for the last ten years has been that all the OTT (Over The Top) players are coming, and telecoms will just become a dumb pipe. The world is a lot more complex than that. Spoken voice has been anchored to telephony for a very long time, and there is a complex and sophisticated set of technology, as well as social norms and government systems, to keep it there. There’s no particular reason why vertical integration is evil.
Apple does lots of vertical integration, and people don’t regard it as evil. So, if phone companies can deliver a better human communications experience based on their long history of delivering this stuff, why not?
There’s value in the networks and relationships and systems that we’ve got. If we were to start again, we wouldn’t design things the way they are. Like in the joke about the drunk who loses his keys and looks for them under the streetlamp, this is where we are.
Steven: Something you said a few months ago was to characterize Twitter and Facebook as hyper-messaging platforms. Platforms whose primary purpose is to link to or to enable access to more information. Hypervoice seems like not just a great thing in itself, but a herald for a new age of hyper-everything. What other media is failing in its promise, so needs to become more connected and a more integral part of our conversations?
Martin: YouTube is really a kind of hyper-video. You can overlay bits of information onto videos. Something becomes hyper when it gets a URL. If it doesn’t have a URL, it doesn’t exist. Email has traditionally not had URLs, so I can’t easily point to and publish an email message.
The necessary part of hypervoice is that voice objects now have URLs. But that’s not sufficient. Linking is the sufficient part. Linking what people said to what they did. So, all the objects we touch or interact with throughout a voice conversation, all the PowerPoint decks or notes or trouble tickets or Web pages that we view, all the gestures that we make all get linked back to what we say.
All together that enables a whole new way of thinking about voice. Rather than an ephemeral thing, it is a permanent, digital asset that we are creating.
This is a prototype of a hypervoice conversation. Steven will replay it, mark it up in various ways, and the notes you take—if you were using a LiveScribe pen—would be tied back to the moment at which I was speaking.
Steven: There’s no reason you can’t tie a service like this to an arbitrary device, so if someone like LiveScribe gets on board, their pens would become standards compliant, and you could use that data within your existing workflow.
Martin: Yes, you should take the hypervoice stuff and embed it into other things like Microsoft Office, so people could continue to use Oracle Social Network to take their notes, but still tie them back to the tools they prefer and are comfortable with rather than forcing them to use a whole new tool.
Everything has to fit within the existing workflows people use, with minimal changes. For example, I use a Kanban system for managing all my tasks. I have a very busy life, so I have about 12 different swimlanes and different statuses these things go into. It’s okay, but I want a system that tells me what I need to be doing next. Our relationships with computers are not quite adversarial, but oppositional. The machine should be working in lockstep with us.
Steven: Reading through your presentations and papers, I was less struck by how radical the ideas were than how natural they seemed. And also, that they were proposed decades ago. If I may, let me show you two quotations from Vannevar Bush’s 1945 article “As We May Think,” in The Atlantic: 
“One can now picture a future investigator in his laboratory. His hands are free, and he is not anchored. As he moves about and observes, he photographs and comments. Time is automatically recorded to tie the two records together. If he goes into the field, he may be connected by radio to his recorder. As he ponders over his notes in the evening, he again talks his comments into the record. His typed record, as well as his photographs, may both be in miniature, so that he projects them for examination.”
“All our steps in creating or absorbing material of the record proceed through one of the senses—the tactile when we touch keys, the oral when we speak or listen, the visual when we read. Is it not possible that some day the path may be established more directly?”
The popular mythology of the information age is that people demand something better, then go build it. But, in fact, as you were just talking about, there are a lot of regulatory frameworks and even inertia.
How did we get here, or more usefully maybe, how do we get out of where we are to make this future happen? How do you see the future evolving so these kinds of services aren’t little, fun niches in specialized, enterprise systems, but ways we can all get a hyper-everything world in the future.
Martin: The conference I am here in San Francisco to attend is the WebRTC Expo, which is allegedly about putting voice on the Web. But it comes from such a narrow and unimaginative place, which is that the highest aspiration is to put real-time, streaming, two-way audio and video into the browser. This totally misunderstands what the Web is about, which is about linking. It isn’t Web voice at all; it’s just browser voice.
It’s useful, but it’s neither necessary nor sufficient. We’ve got these things called telcos, and they deliver really quite good, high-quality voice already. To make hypervoice work, you don’t need browser voice at all. It’s irrelevant.
The question might be: where does this all go? Our idea of a browser is currently very limited. It’s still locked into the 1990s paradigm of basically client-server scaled up to global scale.
There are these two paradigms, one of which is the document hyperlinking paradigm; the other, activity streams. Twitter is all about activity streams. It’s not about documents. You can turn an individual tweet into a document, but it’s really about the temporal relationship.
And so the current Web is time blind. It’s like someone who knows only spatial metaphors, but there’s this other dimension of living called time. And that is important. And the Web doesn’t understand it. Today, the Internet and the Web are both prototypes, each with severe holes and faults. Neither of them is temporal.
Steven: Just the other day, I was outlining the limitations of current voice communications as a reason why we use texting and other means of sending data instead. Voice is, today, transient. You can’t tie it to data. You can’t review it. So if you mishear something, it’s just gone.
But there are bits and pieces that have pushed the boundaries for some time. For example, LiveScribe is a consumer product that comes to mind as being conceptually similar to hypervoice. But their approach is not universal, or standard. I cannot press *22 on just any phone to jump back 10 seconds. My LiveScribe notes are in a proprietary format.