As part of the continuing series covering our experience with the RealSense technology from Intel, I’ve been thinking about gestures…
I’ve been saying for a long time that one of the keys to Apple’s success in getting developer buy-in for iOS was the very approachable and well designed tool kit they provided in X-Code. It was as if they polled 100 random potential coders and asked, “If you made an iPhone app, what’s the first thing you would want to tinker with?” and then they made all of those APIs easy to find and easy to use. The result was a tool kit that rewarded you early for modest effort and thereby encouraged developers to try more, to get better, to learn more again and keep exploring the tool kit for the next cool thing. It made adoption of something totally new feel manageable and rewarding. That not only encouraged the curiosity crowd, but also the business-minded crowd who has to ask, “How long will it take to adopt this tech? And is it likely to be worth it?” So long as the first answer is “Not too much.” then the second question is less acute.
The point being: it enabled early adopters to show off quickly. That drew in the early followers and the dominoes fell from there.
RealSense would benefit greatly from this lesson. Hardware appears to be in the pipe and were adequately impressed by the capability – check. A Unity3d SDK (among several others) is looking really sharp – check. So now I’m thinking about the question, “…What’s the first thing I want to tinker with?” and probably 75% of my ideas revolve around gestures. In fact, gestures are probably the essential component of this input schema and as such, it will be make-or-break for Intel to make gestures easy to get started with and also deep enough to explore, experiment, and mod. But Easy needs to come first…
It’s safe to say that a lot of developers will start by taking some app they already have and want to map touch or mouse actions to RealSense equivalents. A hash map that connected common touch gestures automatically to RealSense gestures would be a great entry level tool. Out of the box a Pat gets mapped to a click or tap, a Grab to a double-tap. Other gestures are almost effortless to conceive: Swipe, Pinch, Spread and Drag are no brainers but ONLY if RealSense can recognize these basic gestures right out of the box.
Since this kind of root interactivity is foundational to basically every app in the world then I hope the SDK gets these right from day 1. Expecting a “merely curious” developer to define their own gestures is likely a bridge too far and might encourage the less motivated to put RealSense down before they get started.
Expanding on this, making the gestures easy for developers to insert is the key but of course someone might say they also must be easy for users to adopt. While this is true, I’d be cautious about overemphasizing this aspect. Sure the gestures need to be intuitive and forgiving enough to accommodate variations in individual use. However, the adoption of touch teaches us that users are willing to learn how to do a new thing so long as the reward is reasonable. People interested in using this UI will also be willing to practice a little to get the results they are after.
The most obvious place to go after this is to define recognizable positions for the hand. An Open Palm, a Peace Sign, a Thumbs Up. Tying a set of base gestures to specific API calls would be a good short cut. OnPeaceSign() for instance is the kind of thing that would pique my interest as I perused the SDK and would get me thinking and experimenting. Expand that to something extensible like OnGesture.[myGesture]() and I start to think big.
As a side note, making the terminology distinguish between a “sign”, which I would define as a static hand configuration, and a “gesture” which may or may not require a sign first, and then add motion, would help make the jargon more precise.
By combining a simple sign like a Pointing Finger, with the ability to “draw” on an imaginary plane perpendicular to the camera offers the ability to somatic symbols that trigger other actions. A Perceptual Computing game at GDC included this kind of feature to cast spells in a wizard duel but the ground here is fruitful for all manner of things. From the SDK perspective the important part is a simple ability to import the desired symbol in a common format, like SVG for instance, that can encode both the desired path AND the sequence needed to do it right. For instance the definition of a Square might include the notion that it starts in the upper right corner and proceeds clockwise. Ideally, I could draw a shape in Illustrator, import that as my symbol, and RealSense knows how to recognize it from there.
Achieving a kind of feature parity with 2D UI standards is a big deal and almost certainly a requirement to achieve adoption but if RealSense stops there it’ll be missing its real potential : 3D UI. Z-axis gestures that I want to see built in to the system would include Push, Pull, Spin, Stack and Throw just for starters.
After answering the immediate issues of mapping an old UI to the new technology, you really want to encourage devs to start doing things that are totally impossible without RealSense. For there to be a Killer RealSense App it must do something that cannot be adequately reproduced in another format and 3D gestures is a powerful differentiator.
Again, these base gestures need to be in the SDK when I open the box. I want to see those methods available in the lists and ask, “Hmmmm…what could I do with a Pull effect on this object?”
Another obvious place to facilitate is with the ability to track both hands (or more?) simultaneously. In the previous generation (PerC) we were practically limited to a single hand for input. That led to obvious comparisons to a mouse or a finger in a touch environment. But with two hands you’re now in uncharted waters and giving developers a few starting points will encourage development as opposed to waiting for folks to figure it out on their own – they need a starting point. A simple feature that automatically distinguished the right hand from the left is the kind of thing that I really would expect to see as a gimme for instance.
With that, an example 2-handed function might be one that draws an imaginary line connecting my palms. Now we have a ready made “handle” that can be used to manipulate a wide range of things like zoom, object rotation, camera angle, etc.
Selection could also be facilitated with two-handed gestures. Imagine drawing a selection marquee with my hands in a T-Square sign. It would be natural, intuitive and can easily be rotated or spin in space to create unique selection tools.
Extensibility and Gesture Creation
Once this set of hand-holding APIs is mastered I expect to be able to roll my own. The ability to create custom gestures is a big deal and it’ll be interesting to see how this gets done. For my mind, the simplest and most useful tool would be one that simply used the camera itself to input a sign or gesture. Click record, hold up your new sign, and click stop. Assuming all of my finger positions can be compared to another set of finger positions with enough flexibility to account for morphology then it’s hard to imagine an easier way to get this done. Teaching American Sign Language seems like an obvious use case. Show the computer how to form an E and save it. Now the app can teach anybody how to Fingerspell just by comparing their sign to the one you taught it.
I think a simple total like this would be much preferable to some esoteric data format that defines a gesture with numbers and angles and rotations. That said, for certain purposes, the ability to modify an existing gesture would be valuable. Going back to ASL for instance, perhaps the are levels of expertise that tolerate more or less deviation from the model in memory. As beginner, you can get “close enough” but to advance to expert level the tolerances to down.
Gestures and signs will be the entry-level bread and butter of RealSense as people attempt to make practical use of the new input technology. The closer Intel brings this to mainstream developers the more they will want to provide easy, intuitive APIs that let programmers focus on cool uses of the tech, and minimizing the effort needed to make it “just work.”