(This white paper is also published in Intel Developer Zone here)
If the goal of virtual input devices like those that can be created with Intel® RealSense™ technology merged with an appropriate Natural User Interface (NUI), is to be competitive with, or a replacement for, established physical inputs like mouse and keyboard, they must address the matter of text input. Where current-gen NUI technology has done a reasonable job of competing with the mouse, a visual and spatially contextualized input method, it has fallen notably short of competing with the keyboard.
The primary problem facing a virtual keyboard replacement is speed of input, and speed is necessarily related to accuracy. However, as sensor accuracy continues to improve, we can see other challenges arise that may prove more difficult to address.
In this paper I start with the assumption that sensor accuracy does, or soon will, allow the detection of small hand movements of a degree similar to that of keystrokes. Then I examine the opportunities, challenges, and some possible solutions for virtualized textual input without the need for a physical peripheral beyond the camera sensor.
A Few Ground Rules For This Discussion
For the sake of this paper, I’ll be discussing the potential replacement of a western style keyboard. The specific configuration, QWERTY or otherwise, is irrelevant to the main point. With that in mind, keyboards can be considered in an hierarchy of complexity, from a numeric 10-key, through extended computer keyboards that include letters, numbers, punctuation, and various macro and function keys.
As noted above, factors most likely to make or break any proposed replacement to the keyboard are speed, followed by accuracy. Sources disagree on the average typing speed of a modern tech employee, and Words Per Minute (WPM) may be an improper measure of keyboarding skills for code writing, but it will serve as a useful comparative metric. I will assume that 40WPM is a reasonable speed, and methods that cannot realistically reach that goal given an experienced user should be discarded.
I will also be focused on using the Latin alphabet; however, other alphabetic languages are similar in application. What is not considered here, though well worth exploring, is the virtualization of logographic input. It’s conceivable that gestural encoding of conceptual content would be faster and superior to gestures that encode phonemes and may even represent a linguistic evolutionary advance from logographic systems like Kanji. That said, a very likely use case for this kind of technology is writing computer code, in which case letter-by-letter input is an overarching consideration.
What “Works” On a Keyboard?
Let’s start by looking at the characteristics that have made keyboards so practical as to remain relatively unchanged for almost 150 years.
Skilled keyboardists can type 120WPM with world-class typists exceeding 180WPM. While the average speed is considerably less, it remains true that the humble keyboard provides a fast way to enter alphanumeric characters. This is a function of a layered layout that utilizes a two-dimensional arrangement that is modified with transformation keys (Shift, Ctrl) and accessed by 10 discreet actuators (fingers) traveling an average of 1 – 3 inches per keystroke. Leveraging the remarkable natural dexterity of human hands, keyboards are compact and fast. Any competing method must achieve comparable speed, to be widely adopted.
The physical and tactile nature of the keyboard makes for a powerfully accurate input method. On the mechanical level, where I may hit a key I didn’t want to hit, when the “H” key, for example, is pressed, an “H” character is always delivered. On the human interface level, the physical nature of the individual buttons, the way they click and push back, provides second- and third-sense reinforcement to the visual feedback. The subtle motion of fingers over the keys and the gaps between them subconsciously keeps the practiced typist’s hands aligned and attuned to the correct key. Indeed skilled typists enter text fully by touch without ever looking down.
Easily overlooked, the fact that a keyboard shows me an alphabet I know makes the tool easy to adopt by even pre-literate users. By using the characters we already know and use in other contexts, the keyboard benefits from a kind of path dependency. Other input schemes like stenography may offer speed increases, but they require specialized learning and therefore remain out of the mainstream.
More important today than ever before, a keyboard affords users privacy. Keystrokes are fast, discreet, and audibly indistinguishable from one another. Even if I am directly watching somebody typing, it is hard to determine what a typist is entering. This fact is what makes text-masking password fields functional for the public input of sensitive information.
Overall the keyboard has stood the test of time because it works remarkably well for what it does.
Some Previously Considered Approaches
Virtual / Projected Keyboard
Probably the most obvious approach to this question is the 1-to-1 virtualization of the physical keyboard. Examples of this include laser projections that can be purchased at stores like Sharper Image. This kind of device really is a step backward not forward. It tries to mimic the physical peripheral in most ways and only adds the dubious value of portability. While ‘virtual’ in one sense, it isn’t natural or flexible, and with current technology the user experience is worse than a physical keyboard with a high error rate and stringent environment requirements like a flat empty plane for projection—more stringent than a physical keyboard. The lack of tactile feedback is also more important than it might seem at first glance. Touch-typing is impossible on this kind of device because kinesthesia is too inaccurate to act on inertial motion alone. The physical and responsive touch of a keyboard gives the typist second-sense awareness of finger position and acts as an ongoing error-correction mechanism. Without that, one’s fingers tend to drift off target and slight positioning errors compound quickly requiring a ‘reset’ to home keys.
It’s reasonable to assume that future sensor improvements could go a long way to ameliorate the first two issues, but even if a highly accurate holographic keyboard could be projected in most environments, it’s unlikely to be an improvement.
The first place we experimented with applying Intel RealSense Technology to the matter of text input was an effort to recognize American Sign Language finger-spelling.
As with other methods, current technology makes ASL impractical primarily because of the way finger occlusion so reliably confuses the sensors. In letters like “N” and “M” the occlusion of one or more fingers is the essential characteristic of the letterform and cannot be sloppily interpreted. That said, like other methods, we assume that technology will progress and accuracy will improve.
But accuracy will not solve the deeper problems with gestural input. The first problem is, again, speed. Proficient finger-spellers can flash about two letters per second or 120 letters per minute. At an average of 5 letters per word that’s 24 WPM, which is considerably below our goal of 40WPM to compete with typing. Another way to say it is that a good fingerspeller is about half as fast as a so-so keyboarder.
Another problem is the need for the user to learn a new character set. One of the less-than-obvious values of a standard keyboard is that it comports to all of the other tools we use to write. The printed “T” learned in kindergarten and used throughout this document is the same “T” seen on the keyboard. Asking users to learn a new character set is at least a barrier to entry and likely a deal killer without some compelling improvement over what is already available.
This problem of learning a new character set is not only a barrier to entry, it is also a violation of a core NUI goal—that of a ‘metaphor free’ experience. Admittedly, each letter is a kind of metaphor in itself, but compounding that with an ASL metaphor for the alphanumeric metaphor feels like “a bridge too far.”
Joystick and Spatial Input
Game consoles regularly require text input for credit card numbers, passwords, character names, and other customizable values. The typical input method is to display a virtual keyboard on the screen and allow a spatially sensitive input to tap a given ‘key.’ If the laser-projected keyboard brings the virtual keys to the physical user, these in-game controls virtualize keystrokes.
There are many iterations of this concept. Many use a joystick to move a cursor. Others may use a hand-tracking technology like Intel RealSense or Kinnect to do essentially the same thing. Stephen Hawking uses a conceptually similar input that tracks eye movements to move a cursor. But all of these systems create a worst-of-both-worlds scenario where a single-point spatial input, essentially a mouse pointer, is used to work a multi-touch device and is little different than using a pencil eraser to type one letter at a time.
Some interesting work has been done to make joystick text input faster and more flexible by people like Doug Naimo at Triggerfinger (http://triggerfingersoftware.com/), but the input fails the speed test by a large margin and is really only valuable when better/faster input methods are unavailable.
Vocal input has potentially the highest speed of all the methods considered. It’s easy to speak hundreds of words per minute. But speech inputs are impractical for other reasons. Currently, differences in pronunciation and the variability of background noise present challenges to reliable speech recognition beyond simple commands, but it’s realistic to assume future development will overcome these issues.
The real problem with speech input is privacy. Imagine a dozen office cubicles where each employee is speaking to their computer to get daily work done. The lack of privacy and discretion of a keyboard and multiple collocated users add to an increasingly complex soundscape. But even if such a work environment could be managed (call centers are not much different in this regard), users in public spaces are likely to be reluctant to speak sensitive information. Technology like biometrics could make personal identification private and discreet, but making the content of an interaction private is a different problem.
When considering our use case of writing code, speech becomes even more problematic. While speaking entire words can be fast, speaking individual letters is not and current programming best practices include concepts like CamelCase and unpronounceable functions that make letter-specific entry a requirement.
Knowing What Won’t Work and What Will?
All this talk about the weaknesses of alternate text-input methods implies that the humble keyboard has several strengths that are not easily replaced or improved upon. So the question comes back to how can these demonstrated strengths be conserved in a Natural UI schema where hardware peripherals are undesirable? I believe the answer lies in a few critical observations:
1.The ability to use as many as 10 actuators, i.e., fingers, is impossible to meet or beat with any single-point system.
2.The tight, layered, and customizable layout of the keyboard is remarkably efficient. But as a 2-D design it could be expanded by incorporating a third dimension.
A Possible Design for 10-Key Calculator Input
I started by conceptualizing a NUI calculator keyboard. Minimum operation required digits 0-9 and basic operators. Aiming for using all ten fingers I settled on breaking the keyboard into two 5-digit groups and representing these on screen as two square-footed pyramids.
Each finger corresponds to one face of each pyramid. Each face can be thought of as a ‘key’ on a keyboard, so I call them facekeys. The left hand enters digits 1-5 by flexing fingers individually, while the right enters digits 5-0. Flexing the same finger on both hands simultaneously, say both ring fingers, actuates a facekey on the operand pyramid that includes ‘plus’, ‘minus’, ‘divide’, and ‘multiply’ and an ‘equals’ operator. Non-digit, but essential functions include a left-fist to write the displayed value to memory, a right-fist to read (and clear) memory, and two closed fists to clear the decks and start a new calculation.
First-pass plans had users holding their hands with palms downward, mimicking the operation of a desktop keyboard.
However, it was quickly apparent that a position with palms facing inward was more comfortable and allowed for both longer use and more speed.
Visual feedback from the screen is very important, especially when learning, and this is provided by familiar calculator-style digit readout in the main part, but the pyramids also rotate and animate with each stroke to establish and reinforce the connection between a finger and its corresponding facekey.
We found this system to be comfortable, easily learned, and easily extensible. We found ways to add additional keys or functions. For instance, the lack of a decimal key was noticed as soon as we were ready to move beyond the most basic calculations. Also missing was a backspace to correct errors. But these were easily accommodated with minor modifications. A right-handed wave acted as a backspace. The equals facekey was replaced with a decimal point for entry (two thumbs) and a “clap” gesture became the equals operator, which had the unexpected result of making calculations rhythmic and modestly fun.
Simulated operation indicated that significant speed was possible, but in reality the palms-facing orientation was a challenge for the SDK to reliably track finger motions. In practice the only reliable position was with the palms facing the camera, which is even less comfortable than downward facing palms but is reasonable for short periods of time. By the same token, it’s easy to move faster than the camera can reliably recognize, so deliberate movement is needed at this point to avoid errors.
Extending the Keyboard
A peripheral-free calculator is one thing, but a typical 80+ keyboard replacement is quite another. I did, however, find some very simple and practical ways to continue development around this keyfacing concept.
The standard keyboard is arranged in four rows of keys plus a spacebar: numbers and punctuation on top with three rows of letters beneath. Each row is defined by its position in space, and I used that concept here. Instead of moving my hands toward or away from a fixed point like the camera, a more flexible method would be to make the system self-referential. I define a comfortable distance between my palms, and the software sets this distance as 1-Delta. The actual distance between my hands is irrelevant and therefore customizable to the individual user and physical circumstances. The equivalent of reaching to different rows on the keyboard is moving my hands closer and farther apart from one another. A 2-Delta distance accesses “second row” keys, and 3-Delta reaches third row keys.
The ‘home keys’ are set to this 1-Delta distance, and keyfacing proceeds by mapping letters and other characters to a series of pyramids that sequentially cover the entire alphabet. Experimentation suggests 3-4 comfortable and easily reproducible Deltas exist between hands that are touching and shoulder-width. Skilled users may find many more, but the inherent inaccuracy of normal kinesthesia is likely to be a ceiling to this factor.
Simple gestures provide another axis of expansion. The keyboard’s shift key, for instance, transforms each key into two. Ctrl and Alt keys extend that even more. Simple, single-handed gestures would create exactly the same access to key layers while maintaining speed and flexibility. For instance, a fist could be the shift key. A ‘gun’ hand may access editing commands or any number of combinations. By using single-handed gestures to modify the keyfaces we can access different characters.
Experimentation revealed some things that are best noted as modifiers for future development but were not acted upon in this experiment.
•Fingers rarely move independently. This is especially true with the way the ring finger tends to follow motion in the middle finger. All users found it difficult to make fully discreet motions with any finger excepting the thumb. This may be a matter of practice more than anything else, but I found the observation had practical implications when I tried to actuate keyfaces by the recognizable motion of a single finger.
•Hanging one’s hands freely in space was less comfortable and created less accurate keyfacing than any configuration that provided a stable base, even if that base was the user’s other hand. This observation suggests that paired-finger gestures may be worth exploring. While we reduce the number of available actuators by half, we gain a significant amount of accuracy and comfort, which may increase speed enough to overcome the reduction of available fingers.
Along with the pyramid-based structure, I experimented briefly with other structures. Most were impractical due to limitations of the technology, but remain intriguing. With high keyboard usage leading to injuries like carpal tunnel syndrome, there is every reason to believe the technique explored here would suffer from the same threat. A two-handed twisting configuration was considered and seemed promising. In practice it is much like manipulating a Rubik’s Cube. The interface projects interconnected rings or discs that correspond to the user’s hand “normal” and it is in the relative positions of hands that interaction is initiated.
It should not be a foregone conclusion that some peripheral-free method will ever replace the keyboard if the keyboard is the right tool for the job. After all, the mouse didn’t replace the keyboard, it supplemented it. Touch UI has kept the concept intact, just in a virtualized rendition. NUI may be the same. But if NUI is to replace the keyboard, it must do so by beating it on its home turf: accurate, high-speed text input. I think there is reason to be hopeful for such replacement in the near future, but only time will tell.
About the Author
Chris Skaggs is a 15 year veteran of the web and mobile software industry. The founder and CEO of both Code-Monkeys and Soma Games LLC Chris has delivered software applications to some of the country’s most discerning clients like Intel, Four Seasons, Comcast, MGM and Aruba Networks. In addition to corporate customers, Code-Monkeys and Soma Games have programmed many casual and mid-core games for iPhone, iPad, Android and Mac/PC platforms. A Black Belt in Intel’s Software Developer Network, Chris also writes and speaks on topics surrounding the rapidly changing mobile application environment at venues like GDC Next, CGDC, Casual Connect, TechStart, Serious Play, and AppUp Elements. Soma Games recently acquired the rights to produce games based on the wildly popular Redwall series of books and is currently in production of their first Redwall title, The Warrior Reborn.