I am a post-doctoral research associate in the Virtual Environments and Computer Graphics group at University College London. My research interests are virtual, mixed and augmented realities, telepresence and 3D interaction.Comments
The overall aim of this work was to build a stereo camera rig to support immersive video see-trough augmented reality (AR) for the Oculus Rift. An immersive experience implies that the video frame captured by the cameras must match both the extents and distribution of the Rift’s field of view (FOV) so that virtual and video spaces are perceptually aligned, and that this full frame should be augmentable. The vision of the project is to support an immersive AR experience where the lines between what is real and what is virtual are blurred.
This is the first in a series of seven articles detailing the process of designing and building a stereo camera rig for the Oculus Rift and an AR showcase demonstrating the types of interactions the system makes possible. The video above gives an overview of the work and showcases the demonstrations.
The articles are split as follows:
There are several reasons why the Rift is a suitable head-mounted display (HMD) for immersive AR: it has a large field-of-view (FOV) of around 90° horizontal and 110° vertical per eye depending on the user’s inter-pupillary distance (IPD), it is light-weight at 379g, and its front-most surface is only ~5cm in front of the user’s eyes, meaning that cameras can be mounted at a relatively small offset.
The job of the cameras in a head-mounted video-see-through AR system is to capture the world from the perspective of the user’s eyes. This real-time video is processed and augmented with graphical content before it is displayed to the user. Thus, the optical and technical requirements of the cameras should not aim to be "as close to the human eye as possible", as the user’s vision is mediated by the optical characteristics of the HMD. Rather, the camera requirements should be defined according to the specifications of the HMD. This is naturally a very good thing as HMDs are unlikely to reach the level of acuity afforded by the human eye any time soon and thus greatly relaxes our requirements.
While our cameras should match the specifications of the display, they should not necessarily exceed them. For instance, if we were building a stereo camera rig for an HMD with a 50° FOV (much lower than that of the human eye and the Rift), then using a camera lens with a 75° FOV would be both unnecessary, as 25° would not be displayed, and would also reduce the effective resolution of the 50° that was displayed as the full camera sensor would not be used. We could of course compress the 75° camera view into the 50° display but this would distort the user’s view, distorting both scale and position (your hands would look too small and appear in the wrong place) and making alignment between the video image and the tracking coordinate system difficult. This would break a critical aspect of the immersive AR experience.
There are numerous technical requirements for cameras suitable for use with the Rift. The following list is not exhaustive and there are further aspects such as sensor type, HDR and synchronisation that I don’t go into, but here are some key requirements:
For the setup that you see in the videos and images in this article, I have used Logitech C310 cameras, which can be modified to satisfy the above requirements. The sensor resolution is 1280x960, thus exceeding the Rift’s per-eye resolution. This resolution has a 1.33:1 aspect ratio, thus satisfying our second requirement. The C310’s stock lens as a FOV half that of our 120° requirement, so I salvaged the lenses from two Genius WideCam F100 cameras. Higher quality lenses could be sourced, but the FOV is ideal. The fourth requirement of 60 frames-per-second (FPS) at a high resolution is a big ask for a consumer USB 2.0 camera (true USB 3.0 webcams don’t seem to exist yet), yet it seems that the C310 can deliver greater than the 30Hz stated in its specifications. I intend to verify the exact maximum frame rate it can deliver, but I estimate it to be ~50Hz. The 1280x960 resolution appears to be a “sweet spot” for matching the Rift’s resolution at a reasonably high refresh rate. If we were to go for a higher-spec camera such as the Logitech C920, we would have an unnecessarily high resolution, which would likely both impact the frame rate and start to hit the upper limits of USB 2.0 protocol bandwidth. The C310 satisfies our fifth and sixth requirements. When stripped to the board level and fitted with the lenses, the cameras look like the image at the top of this article.
In summary, the modified C310 has proven to be an excellent choice of camera. There are certainly alternatives that may prove superior (I have my eye on the forthcoming IDS uEye USB 3.0 board-level sensor). As far as consumer webcams go, however, there doesn’t seem to be a much better choice. I have already mentioned the drawbacks of higher resolution cameras. Regarding the popular PS3 Eye camera, while it can push out 60FPS, it has a 640x480 (0.3MP) resolution, which is 64% of the Rift’s 800x600 (0.5MP) per-eye pixel count, and just 25% of the C310’s 1280x960 (1.2MP). The resolution of the PS3 Eye is too low to stretch across the large FOV without dramatically reducing the visual quality of the stereo video and with it the overall AR experience.Comments
Users are particularly sensitive to imperfections in head-mounted display (HMD) optics. In our case, the cameras introduce another layer of potential error and wrongly mounted cameras could lead to stereo convergence issues, discomfort and motion sickness.
The cameras should be mounted in such a way that:
The video in the first part of this series of articles shows how I mounted the cameras. The two main components are the 3D-printed arms that slide onto the front of the Rift, and the camera moulds. The moulds were made from thermoplastic and provide a flat base surrounding the cameras for easier mounting and adjustment. The process of creating the moulds is shown below:
The moulds were cut to size and attached centrally on the 3D printed arms using M2 thread Nyloc nuts and bolts. Rubber washers between the mould and the arms allow for fine tilting adjustments.
There are two ways to mount the cameras in a stereo rig:
There are two types of parallel setups: with and without lens shift. Parallel without lens shift means that the optical axes of the two cameras overlap at infinity. Objects at infinity will be cast at the surface of the display, all other images will be cast in front of the display and nothing appears behind the screen surface (no positive screen parallax). This should be avoided.
Parallel cameras with lens shift avoids these distortions but requires physical modification of the cameras by horizontally shifting the lens relative to the sensor. This results in a skewed frustum. Lens shift can be simulated in software if using standard cameras in parallel setups, but this has the disadvantage that the image must be cropped, thereby losing a portion of our precious field-of-view (FOV).
Toed-in cameras are rotated inwards so that their optical axes intersect at a point usually mid-way through the scene, so that some objects will appear behind the screen and some objects will appear in front of the screen. Toe-in produces keystone distortion and depth-plane curvature. Keystone distortion can lead to problems with vertical parallax particularly in the corners of the image. Depth-plane curvature produces a warping of 3D space leading to flat planes appearing bowed in the centre toward the camera.
Ideally, lens shift would always be used in physical stereo camera rigs (and the same goes for virtual stereo camera), but toe-in is often used because it is easy and expedient. Moreover, when dealing with physical cameras, lens shift is impossible to achieve on a standard consumer camera and would require a custom-built system. For this reason, my setup uses toed-in cameras. None of the ~10 colleagues who have used the AR-Rift have noticed issues with vertical parallax or ability to fuse the stereoscopic video. The large FOV of the AR-Rift display results in a low pixel density, which may mean this issue is minimal. Depth-plane curvature is noticeable, but this is also noticeable when using the Rift as normal in pure VR, and is probably an artefact of the Rift’s physical lenses.
In future iterations of the AR-Rift, and specifically when suitable USB 3.0 sensors are available, I plan to investigate the use of a parallel setup using both an excessively high FOV lens and an excessively high resolution sensor. This would allow for software lens shift while retaining the required end resolution and FOV.Comments
The video linked in the first article in this series goes into a wireless setup for the Rift, starting at 2:38. The components required for this setup are as follows:
The WAVI can be used to wirelessly transmit an HDMI signal to the Rift. The WAVI is powered by the portable battery. After performing the Rift USB Power Hack, the Rift can be powered by the WAVI through USB. Note that you will be restricted to using short USB cables following the power hack. Post-hack, the longest cable that seems to work is slightly less than 3 metres. Data from the Rift’s orientation sensor are also transmitted to the host machine by the WAVI. This equipment can be placed in a slim backpack such as this. In normal use, the battery can power the setup for around 7 hours. I have experienced drop-outs in USB data that seem to be due to the WAVI overheating after >30 minutes of uninterrupted use while in the backpack. When drop-outs occur they are critical rather than intermittent, and requires the WAVI to be turned off for 10-15 minutes to cool down.
As shown video starting at 1:39, we are fortunate enough to have an OptiTrack optical motion capture system in our lab. The 12 cameras capture at 100FPS, and the Motive software computes position and orientation for a defined set of markers. We stream out the data describing the pose of the head and the two hands and pick it up in real-time in Unity.Comments
The AR-Rift displays a live video background onto which virtual content is embedded. These two layers of media are treated separately in terms of software implementation (one is not reliant on the other). From a usability standpoint, however, it is of critical importance that these two representations of spaces and objects are aligned.
Alignment between the room-area tracking coordinate system and the stereo video image is core to making an immersive head-mounted AR experience work. Alignment implies that real tracked objects and their virtual counterparts appear in the same place, or are perceptually collocated, to the user. The video linked in the first post demonstrates this goal, starting at 3:34. There are three stages to achieving this:
The first step toward aligning video and virtual spaces is verifying our physical camera field-of-view (FOV) and determining its angular distribution over the full image frame. To do this, we need to establish the optical centre, or nodal point of the lens. The nodal point of the lens can be considered as the point at which the rays entering the lens converge. It can also be considered as the centre of perspective of the lens.
A method to find the nodal point is described below. This method was devised independently, extending the method discussed here through the use of a printed protractor. The method allows the entire camera FOV to be examined simultaneously and appears to be a robust approach. (I welcome advice or disapproval from any experts reading this article.)
The image below is divided into three columns, each with a top-bottom pair of images. The top image in each pair shows the position of a digital camera relative to a protractor, while the bottom images show the resulting photograph taken from that position. The left image pair shows the root of the protractor placed just in front of the nodal point of the camera lens. The resultant photograph shows the protractor lines converging inwards. This parallax error indicates that we are beyond the nodal point of the lens. The right image pair shows the protractor root well behind the nodal point. The lines in the corresponding photograph are seen to fan outwards, showing inverse parallax error as previously by diverging from the root of the protractor. Finally, in the central image pair, we see the protractor root and the nodal point of the lens approximately collocated. There is no (minimal) parallax error in the photograph taken from this position as the protractor lines are seen to be parallel. This is the nodal point, or optical centre of the lens.
Establishing the nodal point using a protractor has the advantage that we can then read the FOV directly. For instance, in the central image above, the horizontal FOV is approximately 72°.
The reason that we need to establish the nodal point of our modified C310 cameras is firstly to verify that they are indeed what we think they are, and secondly to accurately determine the angular distribution over the image. This needs to be done for both horizontal and vertical orientations. The FOV distribution can then be plotted. The image below illustrates this process for our modified C310 cameras. The resulting images and FOV taken from the C310 nodal point are shown. As we can see, the horizontal FOV is slightly less than the 120° advertised, but the vertical is slightly more than would be expected from the 1.33:1 aspect ratio. This may be due either to the lens itself or to its mounted position relative to the sensor.
This information is required going into our second part of the alignment process: matching the perspective properties (FOV and angular distribution of FOV) of the physical camera image to those of the virtual camera.
Once we have verified the physical camera FOV and its distribution of FOV it must be matched to the properties of the virtual camera as displayed by the Rift. The image above shows an overlay of our established camera FOV grid together with virtual markers representing the Rift’s virtual camera FOV distribution. The grid has been rotated 90° from that shown in the previous image as this is the mounted orientation of the cameras. The Rift’s virtual camera FOV is marked by the black points (in fact geometric spheres in the 3D engine). There is general alignment between these two FOV distributions. It is only at the edges of the image (50° vertical and 30° horizontal) where mismatches are observed. These mismatches are not significant from a usability standpoint. They could be corrected by extending the fragment shader discussed in final section: calibrating and undistorting our video images in real-time using a shader.
Consumer cameras generally suffer from radial distortion that results in straight lines to bow out toward the edges of the image and uniform grid squares appearing to vary in size. The wider the FOV of the more pronounced the distortion appears, with most extreme distortion appearing at the corners and edges of the image.
For head-mounted AR it is important to correct for these distortions. If we do not, then the user will perceive a warped video space, while virtual objects will appear differently. This may lead to difficulties in object size estimation, virtual object misalignment, and a reduced sense of presence in the AR experience. The established method for camera rectification is to firstly calibrate your camera using one of a number of toolkits. The result will be a series of radial and tangential coefficients describing the behaviour of the lens.
OpenCV, a popular computer vision library, features functions to rectify a source image. This standard CPU-bound function is slow, however, achieving ~15FPS on my Intel i7 CPU. For immersive head-mounted AR we need to render the rectified video images immediately at the source frame rate of our cameras (~50Hz for the C310). The image below shows the results from a custom fragment shader running on the GPU. The left side of the image shows the raw unprocessed C310 image frame, while the right side shows the post-shader image. The rectified image is not perfect, and distortion is still visible at the corners of the image frame but it is significantly improved over the unprocessed image. The shader runs separately on each camera image and there there is no impact on frame rate.
Written in CG and for Unity’s shader system, the fragment program is shown in the image below:
The output from the processes covered in this section is what makes the immersive head-mounted AR work as a usable and convincing experience. The stages ensure visual alignment between an object in the stereo video and the overlaid virtual objects. They are the nuts and bolts merging the two media spaces.Comments
The AR-Rift puts the user in an 3D augmented space featuring stereoscopic video captured from wide field-of-view (FOV) cameras together with overlaid virtual content. Two hand markers allow for real-time 6DoF hand tracking. Users are free to move around the tracked volume naturally, similarly to as if they were not using the system.
Given these characteristics, an embodied approach is appropriate when designing an interaction metaphor for use with the system. Similarly to how mixing 3D input devices with 2D displays are more interesting novelties rather than widely adopted paradigms, using a traditional mouse or gamepad to interact in immersive AR or VR is both limited and limiting.
The 3DUI (3D user interface) is described in the video linked in first article in this series, starting at 7:21. It features a panel attached to left hand marker and a manipulator in the position of the fingertips on the right hand (handedness can be swapped). The image below shows these two objects in the Unity editor. The green wireframe around the objects represent the attached collider components.
When the manipulator object enters the collider space that surrounds a button on the panel, an indicator visually joins the two objects. After 0.5 seconds of hovering in this position the panel button highlights blue, and after a further 1 second the button highlights green and the action is performed. This process can be clearly seen in the video.
Several other actions can be performed and are seen in the video. Of particular interest are the gestures relating to the environment. These include docking objects at a specific position, placing virtual displays on the walls of the real room, deleting objects by touching the floor, and donning a virtual Oculus Rift to transition between AR and VR. Some of these actions are illustrated in the image below.Comments
I have developed several AR demonstrations that can be seen in the videos below. Be sure to watch them while wearing a Rift if you have one! The demos include an animated avatar, creating and manipulating virtual objects and particles, creating virtual desktops, using those virtual desktops with the mouse and keyboard and transitioning between AR and VR by donning a virtual Oculus Rift.
These simple demos hopefully depict how compelling an experience immersive head-mounted AR can be. I am particularly excited about he possibilities of virtual screens once the resolution of commercial head-mounted displays increases. It will be exciting to see how the field develops over the coming years.Comments
The article is part of a special on VR, which appears in the latest issue of New Scientist.Comments
I gave a Unity tutorial at UCL on 24th May 2013. Below you will find screen-capture videos. There are 13 videos in total, running almost 2 hours. They are all at 1024x768 resolution.
My CHI and IEEE VR papers are now available in my publications page. Both papers received honorable mentions for Best Paper.
PanoInserts: Mobile Spatial Teleconferencing appears in proceedings of the International Conference on Human Factors in Computing Systems (CHI) 2013.
Human Tails: Ownership and Control of Extended Humanoid Avatars is presented at IEEE Virtual Reality 2013 and will appear in the IEEE Transactions of Visualization and Computer Graphics journal.Comments
The BBC have included some of our work from the BEAMING project as part of their technology highlights for 2012! Also covered is UCL-based startup chirp.io, which I’ve done some work for.Comments
The cover of the latest issue of the Presence: Teleoperators and Virtual Environments journal features an image from our paper Multimodal Data Capture and Analysis of Interaction in Immersive Collaborative Virtual Environments, which can also be found in the issue. The BEAMING acting rehearsal paper can also be found in the issue. PDFs of both these papers can be found in my publications page.