Saturday, November 26, 2011

Techniques for Automatic Monitoring of Stereoscopic 3D Video


3D CineCast
________________________________

Techniques for Automatic Monitoring of Stereoscopic 3D Video

Posted: 25 Nov 2011 08:07 AM PST

Running a multi-channel TV broadcast installation brings new headaches when 3D is involved. Live monitoring of dozens of TV channels is difficult enough. Over the years several manufacturers have developed automated monitoring solutions covering a whole range of tasks of increasing complexity.

With the advent of 3D, there is literally a new dimension of monitoring tasks, because we have to check not only the integrity of individual video signals but also the correct relationship between the left and right video signals in a stereo pair.

In addition, manual monitoring of 3D is more difficult than 2D because the operator would need either to wear glasses or accept the limitations of autostereoscopic displays. For these reasons, there is a burgeoning interest and market in automatic monitoring of 3D television.

Overview of 3D Monitoring
Analysis and Correction
One of the purposes of automatic analysis is to provide information to enable correction of any problems encountered. The techniques for correction are beyond the scope of this paper, though it is important to point out that correction of an upstream problem may be necessary before monitoring of further aspects can be carried out.

Metadata
The correct use of metadata, for example to identify left and right channels or to signal how they are packed into a single container, can in theory remove the need for some analysis. However, metadata for 3D is not yet fully standardized, and even when it is there will still be cases of incorrect usage, so there will always be a place for techniques that avoid the requirement for metadata. Of course, the results of measurements performed at any point in the processing chain may in their turn be passed on downstream as metadata.

Format Detection
The first task when faced with a single video signal carrying a stereoscopic pair is to identify the format by which the two channels are packed into one container. For some formats this is an easy task, but there are some problems when the granularity of the packing is finer.

Matching Left and Right Images
Having unpacked the signal into left and right channels, the next task is to check whether the two channels are correctly matched, particularly as regards timing, grey scale and colour balance. Grey scale and colour balance can be aligned using histogramming techniques. Relative timing can be measured using fingerprinting techniques similar to those used for lip-sync measurement. It is important to note that a timing mismatch will not only be detrimental to the 3D viewing experience but will also have an adverse effect on downstream analysis, particularly of 3D depth. Relative timing is thus a good example of the need to correct a problem before further analysis can reliably be performed.

Depth or Disparity Analysis
A more algorithmically challenging analysis task is to measure the 3D depth across the picture, which is directly related, via screen size and resolution, to disparity or relative displacement between the left and right representations of objects in the scene. Horizontal disparity that is outside a certain range, as well as undue vertical disparity, are known to cause significant problems of eye strain for some viewers. Disparity analysis is also important for checking the overall relative geometric alignment of the two images.

Higher Level Analysis
Finally, we shall look at two examples of detection tasks which require a higher level of analysis. The first is deceptively simple to state: can we tell whether the left and right channels have been inadvertently swapped? The second is: can we tell whether the 3D pair has come from a simple 2D to 3D converter? Ultimately, 3D analysis can extend to detecting or measuring any process that has been carried out on 3D signals, either with a view to improving, modifying or reversing the process, or simply in order to report or record what has been done.

Format Detection
There are many ways in which left and right signals may be packed into a single video channel. These include left-right or top-bottom juxtaposition (with or without reflection of one of the channels), line interleaved, column interleaved, checkerboard and frame interleaved formats.

For the purposes of automatic detection, these formats may be classified into two groups. Left/right and top/bottom formats are "loose packed" because the two pictures are physically quite separate. The remaining formats are "close packed" because corresponding left and right channel pixels are close together in space or time.

Loose Packing
Loose packed formats are quite easy to detect. One way is to carry out a trial unpacking with an assumed format and then detect whether the two resulting images are sufficiently similar to be a stereoscopic pair. And if the two images turn out to be identical, we may conclude that a 2D image is being transported in a 3D container; this is a simple case of disparity estimation in which we look for zero disparity across the image.

Figure 1 shows the left-right differences for a small area of a picture when each of four possible trial formats is used to unpack each of four possible actual formats. Where the correct format has been used, the left-right difference contains only edge information arising from disparity.

We can summarise the detection of loose-packed formats by saying that we exploit the relative similarity of the left and right images when compared with unrelated, distant parts of the picture.

Figure 1 - Detection of loose packed formats

Close Packing
Close packed formats present more of a problem because the packed image looks increasingly like a single 2D image as the amount of 3D content in the scene decreases. So simply carrying out trial unpackings will often give a positive result, even if the wrong format is being tried. If there is significant 3D content, the detection becomes easier because a picture wrongly unpacked will look increasingly less like a pair of plausible images.

The left half of Figure 2 shows a small part of the left image for some different combinations of packing and unpacking formats, and the right half shows the combined energy of horizontal and vertical high pass filtered versions of those outputs. The energy is clearly significantly lower when the correct unpacking format has been used.

We can summarise the detection of these close-packed formats by saying that we exploit the relative difference of the left and right images when compared with adjacent pixels or lines.

Figure 2 - Detection of close packed formats

Temporal interleaving presents further difficulties because there is a higher chance that motion can be confused with left-right disparity. This could be overcome using motion compensated high-pass filtering, though care would have to be taken to use information from a single channel (albeit subsampled) for the purposes of motion estimation.

Depth of Disparity Analysis
One of the most important monitoring or analysis tasks in stereoscopic 3D is to measure the perceived depth of the various objects in the scene. Perceived depth is a function of disparity (the horizontal distance between left and right representations of the object, measured in pixels), display size and resolution, and viewing distance. In the context of signal monitoring, we can only measure disparity and then relate it to perceived depth for different display configurations.

Disparity measurement is useful for many monitoring purposes, the most important being to provide a warning if the viewer is likely to suffer eye strain. Other reasons for measuring disparity are to verify that the sequence really is 3D rather than just being 2D in a 3D container, to detect and correct for global geometric distortions between the two channels, and to assist in the insertion of captions or subtitles at suitable depths.

Eye Strain
Eye strain can occur in 3D viewing when disparity exceeds certain limits – particularly if the eyes are being encouraged to diverge, an unnatural action. The limits depend on display size but it is also useful to measure how often and for how long extreme disparity values are observed, and possibly to identify where in the scene the extremes are occurring.

Disparity Measurement
One class of disparity measurement methods involves performing a local correlation between the left and right images to generate a sparse disparity map. This approach is ideal for looking at the behaviour of different objects in the scene and for determining to what extent limits have been exceeded.

Other methods seek to generate a dense disparity map, in which every pixel has an associated disparity value, or possibly an occlusion indicator if there is no corresponding point in the other picture. This approach would be necessary if the measurement were being used to drive post-processing, for example to change the effective camera spacing.

Finally, for some applications an approximate, region-based approach to disparity measurement might be sufficient, for example to gather statistics about typical depth ranges used across a programme, or to drive a global spatial transform to correct for camera misalignment.

Vertical Disparity
The impression of depth is conveyed by introducing horizontal disparity. If there is any vertical disparity present, it should be detected and corrected, both because it can be very disturbing to the eyes, and because it can interfere with correct measurement of horizontal disparity.

Of course, horizontal and vertical disparity can be measured jointly using conventional motion estimation methods. However, it would be preferable to exploit the constraints arising from stereoscopy. For example, we would expect vertical disparity to be a combination of two components: one directly related to horizontal disparity, such as might arise from a vertical displacement between the cameras, and one which fits a simple global model, such as might arise from different zoom factors or axis directions between the cameras.

Disparity Monitoring Display
Figure 3 shows an example of a monitoring display that provides information about the distribution of disparity in various ways, including a left-right difference, a disparity histogram, an indication of vertical disparity and a colour coded warning of the possibility of eye strain from near and far objects for different display sizes. Such a tool makes good use of automatic analysis coupled with an operator's skill in interpreting the results.

Figure 3 - Example of a disparity monitoring display

Dense Disparity Maps
Because of the difficulty and the usefulness of measuring dense disparity maps, there is some interest in standardising a format for dense disparity map metadata. For example, SMPTE has recently begun such an activity.

Higher Level Analysis
Left-Right Swap Detection
Many people viewing 3D demonstrations have encountered the situation where the left and right images have been inadvertently swapped over. The result is very disturbing, but it is not always obvious even to a human observer what is wrong. It would be useful to be able to detect the swap automatically, but this turns out to be quite a difficult problem.

Measurement of a disparity map is a good starting point, but a correctly arranged 3D pair will typically exhibit both negative disparity values for objects intended to be seen in front of the screen and positive values for objects behind the screen. So a simple analysis of the histogram of disparity values, for example, will not be enough.

One approach that works with reasonable reliability is based on the spatial distribution of disparity values. We observe that for most scenes objects at the centre and bottom of the screen are generally nearer than objects at the top and sides. Figure 4 shows the spatial disparity distribution measured over a set of varied clips comprising 6000 frames.

A possible left-right detection algorithm is to correlate measured disparity with the above template. A positive correlation indicates that the assumed left-right configuration is correct, while a negative correlation indicates that it is reversed.

Figure 4 - Spatial disparity distribution

Figure 5 shows the results of such an algorithm on 38,000 frames of (correctly ordered) 3D material. The blue line shows a 10-frame rolling average and the red line a 1000-frame rolling average of correlation coefficients between measured disparity and the template.

Figure 5 - Performance of left-right swap detection algorithm based on disparity distribution

Whenever the graph is positive, the algorithm is giving a correct result. The last third of the material is professionally produced, well-behaved 3D material whereas the first two-thirds consists of test sequences of varying quality. Clearly, there is always some material that will defeat the algorithm, but on "normal" material it is quite reliable.

A potentially more reliable method of left-right detection is based on the observation that closer objects are expected to occlude objects that are further away. A dense disparity estimator will usually have some kind of confidence output which indicates whether a pixel or region in one view has no equivalent in the other view and is therefore an occluded background region.

As shown in Figure 6, we would expect occluded regions to extend to the left of transitions in the left-eye view and to the right in the right-eye view. The bottom part of the diagram shows where the transitions between foreground (green) and background (blue) are observed to be in relation to occlusions (red) in the two views.

This observation allows us to determine automatically, on a statistical basis, which view is the left-eye view and which is the right-eye view. This approach is potentially more reliable than the method based on spatial disparity distribution, but it does depend on accurate dense disparity estimation including reliable location of occlusions.

Figure 6 - Use of occlusions in left-right swap detection

Reliable analysis of the local relationship between depth and occlusions may be employed for other high-level monitoring tasks, for example to provide a warning that captions might have been inserted at an inappropriate location or depth relative to the other objects in the scene.

2D to 3D Conversion Detection
Our final example concerns the automatic detection of automatic 2D to 3D conversion.

One common technique in simple 2D to 3D conversion is the use of a fixed spatial disparity profile; for example the bottom and centre of the picture are made to appear closer than the top and sides, much as shown by Figure 4 above.

Another technique is to introduce delay between two versions of the same moving sequence to give an impression of depth. This can work because a 3D camera rig tracking across a static scene will in fact generate two streams separated by a delay which corresponds to the time taken for the camera to move by the eye spacing distance.

The algorithm illustrated in Figure 7 detects the use of either or both of these techniques, to give a warning that a 2D to 3D converter might have been used.

Figure 7 - Block diagram of automatic 2D to 3D conversion detector

Fingerprints are calculated separately on the left and right input picture signals. These could be as simple as the average luminance value over each frame, an average over each of a few regions, or any measure which when applied to correctly co-timed left and right signals would be expected to be similar to each other.

A correlation process is then applied to the two fingerprint signals to produce an estimated temporal offset between the input channels. This estimated offset is applied to a temporal low pass filter, which may for example be designed to detect piecewise constant inputs. The filtered temporal offset value is used to control a temporal alignment process on the left and right images; this would be done by applying a delay to one or other of the two inputs.

A disparity map between the temporally aligned left and right images is then calculated, producing a number of disparity values across the picture. A temporal high pass filter is applied to the disparity values, thereby looking for variation in time of the disparity observed in each part of the picture. The mean square value, or other average energy value, of the high pass filter output is calculated. In parallel, a spatial reression process is applied to the disparity map to see if the map fits a fixed spatial model. A low mean square output from the temporal high pass filter, or a close correlation to a fixed spatial model, both provide evidence for a final decision that simple 2D to 3D conversion might have been performed.

With automatic detection such as this, one can envisage a game of "cat and mouse" whereby detection algorithms have to become ever more sophisticated in order to keep up with the increasing complexity of automatic 2D to 3D converters.

By Mike Knee, Snell via 3Droundabout

MPEG Analysis and Measurement

Posted: 25 Nov 2011 08:52 AM PST

Broadcast engineering requires a unique set of skills and talents. Some audio engineers claim the ability to hear the difference between tiny nuisances such as different kinds of speaker wire. They are known as those with golden ears. Their video engineering counterparts can spot and obsess over a single deviate pixel during a Super Bowl touchdown pass or a "Leave it to Beaver" rerun in real time. They are known as eagle eyes or video experts.

Not all audio and video engineers are blessed with super-senses. Nor do we all have the talent to focus our brain's undivided processing power to discover and discern vague, cryptic and sometimes immeasurable sound or image anomalies with our bare eyes or ears on the fly, me included. Sometimes, the message can overpower the media. Fortunately for us and thanks to the Internet and digital video, more objective quality and measurement standards and tools have developed.

One of those standards is Perceptual Evaluation of Video Quality (PEVQ). It is an End-to-End (E2E) measurement algorithm standard that grades picture quality of a video presentation by a five-point Mean Opinion Score (MOS), one being bad and five being excellent.

PEVQ can be used to analyze visible artifacts caused by digital video encoding/decoding or transcoding processes, RF- or IP-based transmission systems and viewer devices like set-top boxes. PEVQ is suited for next-generation networking and mobile services and include SD and HD IPTV, streaming video, mobile TV, video conferencing and video messaging.

The development for PEVQ began with still images. Evaluation models were later expanded to include motion video. PEVQ can be used to assess degradations of a decoded video stream from the network, such as that received by a TV set-top box, in comparison to the original reference picture as broadcast from the studio. This evaluation model is referred to as End-to-End (E2E) quality testing.

E2E exactly replicates how so-called average viewers would evaluate the video quality based on subjective comparison, so it addresses Quality-of-Experience (QoE) testing. PEVQ is based on modeling human visual behaviors. It is a full-reference algorithm that analyzes the picture pixel-by-pixel after a temporal alignment of corresponding frames of reference and test signal.

Besides an overall quality Mean Opinion Score figure of merit, abnormalities in the video signal are quantified by several Key Performance Indicators (KPI), such as Peak Signal-to-Noise Ratios (PSNR), distortion indicators and lip-sync delay.

PVEQ References
Depending on the data made available to the algorithm, video quality test algorithms can be divided into three categories based on available reference data.

A Full Reference (FR) algorithm has access to and makes use of the original reference sequence for a comparative difference analysis. It compares each pixel of the reference sequence to each corresponding pixel of the received sequence. FR measurements deliver the highest accuracy and repeatability but are processing intensive.

A Reduced Reference (RR) algorithm uses a reduced bandwidth side channel between the sender and the receiver, which is not capable of transmitting the full reference signal. Instead, parameters are extracted at the sending side, which help predict the quality at the receiving end. RR measurements are less accurate than FR and represent a working compromise if bandwidth for the reference signal is limited.

A No Reference (NR) algorithm only uses the degraded signal for the quality estimation and has no information of the original reference sequence. NR algorithms are low accuracy estimates only, because the original quality of the source reference is unknown. A common variant at the upper end of NR algorithms analyzes the stream at the packet level, but not the decoded video at the pixel level. The measurement is consequently limited to a transport stream analysis.

Another widely used MOS algorithm is VQmon. This algorithm was recently updated to VQmon for Streaming Video. It performs real-time analysis of video streamed using the key Adobe, Apple and Microsoft streaming protocols, analyzes video quality and buffering performance and reports detailed performance and QoE metrics. It uses packet/frame-based zero reference, with fast performance that enables real-time analysis on the impact that loss of I, B and P frames has on the content, both encrypted and unencrypted.

The 411 on MDI
The Media Delivery Index (MDI) measurement is specifically designed to monitor networks that are sensitive to arrival time and packet loss such as MPEG-2 video streams, and is described by the Internet Engineering Task Force document RFC 4445. It measures key video network performance metrics, including jitter, nominal flow rate deviations and instant data loss events for a particular stream.

MDI provides information to detect virtually all network-related impairments for streaming video, and it enables the measurement of jitter on fixed and variable bit-rate IP streams. MDI is typically shown as the ratio of the Delay Factor (DF) to the Media Loss Rate (MLR), i.e. DF:MLR.

DF is the number of milliseconds of streaming data that buffers must handle to eliminate jitter, something like a time-base corrector once did for baseband video. It is determined by first calculating the MDI virtual buffer depth of each packet as it arrives. In video streams, this value is sometimes called the Instantaneous Flow Rate (IFR). When calculating DF, it is known as DELTA.

To determine DF, DELTA is monitored to identify maximum and minimum virtual depths over time. Usually one or two seconds is enough time. The difference between maximum and minimum DELTA divided by the stream rate reveals the DF. In video streams, the difference is sometimes called the Instantaneous Flow Rate Deviation (IFRD). DF values less than 50ms are usually considered acceptable. An excellent white paper with much more detail on MDI is available from Agilent.

Figure 1 - The Delay Factor (DF) dictates buffer size needed to eliminate jitter

Using the formula in Figure 1, let's say a 3.Mb/s MPEG video stream observed over a one-second interval feeds a maximum data rate into a virtual buffer of 3.005Mb and a low of 2.995Mb. The difference is the DF, which in this case is 10Kb. DF divided by the stream rate reveals the buffer requirements. In this case, 10K divided by 3.Mb/s is 3.333 milliseconds. Thus, to avoid packet loss in the presence of the known jitter, the receiver's buffer must be 15kb, which at a 3Mb rate injects 4 milliseconds of delay. A device with an MDI rating of 4:0.003, for example, would indicate that the device has a 4 millisecond DF and a MLR of 0.003 media packets per second.

The MLR formula in Figure 2 is computed by dividing the number of lost or out-of-order media packets by observed time in seconds. Out-of-order packets are crucial because many devices don't reorder packets before handing them to the decoder. The best-case MLR is zero. The minimum acceptable MLR for HDTV is generally considered to be less than 0.0005. An MLR greater than zero adds time for viewing devices to lock into the higher MLR, which slows channel surfing an can introduce various ongoing anomalies when locked in.

Figure 2 - The Media Loss Rate (MLR) is used in the Media Delivery Index (MDI)

Watch That Jitter
Just as too much coffee can make you jittery, heavy traffic can make a network jittery, and jitter is a major source of video-related IP problems. Pro-actively monitoring jitter can alert you to help avert impending QoE issues before they occur.

One way to overload a MPEG-2 stream is with excessive bursts. Packet bursts can cause a network-level or a set-top box buffer to overflow or under-run, resulting in lost packets or empty buffers, which cause macro blocking or black/freeze frame conditions, respectively. An overload of metadata such as video content PIDs can contribute to this problem.

Probing a streaming media network at various nodes and under different load conditions makes it possible to isolate and identify devices or bottlenecks that introduce significant jitter or packet loss to the transport stream. Deviations from nominal jitter or data loss benchmarks are indicative of an imminent or ongoing fault condition.

QoE is one of many subjective measurements used to determine how well a broadcaster's signal, whether on-air, online or on-demand, satisfies the viewer's perception of the sights and sounds as they are reproduced at his or her location. I can't help but find some humor in the idea that the ones-and-zeros of a digital video stream can be rated on a gray scale of 1-5 for quality.

Experienced broadcast engineers know the so-called quality of a digital image begins well before the light enters lens, and with apologies to our friends in the broadcast camera lens business, the image is pre-distorted to some degree within the optical system before the photons hit the image sensors.

QoE or RST?
A scale of 1-5 is what ham radio operators have used for 100 years in the readability part of the Readability, Strength and Tone (RST) code system. While signal strength (S) could be objectively measured with an S-meter such as shown in Figure 3, readability (R) was purely subjective, and tone (T) could be subjective, objective or both.

Figure 3 - The S-meter was the first commonly used metric to objectively
read and report signal strength at an RF receive site

Engineers and hams know that as S and or T diminish, R follows, but that minimum acceptable RST values depend almost entirely on the minimum R figure the viewer or listener is willing to accept. In analog times, the minimum acceptable R figure often varied with the value of the message.

Digital technology and transport removes the viewer or listener's subjective reception opinion from the loop. Digital video and audio is either as perfect as the originator intended or practically useless. We don't need a committee to tell us that. It seems to me the digital cliff falls just south of a 4x5x8 RST. Your opinion may vary.

By Ned Soseman, Broadcast Engineering
You are subscribed to email updates from 3D CineCast
To stop receiving these emails, you may unsubscribe now.
Email delivery powered by Google
Google Inc., 20 West Kinzie, Chicago IL USA 60610

No comments: