 Open Access
 Total Downloads : 11
 Authors : Harini S
 Paper ID : IJERTCONV2IS13118
 Volume & Issue : NCRTS – 2014 (Volume 2 – Issue 13)
 Published (First Online): 30072018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Automatic Image Conversion From 2D to 3D Using Support Vector Machine
Harini S
Dept.Of Computer Science
H.K.B.K College Of Engineering Bangalore45
Abstract3D image and video applications are becoming popular in our daily life, especially at home entertainment. Although more and more 3D movies are being made, 3D image and video contents are still not rich enough to satisfy the future 3d image in market. There is a rising demand on new techniques for automatically converting 2d to 3d image. the most common method involves human operator and automatic conversion of image. The proposed system methods that are used to convert 2d to 3d image includes point mapping from local images which mainly includes attributes like color location and motion and global method which mainly estimate the depth of an image that are stored in repository(depth+image) using a nearestneighbor regression type idea. We demonstrate both the efficacy and the computational efficiency of our methods on numerous 2D images and discuss their drawbacks and benefits. these method lack behind in time computation hence we present a new method support vector machine which increase the time efficiency and calculate the depth map of an image and also use median filter and cross bilateral filters to provide high quality images.
KeywordsImages,Nearest neighbouring search,SVM,Filters.
I.INTRODUCTION
Rapid development of 3D displays technologies and image has brought 3D into our life. As more facilities and devices are 3D capable, the demand for 3D image and video contents is increasing sharply.However, the tremendous amount of current and past media data is in 2D format and 3D stereo contents are still not rich now.
The availability of 3Dcapable hardware today, such as TVs, BluRay players, gaming consoles, and smartphones, is not yet matched by 3D content production. constantly growing in numbers, 3D movies are still an exception rather than a rule, and 3D broadcasting (mostly sports) is still minuscule compared to 2D broadcasting. The gap between 3D hardware and 3D content availability is likely to close in the future, but today there exists an urgent need to convert the existing 2D content to 3D. A typical 2Dto3D conversion process consists of two steps: depth estimation for a given 2D image and depth based rendering of a new image in order to form a stereopairimages.
The methods we propose in this paper, carry the big dataphilosophy of machine learning. In consequence, they apply toarbitrary scenes and require no manual annotation. Our datadrivenapproach to 2Dto3D conversion has been inspired by the recent trend to use large image databases for various computer vision tasks, such as object recognition [18] and image saliency detection [19]. In particular, we propose anew class of methods that are based on the radically different approach of learningthe 2Dto3D conversion from examples.We develop two types of methods. The first one is based onlearning a point mapping from local image/video
attributes, such as color, spatial position, and motion at each pixel, toscenedepth at that pixel using a regression type idea. Thesecond one is based on globally estimating the entire depthmap of a query image directly from a repository of 3D images(image+depth pairs or stereopairs) using a nearest neighbor regression type idea. Early versions of our learning based approach to 2Dto3D image conversion, either suffered from high computational complexity [8] or were tested on only a single dataset [9]. Here, we introduce the local method and evaluate the qualitative performance and the computational efficiency of both the local and global methods against those of the Make3D algorithm [14] and a recent method proposed by Karsch[7]. We demonstrate the improved quality of the depth maps produced by our global method relative to stateof theart methods together with up to 4 orders of magnitude reduction in computational effort. We also discuss weaknesses of both proposed methods.

EXISTING SYSTEM

SemiAutomatic Method
To reduce operator involvement in the process and, therefore, lower the cost while speeding up the conversion,research effort has recently focused on the most laborintensiv steps of the manual involvement, namely spatial depth assignment Guttman [6] have proposed a dense depth recoveryvia diffusion from sparse depth assigned by the operator.In the first step, the operator assigns relative depth to imagepatches in some frames by scribbling. In the second step, acombination of depth diffusion ,that accounts for local imagesaliency and local motion, and depth classification is applied.In the final step, disparity is computed from the depth field and two novel views are generated by applying half of the disparity amplitude. Phan [12] propose a simplified and more efficient version of the Guttmann et al. [6] method using scale space random walks that they solve with the help of graph cuts. Liao [10] further simplify operator involvement by first computingoptical flow, then applying structurefrommotion estimationand finally extracting moving object boundaries. The role ofan operator is to correct errors in the automatically computeddepth of moving objects and assign depth in undefined areas.

Automatic Method
Several electronics manufacturers have developed realtime 2Dto3D converters that rely on stronger assumptions andsimpler processing than the methods discussed above, e.g., moving or larger objects are assumed to be closer to theviewer, higher frequency of texture is assumed to belong toobjects
located further away, etc. Although such methods maywork well in specific scenarios, in general it is very difficult,if not impossible, to construct heuristic assumptions that cover all possible background and foreground combinations.
The problem of depth estimation from a single 2D image,which is the main step in 2Dto3D conversion, can beformulated in various ways, for example as a shape fromshadingproblem [20]. However, this problem is severelyunderconstrained; quality depth estimates can be found onlyfor special cases. Other methods, often called multi viewstereo, attempt to recover depth by estimating scene geometryfrom multiple images not taken simultaneously. For example, amoving camera permits structurefrommotion estimation [17]while a fixed camera with varying focal length permits depthfromdefocus estimation [16]. Both are examples of the useof multiple images of the same scene captured at differenttimes or under different exposure conditions.


PROPOSED SYSTEM
A.LOCAL POINT TRANSFORMATION
The first class of conversion methods we are presenting is based on learning a point transformation that relates local lowlevelimage or video attributes at a pixel to scenedepth at thatpixel. Once the point transformation is learned, it is appliedto a monocular image, i.e., depth is assigned to a pixel based on its attributes. This is in contrast to methods described where the entire depth map of a query is estimated directly from a repository of 3D images (image+depth pairs or stereopairs) using a nearestneighbor regression type idea.
A pivotal element in this approach is a point transformation used to compute depth from image attributes. This transformationcan be estimated either by training on a groundtruth dataset, the approach we take in this paper, or defined heuristically.
image (usually in YUV format) and dkis this a color image (usually in YUV format) and is the corresponding depth field. We assume that all images and depth fields have the same spatial dmensions. Such a dataset can be constructed in various ways. One example is the Make3D dataset [13], [14],
[21] that consists of images and depth fields captured outdoors by a laser range finder. Another example is the NYU Kinect dataset [15], [22] containing over 100 k images and depth fields captured indoors using a Kinectcamera.Given a training set I consisting of K imagedepth pairs,
one can, in principle, learn a general regression function that maps a tuple of local features such as (color,location,motion) to a depth value, i.e.
f : (color, location,motion) depth.
However, to ensure low runtime memory and processing costs, we learn a more restricted form of transformation:
f [color, x,motion] = [color]+ [x]+ [motion].
We now discuss how the individual colordepth, locationdepth, and motiondepth transformations as well as the weights are learned.
Fig. 1 shows a sample video frame with depth maps estimated from color, location and motion cues separately, as well as the final combined depth map. In order to obtain a color depth transformation fc.we first transform the YUV space, commonly used in compressed images and videos, to the HSV color space. We found out that the saturation component (S) provides little depth discrimination capacity and therefore we limit the transformation attributes to hue (H) and value (V ). Let [Hk [x], Sk [x], Vk [x]]T be the HSV components of a pixel at spatial location x quantized to L levels. The depth mapping fc[h, v], h, v = 1, …, L is computed as the average of depths at all pixels in
Let = 1 , 1 , 2 ,
2 , , ,
denote a training
I with hue h and value v:
K x l Hk x =h,Vk x =v dk [x]
dataset composed of K pairs ,
where 1 ,is a color
fc h, v =
k =1
k k
(1)
K
k =1
x l H x =h,V x =v
Fig.1. Example of depth estimation from color spatial and location and motion.
where 1(A) is the indicator function which equals one if A is true and equals zero otherwise.
Fig.2. Colordepth Transformation
Fig. 2(a) shows the transformation fc computed from a dataset
of, mostly, outdoor scenes. Note a large dark patch around reddish colors indicating that red elements of a scene are located closer to the camera. A large bright patch around brightbluish colors is indicative of a faraway sky. The bright patch around yelloworange colors is more difficult to classify but may be due to the distant sun as many videos have been captured outdoors.The locationdepth transformation is simply the average depth computed from all depth maps in at the same location:
global 3D scenestructure. This is because this type of conversion, althoughlearningbased, is based on purely localimage/video attributes, such as color, spatial position, and motion at each pixel. To address this limitation, in this section we develop a second method that estimates the globaldepth map of a query image or video frame directly from a repository of 3D images
(image+depth pairs or stereopairs) using a nearestneighbor regression type idea.The approach we propose here is built upon a key observation and an assumption.
The following steps are:

search for representative depth fields: find k 3D images in the repository I that have most similar depth to the query image, for example by performing a k nearestneighbor (kNN) search using a metric based on photometric properties,

depth fusion: combine the k representative depth fields,for example, by means of median filtering across depthfields
[x]= 1 [] [2]
=1
In addition to color and spatial attributes, video sequences may contain motion attributes relevant to depth recovery. In this case, local motion between consecutive video frames is of interest. The underlying assumption in the motiondepth transformation is that moving objects are closer to the viewer than the background. In order to estimate the motiondepth transformation , the basic idea is to first compute local motion between consecutive video frames, then extract a moving object mask from this motion, and, finally, assign a distinct depth (smaller than that of the background) to this mask. This brings the moving objects closer to the viewer. The estimation of local motion may be accomplished by any optical flow method, e.g., [2], but may also require global motion compensation, e.g., [5], in order to account for camera movements. A simple thresholding of the magnitude of local motion produces a moving objects mask. However, since such masks are often noisy some form of smoothing may be needed. Crossbilateral filtering [4] controlled by the luminance of the video frame, in which the estimated local motion is anchored, usually suffices.In the final step, the local transformation outputs are linearly combined to produce the final depth field.
B.GLOBALNEARESTNEIGHBOR DEPTH LEARNING
While 2Dto3D conversion based on learning a local point transformation has the undisputed advantage of computational efficiency the point transformation can be learned off lineand applied basically in real time the same transformation isapplied to images with potentially different
Fig.3. Block diagram of global method.

depth smoothing: process the fused depth field to remove spurious variations, while preserving depth,for example, by means of cross bilateralfiltering,

stereo rendering: generate the right image of a fictitiousstereopair using the monocular query image and thesmoothed depth field followed by suitable processingof occlusions and newlyexposed areas.directly to 3D images represented as an image+depth pair.However, in the case of stereopairs a disparity field needsto be computed first for each left/right image pair. Then,each disparity field can be converted to a depth map.

kNN Search
There exist two types of images in a large 3D image repository those that are relevant for determining depth in a 2D query image, and those that are irrelevant. Images that are not photometrically similar to the 2D query need to berejected because they are not useful for estimating depth(as per our
assumption). Note that although we might misssome depth relevant images, we are effectively limiting thenumber of irrelevant images that could potentially be moreharmful to the 2Dto3D conversion process. The selection ofa smaller subset of images provides the added practical benefitof computational tractability when the size of the repository is very large. One method for selecting a useful subset of depth relevant images from a large repository is to select only the k images that are closest to the query where closeness is measured by some distance function capturing global image properties such as color, texture, edges, etc. As this distance function, we use the Euclidean norm of the difference between histograms oforiented gradients (HOGs) [3] computed from two images.Each HOG consists of 144 real values (4 Ã—4
blocks with9 gradient direction bins) that can be efficiently
computed.We perform a search for top matches to our
monocular query Q among all images , k = 1, …, K in the 3D databaseI. The search returns an ordered list of image+depthpairs,from the most to the least photometrically similar visÃ vis thequery. We discard all but the top k matches (kNNs) from thislist.
Fig. 4 shows search results for two outdoor query images performed on the Make3D dataset #1. Although none of the fourkNNs perfectly matches the corresponding 2D query, the general underlying depth is somewhat related to that expected in the query. In Fig. 5.we show search results for two indoor query images (office and dining room) performed on the NYU Kinect dataset. While some of the retained images share local 3D structures with the query image .The average photometric similarity between a query and its kth nearest neighbor usually decays withthe increasing k. While for large databases, larger values of k may be appropriate, since there are many good matches, for smaller databases this may not be true. Therefore, a judicious selection of k is important. We discuss the choice of k. We denote by K the set of indices i of image+depth pairs that are the top k photometricallynearest neighbors of the query Q.
2D Query: Buildings
Fig.4. RGB image and depth field of two 2D queries (left column), and their four nearest neighbors (columns 25) retrieved using the Euclidean norm on the difference between histograms of gradients.
2D Query :Dining room
Fig. 5. RGB image and depth field of two 2D queries (left column), and their four nearest neighbors (columns 25) retrieved using the Euclidean norm on the difference between histograms of gradients.

Depth Fusion
In general, none of the NN image+depth pairs (I i,di ), i K match the query Q accurately (Figs. 4 and 5). However,the location of some objects (e.g., furniture) and parts of the
background (e.g., walls) is quite consistent with those intherespective query. If a similar object (e..g, building, table) appears at a similar location in several kNN images, it is likely that such an object also appears in the query, and the depth field being sought should reflect this. We compute this depth field by applying the median operator across the kNN depths at each spatial location x as follows:
d[x]= median{di [x] i K}(3)

CrossBilateral Filtering (CBF) of Depth
While the medianbased fusion helps make depth more consistentglobally, the fused depth is overly smooth and locallyinconsistent with the query image due to edge misalignmentbetween the depth fields of the kNNs and the query image.This, in turn, often results in the lack of edges in the fuseddepth where sharp object boundaries should occur and/orthe lack of fuseddepth smoothness where smooth depth is expected.In order to correct this, similarly to Agnot[1],we apply crossbilateral filtering (CBF). CBF is a variant
of bilateral filtering, an edgepreserving image smoothing method that applies anisotropic diffusion controlled by the local content of the image itself [4]. In CBF, however, the diffusion is not controlled by the local content of the image under smoothing but by an external input. We apply CBF to the fused depth d using the query image Q to control diffusion. This allows us to achieve two goals simultaneously: alignment of the depth edges with those of the luminance Y in the query imageQ and local noise/granularity suppression in the fused depthd. This is implemented as follows:
(4)
Where the filtered depth field and (x)=exp(
/22)/22is a Gaussian weighting function.The
directional smoothing of d is controlled by the query image via the weight (Y [x]Y [y]). For largeluminance discontinuities, the weight (Y [x]Y [y]) is small and thus the contribution of d[y] to the output is small. However, when Y [y] is similar to Y [x] then (Y [x]Y [y])is relatively large and the contribution of d[y] to the output is larger. In essence, depth filtering (smoothing) is happening along (and not across) query edges.

Stereo Rendering
In order to generate an estimate of the right image QR fromthe monocular query Q, we need to compute a disparity fromthe estimated depth . Assuming that the fictitious image pair(Q,
) was captured by parallel cameras with baseline Band focal length f , the disparity is simply [x, y] = B f/ [x],where x
=[, ] .We forwardproject the 2D query Q toproduce the right image:
[x + [x, y], y] = Q[x, y] (5)
while rounding the location coordinates (x +[x, y], y) to the nearest sampling grid point. We handle occlusions by depth ordering: if (xi + [xi , yi], yi) = (x j +[x j , yi], yi) for some
i, j , we assign to the location (xi + [xi , yi], yi) in QR an RGB value from that location (xi , yi) in Q whose disparity [xi , yi] is the largest. In newlyexposed areas, i.e., for x j such that no xi satisfies (x j , yi) = (xi + [xi , yi], yi),we apply simple inpainting using inpaint_nans from matlab Central.Applying a more advanced depthbased rendering method would only improve this step of the proposed 2Dto 3D conversion.
smooth (slowly varying) while depth edges, if any, are aligned with features in the query image. Fig.7. compares the fused depth before crossbilateral filtering and after. The filtered depth preserves the global properties captured by the unfiltered depth field d, and is smooth within objects and in the background. At the same time it keeps edges sharp and aligned with the query image structure.
Query image Q Query depth Global(median)
Global(median+CBF) Make3D
Fig. 7. Query images from Fig. 6 and depth fields: of the query, estimated depth by the global method after medianbased fusion and after the same fusion and CBF, and depth computed using the Make3D algorithm.
In order to evaluate the performance of the proposedalgorithms quantitatively, we first applied leaveone out crossvalidation (LOOCV) as follows. We selected one image+depth pair from a database as the 2D query (Q, dQ)treating the remaining pairs as the 3D image repository
based on which a depth estimate d^ and a rightimage estimate^are computed. As the quality metric, we used normalizedcrosscovariance between the estimated depth d^and the groundtruth depth dQdefined as follows:
Query image Q Query depth d Local method
C= 1
( )( [x]
) (6)
Global method Make3D
Fig.6.Query images from Fig. 5and depth fields: of the query, depth estimated by the local transformation method, depth estimated by the global transformation method (with CBF) and depth computed using the Make3D algorithm.
In Fig. 6, we show an example of medianfused depth field after crossbilateral filtering. Clearly, the depth field is overall
whereN is the number of pixels in and , and are the empirical means of and , respectively, while and
are the corresponding empirical standard deviations.The normalized crosscovariance C takes values between 1 and
+1 (for values close to +1 the depths are very similar and for
values close to 1 they are complementary).

Support Vector Machine
The SVM method in general it is a set of labeled sample data in order to classify new sample data. To use SVM, you train the algorithm by providing it with example data that you have grouped into a series of categories. Then, when you provide the algorithm with new, unknown data, it assigns that data to one of your given categories based on its resemblance to the known training data.It mainly distinguish the objects in a given image using HOG and SVM uses a subset of training point also known as support vectors to classify different objects hence it is more efficient and which helps in conversion of 2D to 3D images in less time compare to proposed system i.e local and global methods. we can compute efficient time computation.
Fig.8. Block Diagram of SVM and filters used for conversion of 2d to 3D images.
The above fig.9.which uses the svm to convert 2D to 3D image using mask and cross bilateral filters.The advantage over local and global methods s during the conversion the time taken by the svm is very less i.e. about 56 seconds whereas the global and local takes 1012 seconds.


CONCLUSION
We have proposed a new class of methods 2Dto3D image conversion that are based on the different approach of learning. One method is local point mapping from local image attributes to scenedepth. The second method is based on globally estimating the entire depth field of a query directly from a repository of image +depth pairs using nearest neighborbased regression. These method overcome the disadvantage of existing system.While the local method perform extremely fast as it is, bsically, based on table lookup. However, our global method performed better than the previous method in terms of cumulative performance across two datasets and two testing methods, and has done so at a fraction of CPU time.The support vector machine which provide better time computational efficiency.With the continuously increasing amount of 3D data online and with the rapidly growing computing power in the cloud, the proposed framework seems a promising alternative to operatorassisted 2Dto3D image and video conversion.
REFERENCES

L. Angot, W.J. Huang, and K.C. Liu, A 2D to 3D video and image conversion technique based on a bilateral filter, Proc. SPIE, vol. 7526,p. 75260D, Feb. 2010.

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High accuracyoptical flow estimation based on a theory for warping, in Proc. Eur.Conf. Comput. Vis., 2004, pp. 2536.

F. Durand and J. Dorsey, Fast bilateral filtering for the display of high dynamicrange images, ACM Trans. Graph., vol. 21, pp. 257266,Jul. 2002.

M.Grundmann,V.Kwatra, and I.Essa, Autodirected video stabilizationwith robust L1 optimal camera paths, in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2011, pp. 225232.

M. Guttmann, L. Wolf, and D. CohenOr, Semiautomatic stereoextraction from video footage, in Proc. IEEE Int. Conf. Comput. Vis.,Oct. 2009, pp. 136142.

K. Karsch, C. Liu, and S. B. Kang, Depth extraction from videousing nonparametric sampling, in Proc. Eur. Conf. Comput. Vis., 2012,pp. 775788.

J. Konrad, G. Brown, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee,Automatic 2Dto3D image conversion using 3D examples from theInternet, Proc. SPIE, vol. 8288, p. 82880F, Jan. 2012.

J. Konrad, M. Wang, and P. Ishwar, 2Dto3D image conversion bylearning depth from examples, in Proc. IEEE Comput. Soc. CVPRW,Jun. 2012, pp. 1622.
[10 M. Liao, J. Gao, R. Yang, and M. Gong, Video stereolization:Combining motion analysis with user interaction, IEEE Trans.Visualizat. Comput. Graph., vol. 18, no. 7, pp. 10791088, Jul.2012.

B. Liu, S. Gould, and D. Koller, Single image depth estimation frompredicted semantic labels, in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2010, pp. 12531260.

R. Phan, R. Rzeszutek, and D. Androutsos, Semiautomatic 2D to3D image conversion using scalespace random walks and a graphcuts based depth prior, in Proc. 18th IEEE Int. Conf. Image Process.,Sep. 2011, pp. 865868.

A. Saxena, S. H. Chung, and A. Y. Ng, Learning depth from singlemonocular images, in Advances in Neural Information ProcessingSystems. Cambridge, MA, USA: MIT Press, 2005.

A. Saxena, M. Sun, and A. Ng, Make3D: Learning 3D scene structurefrom a single still image, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 31, no. 5, pp. 824840, May 2009.

N. Silberman and R. Fergus, Indoor scene segmentation using a structuredlight sensor, in Proc. Int. Conf. Comput. Vis. Workshops,Nov. 2011, pp. 601608.

M. Subbarao and G. Surya, Depth from defocus: A spatial domainapproach, Int. J. Comput. Vis., vol. 13, no. 3, pp. 271294,1994.

R. Szeliski and P. H. S. Torr, Geometrically constrained structure frommotion: Points on planes, in Proc. Eur. Workshop 3D Struct. MultipleImages LargeScale Environ., 1998, pp. 171186.

A. Torralba, R. Fergus, and W. T. Freeman, 80 million tiny images:A large data set for nonparametric object and scene recognition, IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958 1970,Nov. 2008.

M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. Rowley, Image saliency:From intrinsic to extrinsic context, in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2011, pp. 417424.

R. Zhang, P. S. Tsai, J. Cryer, and M. Shah, Shapefromshading:A survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8, pp. 690 706, Aug. 1999.

(2012).Make3D[Online].Available: http://make3d.cs.cornell.edu/data.html
(2012).NYUDepthV1[Online].Available:http://cs.nyu.edu/~silberman/da tasets/nyu_depth_v1.html