P
US11763565B2ActiveUtilityPatentIndex 61

Fine-grain object segmentation in video with deep features and multi-level graphical models

Assignee: INTEL CORPPriority: Nov 8, 2019Filed: Nov 8, 2019Granted: Sep 19, 2023
Est. expiryNov 8, 2039(~13.3 yrs left)· nominal 20-yr term from priority
Inventors:RHODES ANTHONYGOEL MANAN
G06N 3/09G06N 3/0464G06T 7/11G06V 20/49G06T 2207/10016G06T 2207/20084G06N 7/01G06N 3/045G06T 7/168G06T 2207/20081G06T 11/20G06T 2210/12G06F 17/18G06V 20/46G06N 3/047
61
PatentIndex Score
1
Cited by
19
References
22
Claims

Abstract

Techniques related to automatically segmenting a video frame into fine grain object of interest and background regions using a ground truth segmentation of an object in a previous frame are discussed. Such techniques apply multiple levels of segmentation tracking and prediction based on color, shape, and motion of the segmentation to determine per-pixel object probabilities, and solve an energy summation model to generate a final segmentation for the video frame using the object probabilities.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A system for providing segmentation in video comprising:
 a memory to store a first video frame; and 
 one or more processors coupled to the memory, the one or more processors to:
 train a color mixture model using a region within a first bounding box of the first video frame, the first bounding box surrounding a ground truth segmentation of an object from a background within the bounding box; 
 determine, based on an optical flow between at least the ground truth segmentation and a second bounding box of a second video frame, a first shape estimation of the object in the second video frame; 
 apply an affine transformation to the first shape estimation to generate a second shape estimation of the object in the second video frame, the affine transformation generated based on object landmark tracking between the first and second video frames; and 
 determine a final segmentation of the object in the second video frame based at least on the second shape estimation and application of the color mixture model to the second bounding box. 
 
 
     
     
       2. The system of  claim 1 , wherein the second shape estimation comprises per-pixel shape and motion based probability scores indicative of a probability the pixel is part of the object, wherein application of the color mixture model generates a color based estimation of a segmentation of the object in the second video frame, the color based estimation comprising per-pixel color based probability scores indicative of a probability the pixel is part of the object, and wherein the one or more processors to determine the final segmentation comprises the one or more processors to merge the per-pixel shape and motion based probability scores and the per-pixel color based probability scores to generate final per-pixel probability scores. 
     
     
       3. The system of  claim 2 , wherein the one or more processors to merge the per-pixel shape and motion based probability scores and the per-pixel color based probability scores comprises the one or more processors to multiply the per-pixel shape and motion based probability scores and the per-pixel color based probability scores to generate the final per-pixel probability scores. 
     
     
       4. The system of  claim 2 , wherein the one or more processors to determine the final segmentation for the object comprises the one or more processors to minimize a graph based energy summation model comprising a unary energy term based on the final per-pixel probability scores within the second bounding box, a pairwise energy term based on color differences between neighboring pixels within the second bounding box, and a super pixel energy term based on super pixel boundaries within the second bounding box. 
     
     
       5. The system of  claim 4 , wherein the one or more processors to minimize the graph based energy summation model comprises the one or more processors to determine the final segmentation within the second bounding box that minimizes a sum of the unary energy term, the pairwise energy term, and the super pixel energy term. 
     
     
       6. The system of  claim 4 , wherein the super pixel energy term provides, for a first candidate segmentation having all pixels of a super pixel within the object, a first super pixel energy value and, for a second candidate segmentation having at least one pixel of the super pixel outside the object and the remaining pixels within the object, a second super pixel energy value that is greater than the first super pixel energy value. 
     
     
       7. The system of  claim 4 , wherein the super pixel energy term provides, for a first candidate segmentation having all pixels of a super pixel within the object, a first super pixel energy value and, for a second candidate segmentation having a percentage of pixels of a super pixel within the object that exceeds a threshold and at least one pixel of the super pixel outside the object, a second super pixel energy value that is greater than the first super pixel energy value. 
     
     
       8. The system of  claim 4 , wherein the unary energy term provides for a greater unary energy value for a pixel in response to a candidate segmentation having a mismatch with respect to the final per-pixel probability scores for the pixel and the pairwise energy term provides for a greater pairwise energy value for a pair of pixels in response to the candidate segmentation having one of the pair of pixels within the object and the other outside the object and the pair of pixels having the same color. 
     
     
       9. The system of  claim 4 , wherein the one or more processors to minimize the graph based energy summation comprises the one or more processors to apply a Boykov-Kolmogorov solver to the graph based energy summation and wherein the super pixel boundaries are generated by applying simple linear iterative clustering to the second bounding box. 
     
     
       10. The system of  claim 1 , wherein the one or more processors to determine the first shape estimation comprises the one or more processors to translate the ground truth segmentation based on the optical flow and to apply a distance transform to the translated ground truth segmentation, wherein the first and second shape estimations comprises per-pixel probability scores indicative of a probability the pixel is part of the object. 
     
     
       11. The system of  claim 1 , the one or more processors to:
 determine the second bounding box of the second video frame by applying a pretrained convolutional Siamese tracker network based on a search region of the second video frame and the first bounding box as an exemplar. 
 
     
     
       12. A method for providing segmentation in video comprising:
 training a color mixture model using a region within a first bounding box of a first video frame, the first bounding box surrounding a ground truth segmentation of an object from a background within the bounding box; 
 determining, based on an optical flow between at least the ground truth segmentation and a second bounding box of a second video frame, a first shape estimation of the object in the second video frame; 
 applying an affine transformation to the first shape estimation to generate a second shape estimation of the object in the second video frame, the affine transformation generated based on object landmark tracking between the first and second video frames; and 
 determining a final segmentation of the object in the second video frame based at least on the second shape estimation and application of the color mixture model to the second bounding box. 
 
     
     
       13. The method of  claim 12 , wherein the second shape estimation comprises per-pixel shape and motion based probability scores indicative of a probability the pixel is part of the object, wherein application of the color mixture model generates a color based estimation of a segmentation of the object in the second video frame, the color based estimation comprising per-pixel color based probability scores indicative of a probability the pixel is part of the object, and wherein determining the final segmentation comprises merging the per-pixel shape and motion based probability scores and the per-pixel color based probability scores to generate final per-pixel probability scores. 
     
     
       14. The method of  claim 13 , wherein determining the final segmentation for the object comprises minimizing a graph based energy summation model comprising a unary energy term based on the final per-pixel probability scores within the second bounding box, a pairwise energy term based on color differences between neighboring pixels within the second bounding box, and a super pixel energy term based on super pixel boundaries within the second bounding box. 
     
     
       15. The method of  claim 14 , wherein the super pixel energy term provides, for a first candidate segmentation having all pixels of a super pixel within the object, a first super pixel energy value and, for a second candidate segmentation having at least one pixel of the super pixel outside the object and the remaining pixels within the object, a second super pixel energy value that is greater than the first super pixel energy value. 
     
     
       16. The method of  claim 14 , wherein the unary energy term provides for a greater unary energy value for a pixel in response to a candidate segmentation having a mismatch with respect to the final per-pixel probability scores for the pixel and the pairwise energy term provides for a greater pairwise energy value for a pair of pixels in response to the candidate segmentation having one of the pair of pixels within the object and the other outside the object and the pair of pixels having the same color. 
     
     
       17. The method of  claim 12 , further comprising:
 determining the second bounding box of the second video frame by applying a pretrained convolutional Siamese tracker network based on a search region of the second video frame and the first bounding box as an exemplar. 
 
     
     
       18. At least one non-transitory machine readable medium comprising a plurality of instructions that, in response to being executed on a computing device, cause the computing device to provide segmentation in video by:
 training a color mixture model using a region within a first bounding box of a first video frame, the first bounding box surrounding a ground truth segmentation of an object from a background within the bounding box; 
 determining, based on an optical flow between at least the ground truth segmentation and a second bounding box of a second video frame, a first shape estimation of the object in the second video frame; 
 applying an affine transformation to the first shape estimation to generate a second shape estimation of the object in the second video frame, the affine transformation generated based on object landmark tracking between the first and second video frames; and 
 determining a final segmentation of the object in the second video frame based at least on the second shape estimation and application of the color mixture model to the second bounding box. 
 
     
     
       19. The non-transitory machine readable medium of  claim 18 , wherein the second shape estimation comprises per-pixel shape and motion based probability scores indicative of a probability the pixel is part of the object, wherein application of the color mixture model generates a color based estimation of a segmentation of the object in the second video frame, the color based estimation comprising per-pixel color based probability scores indicative of a probability the pixel is part of the object, and wherein determining the final segmentation comprises merging the per-pixel shape and motion based probability scores and the per-pixel color based probability scores to generate final per-pixel probability scores. 
     
     
       20. The non-transitory machine readable medium of  claim 19 , wherein determining the final segmentation for the object comprises minimizing a graph based energy summation model comprising a unary energy term based on the final per-pixel probability scores within the second bounding box, a pairwise energy term based on color differences between neighboring pixels within the second bounding box, and a super pixel energy term based on super pixel boundaries within the second bounding box. 
     
     
       21. The non-transitory machine readable medium of  claim 20 , wherein the super pixel energy term provides, for a first candidate segmentation having all pixels of a super pixel within the object, a first super pixel energy value and, for a second candidate segmentation having at least one pixel of the super pixel outside the object and the remaining pixels within the object, a second super pixel energy value that is greater than the first super pixel energy value. 
     
     
       22. The non-transitory machine readable medium of  claim 20 , wherein the unary energy term provides for a greater unary energy value for a pixel in response to a candidate segmentation having a mismatch with respect to the final per-pixel probability score for the pixel and the pairwise energy term provides for a greater pairwise energy value for a pair of pixels in response to the candidate segmentation having one of the pair of pixels within the object and the other outside the object and the pair of pixels having the same color.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.