US11176404B2ActiveUtilityPatentIndex 71

Method and apparatus for detecting object in image, and storage medium thereof

Assignee: TENCENT TECH SHENZHEN CO LTDPriority: Jul 11, 2018Filed: Aug 31, 2020Granted: Nov 16, 2021

Est. expiryJul 11, 2038(~12 yrs left)· nominal 20-yr term from priority

Inventors:ZHAO SHIJIE LI FENG ZUO XIAOXIANG

G06V 10/82G06V 10/454G06V 10/25G06V 10/764G06F 18/24G06T 2207/20084G06T 7/73G06N 3/02G06T 7/246G06T 2207/10004G06K 9/6232G06K 9/6267G06K 9/46

PatentIndex Score

Cited by

References

Claims

Abstract

An embodiment of this application provides an image object detection method. The method may include obtaining a detection image, an n-level deep feature map framework, and an m-level non-deep feature map framework. The method may further include extracting deep feature from an (i−1)-level feature of the detection image using an i-level deep feature map framework, to obtain an i-level feature of the detection image. The method may further include extracting non-deep feature from a (j−1+n)-level feature of the detection image using a j-level non-deep feature map framework, to obtain a (j+n)-level feature of the detection image. The method may further include performing information regression operation on an a-level feature to an (m+n)-level feature of the detection image, to obtain an object type information and an object position information of an object in the detection image. The a is an integer less than n and greater than or equal to 2.

Claims

exact text as granted — not AI-modified

What is claimed is: 
     
       1. A method for detecting an object in an image, performed by an electronic device, the method comprising:
 obtaining a detection image, an n-level deep feature map framework, and an m-level non-deep feature map framework, n being an integer greater than or equal to 2, m being an integer greater than or equal to 1, and the n-level deep feature map framework and the m-level non-deep feature map framework comprising a feature size and a feature dimension; 
 extracting, based on a deep feature extraction model, deep feature from an (i−1)-level feature of the detection image using an i-level deep feature map framework, to obtain an i-level feature of the detection image, i being a positive integer less than or equal to n; 
 extracting, based on a non-deep feature extraction model, non-deep feature from a (j−1+n)-level feature of the detection image using a j-level non-deep feature map framework, to obtain a (j+n)-level feature of the detection image, j being a positive integer less than or equal to m; and 
 performing, based on a feature prediction model, an information regression operation on an a-level feature to an (m+n)-level feature of the detection image, to obtain object type information and object position information of an object in the detection image, a being an integer less than n and greater than or equal to 2. 
 
     
     
       2. The method of  claim 1 , wherein the deep feature extraction model comprises a deep input convolution layer, a first deep nonlinear transformation convolution layer, a second deep nonlinear transformation convolution layer, and a deep output convolution layer; and
 the extracting the deep feature comprises: 
 increasing a dimension of the (i−1)-level feature of the detection image using the deep input convolution layer, to obtain an i-level dimension feature of the detection image; 
 extracting an i-level first convolved feature from the i-level dimension feature of the detection image using the first deep nonlinear transformation convolution layer; 
 extracting an i-level second convolved feature from the i-level first convolved feature using the second deep nonlinear transformation convolution layer; and 
 decreasing a dimension of the i-level second convolved feature using the deep output convolution layer, to obtain the i-level feature of the detection image. 
 
     
     
       3. The method of  claim 2 , wherein a convolution kernel size of the deep input convolution layer is 1*1, a convolution kernel size of the first deep nonlinear transformation convolution layer is 3*3, a convolution kernel size of the second deep nonlinear transformation convolution layer is 3*3, and a convolution kernel size of the deep output convolution layer is 1*1; and
 the deep input convolution layer is a standard convolution layer with a nonlinear activation function, the first deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, the second deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, and the deep output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       4. The method of  claim 3 , wherein the second deep nonlinear transformation convolution layer is a depthwise separable atrous convolution layer with a nonlinear activation function. 
     
     
       5. The method of  claim 4 , wherein the depthwise separable atrous convolution layer sets a dilation rate in convolution operation, the dilation rate defines a spacing between data to be processed by the depthwise separable atrous convolution layer. 
     
     
       6. The method of  claim 1 , wherein the non-deep feature extraction model comprises a non-deep input convolution layer, a non-deep nonlinear transformation convolution layer, and a non-deep output convolution layer; and
 the extracting the non-deep feature comprises: 
 increasing a dimension of the (j−1+n)-level feature of the detection image using the non-deep input convolution layer, to obtain a (j+n)-level dimension feature of the detection image; 
 extracting a (j+n)-level convolved feature from the (j+n)-level dimension feature of the detection image using the non-deep nonlinear transformation convolution layer; and 
 decreasing a dimension of the (j+n)-level convolved feature using the non-deep output convolution layer, to obtain the (j+n)-level feature of the detection image. 
 
     
     
       7. The method of  claim 6 , wherein a convolution kernel size of the non-deep input convolution layer is 1*1, a convolution kernel size of the non-deep nonlinear transformation convolution layer is 3*3, and a convolution kernel size of the non-deep output convolution layer is 1*1; and
 the non-deep input convolution layer is a standard convolution layer with a nonlinear activation function, the non-deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, and the non-deep output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       8. The method of  claim 7 , wherein the non-deep nonlinear transformation convolution layer is a depthwise separable atrous convolution layer with a nonlinear activation function. 
     
     
       9. The method according of  claim 1 , wherein the feature prediction model comprises a feature classification convolution layer and a feature output convolution layer; and
 the performing the information regression operation comprises: 
 extracting a classification recognition feature from the a-level feature to the (m+n)-level feature of the detection image using the feature classification convolution layer; and 
 decreasing a dimension of the classification recognition feature using the feature output convolution layer, to obtain the object type information and the object position information of the object in the detection image. 
 
     
     
       10. The method of  claim 9 , wherein a convolution kernel size of the feature classification convolution layer is 3*3, and a convolution kernel size of the feature output convolution layer is 1*1; and
 the feature classification convolution layer is a depthwise separable convolution layer without an activation function, and the feature output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       11. The method of  claim 1 , wherein the object position information of the object in the detection image comprises center coordinates of the object or length and width of the object. 
     
     
       12. An apparatus for detecting an object in an image, comprising:
 a memory configured to store program code; and 
 a processor, to read the program code and configured to: 
 obtain a detection image, an n-level deep feature map framework, and an m-level non-deep feature map framework, n being an integer greater than or equal to 2, m being an integer greater than or equal to 1, and the n-level deep feature map framework and the m-level non-deep feature map framework comprising a feature size and a feature dimension; 
 extract, based on a deep feature extraction model, deep feature from an (i−1)-level feature of the detection image using an i-level deep feature map framework, to obtain an i-level feature of the detection image, i being a positive integer less than or equal to n; 
 extract, based on a non-deep feature extraction model, non-deep feature from a (j−1+n)-level feature of the detection image using a j-level non-deep feature map framework, to obtain a (j+n)-level feature of the detection image, j being a positive integer less than or equal to m; and 
 perform, based on a feature prediction model, an information regression operation on an a-level feature to an (m+n)-level feature of the detection image, to obtain an object type information and an object position information of an object in the detection image, a being an integer less than n and greater than or equal to 2. 
 
     
     
       13. The apparatus of  claim 12 , wherein the deep feature extraction model comprises a deep input convolution layer, a first deep nonlinear transformation convolution layer, a second deep nonlinear transformation convolution layer, and a deep output convolution layer; and
 the processor is configured to: 
 increase a dimension of the (i−1)-level feature of the detection image using the deep input convolution layer, to obtain an i-level dimension feature of the detection image; 
 extract an i-level first convolved feature from the i-level dimension feature of the detection image using the first deep nonlinear transformation convolution layer; 
 extract an i-level second convolved feature from the i-level first convolved feature using the second deep nonlinear transformation convolution layer; and 
 decrease a dimension of the i-level second convolved feature using the deep output convolution layer, to obtain the i-level feature of the detection image. 
 
     
     
       14. The apparatus of  claim 13 , wherein a convolution kernel size of the deep input convolution layer is 1*1, a convolution kernel size of the first deep nonlinear transformation convolution layer is 3*3, a convolution kernel size of the second deep nonlinear transformation convolution layer is 3*3, and a convolution kernel size of the deep output convolution layer is 1*1; and
 the deep input convolution layer is a standard convolution layer with a nonlinear activation function, the first deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, the second deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, and the deep output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       15. The apparatus of  claim 14 , wherein the second deep nonlinear transformation convolution layer is a depthwise separable atrous convolution layer with a nonlinear activation function. 
     
     
       16. The apparatus of  claim 12 , wherein the non-deep feature extraction model comprises a non-deep input convolution layer, a non-deep nonlinear transformation convolution layer, and a non-deep output convolution layer; and
 the processor is configured to: 
 increase a dimension of the (j−1+n)-level feature of the detection image using the non-deep input convolution layer, to obtain a (j+n)-level dimension feature of the detection image; 
 extract a (j+n)-level convolved feature from the (j+n)-level dimension feature of the detection image using the non-deep nonlinear transformation convolution layer; and 
 decrease a dimension of the (j+n)-level convolved feature using the non-deep output convolution layer, to obtain the (j+n)-level feature of the detection image. 
 
     
     
       17. The apparatus of  claim 16 , wherein a convolution kernel size of the non-deep input convolution layer is 1*1, a convolution kernel size of the non-deep nonlinear transformation convolution layer is 3*3, and a convolution kernel size of the non-deep output convolution layer is 1*1; and
 the non-deep input convolution layer is a standard convolution layer with a nonlinear activation function, the non-deep nonlinear transformation convolution layer is a depthwise separable convolution layer with a nonlinear activation function, and the non-deep output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       18. The apparatus of  claim 12 , wherein the feature prediction model comprises a feature classification convolution layer and a feature output convolution layer; and
 the processor is configured to: 
 extract a classification recognition feature from the a-level feature to the (m+n)-level feature of the detection image using the feature classification convolution layer; and 
 decrease a dimension of the classification recognition feature using the feature output convolution layer, to obtain the object type information and the object position information of the object in the detection image. 
 
     
     
       19. The apparatus of  claim 18 , wherein a convolution kernel size of the feature classification convolution layer is 3*3, and a convolution kernel size of the feature output convolution layer is 1*1; and
 the feature classification convolution layer is a depthwise separable convolution layer without an activation function, and the feature output convolution layer is a standard convolution layer without an activation function. 
 
     
     
       20. A non-transitory machine-readable media, having processor executable instructions stored thereon for causing a processor to:
 obtain a detection image, an n-level deep feature map framework, and an m-level non-deep feature map framework, n being an integer greater than or equal to 2, m being an integer greater than or equal to 1, and the n-level deep feature map framework and the m-level non-deep feature map framework comprising a feature size and a feature dimension; 
 extract, based on a deep feature extraction model, deep feature from an (i−1)-level feature of the detection image using an i-level deep feature map framework, to obtain an i-level feature of the detection image, i being a positive integer less than or equal to n; 
 extract, based on a non-deep feature extraction model, non-deep feature from a (j−1+n)-level feature of the detection image using a j-level non-deep feature map framework, to obtain a (j+n)-level feature of the detection image, j being a positive integer less than or equal to m; and 
 perform, based on a feature prediction model, an information regression operation on an a-level feature to an (m+n)-level feature of the detection image, to obtain an object type information and an object position information of an object in the detection image, a being an integer less than n and greater than or equal to 2.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.