P
US12094208B2ActiveUtilityPatentIndex 52

Video classification method, electronic device and storage medium

Assignee: BEIJING BAIDU NETCOM SCI & TECH CO LTDPriority: Mar 5, 2021Filed: Oct 15, 2021Granted: Sep 17, 2024
Est. expiryMar 5, 2041(~14.7 yrs left)· nominal 20-yr term from priority
Inventors:YANG HUHE FENGWANG QIFENG ZHIFANCHAI CHUNGUANGZHU YONG
G06N 3/08G06V 20/70G06F 18/241G06V 10/806G06V 10/768G06N 20/00G06V 10/40G06V 10/764G06V 10/82G06V 10/22G06F 18/253G06F 18/214G10L 15/08G06V 20/635G06V 20/41G06V 30/10G06V 30/1444G06V 30/262G06V 20/46
52
PatentIndex Score
0
Cited by
48
References
16
Claims

Abstract

The present disclosure discloses a video classification method, an electronic device and a storage medium, and relates to the field of computer technologies, and particularly to the field of artificial intelligence technologies, such as knowledge graph technologies, computer vision technologies, deep learning technologies, or the like. The video classification method includes: extracting a keyword in a video according to multi-modal information of the video; acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and classifying the text to be recognized to obtain a class of the video.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A video classification method, comprising:
 extracting a keyword in a video according to multi-modal information of the video; 
 acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and 
 classifying the text to be recognized to obtain a class of the video, 
 wherein the extracting a keyword in a video according to multi-modal information of the video comprises:
 performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information; 
 fusing the features corresponding to each piece of modal information to obtain a fused feature; and 
 performing a word labeling according to the fused feature in the video to determine the keyword in the video, 
 
 wherein the multi-modal information comprises text content and visual information, the visual information comprises first visual information and second visual information, the first visual information is visual information corresponding to a text in a video frame in the video, the second visual information is a key frame in the video, and the performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information comprises:
 performing a first text encoding operation on the text content to obtain a text feature; 
 performing a second text encoding operation on the first visual information to obtain a first visual feature; and 
 performing an image encoding operation on the second visual information to obtain a second visual feature. 
 
 
     
     
       2. The method according to  claim 1 , wherein the fusing the features corresponding to each piece of modal information to obtain a fused feature comprises:
 performing a vector stitching operation on the features corresponding to each piece of modal information, so as to obtain a stitched vector as the fused feature. 
 
     
     
       3. The method according to  claim 1 , wherein the labeling the keyword according to the fused feature comprises:
 labeling the keyword according to the fused feature using a conditional random field. 
 
     
     
       4. The method according to  claim 1 , wherein the acquiring background knowledge corresponding to the keyword comprises:
 acquiring the background knowledge corresponding to the keyword from an existing knowledge base. 
 
     
     
       5. The method according to  claim 1 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       6. The method according to  claim 1 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       7. The method according to  claim 1 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       8. The method according to  claim 2 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       9. The method according to  claim 3 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       10. The method according to  claim 4 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       11. An electronic device, comprising:
 at least one processor; and 
 a memory communicatively connected with the at least one processor; 
 wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform video classification method, wherein video classification method comprises: 
 extracting a keyword in a video according to multi-modal information of the video; 
 acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and 
 classifying the text to be recognized to obtain a class of the video, 
 wherein the extracting a keyword in a video according to multi-modal information of the video comprises:
 performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information; 
 fusing the features corresponding to each piece of modal information to obtain a fused feature; and 
 performing a word labeling according to the fused feature in the video to determine the keyword in the video, 
 
 wherein the multi-modal information comprises text content and visual information, the visual information comprises first visual information and second visual information, the first visual information is visual information corresponding to a text in a video frame in the video, the second visual information is a key frame in the video, and the performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information comprises:
 performing a first text encoding operation on the text content to obtain a text feature; 
 performing a second text encoding operation on the first visual information to obtain a first visual feature; and 
 performing an image encoding operation on the second visual information to obtain a second visual feature. 
 
 
     
     
       12. The electronic device according to  claim 11 , wherein the fusing the features corresponding to each piece of modal information to obtain a fused feature comprises:
 performing a vector stitching operation on the features corresponding to each piece of modal information, so as to obtain a stitched vector as the fused feature. 
 
     
     
       13. The electronic device according to  claim 11 , wherein the labeling the keyword according to the fused feature comprises:
 labeling the keyword according to the fused feature using a conditional random field. 
 
     
     
       14. The electronic device according to  claim 11 , wherein the acquiring background knowledge corresponding to the keyword comprises:
 acquiring the background knowledge corresponding to the keyword from an existing knowledge base. 
 
     
     
       15. The electronic device according to  claim 11 , wherein the classifying the text to be recognized comprises:
 classifying the text to be recognized using a classification model, the classification model being obtained after trained using broadcast television data. 
 
     
     
       16. A non-transitory computer readable storage medium with computer instructions stored thereon, wherein the computer instructions are used for causing a computer to perform a video classification method, wherein the video classification method comprises:
 extracting a keyword in a video according to multi-modal information of the video; 
 acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and 
 classifying the text to be recognized to obtain a class of the video, 
 wherein the extracting a keyword in a video according to multi-modal information of the video comprises:
 performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information; 
 fusing the features corresponding to each piece of modal information to obtain a fused feature; and 
 performing a word labeling according to the fused feature in the video to determine the keyword in the video, 
 
 wherein the multi-modal information comprises text content and visual information, the visual information comprises first visual information and second visual information, the first visual information is visual information corresponding to a text in a video frame in the video, the second visual information is a key frame in the video, and the performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information comprises:
 performing a first text encoding operation on the text content to obtain a text feature; 
 performing a second text encoding operation on the first visual information to obtain a first visual feature; and 
 performing an image encoding operation on the second visual information to obtain a second visual feature.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.