US11170758B2ActiveUtilityPatentIndex 82
Systems and methods for providing notifications within a media asset without breaking immersion

Assignee: ROVI GUIDES INCPriority: Sep 27, 2018Filed: Sep 27, 2018Granted: Nov 9, 2021
Est. expirySep 27, 2038(~12.2 yrs left)· nominal 20-yr term from priority
Inventors:GUPTA VIKRAM MAKAM VARSHNEY PRATEEK SEETHARAM MADHUSUDHAN SRIVASTAVA ASHISH KUMAR SREEKANTH HARSHITH KUMAR GEJJEGONDANAHALLY
H04M 1/72433G10L 13/027G10L 13/033G06F 40/205H04M 2201/39H04W 68/005H04M 1/72442G06F 40/279G10L 13/08G10L 13/00
PatentIndex Score
Cited by
References
Claims
Abstract

Systems and methods for providing notifications without breaking media immersion. A notification delivery application receives notification data while a media device provides a media asset. In response to receiving the notification data while the media device provides the media asset, the notification delivery application generates a voice model based on a voice detected in the media asset. The notification delivery application converts the notification data to synthesized speech using the voice model and generates, by the media device, the synthesized speech for output at an appropriate point in the media asset based on contextual features of the media asset.
Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A method for providing notifications without breaking media immersion, the method comprising:
 receiving notification data while a media device provides a media asset; 
 in response to receiving the notification data while the media device provides the media asset:
 determining whether an audio component of the media asset comprises a voice; 
 in response to determining that the audio component comprises the voice, generating a text-to-voice model based on characteristics of the voice; 
 converting the notification data to synthesized speech using the text-to-voice model; 
 determining a playback position in the media asset for outputting the synthesized speech, based on the notification data; and 
 generating, for output at the playback position in the media asset by the media device, the synthesized speech. 
 
 
     
     
       2. The method of  claim 1 , wherein determining whether the audio component of the media asset comprises the voice comprises:
 extracting frequency and temporal characteristics from the audio component; 
 retrieving, from memory, vocal characteristics that comprise frequency and temporal information of speech; 
 comparing the frequency and temporal characteristics from the audio component with the vocal characteristics; and 
 in response to determining that the frequency and temporal characteristics correspond to the vocal characteristics, determining that the audio component comprises the voice. 
 
     
     
       3. The method of  claim 1 , wherein converting the notification data to the synthesized speech using the text-to-voice model comprises:
 identifying textual information in the notification data; and 
 generating the synthesized speech based on the textual information, wherein the synthesized speech is an audio clip comprising a recitation, made by the text-to-voice model, of the textual information. 
 
     
     
       4. The method of  claim 1 , wherein determining the playback position in the media asset for outputting the synthesized speech further comprises:
 parsing the notification data into textual information; 
 identifying a keyword from the textual information; 
 retrieving, from memory, a plurality of priority keywords, wherein each priority keyword of the plurality of priority keywords is associated with a respective priority level; 
 comparing the keyword from the textual information to each priority keyword of the plurality of priority keywords; and 
 in response to determining that the keyword from the textual information matches a first priority keyword that is associated with a first priority level, determining the playback position in the media asset for outputting the synthesized speech, based on both the first priority level and the contextual features of the media asset. 
 
     
     
       5. The method of  claim 1 , wherein determining the playback position in the media asset for outputting the synthesized speech comprises:
 retrieving notification access data from memory, wherein the notification access data is indicative of receipt times and access times for a plurality of notification types; 
 identifying a notification type associated with the notification data; 
 determining, based on the notification access data, an access delay for the notification type, wherein the access delay represents a time difference between when a notification of the notification type was received and when the notification of the notification type was accessed; 
 identifying a current play position of the media asset; and 
 determining that the playback position is a sum of the current play position and the access delay. 
 
     
     
       6. The method of  claim 1 , wherein determining the playback position in the media asset for outputting the synthesized speech is further based on contextual features of the media asset, and further comprises:
 determining the contextual features of the media asset, wherein the contextual features comprise silence periods, by:
 retrieving metadata of the media asset; and 
 identifying, based on the metadata, a plurality of silence periods in the media asset, wherein a silence period of the plurality of silence periods is indicative of a time period in the media asset in which no voices are detected; 
 determining a candidate playback position in the media asset that is within the silence period; 
 identifying the candidate playback position as the playback position in the media asset for outputting the synthesized speech. 
 
 
     
     
       7. The method of  claim 6 , wherein the contextual features comprise keywords and wherein determining the playback position in the media asset for outputting the synthesized speech comprises:
 retrieving a keyword from memory; 
 retrieving metadata of the media asset; 
 identifying, based on the metadata, a time position in the media asset at which the keyword is recited; 
 identifying a silence period in the media asset that subsequently follows the time position at which the keyword is recited; 
 determining a candidate playback position in the media asset that is within the silence period; 
 identifying the candidate playback position as the playback position in the media asset for outputting the synthesized speech. 
 
     
     
       8. The method of  claim 1 , wherein determining the playback position in the media asset for outputting the synthesized speech comprises:
 detecting that a different voice is being outputted in the media asset; 
 determining a position in the media asset when the different voice ceases output; and 
 identifying the position as the playback position in the media asset for outputting the synthesized speech. 
 
     
     
       9. The method of  claim 1 , wherein generating, for output at the playback position in the media asset by the media device, the synthesized speech comprises:
 pausing the media asset at the playback position; 
 in response to pausing the media asset at the playback position, generating for output the synthesized speech; and 
 unpausing the media asset at the playback position in response to completing output of the synthesized speech. 
 
     
     
       10. The method of  claim 1 , wherein generating, for output at the playback position in the media asset by the media device, the synthesized speech, comprises outputting the synthesized speech at a higher frequency than a normal frequency of the voice. 
     
     
       11. A system for providing notifications without breaking media immersion, the system comprising:
 audio generating circuitry; 
 control circuitry configured to:
 receive notification data while a media device provides a media asset; 
 in response to receiving the notification data while the media device provides the media asset:
 determine whether an audio component of the media asset comprises a voice; 
 in response to determining that the audio component comprises the voice, generate a text-to-voice model based on characteristics of the voice; 
 convert the notification data to synthesized speech using the text-to-voice model; 
 determine a playback position in the media asset for outputting the synthesized speech, based on the notification data; and 
 generating, via audio generating circuitry, the synthesized speech for output at the playback position in the media asset by the media device. 
 
 
 
     
     
       12. The system of  claim 11 , wherein the control circuitry, when determining whether the audio component of the media asset comprises the voice, is further configured to:
 extract frequency and temporal characteristics from the audio component; 
 retrieve, from memory, vocal characteristics that comprise frequency and temporal information of speech; 
 compare the frequency and temporal characteristics from the audio component with the vocal characteristics; and 
 in response to determining that the frequency and temporal characteristics correspond to the vocal characteristics, determine that the audio component comprises the voice. 
 
     
     
       13. The system of  claim 11 , wherein the control circuitry, when converting the notification data to the synthesized speech using the text-to-voice model, is further configured to:
 identify textual information in the notification data; and 
 generate the synthesized speech based on the textual information, wherein the synthesized speech is an audio clip comprising a recitation, made by the text-to-voice model, of the textual information. 
 
     
     
       14. The system of  claim 11 , wherein the control circuitry, when determining the playback position in the media asset for outputting the synthesized speech, is further configured to:
 parse the notification data into textual information; 
 identify a keyword from the textual information; 
 retrieve, from memory, a plurality of priority keywords, wherein each priority keyword of the plurality of priority keywords is associated with a respective priority level; 
 compare the keyword from the textual information to each priority keyword of the plurality of priority keywords; and 
 in response to determining that the keyword from the textual information matches a first priority keyword that is associated with a first priority level, determine the playback position in the media asset for outputting the synthesized speech, based on both the first priority level and the contextual features of the media asset. 
 
     
     
       15. The system of  claim 11 , wherein the control circuitry, when determining the playback position in the media asset for outputting the synthesized speech, is further configured to:
 retrieve notification access data from memory, wherein the notification access data is indicative of receipt times and access times for a plurality of notification types; 
 identify a notification type associated with the notification data; 
 determine, based on the notification access data, an access delay for the notification type, wherein the access delay represents a time difference between when a notification of the notification type was received and when the notification of the notification type was accessed; 
 identify a current play position of the media asset; and 
 determine that the playback position is a sum of the current play position and the access delay. 
 
     
     
       16. The system of  claim 11 , wherein the control circuitry, when determining the playback position in the media asset for outputting the synthesized speech further based on contextual features of the media asset, is further configured to:
 determine the contextual features of the media asset, wherein the contextual features comprise silence periods, by:
 retrieving metadata of the media asset; and 
 identifying, based on the metadata, a plurality of silence periods in the media asset, wherein a silence period of the plurality of silence periods is indicative of a time period in the media asset in which no voices are detected; 
 determine a candidate playback position in the media asset that is within the silence period; 
 identify the candidate playback position as the playback position in the media asset for outputting the synthesized speech. 
 
 
     
     
       17. The system of  claim 16 , wherein the contextual features comprise keywords and wherein the control circuitry, when determining the playback position in the media asset for outputting the synthesized speech, is further configured to:
 retrieve a keyword from memory; 
 retrieve metadata of the media asset; 
 identify, based on the metadata, a time position in the media asset at which the keyword is recited; 
 identify a silence period in the media asset that subsequently follows the time position at which the keyword is recited; 
 determine a candidate playback position in the media asset that is within the silence period; 
 identify the candidate playback position as the playback position in the media asset for outputting the synthesized speech. 
 
     
     
       18. The system of  claim 11 , wherein the control circuitry, when determining the playback position in the media asset for outputting the synthesized speech, is further configured to:
 detect that a different voice is being outputted in the media asset; 
 determine a position in the media asset when the different voice ceases output; and 
 identify the position as the playback position in the media asset for outputting the synthesized speech. 
 
     
     
       19. The system of  claim 11 , wherein the control circuitry, when generating, via the audio generating circuitry, the synthesized speech for output at the playback position in the media asset by the media device, is further configured to:
 pause the media asset at the playback position; 
 in response to pausing the media asset at the playback position, generate for output the synthesized speech; and 
 unpause the media asset at the playback position in response to completing output of the synthesized speech. 
 
     
     
       20. The system of  claim 11 , wherein the control circuitry, when generating, via the audio generating circuitry, the synthesized speech for output at the playback position in the media asset by the media device, is further configured to output the synthesized speech at a higher frequency than a normal frequency of the voice.
Cited by (0)

No later patents cite this yet.
References (0)

No backward citations on record.