Sentence generation method, sentence generation apparatus, and smart device
Abstract
The present disclosure provides a sentence generation method as well as a sentence generation apparatus and a smart device. The method includes: obtaining an input sentence; searching for structurally similar sentence(s) of each input sentence, where the structurally similar sentence(s) are structurally similar to the input sentence; finding semantically similar sentence(s) of the structurally similar sentence(s); parsing the input sentence and the structurally similar sentence(s) to obtain a subject block, a predicate block, and an object block to rewrite the semantically similar sentences to generate a new sentence; filtering the new sentence based on a preset filtering condition; and labeling the filtered new sentence as a semantically similar sentence of the input sentence. In this manner, a plurality of new sentences with different sentence patterns can be generated based on the same input sentence, which improves the controllability in generating the sentences and saves the labor cost therein.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A computer-implemented sentence generation method, comprising executing on a processor steps of:
obtaining an input sentence;
obtaining a first dependency tree of the input sentence and a second dependency tree of each sentence in a preset corpus, and searching for one or more structurally similar sentences of the input sentence based on a matching degree of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, wherein the one or more structurally similar sentences are structurally similar to the input sentence;
finding one or more semantically similar sentences of the one or more structurally similar sentences;
parsing the input sentence and the one or more structurally similar sentences to obtain a subject block, an object block, and a predicate block of each of the input sentence and the one or more structurally similar sentences, wherein the subject block is obtained by extracting each sentence based on a subject of the sentence and a dependency of the subject, and the object block is obtained by extracting each sentence based on an object of the sentence and a dependency of the object;
rewriting each of the one or more semantically similar sentences to generate at least one new sentence, by substituting the subject block in the each of the one or more semantically similar sentences with the subject block in the input sentence, substituting the object block in the each of the one or more semantically similar sentences with the object block in the input sentence, and substituting the predicate block in the each of the one or more semantically similar sentences with the predicate block in the input sentence;
filtering the at least one new sentence based on a preset filtering condition; and
labeling the at least one filtered sentence as a semantically similar sentence of the input sentence.
2. The method of claim 1 , wherein the step of searching for the one or more structurally similar sentences of the input sentence based on the matching degree of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence comprises:
obtaining all sub-paths of the first dependency tree of the input sentence, wherein each sub-path is a line without a branch formed between any amount of adjacent nodes in the first dependency tree;
obtaining all sub-paths of the second dependency tree of any sentence in the corpus;
classifying the sub-paths with a same dependency into a same sub-path category; and
calculating a similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence based on the following formula:
S
=
∑
i
∈
I
count
i
(
S
1
)
×
count
i
(
S
2
)
/
2
meanDeep
(
i
)
;
where, I is a set of all the sub-paths in the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, count i (S 1 ) indicates an amount of occurrences of the sub-paths in the second dependency tree of the each sentence in the corpus and belonging to the sub-path category i, the count i (S 2 ) indicates an amount of occurrences of the sub-paths in the first dependency tree of the input sentence and belonging to the sub-path category i, and meanDeep (i) indicates an average distance from a first node of each sub-path in the sub-path category i to a root node of the corresponding dependency tree; and
determining the one or more structurally similar sentences of the input sentence based on the similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, wherein the one or more structurally similar sentences are one or more of the sentences in the corpus having the second dependency tree with a similarity with the first dependency tree of the input sentence exceeding a preset similarity threshold.
3. The method of claim 2 , wherein when a number of the one or more structurally similar sentences is less than a preset first amount, the preset similarity threshold is reduced based on a preset adjustment parameter; and
when the number of the one or more structurally similar sentences is more than a preset second amount, the one or more structurally similar sentences are filtered in a ascending order of the similarity.
4. The method of claim 1 , wherein the step of filtering the at least one new sentence based on the preset filtering condition comprises:
detecting whether there is a redundant content word in any new sentence, wherein the redundant content word is a content word not exist in the input sentence;
excluding the new sentence in response to there being the redundant content word in the new sentence; and
retaining the new sentence in response to there being no redundant content word in the new sentence.
5. The method of claim 1 , wherein the step of filtering the at least one new sentence based on the preset filtering condition comprises:
obtaining a sum of word vectors of any new sentence and a sum of word vectors of the input sentence;
calculating a cosine similarity of the sum of the word vectors of the new sentence and the sum of the word vectors of the input sentence;
sorting all the new sentences according to a descending order of the cosine similarity; and
retaining the first X 1 new sentences based on a result of the sorting, wherein X 1 is a preset positive integer.
6. The method of claim 5 , wherein X 1 has a positive proportional relationship with a total number of the at least one new sentence; or
wherein X 1 has a positive proportion to a number of new sentences with the cosine similarity higher than a cosine similarity threshold.
7. The method of claim 1 , wherein the step of filtering the at least one new sentence based on the preset filtering condition comprises:
calculating a perplexity of any new sentence based on a trained language model and a preset perplexity calculation formula, wherein the perplexity indicating a fluency degree of one sentence is calculated using a formula of:
PP
(
S
new
)
=
p
(
w
1
w
2
…
w
M
)
-
1
M
=
1
p
(
w
1
w
2
…
w
M
)
M
=
∏
i
=
1
M
1
p
(
w
i
❘
w
1
w
2
…
w
i
-
1
)
M
;
where, S new indicates a new sentence, M is the length of the new sentence S new , p (w i ) is a probability of the i-th word in the new sentence S new , and the probability is obtained based on the language model;
sorting the new sentences according to an ascending order of the perplexity; and
retaining the first X 2 new sentences based on a result of the sorting, wherein X 2 is a preset positive integer.
8. The method of claim 7 , wherein wherein X 2 has a positive proportional relationship with a total number of the at least one new sentence; or
wherein X 2 is in a positive proportional relationship with a number of new sentences with the perplexity below a perplexity threshold.
9. The method of claim 1 , wherein the step of filtering the at least one new sentence based on the preset filtering condition comprises:
filtering the at least one new sentence based on content words in the at least one new sentence, a cosine similarity between each of the at least one new sentence and the input sentence, and a perplexity of each of the at least one new sentence in order;
wherein, the step of filtering the at least one new sentence based on the content words in the at least one new sentence comprises:
detecting whether there is a redundant content word in any new sentence, wherein the redundant content word is a content word not exist in the input sentence;
excluding the new sentence in response to there being the redundant content word in the new sentence; and
retaining the new sentence in response to there being no redundant content word in the new sentence;
wherein, the step of filtering the at least one new sentence based on the cosine similarity between the each of the at least one new sentence and the input sentence comprises:
obtaining a sum of word vectors of the new sentence retained after filtering the new sentences based on the content words in the at least one new sentence and a sum of word vectors of the input sentence;
calculating a cosine similarity of the sum of the word vectors of the new sentence and the sum of the word vectors of the input sentence;
sorting all the new sentences according to a descending order of the cosine similarity; and
retaining the first X 3 new sentences based on a result of the sorting, wherein X 3 is a preset positive integer;
wherein, the step of filtering the at least one new sentence based on the perplexity of the each of the at least one new sentence comprises:
calculating a perplexity of the new sentence retained after filtering based on the cosine similarity of the new sentence and the input sentence based on a trained language model and a preset perplexity calculation formula, wherein the perplexity indicating a fluency degree of one sentence is as follows:
PP
(
S
new
)
=
p
(
w
1
w
2
…
w
M
)
-
1
M
=
1
p
(
w
1
w
2
…
w
M
)
M
=
∏
i
=
1
M
1
p
(
w
i
❘
w
1
w
2
…
w
i
-
1
)
M
;
where, S new indicates a new sentence, M is the length of the new sentence S new , p (w) is a probability of the i-th word in the new sentence S new , and the probability is obtained based on the language model;
sorting the new sentences according to an ascending order of the perplexity; and
retaining the first X 4 new sentences based on a result of the sorting, wherein X 4 is a preset positive integer, and X 4 is smaller than X 3 .
10. The method of claim 1 , wherein the subject block is obtained based on a subject-based adjective modification relationship, a subject-based conjunction relationship, or a subject-based direct object relationship in the sentence; and
the object block is obtained based on an object-based adjective modification relationship, an object based conjunction relationship, or an object-based direct object relationship.
11. The method of claim 1 , wherein the each of the one or more semantically similar sentences is rewritten based on correspondences between the subject block, the object block, and the predicate block of the each of one or more structurally similar sentences, and the subject block, the object block, and the predicate block of the input sentence.
12. The method of claim 11 , wherein the step of rewriting the each of the one or more semantically similar sentences to generate the at least one new sentence comprises:
determining a correspondence between a key sentence component of the each of the one or more structurally similar sentences and a key sentence component of the input sentence, wherein the key sentence component comprises the subject block, the object block, and the predicate block;
finding segments in the each of the one or more semantically similar sentences that are expressively consistent with the key sentence component of the each of the one or more structurally similar sentence to use as to-be-substituted segments; and
substituting the to-be-substituted segments in the each of the one or more semantically similar sentences with the corresponding key sentence component of the input sentence based on the correspondence, to generate the at least one new sentence.
13. A sentence generation apparatus, comprising:
an obtaining unit configured to obtain an input sentence;
a searching unit configured to obtain a first dependency tree of the input sentence and a second dependency tree of each sentence in a preset corpus, and search for one or more structurally similar sentences of the input sentence based on a matching degree of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, wherein the one or more structurally similar sentences are structurally similar to the input sentence;
a finding unit configured to find one or more semantically similar sentences of the one or more structurally similar sentences;
a parsing unit configured to parse the input sentence and the one or more structurally similar sentences to obtain a subject block, an object block, and a predicate block of each of the input sentence and the one or more structurally similar sentences, wherein the subject block is obtained by extracting each sentence based on a subject of the sentence and a dependency of the subject, and the object block is obtained by extracting each sentence based on an object of the sentence and a dependency of the object;
a substituting unit configured to rewrite each of the one or more semantically similar sentences to generate at least one new sentence, by substituting the subject block in the each of the one or more semantically similar sentences with the subject block in the input sentence, substituting the object block in the each of the one or more semantically similar sentences with the object block in the input sentence, and substituting the predicate block in the each of the one or more semantically similar sentences with the predicate block in the input sentence;
a filtering unit configured to filter the at least one new sentence based on a preset filtering condition; and
a labeling unit configured to label at least one the filtered new sentence as a semantically similar sentence of the input sentence.
14. The apparatus of claim 13 , wherein the searching unit comprises:
a sub-path obtaining subunit configured to obtain all sub-paths of the first dependency tree of the input sentence, and all sub-paths of the second dependency tree of any sentence in the corpus, wherein each sub-path is a line without a branch formed between any amount of adjacent nodes in the first dependency tree;
a category classifying subunit configured to classify the sub-paths with a same dependency into a same sub-path category; and
a similarity calculating subunit configured to calculate a similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence based on the following formula:
S
=
∑
i
∈
I
count
i
(
S
1
)
×
count
i
(
S
2
)
/
2
meanDeep
(
i
)
;
where, I is a set of all the sub-paths in the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, count i (S 1 ) indicates an amount of occurrences of the sub-paths in the second dependency tree of the each sentence in the corpus and belonging to the sub-path category i, the count i (S 2 ) indicates an amount of occurrences of the sub-paths in the first dependency tree of the input sentence and belonging to the sub-path category i, and meanDeep (i) indicates an average distance from a first node of each sub-path in the sub-path category i to a root node of the corresponding dependency tree; and
a structurally similar sentence determining subunit configured to determine the one or more structurally similar sentences of the input sentence based on the similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, where a similarity of the second dependency tree of each of the one or more structurally similar sentences and the first dependency tree of the input sentence exceeds a preset similarity threshold.
15. A smart device, comprising:
a memory;
a processor; and
one or more computer programs stored in the memory and executable on the processor, wherein the one or more computer programs comprise:
instructions for obtaining an input sentence;
instructions for obtaining a first dependency tree of the input sentence and a second dependency tree of each sentence in a preset corpus, and searching for one or more structurally similar sentences of the input sentence based on a matching degree of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, wherein the one or more structurally similar sentences are structurally similar to the input sentence;
instructions for finding one or more semantically similar sentences of the one or more structurally similar sentences;
instructions for parsing the input sentence and the one or more structurally similar sentences to obtain a subject block, an object block, and a predicate block of each of the input sentence and the one or more structurally similar sentences, wherein the subject block is obtained by extracting each sentence based on a subject of the sentence and a dependency of the subject, and the object block is obtained by extracting each sentence based on an object of the sentence and a dependency of the object;
instructions for rewriting each of the one or more semantically similar sentences to generate at least one new sentence, by substituting the subject block in the each of the one or more semantically similar sentences with the subject block in the input sentence, substituting the object block in the each of the one or more semantically similar sentences with the object block in the input sentence, and substituting the predicate block in the each of the one or more semantically similar sentences with the predicate block in the input sentence;
instructions for filtering the at least one new sentence based on a preset filtering condition; and
instructions for labeling the at least one filtered new sentence as a semantically similar sentence of the input sentence.
16. The smart device of claim 15 , wherein the instructions for searching for the one or more structurally similar sentences of the input sentence based on the matching degree of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence comprise:
instructions for obtaining all sub-paths of the first dependency tree of the input sentence, wherein each sub-path is a line without a branch formed between any amount of adjacent nodes in the first dependency tree;
instructions for obtaining all sub-paths of the second dependency tree of any sentence in the corpus;
instructions for classifying the sub-paths with a same dependency into a same sub-path category; and
instructions for calculating a similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence based on the following formula:
S
=
∑
i
∈
I
count
i
(
S
1
)
×
count
i
(
S
2
)
/
2
meanDeep
(
i
)
;
where, I is a set of all the sub-paths in the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, count i (S 1 ) indicates an amount of occurrences of the sub-paths in the second dependency tree of the each sentence in the corpus and belonging to the sub-path category i, the count i (S 2 ) indicates an amount of occurrences of the sub-paths in the first dependency tree of the input sentence and belonging to the sub-path category i, and meanDeep (i) indicates an average distance from a first node of each sub-path in the sub-path category i to a root node of the corresponding dependency tree; and
instructions for determining the one or more structurally similar sentences of the input sentence based on the similarity of the second dependency tree of the each sentence in the corpus and the first dependency tree of the input sentence, wherein the one or more structurally similar sentences are one or more of the sentences in the corpus having the second dependency tree with a similarity with the first dependency tree of the input sentence exceeding a preset similarity threshold.
17. The smart device of claim 15 , wherein the instructions for filtering the at least one new sentence based on the preset filtering condition comprise:
instructions for detecting whether there is a redundant content word in any new sentence, wherein the redundant content word is a content word not exist in the input sentence;
instructions for excluding the new sentence in response to there being the redundant content word in the new sentence; and
instructions for retaining the new sentence in response to there being no redundant content word in the new sentence.
18. The smart device of claim 15 , wherein the instructions for filtering the at least one new sentence based on the preset filtering condition comprise:
instructions for obtaining a sum of word vectors of any new sentence and a sum of word vectors of the input sentence;
instructions for calculating a cosine similarity of the sum of the word vectors of the new sentence and the sum of the word vectors of the input sentence;
instructions for sorting all the new sentences according to a descending order of the cosine similarity; and
instructions for retaining the first X 1 new sentences based on a result of the sorting, wherein X 1 is a preset positive integer.
19. The smart device of claim 15 , wherein the instructions for filtering the at least one new sentence based on the preset filtering condition comprise:
instructions for calculating a perplexity of any new sentence based on a trained language model and a preset perplexity calculation formula, wherein the perplexity indicating a fluency degree of one sentence is calculated using a formula of:
PP
(
S
new
)
=
p
(
w
1
w
2
…
w
M
)
-
1
M
=
1
p
(
w
1
w
2
…
w
M
)
M
=
∏
i
=
1
M
1
p
(
w
i
❘
w
1
w
2
…
w
i
-
1
)
M
;
where, S new indicates a new sentence, M is the length of the new sentence S new , p (w) is a probability of the i-th word in the new sentence S new , and the probability is obtained based on the language model;
instructions for sorting the new sentences according to an ascending order of the perplexity; and
instructions for retaining the first X 2 new sentences based on a result of the sorting, wherein X 2 is a preset positive integer.
20. The smart device of claim 15 , wherein the instructions for filtering the at least one new sentence based on the preset filtering condition comprise:
instructions for filtering the at least one new sentence based on content words in the at least one new sentence, a cosine similarity between each of the at least one new sentence and the input sentence, and a perplexity of each of the at least one new sentence in order;
wherein, the instructions for filtering the at least one new sentence based on the content words in the at least one new sentence comprises:
instructions for detecting whether there is a redundant content word in any new sentence, wherein the redundant content word is a content word not exist in the input sentence;
instructions for excluding the new sentence in response to there being the redundant content word in the new sentence; and
instructions for retaining the new sentence in response to there being no redundant content word in the new sentence;
wherein, the instructions for filtering the at least one new sentence based on the cosine similarity between the each of the at least one new sentence and the input sentence comprises:
instructions for obtaining a sum of word vectors of the new sentence retained after filtering the new sentences based on the content words in the at least one new sentence and a sum of word vectors of the input sentence;
instructions for calculating a cosine similarity of the sum of the word vectors of the new sentence and the sum of the word vectors of the input sentence;
instructions for sorting all the new sentences according to a descending order of the cosine similarity; and
instructions for retaining the first X 3 new sentences based on a result of the sorting, wherein X 3 is a preset positive integer;
wherein, the instructions for filtering the at least one new sentence based on the perplexity of the each of the at least one new sentence comprise:
instructions for calculating a perplexity of the new sentence retained after filtering based on the cosine similarity of the new sentence and the input sentence based on a trained language model and a preset perplexity calculation formula, wherein the perplexity indicating a fluency degree of one sentence is as follows:
PP
(
S
new
)
=
p
(
w
1
w
2
…
w
M
)
-
1
M
=
1
p
(
w
1
w
2
…
w
M
)
M
=
∏
i
=
1
M
1
p
(
w
i
❘
w
1
w
2
…
w
i
-
1
)
M
;
where, S new indicates a new sentence, M is the length of the new sentence S new , p (w i ) is a probability of the i-th word in the new sentence S new , and the probability is obtained based on the language model;
instructions for sorting the new sentences according to an ascending order of the perplexity; and
instructions for retaining the first X 4 new sentences based on a result of the sorting, wherein X 4 is a preset positive integer, and X 4 is smaller than X 3 .Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.