Method, electronic device, and computer program product for data anonymization
Abstract
Embodiments disclosed herein relate to a method, an electronic device, and a computer program product for data anonymization. The method includes: performing classification on data by a classifier to obtain data types of the data. The method further includes: performing anonymization on the data by a first anonymization model to obtain first anonymized data. The method further includes: determining, based on the data types, using an anonymizer whether re-anonymization needs to be performed on the first anonymized data. The method further includes: performing, based on a determination that the re-anonymization needs to be performed, the re-anonymization on the first anonymized data by a second anonymization model to obtain second anonymized data. Accordingly, anonymization processing may be performed on data using different anonymization models for different types of data to obtain the final anonymized data and to ensure that no data leakage occurs.
Claims
exact text as granted — not AI-modifiedThe invention claimed is:
1 . A method for data anonymization, comprising:
performing classification on data that comprises a knowledge graph by a classifier to obtain data types of the data,
wherein, upon receiving a request from a querying party, the data is obtained from a graphical database using a client device because the data matches a query criteria,
wherein the data comprises company's sales data, customer data, and inventory data,
wherein the customer data specifies a unique identifier of the customer, an address of the customer, and a phone number of the customer, and
wherein the data types comprise a numeric data type, an enumerated data type, and a free text data type;
performing anonymization on the data by a first anonymization model of a plurality of anonymization models to obtain first anonymized data; determining, based on the data types, using an anonymizer whether re-anonymization needs to be performed on the first anonymized data; and performing, based on a determination that the re-anonymization needs to be performed, the re-anonymization on the first anonymized data by a second anonymization model of the plurality of anonymization models to obtain second anonymized data and to prevent a risk of the first anonymized data being leaked,
wherein the re-anonymization needs to be performed because the first anonymization model is not suitable for processing the data.
2 . The method according to claim 1 , wherein performing the classification on the data using the classifier comprises at least one of:
performing the classification on the data using a regular expression; performing the classification on the data using a dictionary base; or performing the classification on the data using a machine learning model.
3 . The method according to claim 1 , wherein the data comprises tabular data, the tabular data comprising a plurality of data columns that have different ones of said data types.
4 . The method according to claim 3 , wherein performing the anonymization on the data by the first anonymization model comprises:
learning data patterns of the plurality of data columns by training a generative adversarial network model; and performing the anonymization on the plurality of data columns separately using the trained generative adversarial network model, so as to generate the first anonymized data for each of the plurality of data columns.
5 . The method according to claim 4 , wherein determining using the anonymizer whether the re-anonymization needs to be performed on the first anonymized data comprises:
obtaining, through a data profile, a data anonymization level for each of the plurality of data columns; obtaining, by the anonymizer, a query level for the querying party that queries the plurality of data columns; and determining, based on the data types, the data anonymization level, and the query level, whether the re-anonymization needs to be performed on the first anonymized data of each of the plurality of data columns.
6 . The method according to claim 5 , wherein determining, based on the data types, the data anonymization level, and the query level, whether the re-anonymization needs to be performed on the first anonymized data of each of the plurality of data columns comprises:
determining that the re-anonymization does not need to be performed based on a determination that the data type of each of the plurality of data columns conforms to a data processing type of the first anonymization model; and determining that the re-anonymization does not need to be performed based on a determination that the data anonymization level of each of the plurality of data columns is lower than the query level.
7 . The method according to claim 6 , wherein performing the re-anonymization by using the second anonymization model comprises:
obtaining a profile of the plurality of anonymization models, wherein the profile indicates each anonymization model of the plurality of anonymization models and the data processing type corresponding to said each anonymization model; selecting, based on the profile and the data type, the second anonymization model from the plurality of anonymization models for the data type of each of the plurality of data columns; and performing the re-anonymization on the first anonymized data of the plurality of data columns using the second anonymization model.
8 . The method according to claim 6 , wherein the plurality of anonymization models comprises at least two of:
a pseudo-data generation model; a statistical model; or a text generative adversarial network model.
9 . An electronic device, comprising:
a processor; and a memory coupled to the processor, wherein the memory has instructions stored therein which, when executed by the processor, cause the device to perform actions comprising:
performing classification on data that comprises a knowledge graph by a classifier to obtain data types of the data,
wherein, upon receiving a request from a querying party, the data is obtained from a graphical database using a client device because the data matches a query criteria,
wherein the data comprises company's sales data, customer data, and inventory data,
wherein the customer data specifies a unique identifier of the customer, an address of the customer, and a phone number of the customer, and
wherein the data types comprise a numeric data type, an enumerated data type, and a free text data type;
performing anonymization on the data by a first anonymization model of a plurality of anonymization models to obtain first anonymized data;
determining, based on the data types, using an anonymizer whether re-anonymization needs to be performed on the first anonymized data; and
performing, based on a determination that the re-anonymization needs to be performed, the re-anonymization on the first anonymized data by a second anonymization model of the plurality of anonymization models to obtain second anonymized data and to prevent a risk of the first anonymized data being leaked,
wherein the re-anonymization needs to be performed because the first anonymization model is not suitable for processing the data.
10 . The electronic device according to claim 9 , wherein performing the classification on the data using the classifier comprises at least one of:
performing the classification on the data using a regular expression; performing the classification on the data using a dictionary base; or performing the classification on the data using a machine learning model.
11 . The electronic device according to claim 9 , wherein the data comprises tabular data, the tabular data comprising a plurality of data columns that have different ones of said data types.
12 . The electronic device according to claim 11 , wherein performing the anonymization on the data by the first anonymization model comprises:
learning data patterns of the plurality of data columns by training a generative adversarial network model; and performing the anonymization on the plurality of data columns separately using the trained generative adversarial network model, so as to generate the first anonymized data for each of the plurality of data columns.
13 . The electronic device according to claim 12 , wherein determining using the anonymizer whether the re-anonymization needs to be performed on the first anonymized data comprises:
obtaining, through a data profile, a data anonymization level for each of the plurality of data columns; obtaining, by the anonymizer, a query level for the querying party that queries the plurality of data columns; and determining, based on the data types, the data anonymization level, and the query level, whether the re-anonymization needs to be performed on the first anonymized data of each of the plurality of data columns.
14 . The electronic device according to claim 13 , wherein determining, based on the data types, the data anonymization level, and the query level, whether the re-anonymization needs to be performed on the first anonymized data of each of the plurality of data columns comprises:
determining that the re-anonymization does not need to be performed based on a determination that the data type of each of the plurality of data columns conforms to a data processing type of the first anonymization model; and determining that the re-anonymization does not need to be performed based on a determination that the data anonymization level of each of the plurality of data columns is lower than the query level.
15 . The electronic device according to claim 14 , wherein performing the re-anonymization by using the second anonymization model comprises:
obtaining a profile of the plurality of anonymization models, wherein the profile indicates each anonymization model of the plurality of anonymization models and the data processing type corresponding to said each anonymization model; selecting, based on the profile and the data type, the second anonymization model from the plurality of anonymization models for the data type of each of the plurality of data columns; and performing the re-anonymization on the first anonymized data of the plurality of data columns using the second anonymization model.
16 . The electronic device according to claim 14 , wherein the plurality of anonymization models comprises at least two of:
a pseudo-data generation model; a statistical model; or a text generative adversarial network model.
17 . A computer program product that is tangibly stored on a non-volatile non-transitory computer-readable medium and comprises machine-executable instructions, wherein the machine-executable instructions, when executed, cause a machine to perform the following actions:
performing classification on data that comprises a knowledge graph by a classifier to obtain data types of the data,
wherein, upon receiving a request from a querying party, the data is obtained from a graphical database using a client device because the data matches a query criteria,
wherein the data comprises company's sales data, customer data, and inventory data,
wherein the customer data specifies a unique identifier of the customer, an address of the customer, and a phone number of the customer, and
wherein the data types comprise a numeric data type, an enumerated data type, and a free text data type;
performing anonymization on the data by a first anonymization model to obtain first anonymized data; determining, based on the data types, using an anonymizer whether re-anonymization needs to be performed on the first anonymized data; and performing, based on a determination that the re-anonymization needs to be performed, the re-anonymization on the first anonymized data by a second anonymization model to obtain second anonymized data and to prevent a risk of the first anonymized data being leaked,
wherein the re-anonymization needs to be performed because the first anonymization model is not suitable for processing the data.
18 . The computer program product according to claim 17 , wherein performing the classification on the data using the classifier comprises at least one of:
performing the classification on the data using a regular expression; performing the classification on the data using a dictionary base; or performing the classification on the data using a machine learning model.
19 . The computer program product according to claim 17 , wherein the data comprises tabular data, the tabular data comprising a plurality of data columns that have different ones of said data types.
20 . The computer program product according to claim 19 , wherein performing the anonymization on the data by the first anonymization model comprises:
learning data patterns of the plurality of data columns by training a generative adversarial network model; and performing the anonymization on the plurality of data columns separately using the trained generative adversarial network model, so as to generate the first anonymized data for each of the plurality of data columns.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.