Technical data enrichment through language models
Abstract
This disclosure provides a mechanism for the enrichment of sparse datasets using language models. By training language models on the specific distribution of known values in a dataset, missing values can be predicted, and the predicted values added, thereby resulting in a more complete dataset. This method also facilitates the enhancement and augmentation of datasets by predicting values for new properties that were not previously available. The approach proves particularly effective at scale, transforming large sparse datasets into more complete and enhanced datasets. Masking language modeling may be employed to train language models capable of generating representations of technical data. Training data includes corpuses of technical data that may be represented as text strings. These pretrained models are fine-tuned to predict various properties. The resulting models can predict missing values in large technical datasets, providing valuable data for guiding scientific research.
Claims
exact text as granted — not AI-modifiedThe invention claimed is:
1 . A method of data enrichment comprising:
pretraining a transformer-based language model with a corpus of unlabeled technical data comprising biological sequence data or chemical data; fine-tuning the transformer-based language model for a first property with a first property-specific dataset resulting in a first fine-tuned model having a first accuracy rate wherein fine-tuning the transformer-based language model for the first property comprises modifying only a portion of the transformer-based language model; fine-tuning the transformer-based language model for a second property with a second property-specific dataset resulting in a second fine-tuned model having a second accuracy rate; enriching an existing dataset by adding a first value for the first property generated by the first fine-tuned model and a second value for the second property generated by the second fine-tuned model, wherein the first value and the second value are not present in the existing dataset; and generating a data structure for a user interface comprising data from the existing data set, the first value for the first property labeled with the first accuracy rate, and the second value for the second property labeled with the second accuracy rate.
2 . The method of claim 1 , wherein the pretraining comprises masked language modeling (MLM).
3 . The method of claim 1 , wherein the technical data comprises text strings that represent a physical structure using an ordered sequence of text characters.
4 . The method of claim 1 , wherein at least one of the first property or the second property is a discrete variable and the fine-tuning comprises using a classification-based training technique.
5 . The method of claim 1 , wherein at least one of the first property the second property is a continuous variable and the fine-tuning comprises using a regression loss function.
6 . The method of claim 1 , wherein the enriching the existing dataset by adding the first value comprises adding missing values for the first property that exists in the existing dataset.
7 . The method of claim 1 , wherein the enriching the existing dataset by adding the second value comprises adding values for the second property, the second property is a new property that was not previously in the existing dataset, thereby creating a combined dataset combining the existing dataset and the second property-specific dataset.
8 . The method of claim 1 , further comprising training a tokenizer for the technical data.
9 . The method of claim 1 , wherein the portion of the transformer-based language model is a number of layers of the transformer-based language model that are unfrozen.
10 . The method of claim 1 , wherein the portion of the transformer-based language model is a classification layer or a regression layer.
11 . A system comprising:
a processor; a memory coupled to the processor; a transformer-based language model pretrained on a corpus of unlabeled technical data comprising biological sequence data or chemical data; a fine-tuning module configured to:
fine-tune the transformer-based language model for a first property with a first property-specific dataset resulting in a first fine-tuned model having a first accuracy rate, wherein the fine-tuning module is configured to fine-tune the transformer-based model for the first property by modifying only a portion of the transformer-based language model, and
fine-tune the transformer-based language model for a second property with a second property-specific dataset resulting in a second fine-tuned model having a second accuracy rate;
an enrichment module configured to add a first value for the first property generated by the first fine-tuned model and a second value for the second property generated by the second fine-tuned model to an existing dataset, wherein the first value and the second value are not present in the existing dataset; and an output system configured to generate a data structure for a user interface comprising data from the existing data set, the first value for the first property labeled with the first accuracy rate, and the second value for the second property labeled with the second accuracy rate.
12 . The system of claim 11 , wherein the transformer-based language model comprises an embedding layer, multiple transformer layers, and a classification layer.
13 . The system of claim 11 , wherein the property is a discrete variable and the fine-tuning module uses a classification-based training technique configured to fine-tune the transformer-based language model.
14 . The system of claim 11 , wherein the property is a continuous variable and the fine-tuning module uses a regression loss function to fine-tune the transformer-based language model.
15 . The system of claim 11 , further comprising a tokenizer configured to tokenize the technical data.
16 . The system of claim 11 , wherein the portion of the transformer-based language model is a number of layers of the transformer-based language model that are unfrozen.
17 . The system of claim 11 , wherein the portion of the transformer-based language model is a classification layer or a regression layer.
18 . A computing device comprising:
an output device configured to display a user interface comprising: an identifier of a technical object that is a protein, a polynucleotide, or a molecule; an existing value for a first known property of the technical object, the existing value obtained from an existing dataset; a first value for a first property of the technical object, the first value obtained from a first fine-tuned model created by fine-tuning a transformer-based language model with a first property-specific dataset for the first property, wherein fine-tuning the transformer-based model for the first property comprises modifying only a portion of the transformer-based language model, the first value labeled with a first accuracy rate indicating an accuracy of the first fine-tuned model; and a second value for a second property of the technical object, the second value obtained from a second fine-tuned model created by fine-tuning the transformer-based language model with the second property-specific data set for the second property, the second value labeled with a second accuracy rate value indicating an accuracy of the second fine-tuned model.
19 . The computing device of claim 18 , wherein the transformer-based language model is pretrained using a text string that represent a physical structure of the technical object, the text string from the existing dataset from which the known property is obtained.
20 . The computing device of claim 18 , wherein:
(i) the first property is represented by a discrete variable and fine-tuning of the transformer-based language model is performed using a classification-based training; or (ii) the first property is represented by a continuous variable and the fine-tuning of the transformer-based language model is performed using a regression loss function.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.