白丝美女被狂躁免费视频网站,500av导航大全精品,yw.193.cnc爆乳尤物未满,97se亚洲综合色区,аⅴ天堂中文在线网官网

Method for generating text string dictionary, method for searching text string dictionary, and system for processing text string dictionary

專利號(hào)
US10867134B2
公開日期
2020-12-15
申請(qǐng)人
HITACHI HIGH-TECHNOLOGIES CORPORATION(JP Tokyo)
發(fā)明人
Kouichi Kimura
IPC分類
G06F40/30; G06F16/00; H03M7/30; G16B30/00; G06F40/242
技術(shù)領(lǐng)域
string,text,multicore,in,block,process,link,cpu,dictionary,registered
地域: Tokyo

摘要

A multicore CPU of a text string data analyzing device: loads a plurality of blocks obtained by dividing a text string dictionary into a memory; executes, in parallel on block groups executable independently of each other, an entry registration process of registering, character by character, unregistered text strings of text string data as new entries in the blocks in order from last characters; and outputs, as BW transformed data of the text string dictionary in which the text string data is already registered, a text string obtained by coupling text strings registered in entries of the blocks in a state in which no unregistered text strings of the blocks exists.

說明書

TECHNICAL FIELD

The present invention relates to a method for generating a text string dictionary, a method for searching a text string dictionary, and a system for processing a text string dictionary.

BACKGROUND ART

Due to the progress of deoxyribonucleic acid (DNA) sequencing technologies, amounts of DNA sequence data output by DNA sequencers have been rapidly increasing. Thus, calculation costs required for data analysis such as mutational analysis for checking whether or not DNA sequence data with a large amount contains a deleterious mutant sequence have also been increasing.

To improve the efficiency of the data analysis, it is effective to sort, in alphabetical order (lexicographic order), DNA sequence data (text string data) output in the order that the DNA sequence data is measured. This is due to the fact that the sorted data can be searched at a high speed. Especially, as a method suitable for the DNA sequence data, a method using Burrows-Wheeler (BW) transform (or FM index) is known (Nonpatent Literature 1).

DNA sequence data subjected to BW transform is expressed as a single string including a DNA and a delimiter ($) as elements. Each of the elements corresponds to a respective one of elements of a list in which all suffixes of all sequences included in the original DNA sequence data are sorted in alphabetical order. In addition, an efficient method for using results of BW transform as a dictionary obtained by sorting all suffixes in alphabetical order is known (Nonpatent Literature 1). Results of BW transform are also referred to as a text string dictionary.

權(quán)利要求

1
The invention claimed is:1. A method for generating a text string dictionary,the method being executed by a text string data analyzing device including a multicore CPU having a plurality of CPU cores and a memory,the text string dictionary loaded in the memory being divided into a plurality of blocks, the blocks being added thereto respective labels different from each other, the label including an alphabet constituting text string data and one or more delimiters,the method for generating a text string dictionary comprising the steps, performed by the multicore CPU, of:registering, for each of the inputted text string data, the last character of the received text string data as an entry of the block in the blocks added thereto the labels of the delimiters, and making the last character associate with a remaining text string obtained by excluding the last character from the text string data, as an unregistered text string;executing an entry registration process in parallel on each of the blocks grouped into appropriate blocks executable independently of each other, the entry registration process comprising the substep of reading registration source blocks in which the unregistered text strings are associated with the entries of the blocks, the substep of registering last characters of the unregistered text strings of the registration source blocks as new entries in registration destination blocks identified from the labels and entries of the registration source blocks, and the substep of associating remaining text strings obtained by excluding the new entries from the unregistered text strings as new unregistered text strings; andoutputting, as Burrows-Wheeler (BW) transformed data of the text string dictionary in which the text string data is already registered, a text string obtained by coupling text strings registered in the entries of the blocks in the order of alphabets indicated by the labels of the blocks and the delimiters in a state in which no unregistered text strings of the blocks exists.2. The method for generating a text string dictionary according to claim 1,further comprising the step, performed by the multicore CPU, of calculating, based on the number of cores included in the multicore CPU, lengths of the labels of the blocks that are used to determine the number of blocks to be loaded into the memory.3. The method for generating a text string dictionary according to claim 1,further comprising the substeps in the entry registration process, performed by the multicore CPU, of: grouping the registration source blocks to be sequentially read and the registration destination blocks to be simultaneously written; executing, in parallel, processes of reading the registration source blocks between the groups of the registration source blocks; and sequentially executing processes of reading the registration source blocks in each of the groups of the registration source blocks.4. A method for searching a text string dictionary, the method executed by a searching device including a storage means configured to store the text string dictionary generated by the method for generating a text string dictionary according to claim 1, and a control means,the method for searching a text string dictionary comprising the steps, performed by the control means, of:receiving an input query string via an input means;searching the number of appearances of the query string in the text string data registered in the text string dictionary; andoutputting the searched number of appearances via an output means.5. The method for searching a text string dictionary according to claim 4,wherein the text string dictionary is a DNA sequence dictionary in which DNA sequence data that is results of causing a DNA sequencer to analyze each of DNA samples of respective patients is registered as the text string data,the method for searching a text string dictionary further comprising:the step, performed by the input means, of receiving, as the query string, mutant DNA sequence data preset as a genetic panel;the step, performed by the control means, of searching the number of appearances of the query string in the text string data registered in the text string dictionary, thereby analyzing whether or not mutation exists in the DNA sequence data of the patients, andthe step, also performed by the control means, of outputting appearing mutant DNA sequence data and supplementary information associated with the DNA sequence data in the genetic panel via the output means.6. A system for processing a text string dictionary, comprising:the text string data analyzing device configured to execute the method for generating a text string dictionary according to claim 1;a searching device configured to execute a method for searching a text string dictionary, the method executed by a searching device including a storage means configured to store the text string dictionary generated by the method for generating a text string dictionary, and a control means, the method for searching a text string dictionary comprising the steps, performed by the control means, of:receiving an input query string via an input means;searching the number of appearances of the query string in the text string data registered in the text string dictionary; andoutputting the searched number of appearances via an output means,wherein the text string dictionary is a DNA sequence dictionary in which DNA sequence data that is results of causing a DNA sequencer to analyze each of DNA samples of respective patients is registered as the text string data,the method for searching a text string dictionary further comprising:the step, performed by the input means, of receiving, as the query string, mutant DNA sequence data preset as a genetic panel;the step, performed by the control means, of searching the number of appearances of the query string in the text string data registered in the text string dictionary, thereby analyzing whether or not mutation exists in the DNA sequence data of the patients, andthe step, also performed by the control means, of outputting appearing mutant DNA sequence data and supplementary information associated with the DNA sequence data in the genetic panel via the output means; andthe DNA sequencer configured to analyze DNA samples of patients and output results of the analysis as DNA sequence data in the method for searching a text string dictionary.
微信群二維碼
意見反饋