Computing system for extraction of textual elements from a document

專利號

US11176364B2

公開日期

2021-11-16

申請人

Hyland Software, Inc.（US OH Westlake）

發(fā)明人

Ralph Meier; Thorsten Wanschura; Johannes Hausmann; Harry Urbschat

IPC分類

G06K9/00; G06K9/20; G06T7/70; G06K9/72; G06T7/50; G06K9/62

技術領域

textual,document,text,computer,readable,in,extraction,element,computing,documents

地域： OH OH Westlake

摘要

Described herein are various technologies pertaining to text extraction from a document. A computing device receives the document. The document comprises computer-readable text and a layout, wherein the layout defines positions of the computer-readable text within a two-dimensional area represented by the document. Responsive to receiving the document, the computing device identifies at least one textual element in the computer-readable text based upon spatial factors between portions of the computer-readable text and contextual relationships between the portions of the computer-readable text. The computing device then outputs the at least one textual element.

說明書

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

FIELD

This disclosure relates to computer-implemented text and character recognition systems and methods.

BACKGROUND

A computer-readable document comprises computer-readable text and a layout. The layout defines positions of the computer-readable text within a two-dimensional area represented by the document. Such a document may, for example, be a semi-structured document. The document may thus serve as a digital representation of a physical copy of the document while at the same time retaining certain characteristics (e.g., length, width) of the physical copy.

As documents comprise computer-readable text, a computing device may perform a search over computer-readable text in a document in order to identify and extract relevant textual elements in the text. The computing device may then save the textual elements in a format that is suitable for further data processing (e.g., as part of a data structure, as part of a spreadsheet, as an entry in a database). In one conventional approach for identifying and extracting textual elements from a document, the computing device performs regular expression matching in order to identify and extract the textual elements. In another conventional approach, the computing device utilizes a template in order to identify and extract the textual elements from the document. The template is based upon expected positions of the portions of the computer-readable text within the document.

權利要求

What is claimed is:

1. A computing device comprising:a processor; and

memory storing a textual extraction application, wherein the textual extraction application, when executed by the processor, causes the processor to perform acts comprising:receiving a document comprising computer-readable text and a layout, wherein the layout defines positions of the computer-readable text within a two-dimensional area represented by the document;

responsive to receiving the document, identifying at least one textual element in the computer-readable text based upon:spatial factors between portions of the computer-readable text in the document; and

contextual relationships between the portions of the computer-readable text,

wherein the textual extraction application provides the computer-readable text and the positions of the computer-readable text within the document as input to at least one computer-implemented model, wherein the at least one computer-implemented model outputs, based upon the input, a plurality of textual elements from the computer-readable text and scores assigned to the plurality of textual elements, the at least one textual element is included in the plurality of textual elements, wherein the at least one textual element is identified based on a score in the scores, wherein the score is indicative of a likelihood that the at least one textual element represents relevant content in the document based upon defined criteria for a defined type of the document, wherein the textual extraction application calculates the spatial factors between the portions of the computer-readable text based upon the positions of the computer-readable text within the document, wherein the textual extraction application provides the spatial factors as further input to the at least one computer-implemented model, wherein the scores are further based upon the spatial factors calculated by the textual extraction application; andresponsive to identifying the at least one textual element in the computer-readable text, outputting the at least one textual element.

2. The computing device of claim 1, wherein outputting the at least one textual element comprises presenting the at least one textual element on a display.

3. The computing device of claim 1, wherein outputting the at least one textual element comprises storing the at least one textual element in a data structure.

4. The computing device of claim 1, wherein the defined type of the document is one of:an educational transcript;

an invoice;

a medical record;

a personnel record; or

a taxation form.

5. The computing device of claim 1, the acts further comprising:prior to receiving the document, generating a document image of a physical copy of the document by scanning the physical copy of the document via a scanner that is in communication with the computing device, wherein the document image fails to include the computer-readable text; and

generating the document by applying an optical character recognition (OCR) process to the document image of the document.

6. The computing device of claim 1, the acts further comprising:prior to identifying the at least one textual element in the computer-readable text, receiving the defined criteria as input from a user of the computing device.

7. The computing device of claim 1, wherein the computer-readable text comprises a first textual element and a second textual element, wherein the spatial factors include at least one of:a distance between a location of the first textual element and a location of the second textual element within the document;

an angle between the location of first textual element, the location of the second textual element, and an axis of the document;

an ordering of the first textual element and the second textual element within the document; or

a number of textual elements that occur between the first textual element and the second textual element.

8. The computing device of claim 7, wherein the distance is from 0.01 to 20 mm, wherein the angle is from 0 to 180°.

9. The computing device of claim 1, wherein the defined type is indicative of a purpose of the document, the acts further comprising:prior to identifying the at least one textual element in the computer-readable text, generating the at least one computer-implemented model based upon a plurality of documents, wherein each document in the plurality of documents is of the defined type, wherein at least some computer-readable text varies between each document in the plurality of documents, wherein at least some positions of the computer-readable text vary between each document in the plurality of documents.

10. The computing device of claim 1, wherein the document further comprises a table, wherein the layout further defines the positions of the computer-readable text within the table.

11. The computing device of claim 1, wherein identifying the at least one textual element in the computer-readable text is further based upon typographical emphasis of the portions of the computer-readable text, wherein the input to the computer-implemented model further includes indications of the typographical emphasis of the portions of the computer-readable text, wherein the plurality of textual elements and the scores assigned to the plurality of textual elements are further based upon the typographical emphasis of the portions of the computer-readable text.

12. A method executed by a processor of a computing device while the processor executes a textual extraction application, the method comprising:receiving a document comprising computer-readable text and a layout, wherein the layout defines positions of the computer-readable text within a two-dimensional area represented by the document;

identifying at least one textual element in the computer-readable text based upon:spatial factors between portions of the computer-readable text in the document; and

contextual relationships between the portions of the computer-readable text, wherein the textual extraction application provides the computer-readable text and the positions of the computer-readable text within the document as input to at least one computer-implemented model, wherein the at least one computer-implemented model outputs, based upon the input, a plurality of textual elements within the computer-readable text and scores assigned to the plurality of textual elements, the at least one textual element is included in the plurality of textual elements, wherein the at least one textual element is identified based on a score in the scores, wherein the score is indicative of a likelihood that the at least one textual element represents relevant content in the document based upon defined criteria for a defined type of the document, wherein the textual extraction application calculates the spatial factors between the portions of the computer-readable text based upon the positions of the computer-readable text within the document, wherein the textual extraction application provides the spatial factors as further input to the at least one computer-implemented model, wherein the scores are further based upon the spatial factors calculated by the textual extraction application; and

responsive to identifying the at least one textual element in the computer-readable text, outputting the at least one textual element.

13. The method of claim 12, wherein the document is an updated version of a second document, wherein the second document comprises second computer-readable text and a second layout, the second layout defining second positions of the second computer-readable text within a second two-dimensional area represented by the second document, wherein the second layout varies from the layout of the document, wherein at least a portion of the second computer-readable text varies from the computer-readable text of the document.

14. The method of claim 12, wherein identifying the at least one textual element in the computer-readable text is further based upon font types of the portions of the computer-readable text, wherein the input to the computer-implemented model further includes indications of the font types of the portions of the computer-readable text, wherein the plurality of textual elements and the scores assigned to the plurality of textual elements are further based upon the font types of the portions of the computer-readable text.

15. The method of claim 12, wherein the at least one computer-implemented model is one of:a weighted n-gram difference model;

a continuous bag of words model; or

a latent semantic analysis (LSA) model.

16. A non-transitory computer-readable storage medium comprising a textual extraction application that, when executed by a processor of a computing device, causes the processor to perform acts comprising:receiving defined criteria for a defined type of a document, the document comprising computer-readable text and a layout, wherein the layout defines positions of the computer-readable text within a two-dimensional area represented by the document;

receiving the document from a second computing device that is in network communication with the computing device;

identifying at least one textual element in the computer-readable text based upon:spatial factors between portions of the computer-readable text in the document; and

contextual relationships between the portions of the computer-readable text, wherein the textual extraction application provides the computer-readable text and the positions of the computer-readable text within the document as input to a computer-implemented model, wherein the computer-implemented model outputs, based upon the input, a plurality of textual elements within the computer-readable text and scores assigned to the plurality of textual elements, the at least one textual element is included in the plurality of textual elements, wherein the at least one textual element is identified based on a score in the scores, wherein the score is indicative of a likelihood that the at least one textual element represents relevant content in the document based upon the defined criteria for the defined type of the document, wherein the textual extraction application calculates the spatial factors between the portions of the computer-readable text based upon the positions of the computer-readable text within the document, wherein the textual extraction application provides the spatial factors as further input to the at least one computer-implemented model, wherein the scores are further based upon the spatial factors calculated by the textual extraction application; and

responsive to identifying the at least one textual element in the computer-readable text, outputting the at least one textual element.

17. The non-transitory computer-readable storage medium of claim 16, wherein the at least one textual element comprises a first textual element that is indicative of an identifier for the defined criteria and a second textual element that meets the defined criteria.

18. The non-transitory computer-readable storage medium of claim 16, wherein the at least one textual element comprises a first textual element and a second textual element, wherein the first textual element is a word, wherein the second textual element is a number.

19. The non-transitory computer-readable storage medium of claim 16, wherein outputting the at least one textual element comprises storing the at least one textual element as an entry in a database.

20. The non-transitory computer-readable storage medium of claim 16, wherein the defined type of the document is one of:an educational transcript;

an invoice;

a medical record;

a personnel record; or

a taxation form.

微信群二維碼

意見反饋

白丝美女被狂躁免费视频网站,500av导航大全精品,yw.193.cnc爆乳尤物未满,97se亚洲综合色区,аⅴ天堂中文在线网官网

Computing system for extraction of textual elements from a document

摘要

說明書

權利要求

該功能需要專業(yè)版企業(yè)版VIP權限，您可以：

該功能需要專業(yè)版企業(yè)版VIP權限，您可以：