Quill. Digitise. Extract. Translate.

Quill is contract and large document digitisation with lots of other features bundled into one handy tool.

It is the product of many years of development and consultancy that we here at Sonix Software have undertaken to answer a very old question in computing: how do you digitise a scan, and then extract key fields from semi-structured data?

 

Let's begin by making some things clear, and explaining what it isn't for starters.

Quill - what it isn't for

I can download any number of apps on my phone that will scan and digitise text. They all work really well and can easily convert a picture to text. They can even read handwriting. A common misconception about Quill is that it is an OCR tool. It is not. It has that capability, and can do that but, it is designed to be repetitive. So, for example, I would not sit with 10,000 documents of 1000 pages each and snap each page on my phone to digitise them. I need an application to queue them up so I can put my feet up and let them run.

Out of those documents, I probably only need certain pieces of information. So, for example, if I had an insurance contract, I may only need to get the insured name and premium. Quill will extract that information and put it in a neat report for me.

O.K. so now we know what it is designed to do, let's firm up some of our terminology.

Structured, Unstructured and Semi-Structured Data
  1. Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets.

  2. Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organised in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

  3. Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

 

We are only interested in Unstructured Data and Semi-Structured Data when using Quill. This is because if the documents are all the same, we can easily extract the data we need from them. In fact, Quill will deal with structured data but typically there is no need - unless the reason is to save time.

In our Unstructured Documents, we can use Quill to read and extract data and convert them to Structured Data and information. This data can then be put in a spreadsheet to be read or put in a database to be held for future use by other applications.

 

How it works

Quill comes in four different licence arrangements. The first is the desktop version, which uses the Sonix Software estate to process any data. We manage the Quill server in this case, and you do not need to worry about it. We do not however keep any data after it is processed.

We can provide you with a Quill processing server out of the box, which we will manage but it is dedicated to your processing.

Or you can also buy the Quill server edition, which can be distributed to your own I.T. estate.

 

 

 

 

 

 

Native and Non-Native Text

Native text is text that is already digital. For example, if I open notepad and type a sentence in, this is native text. Some documents will already be native text, which mean we do not need to run any Optical Character Recognition on them. An example of non-native text is a picture of text, in other words it did not originate in a computer to begin with, but rather has been imported via a camera or scanner.

OCR - Optical Character Recognition

This is the means of digitising text from a scan or picture. We have come up with a unique way of doing this. Rather than using an out of the box OCR algorithm, we check words and our application learns overtime what words are more likely to be misread, or misrepresented.

Assuming that your text being read is non-native, we invoke this first.

Natural Language Processing

Post OCR, we run some NLP on the text. This is to identify the fields you wish to extract.

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyse large amounts of natural language data.

Machine Learning

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.

Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

Text sent through here is used to for the Quill server and services to learn as it goes along about words and phrases sent through.

 

Machine Translation

You can choose to translate your report or the full document to another language. This is done at runtime via the Quill desktop interface. For example, if I translate my report to Japanese it will look a bit like this-

 

 

 
 
 
 
 
 
 
 
 
Data Mining

This is the final stage of the process. After we have identified in the UI what fields we would like to extract, we can then use it to identify the most likely places in the document that the data we want exists. By adding words and phrases, that usually, may or may not appear around the data we can then identify the field type and send it back inside the report we receive, or send it to a database of your choice. Here is an example of a London Market Underwriting slip we sent through Quill:

 

 

 

 

 

Quill is far easier to use than any other piece of digitisation software. It is also far cheaper. Licensing starts at £10.99 a month, and you can run as much as you like. Many others have tried and failed, by adding too much configuration to their software, and making it too awkward to run at scale, especially when coupling it with RPA software like UiPath or Blue Prism.