Computer vision for scalable document processing

We're from Capital One, and we'll talk about some challenges we face regularly from a customer service point of view, and how we’re using computer vision. It's less about what we’re doing and more about our learnings from this process. And sharing some frameworks for enterprise document management, the sorts of challenges we've faced and what we’ve learned.

Capital One, I'm sure you’ve heard the name. I’m a Senior Director of Product at Capital One. Essentially, I lead our product development efforts for our machine learning group, which is part of our credit card division. This is Jan Amtrup, he's our lead for data science for computer vision, you will be hearing from him in a bit when we switch.

So this is probably what you've heard of Capital One, “What's in your wallet”. Hopefully, everyone's heard of that.

*Mayank has since become a Director, ML Platforms at Wayfair.

We’re a top 10 US Bank

What you might now know about us, is we're also a top 10 US Bank. We're a fairly large bank, known for our credit cards. We’re actually part of the credit card division, but there's a larger bank out there. So if you guys haven't tried us from that point of view, I would urge you to guys go check us out, especially our Capital One cafes. That's a new sort of concept to banking that we're bringing to the market.

We are unique, young, and founder-led

We're actually super young compared to other banks, or at least other big banks. And we're also a founder-led company, Richard Fairbank is our founder and CEO, and we think some other pretty cool companies have similarities there, so we wanted to call that out.

Let's talk about documents!

They're super fun…

No, they're not.

Especially when you're a customer’s trying to get something done. And, for example, you applied for a credit card, and we want to verify your identity. You submit a document supporting that and you just want your credit card. Like “Hey, it's me. I don't know why you flagged me, but it's me”.

And, essentially, you want to put in a set of documents there's a process for, there’re regulatory requirements for us to do this. But that's also a moment for us there, as we believe in the machine learning group and the opportunity to improve the customer service experience. You don't really care about computer vision, at that point, you just care about the customer service workflow.

As a product leader, my opportunity here is to think about how we’re interacting with our customers, the channels, and the types of customer service workflows, and see where there's an opportunity. Not only from a software engineering perspective but also from a machine learning perspective to actually automate that experience and streamline it as much as possible.

Typical (very high level) steps

When you think about the context of our talk, think about it from the context of a product we're building with a focus on customer service. Any sort of customer service solution that depends on machine learning that allows you to do this will essentially have these three things, right?

The customer comes to your website or to your mobile app or mails it to you. You basically have a document that sits in some document repository. You build an application that goes and acquires a document. OCR classifies the document, whether it’s an ID document, etc.

Then you extract a piece of information, “Hey, we need the date on the driver's licence, we need the name, we need some income information.” And we basically want to streamline that as part of our customer service workflow. Seems fairly simple enough.

I'm sure if you guys Google, and some of you are from a lot of pretty cool companies that we leverage a lot of open source from, these things have been sold to a certain degree. And this is sort of the challenge that comes for us. Because in the real world, when customers are interacting with you, they're interacting with you from a variety of different channels.

A lot of our assets are digitally focused. A lot of companies are focusing on digital experiences, but we have to keep these channels open for customer preferences or for when we feel comfortable sending, mailing, or faxing a document. It's not just about, “can you upload the document on the webpage? Or can you take a mobile capture and send that to us?”

As you’d imagine, these sorts of varying channels lead to varying image quality that we get, which leads to other challenges. This is an area Jan has spent his career in, and he'll talk a bit about the data science aspect of it, and how he’s approaching this problem.

Documents can be a variety of types

About a year ago, I noticed I pay about $100 a month in charges, it's pretty high. But for Virginia, it’s actually very high. And what you see in front of you is this standard printed, structured-ish document. I say structured-ish because this is not a standard format for a utility bill, every vendor or provider has its own structure.

But there are some commonalities, you can guess where the dates are, what the amounts are, and what category each thing’s in. This is on the simpler side of the documents we receive. Then we have handwritten documents; we get letters, and freeform letters from customers, as well. So, we want to be able to pull context from those too.

And then we have both, we have people sending us a structured text, along with text written on it with “by the way, this is my proof document”, or “this is something.” With each type of document and each format of the text on it, there are different challenges and pros and cons of what we can do with that.

Here's your standard structured document, like a W2 or a 1099. I would say the simpler version of something we can process. Then you have these semi-structured custom templates, where they aren't necessarily known as templates, but something we can parse.

And then, obviously, you have the non-handwritten (but still freeform) text of what can come in. We have a variety of different use cases that we have to deal with this. When it comes to changing your address, putting in a request for estate planning, or verifying your identity, these customer service requests lead to different types of documents and different types of needs for that.

The document processing framework contains several components

We took all this experience and looked at how customers are interacting, the range of requests, and the range of the types of documents. This led us to this high-level framework that we came up with. You have a repository somewhere where you're going to store these documents, some in digital format. You're going to do some sort of pre-processing, you're going to clean that document (structure it if it's inverted), you're going to ship, you're going to crop it out, you're going to pull the brightness and change the aspect ratio if you need to do certain things.

And then you're going to try to figure out what type of text and layout is there. You're going to classify it, and based on the context of what that request is, you’re going to pull out some information. This information has to be made available just at the right time of the customer service workflow when the agent, or when an automated system, needs this information to process that.

This led us on a journey where we went out to the market and looked at a lot of vendor solutions. And we found three things. One, there are very narrow products out there, products that did really well on ID documents, but couldn't do anything else. Two, there were really good platforms that could solve a part of this layer, but not necessarily an end-to-end solution.

And three, we found we had some unique needs. For example, people who get deployed to active duty get special benefits from the bank. So, to verify that there’s a military document that you’d submit. Getting access to a system that can process that was a unique use case for us that we had to deal with and build capabilities around.

When we thought about this, we looked at it and said, “Okay, we're not going to say no to products that’re out there”. But we had to figure out how to use each of these modular architectures,evolve over a period of time, and build where we needed to build and learn and adopt where best-in-class exists.

We continue to believe a competitive advantage is rarely off the shelf. So you kind of have to build it in pieces. Competitive advantage doesn't have to mean building everything, but you can build it in a modular format that, over a period of time, leads to a unique solution for your particular business needs.

Additional must-have core capabilities

As part of this, we also looked at three things that came out that we wanted. We wanted contextual metadata. So, the idea here was, let's assume a digitally engaged customer for some reason is faxing us documents and wants to know what's going on. Are we showing them the right information in the app, or on the website that they should be aware of? And what are they faxing to us?

Context matters. Context, not just in the customer service workflow, but also in how customers are interacting with us. With a feedback loop, agents are looking at these documents, which are going to be our primary source for accuracy – did we pull this name correctly and get the right date out? – And build that architecture in place and sort out the scalable integration.

By itself, this solution, this platform, and this product is not that useful. It needs to connect with enterprise applications, it needs to be available where the customers are interacting with us. This has to be built within the ecosystem of our larger enterprise applications and applications that can build upon the core capabilities we’re building.

Other teams in the company can enable use cases we can't even imagine today in the company. Thinking of this as a platform-first approach, and not just thinking of it as some algorithm that can process documents, was the core idea we started with, and we built upon it from that point forward.

So let me take a pause, then I'm going to hand over to Jan who's going to talk about the data science aspect of it and how we went through this journey.

Jan Amtrup, Senior Director, Data Science

I'm basically going to repeat a lot of what Mayank said, with a data science basis and bias here. And I'm focusing on three issues that we encounter when processing documents. One is image problems and image complexities, but there are also document problems or complexities, and then the embedding into existing systems that we also need to do.

If you look at this picture here, not every document that comes into an enterprise is a neat size format, black on white, clean, straight. No. There's dust, there are tweaks from scanners, or maybe they were faxed. In this case here, you have prospective distortions and so many years of nonlinear, so there's a lot of stuff going on in images that you need to take care of.

Image source

It matters a great deal where a document comes from. The path that it takes from production to approaching our system is very important, in order to make sure we can process it correctly. The easy way, of course, is if it's sent electronically or scanned, that kind of looks right. But we get faxes a lot.

We get pictures from mobile devices where you have perspective distortion and background stuff going on. And we get screenshots; people take screenshots of something they pasted in a Word document, they print out that Word document, and then they fax it to us. Then the content of the image matters as well, right?

Image content

I think the right side is more important than the left. Mayank might think differently because it's his lunch order. You have to distinguish between content that you want and content that you don't want. And if you have the content that you want, even then things can go wrong because the document itself might have a background image on it or it might have logos that might be lined, and you might have printed over the line. So it's never easy.

Delivery format

And even if everything goes right, then the format the image arrives in still makes a difference. You have different colour schemes, you have different compression mechanisms. And even if you get a PDF document with a text layer, what do you do with it? Do you trust it or not? Good question.

In summary, even if you only look at the image side of things, document processing is relatively hard.

Document content varies widely

Now, let's assume that all that image stuff went well. We have a very clean image, we have maybe even the text already written, OCR, everything is good, right? And there are different types of documents. Mayank mentioned them shortly. The easiest ones you might think of are fixed forms, stuff like tax forms, 1040, W2s, and things like that, because they always look the same, right?

Sometimes. I think the IRS produces at least two variants each year, and they are different only because of a number all the way to the bottom in very subtle forms. You think you could simply say, “my adjusted income is always there and my tax rate is always there, and refund or whatever is always there”. And for that moment in time, that works.

But if something changes, then it doesn't work anymore. And you say, “Okay, I take regular expressions”, right? You should be better. If you just process one form, that's good. But there's a whole lot of variety in forms. And if you scale with use cases, then all these rule-driven and manually processed algorithms simply don't work.

So, machine learning and computer vision are to the rescue. We think that's the right way to address that kind of stuff. Mayank showed invoices, which are less structured, so you need more knowledge to process them. Manual rules work to a certain degree, but if you are a large enterprise, and you get invoices from several thousand vendors, then good luck with keeping that up and maintaining the rule base.

Again, learn by example mechanism, where you provide things, you provide documents, and you just point at the things that you want. And then let the machine learning algorithm figure out what it means and what the best way of getting there is. Freeform letters, very hard. There is no good two-dimensional structure that you could use. There are no labels and fields that you could hang on to. It's just one free-flowing line of text.

There you need natural language processing techniques that establish knowledge about what comes next in a sentence and use that to figure out what's going on. Natural language processing is a very old discipline. And in the last few years, linear networks and recurrent networks have made very, very big strides in understanding what's going on.

Again, manual rule writing versus machine learning, I would always bet on machine learning. When there are special documents, stuff like checks, we get a lot of those ID documents.And social security cards, they are designed to be non-readable in some cases. You can write rules that take this California licence and say, “Okay, the first line there is the driver's licence number”.

But if you go to Delaware or Maryland, it looks completely different. And there are a lot of variations in the US. There are more than 200 different forms of driver's licence. Rule writing, again, is probably not the right way to do it. Machine learning in our estimate, yes.

A document processing platform needs to scale in several dimensions

We talked about images and document content. The third aspect that I want to talk about is embedding into a system. Say you have the urge, or the need, as a big enterprise to devise a document processing system or platform. How do you go about that? You have to think about scale in multiple ways.

The first obvious issue of scale is the number of documents, which makes a big difference, whether you want to process 100 documents per day or 100,000 documents per day. If you have a good architecture and you live in the cloud, then maybe that problem is not that large, because you can scale to your wallet's content and deal with these things.

I'm not going to talk about the intake process, that in itself is a major nightmare.

Use cases

The second scale issue would be use cases. If you only have one use case, let's say you only want to process 1040 documents, things are easy. You just write regular expressions, or you tell the system where the things are that you need, and you're done. In a large enterprise, you have dozens or hundreds of use cases.

And writing rules and maintaining rules for that is very, very hard. So, we advocate machine learning throughout to make it easy to just provide labelled examples in your database. You can make onboarding and maintenance easy. Otherwise, you spend all your time onboarding and maintaining but never building anything, which you don't want.

Requirement variety

The third dimension is the different requirements you might have from a system. You can route documents, you can classify them, and extract information from them. Even validation is an option. And we think that calls for a component-oriented modular architecture that has some kind of component workflow orchestration system behind it so you can easily combine the capabilities you’ve built to solve into a use case.

User groups

Finally, you're dealing with a number of clients and customers inside your organization. Business groups that want to have something solved. That comes with two sets of problems. The first is technical: do they all use the same systems? Do they all talk to the same ECM system, case management system, etc.? Or do they have little things where they have a floppy disk with documents and just want to produce a PDF of it or something?

And the second problem is more on the human side, the prioritisation side. Everybody, of course, thinks their problem is the most important. But you, as the provider of that service, have to establish a valuation schema and see which problem generates the most value for the enterprise.

There's a lot of scaling going on. And a lot of thought has to go into the design, architecture, and implementation of such a system. The choices that you make there can either hinder you or make it much easier to come up with a good solution.

Such solutions should also support a few key requirements

For us, we have a few more problems, as we’re in a regulated industry. We have lots of laws and regulations that tell us what we can and cannot do. We have to worry about any data that we give out to vendors to prevent leakage of that data. And we worry about two types of measurements with regards to runtime performance.

We have real-time processes where people will take a picture with their mobile device and want to get an answer in a few seconds. And we have processes where we process hundreds of thousands of documents, and we're more interested in throughput than latency.

Computer vision is essential for successful document processing

What does that all have to do with machine learning, computer science, deep learning, and computer vision? We mentioned over the talk that we advocate machine learning for the solution of many of the problems. And I wanted to highlight three different areas where this happens.

One is on the image side, where you need to figure out where the document is in your image after you maybe clean it up in some way. And once you know where the document is, you have to figure out what the relevant parts are: paragraphs, lines, text lines, pictures, logos, etc. For many of those problems, there are traditional image processing methods available.

For instance, for skew correction, there's a 30-year-old method that just projects and then looks for peaks or something. But in the last few years, for a number of interesting and hard subproblems of image processing, neural networks have come in and have been very, very successful.

With deblurring, for the location of things on a page, and with table identification, which is one that works very well. So, we see kind of a parallel of the success that computer vision has in real-world image processing, that also applies to documents.

The second is the middle layer. OCR, of course, has always been very close to machine learning. Typically, you isolate the characters with morphological operations, and then you just classify them, potentially with a neural network. But in recent years, two directions have come up that have made that much, much better.

One is the idea of not looking at a single character, but a whole line. And taking into account the context of left and right, not only in terms of pixels, but also in terms of the language model. Then you know that certain words have a high probability of coming after certain other words. And the use of LSTMs; there’s been a very big breakthrough a few years ago, in OCR technology.

The other direction is, what if you don't have a line? Well, then you still can use deep learning and treat OCR basically as a segmentation and classification problem. You look at the whole page, you find bounding boxes for characters or words or parts of words. And then you use convolutional networks to classify those into the characters that are there.

There are a number of interesting papers out there for the application of those methods to documents with no line structure, like receipts. And lastly, once you have all the text, then everything is done, right? Still no, there are a lot of classical traditional machine learning techniques to operate on documents once the text is there.

Classification and information extraction, where you treat extraction as a classification problem, and use a support vector machine. But the larger view is, again, you could look at a document of characters, as a low-resolution image, say 80 by 50, with a lot of channels, one per character.

And then it's a binary multi-channel, low-resolution image. And the physical, geological, and geometrical structure of the characters tells you something about the document. It allows you to run deep networks and extract things that you learn.

So, extraction here, again, would be segmentation, not on a real image with pixels, but on a character pixel. There are a lot of areas where both computer vision and image processing can apply, in addition to traditional machine learning techniques, which we always advocate for.

Final thoughts

I talked a lot about the problems that come up in image processing and document analysis. Mayank introduced all of that. There are a lot of things that look easy and are hard, or that look hard and are easy.

But in the end, I think what’s important for us is that anything we do contributes to the success of our end customer, the person that has our credit card, or is our customer at the bank. And by extension, if those customers are happy, that means our direct customers in the enterprise are happy because we deliver value to them that they pass on to those customers.

Mayank says that happy customers are equal to awesome products. So, that’s good. Thank you for your interest.