Deep Learning, ML, NLP/IR, Data Mining and Data Extraction for a Construction Bidding Company

Data Engineering

Customer.

Our customer is North America’s leading analytics and software-based workflow integration solution provider with huge client base in construction industry.

Challenge.

Our customer has an Online Construction Bidding Platform to help contractors, manufacturers and distributors succeed with the largest, most accurate database of construction projects in the industry with data analytics. Our customer gathers data from diverse and disparate sources such as website, content aggregators, phone calls, emails, web forms, and web services both internal (receipt) and external (acquisition). Every data source is different, and the level of structure varies from highly structured, fielded data down to completely unstructured text-based content. Data acquisition also involves different document types that include PDF with and without text layers, images, DWG, BIM, DOC, XLS, CSV, HTML, XML and plain text.

They looked to partner with an AI solution provider to develop an automated process involving extraction of text and information from construction-based PDF files, segregation and categorization of General Information from the source data such as, plan specific information, Data tables, the extracted information should be indexed and stored so that the documents will become a searchable one.

Solution.

Developers from Intellectyx, analyzed the exact requirements and developed an automated data extraction solution by integrating PDF text extraction engines and OCR SDK’s (Optical Character Recognition) which extracts and modifies text, table data, raster images and vector images into XML. The extracted data were indexed and stored as searchable PDF, so that whenever an internal user interacts with the system through the developed web interface to search for data in the documents, it is ensured that the system delivers a more accurate search result using the NLP rule set created.

The developed automated solution also updates the bookmarks and table of contents on the PDF’s by analyzing the content present inside the PDF. The provided solution is equipped with batch processing capabilities to ensure that a huge volume of documents can be fed into system to get processed.

Result.

The integrated and automatic data extraction solution enabled any sort of content to readable format with a combination of technologies aided by Artificial Intelligence now the customer was able to

Gather data from diverse and disparate sources
Automate processes involving text and information extraction from construction-based PDF files
Extraction of unstructured data from different document types, indexed for searchable one with in few seconds
Generate reports from visualization dashboard and provides visualization charts for aggregated information