What is data parsing?
Breaking files content into smaller unit by following a set of rules, so that it can be more easily translate and managed by system.
How does data parsing work?
Extract unstructured files from source files to destination files and make it structural format. Also, data parsing will provide function such as read XML files, find the specific file inside data source, convert one file into other format, set new schema for XML format.
What are the problems & challenges?
The existing system was supplied with inputs from different platforms and in different file formats. The requirement was to build an intelligent system which would read such inputs and collate them into a standard XML file format which would have more friendly tags and well-tagged information.
Challenges while creating data parsing solution:
- Low-quality images in existing input sources.
- Some articles attributes are overlapping during the parsing process so to identify exact attribute is quite difficult.
- Some news article’s contour are overlapping on each other while getting the output of files.
- Some Attributes of an article are missing when displaying out put of epaper.
- To identify additional Ads and merge that ads in advertisement section of epaper.
- Rearranging page sequence.
Existing system scenario:
There are four main input data sources which contain ‘n’ number of data and files. Input file sources Contain files and data which help to get the output of the epaper. Input sources contain a different type of data and files such as Zip file, InDesign file, SQL DB, and quark file.
Zip file contains image, pdf, and media files. All these input data sources having different type of a structure and schema. So to make data and files in a common structure format we have create data parsing adapter and use this customize data parsing adapter for data parsing.
Data parsing adapter creation and implementation:
We have created a parsing adapter to parse stored files and data. Created adapter fetched all sources file and data from file server which is a source define path to our file server.
Sources files are stored in unstructured format inside those folders. Created adapter helps to extract all those files, and also helps to read XMLfiles. After reading XML file adapter assist to find pdf file inside data source and convert that pdf file into image. After completion of image conversion adapter support to generate XML in a predefine schema format and split into multiple articles.
Steps of implemented solution:
- Step 1: Copy and extract files from the file server to our server.
- Step 2: Parsing adapter helps to read existing XML file to identify the pdf file, article details with their content ex. news, author, images, publisher etc. and position of articles. Also we are handled different type of attribute like Page, Product Name, Zone, Edition, Datetime, PreviousPage, NextPage, ImageUrl, PdfUrl, ArticleObject, Contour, ArticleType, ImageObjectName, ImageOwnerName, Source, Agency, Biline, Roof, Heading, Subheading, PhotoUniqueId, Version, SeePage, Location, Author, Agency and Content etc.
- Step 3: Parsing adapter finds apdf file with the help of XML file and starts trimming of those pdf files. It also helps to generate high-resolution pdf and images.
- Step 4: Also, it assists to generate standard XML for each article and advertisements with their content and images.
- Step 5: Crop high-resolution images into a small unit which helps to view the article in a details format.
- Step 6: We used Raster image for the clear and readable format of an article.
Output of Solution
- Provided solution helps to convert PDF into high-quality readable image and raster.
- Wrote standard schema to validate the unstructured XML and provide an interface to correct articles.
- Created interface to correct contour and content in the input sources as well as change the page sequence.
- The solution is provided for the page rendering and merging in interface.