Idea Validation: Data Extraction Tool using NLP

by tecoholic

Background

I was approached by a doctor who is specializing in infectious diseases to extract Covid-19 patient data from local government bulletins. These are daily press releases in PDF format with a bunch of tables and some text. Source files. I couldn't find any no-code ready to use tools and was experimenting with the NLP library SpaCy to see if it can be automated.

After a couple of online tutorials later, I was able to automate it to a certain degree. Sample as shown in the image above.

Product Idea

Upload files containing the text
Choose a division of labor - (sentences, paragraphs or pages...etc.,)
Tag the data on a sample (say 5 entries)
Run the NLP process in the background and generate a table of the extracted information. Surely, it won't be 100% accurate.
Now the user can click at the rows where the information is wrong which will show the full text and allow the user to tag the right bits
Each correction will update the data table by rerunning the NLP process using the newly received inputs.

Kindly provide your feedback.