Background
I was approached by a doctor who is specializing in infectious diseases to extract Covid-19 patient data from local government bulletins. These are daily press releases in PDF format with a bunch of tables and some text. Source files. I couldn't find any no-code ready to use tools and was experimenting with the NLP library SpaCy to see if it can be automated.
After a couple of online tutorials later, I was able to automate it to a certain degree. Sample as shown in the image above.
Product Idea
- Upload files containing the text
- Choose a division of labor - (sentences, paragraphs or pages...etc.,)
- Tag the data on a sample (say 5 entries)
- Run the NLP process in the background and generate a table of the extracted information. Surely, it won't be 100% accurate.
- Now the user can click at the rows where the information is wrong which will show the full text and allow the user to tag the right bits
- Each correction will update the data table by rerunning the NLP process using the newly received inputs.
Kindly provide your feedback.