Skip to content Skip to sidebar Skip to footer

Split Documents Into Paragraphs

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because th

Solution 1:

I will try to give an easier way to deal with your problem: What you need to do is check for the double \nl then if you find double \nl then sort data considering that, and if you do not find double \nl then just sort data according to single \nl.

Another thing, i am thinking \nl is not a special character since i could not get any ASCII value for it, it is probably newline character but since you have asked for \nl i am giving the example accordingly(if it is indeed \n then you need to just change the part checking for double \nl). Rough example to detect the way for new paragraph used in the file:

f=open('yourfile','r')
a=f.read()
f.close()
temp=0for z inrange(len(a)-4):
 if a[z:z+4]=='\nl\nl':
  temp=1break#temp=1 if formatting is by double \nl otherwise 0

After this you can use simple string formatting to check for single \nl or double \nl and replace them according to your need to distinguish new line or new paragraph.(Please read the file in chunks if the file size is too big, otherwise you might have memory problems or slower code)

Solution 2:

You say

some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs

so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.

You can then pass all the files to the next stage where you detect paragraphs using a single \n character.

Solution 3:

from nltk import tokenize
tk=tokenize
a='para here'
tk.sent_tokenize(a)
#output =list of sentences#thats all u need

Post a Comment for "Split Documents Into Paragraphs"