How To Parse Unstructured Table-like Data?
Solution 1:
Say your example is 'sample.txt'.
importpandasaspddf=pd.read_table('sample.txt',skiprows=[0,1,2,3,5],delimiter='\s\s+')print(df)print(df.shape)12345601YesNo60001 0002 True12NoYes70003 0004 False23YesNo60001 0001 True34YesNo60001 0004 False44NoNo40004 0004 True55YesNo20001 0001 True66YesNo10001 0001 False77NoNo20004 0004 True(8,6)
You can change the data types of course. Please check tons of params of pd.read_table()
. Also, there are method for xlsx, csv, html, sql, json, hdf, even clipboard, etc.
welcome to pandas...
Solution 2:
Try this, it should fully handle multi-row cells:
import re
def splitLine(line, delimiters):
output = []
for start, end in delimiters:
output.append(line[start:end].strip())
return output
with open("path/to/the/file.ext", "r") as f:
_ = f.readline()
_ = f.readline()
_ = f.readline()
headers = [f.readline()]
next = f.readline()
while(next[0] != "-"):
headers.append(next)
next = f.readline()
starts = []
columnNames = set(headers[0].split())
for each in columnNames:
starts.extend([i for i in re.finditer(each, headers[0])])
starts.sort()
delimiters = list(zip(starts, starts[1:] + [-1]))
if (len(columnNames) - 1):
rowsPerEntry = len(headers)
else:
rowsPerEntry = 1
headers = [splitLine(header, delimiters) for header in headers]
keys = []
for i in range(len(starts)):
if ("Header" == headers[0][i]):
keys.append(headers[1][i])
else:
keys.append([])
for header in headers:
keys[-1].append(header[i])
entries = []
rows = []
for line in f:
rows.append(splitLine(line, delimiters))
if (rowsPerEntry == len(rows)):
if (1 == rowsPerEntry):
entries.append(dict(zip(keys, rows[0])))
else:
entries.append({})
for i, key in enumerate(keys):
if (str == type(key)):
entries[-1][key] = rows[0][i]
else:
k = "Column " + str(i+1)
entries[-1][k] = dict.fromkeys(key)
for j, subkey in enumerate(key):
entries[-1][k][subkey] = rows[j][i]
rows = []
Explanation
We use the re
module in order to find the appearances og "Header" in the 4th column.
The splitLine(line, delimiters)
auxiliar function returns an array of the line splitted by columns as defined by the delimiters parameter. This parameter is a list of 2-items tuples, where the first item represents the starting position and the second one the ending position.
Solution 3:
i dont know what you want to do with the title so i will go ahead and skip all 6 lines...the space is not consistent so you need first to make the space consistent between records otherwise it will be hard to read it line by line. you can do some thing like this
import re
def read_file():
with open('unstructured-data.txt', 'r') as f:
for line in f.readlines()[6:]:
line = re.sub(" +", " ", line)
print(line)
record = line.split(" ")
print(record)
read_file()
which will give you something like this
1 Yes No 600010002 True
['1', 'Yes', 'No', '6', '0001', '0002', 'True', '\n']2 No Yes 700030004 False
['2', 'No', 'Yes', '7', '0003', '0004', 'False', '\n']3 Yes No 600010001 True
['3', 'Yes', 'No', '6', '0001', '0001', 'True', '\n']4 Yes No 600010004 False
['4', 'Yes', 'No', '6', '0001', '0004', 'False', '\n']4 No No 400040004 True
['4', 'No', 'No', '4', '0004', '0004', 'True', '\n']5 Yes No 200010001 True
['5', 'Yes', 'No', '2', '0001', '0001', 'True', '\n']6 Yes No 100010001 False
['6', 'Yes', 'No', '1', '0001', '0001', 'False', '\n']7 No No 200040004 True
['7', 'No', 'No', '2', '0004', '0004', 'True\n']
Post a Comment for "How To Parse Unstructured Table-like Data?"