Skip to content Skip to sidebar Skip to footer

How To Parse Unstructured Table-like Data?

I have a text file that holds some result of an operation. The data is displayed in a human-readable format (like a table). How do I parse this data so that I can form a data struc

Solution 1:

Say your example is 'sample.txt'.

importpandasaspddf=pd.read_table('sample.txt',skiprows=[0,1,2,3,5],delimiter='\s\s+')print(df)print(df.shape)12345601YesNo60001 0002   True12NoYes70003 0004  False23YesNo60001 0001   True34YesNo60001 0004  False44NoNo40004 0004   True55YesNo20001 0001   True66YesNo10001 0001  False77NoNo20004 0004   True(8,6)

You can change the data types of course. Please check tons of params of pd.read_table(). Also, there are method for xlsx, csv, html, sql, json, hdf, even clipboard, etc.

welcome to pandas...

Solution 2:

Try this, it should fully handle multi-row cells:

import re

def splitLine(line, delimiters):
    output = []
    for start, end in delimiters:
        output.append(line[start:end].strip())
    return output

with open("path/to/the/file.ext", "r") as f:
    _ = f.readline()
    _ = f.readline()
    _ = f.readline()

    headers = [f.readline()]
    next = f.readline()
    while(next[0] != "-"):
        headers.append(next)
        next = f.readline()

    starts = []
    columnNames = set(headers[0].split())
    for each in columnNames:
        starts.extend([i for i in re.finditer(each, headers[0])])
    starts.sort()
    delimiters = list(zip(starts, starts[1:] + [-1]))

    if (len(columnNames) - 1):
        rowsPerEntry = len(headers)
    else:
        rowsPerEntry = 1

    headers = [splitLine(header, delimiters) for header in headers]
    keys = []
    for i in range(len(starts)):
        if ("Header" == headers[0][i]):
            keys.append(headers[1][i])
        else:
            keys.append([])
            for header in headers:
                keys[-1].append(header[i])

    entries = []
    rows = []
    for line in f:
        rows.append(splitLine(line, delimiters))
        if (rowsPerEntry == len(rows)):
            if (1 == rowsPerEntry):
                entries.append(dict(zip(keys, rows[0])))
            else:
                entries.append({})
                for i, key in enumerate(keys):
                    if (str == type(key)):
                       entries[-1][key] = rows[0][i]
                    else:
                       k = "Column " + str(i+1)
                       entries[-1][k] = dict.fromkeys(key)
                       for j, subkey in enumerate(key):
                           entries[-1][k][subkey] = rows[j][i]
            rows = []

Explanation

We use the re module in order to find the appearances og "Header" in the 4th column.

The splitLine(line, delimiters) auxiliar function returns an array of the line splitted by columns as defined by the delimiters parameter. This parameter is a list of 2-items tuples, where the first item represents the starting position and the second one the ending position.

Solution 3:

i dont know what you want to do with the title so i will go ahead and skip all 6 lines...the space is not consistent so you need first to make the space consistent between records otherwise it will be hard to read it line by line. you can do some thing like this

import re

def read_file():
    with open('unstructured-data.txt', 'r') as f:
         for line in f.readlines()[6:]:
             line = re.sub(" +", " ", line)
             print(line)
             record = line.split(" ")
             print(record)
read_file()

which will give you something like this

1 Yes No 600010002 True

['1', 'Yes', 'No', '6', '0001', '0002', 'True', '\n']2 No Yes 700030004 False 

['2', 'No', 'Yes', '7', '0003', '0004', 'False', '\n']3 Yes No 600010001 True 

['3', 'Yes', 'No', '6', '0001', '0001', 'True', '\n']4 Yes No 600010004 False 

['4', 'Yes', 'No', '6', '0001', '0004', 'False', '\n']4 No No 400040004 True 

['4', 'No', 'No', '4', '0004', '0004', 'True', '\n']5 Yes No 200010001 True 

['5', 'Yes', 'No', '2', '0001', '0001', 'True', '\n']6 Yes No 100010001 False 

['6', 'Yes', 'No', '1', '0001', '0001', 'False', '\n']7 No No 200040004 True

['7', 'No', 'No', '2', '0004', '0004', 'True\n']

Post a Comment for "How To Parse Unstructured Table-like Data?"