Best way to parse data from a large, mixed-format text file

I'm considering a new way to parse data from a large, mixed format text data file. Currently we call a C file to parse the data, use Mex functions to store the data in Matlab-compatible structures, and then save the parsed data in a .m file. Then Matlab can just read in the .m file to access the desired data. (There is also a definition file that the C-file reads in that allows customizing what data is returned in the structure.)
While this works, it is an old program, and system upgrades often cause library and Mex compatibility problems that must be resolved. I'd like to create a new data parser that can return data in a similar manner as the C parser (in fact I need to continue to support the same data output as the existing C-parser for existing scripts) but that allows some enhancements to the parsing. I'm looking for suggestions on how I might do this.
I'm considering Java (because I'm familiar with Java programming), but looking at the Matlab-Java interface, it seems to have only basic methods of transferring data from Java to Matlab.
For example, I would want to parse the data file into, say, an array of a 2000 structures, and then pass that array of structures into Matlab. Fortunately each structure is relatively simple and the fields would be either a string, an array of numbers, or an array of strings (although one is an array of arrays of numbers, but that could be converted within Matlab).

2 Comments

You could try to post an example of the mixed format text data file (here or on upload facility providing the link) and provide a brief (if feasible) description of what you want to import.
The exact format of the file isn't that important. If it helps, imagine rows from several tables of a relational database stored in alternating blocks in a flat text file. I'm more curious about suggestions for a general approach rather than a specific routine to parse our particular data. I'm also curious whether anyone has written some kind of parser in Java and tried to import a large chunk of data into Matlab.

Sign in to comment.

Answers (1)

Have you considered putting the data into a database and then using database calls to get the data out?
It seems from your description that you have already implemented a database of sorts, with the mixed format text file and C program serving as the database.
If whatever was generating the text data had the ability to export to a database, it might be a very effective go-between that takes some of the maintenance and upkeep of the existing code out of your hands.

5 Comments

I don't have any control on the source of the data. There have been talks of outputting the data into a database, but it hasn't happened so far. I do know that once someone did take the data file and convert it into a database, and then from that parsed out the data they wanted with an SQL query or something. I think that requires more overhead than writing a program to parse the data file directly and feeding it to Matlab though. For now I'm still looking for suggestions on a good way to parse the data with something outside of Matlab, and then pass the parsed data into Matlab. I don't think the parser itself will be very difficult, but there needs to be a fast and efficient way to pass the data from the parser into Matlab.
I would try to do everything in matlab:
- bulk import with textscan ...
- and parse with regexp
or - parse import with textscan (depends on how blocks are identified in the flat file and on the dimension of the file itself)
textscan() presumes a consistent data format. For this data, one line might have 10 data fields, and the next one 20, and the next one 15. Granted I could pull in each line as one string and have Matlab sort it out after reading in the whole file. One data file might be a few hundred MB, but I might want to support the ability to read in a file a few GB in size. Considering I want to actually work with only a portion of that data, I'm not sure Matlab would be able to process it as quickly or efficiently as an external program like our current C-parser. My past efforts at parsing even moderate sized text files in Matlab (using something like readline()) have proven I can do it faster in something like Perl, and then pass in the data I want to Matlab. I haven't tried using textscan to read in a large mixed-format data file and then do all the parsing within Matlab. Have you tried that, and can it process it relatively quickly?
We had to import .txt files of up to 400 mb sometime. Don't remember the exact time it used to process the whole bunch but something around some minutes mins for 40 files for a total of ~4 gb. If ti suits you then you could post a reduced example of your file
If you don't want to do the DB route, Perl is indeed another very good option. It also lets you move up a layer from C and will likely insulate you from of the low-level churn you indicated that you were having, since your code is more highly likely to be more portable between different platforms.
MATLAB, is also more highly portable and available on more platforms, too. There are definitely benefits to keeping an all-MATLAB solution in terms of maintenance and keeping all the requisite parts together.
It's also a question of what your colleagues and organization might like to work with, too. If you are a bunch of Java, C and M folks, then adding another language into the mix is likely not going to be a good match, since you're likely to spend more time on code maintenance than designing the thing in the first place.

Sign in to comment.

Categories

Asked:

on 12 May 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!