MATLAB Answers

Unix code check and REMOVE the datapoints ranging outside 9:00am and 4:15pm for a second by second dataset

1 view (last 30 days)
Harsh Rob
Harsh Rob on 20 Aug 2019
Commented: Jan on 21 Aug 2019
I have a list of about 70 million rows. I want to delete the the following and clean the dataset-
  1. Any values which are 0 or in the range of 0.001 or less.
  2. Any values that lie outside the range of 9:00am and 4:15pm
  3. If multiple quotes are present with the same time stamp, then replace that with a single entry of the median price.
I am able ot achive the third point, but not the second and the first one. Can someone guide me with this? Thanks


Show 1 older comment
Jan on 20 Aug 2019
You did not mention, how the data are represented. While the number of 70 million does not matter (of course the solution will work for 173 rows also), it is required to know, what exactly "rows" and "a list" are.
What does "clean the dataset" mean for "Any values which are 0 or in the range of 0.001 or less"? A "range" needs two values to be defined, and 0 is "less than 0.001" already, so why mentioning it explicitely? How is the time represented? As datenumber or string?
You explain the meaning of the first 3 rows. Then you mention, that only the 1st 2 columns matter.
What does "Unix code check" mean? This is a MATLAB forum and it should not matter, if the data are coming from a Unix machine.
Harsh Rob
Harsh Rob on 20 Aug 2019
Apologies for the confusion caused.
This is the description for the RAW dataset I have-
Column 1 contains the timestamp in the unix format - NEEDS to be a part of cleaned data
The raw dataset is in the unix format(number). However, I want to delete all the datapoints which is a weekend or falls outside the range of 9:00 hrs to 16:15 hrs. We can either do this by converting it into dd/mm/yyyy hh:mm:ss format, or if it can be deleted directly from the unix format(number).
Column 2 contains the price data -NEEDS to be a part of cleaned data
If the prices are 0, delete the entire row
If the prices are less than 0.001, delete the entire row
if the timestamps are same, take the median value of the unique timestamp. (I have figured out this one by using the unique and accumarray functions.)
Column 3 contains - NOT NEEDED to be a part of cleaned data
Not required for my calculation purposes, but a part of RAW data. Can be deleted as well.
Does this explantion make sense ?
Jan on 21 Aug 2019
@Harsh Rob: I cannot know what "RAW dataset" means. Is it a binary oder text file? Have you been able to import it already? Converting the time to a datevec or datetime object allow to create a matching filter easily.
It is still not clear, how your data are represented. A "timestamp in unix format" could be a UINT64, or s string containing the digits of the UINT64, or something else.
Please post a small example of the inputs.

Sign in to comment.

Answers (0)

Sign in to answer this question.