Monday, November 13, 2006

Regular Expressions and Dates

Regular expressions are great and I think programmers should always keep in practice with them because you really never know when it'll be useful to parse through text. When I was first learning Regular Expressions, I practised with dates. Eventually, I made myself a short program in ruby. It takes a string as input and if the string suggests a date, the program is supposed to figure out which date. Lots of stuff can be done with that date, such as a report of the day of the week, how far in days the date is from today, or when the end of the month is. All the calculations are trivial if a date is given. The tricky part was determining if a string specified a date, and if so, extracting the date.

As I said, I used regular expressions for figuring out what the date meant. I thought of all the formats a user might use and made a case for each of them. This has actually been a useful exercise and I've used the results of it a lot. The following are all the cases that I could think of... I very much doubt that I got all of them. Remember, this is for ruby, so the reg-exp formatting may be different for you. The i means ignore case and \A and \Z can be used to specify the beginning and end of the string respectively. The only way I can think of improving the recognition for now is to allow for spelling errors. Can you think of any other ways a person might reasonably suggest a date as a string? Here goes.

1. Of the format MM/DD/YY, MM-DD-YY, or MM\DD\YY. YY may be YYYY in any of the cases. This is the trickiest pattern, since the user may mean either Year/Month/Day or Month/Day/Year. Two digits for the year makes it difficult to assume what the user meant. I chose to use the Month/Day/Year form, unless the user uses four digits to specify the year, in which point it is easy to figure out what they meant. Therefore, YYYY/MM/DD is also a valid format . As for the year, if four digits are not specified, then I assume that they are specifying the current millineum (2000).

2. Of the format "June 27, 1983", "Jun 27, 1983", or "June 27" (in which the current year is implied). Granted Junileropwf 27, 1983 would also be valid here, but I didn't think such cases were important enough to detect and were instead a waste the readibility of the expression. In specifying the month, a minimum of 3 characters are required. The comma is optional, but at least one space must separate the tokens. If a day and year are not specified, then day one of the specified month in the current year should be used.
/(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)([^\s]* (\d+)(.*))?/i

3. Of the format YYYYMMDD. It has to be in that order, I didn't make allowances in this case.

4. Of the format YYYY. The date is interpreted as Jan 1 of the year YYYY.

5. Last or Next or Current Day of the Week (ie last Thu, next Thursday, or Thursday). I used a minimum of three letters for the weekday name to avoid matching cases I didn't want it to match. Once again, I didn't test that the letters following the first three were correct because I didn't think it improved the match any more.
/(last |next |this |)(sun|mon|tue|wed|thu|fri|sat).*/i

6. Yesterday, today, tomorrow. Also, my wife and I made up words Yupterday (the day before yesterday) and Threemorrow (the day after tomorrow.) [--We wanted to make up for the lack of an English word for these common references. Yes, we're pretty goofy people. Please feel free to adopt the words.] The test for these is really simple and reg-exp isn't even really needed. It's actually less effecient if you want single matches, O(nm) instead of O(n). Here's the case for the heck of it:

If you think the program I described might be useful, here's the whole current version as html. Do whatever you want with it. I've found it useful and added a bunch of date converting methods for flavor (like get the end of the month, return day statistics, or get number of days between two dates) , but the heart of the whole thing is still the regular expressions.

No comments: