Monday, November 13, 2006

Regular Expressions and Dates

Regular expressions are great and I think programmers should always keep in practice with them because you really never know when it'll be useful to parse through text. When I was first learning Regular Expressions, I practised with dates. Eventually, I made myself a short program in ruby. It takes a string as input and if the string suggests a date, the program is supposed to figure out which date. Lots of stuff can be done with that date, such as a report of the day of the week, how far in days the date is from today, or when the end of the month is. All the calculations are trivial if a date is given. The tricky part was determining if a string specified a date, and if so, extracting the date.

As I said, I used regular expressions for figuring out what the date meant. I thought of all the formats a user might use and made a case for each of them. This has actually been a useful exercise and I've used the results of it a lot. The following are all the cases that I could think of... I very much doubt that I got all of them. Remember, this is for ruby, so the reg-exp formatting may be different for you. The i means ignore case and \A and \Z can be used to specify the beginning and end of the string respectively. The only way I can think of improving the recognition for now is to allow for spelling errors. Can you think of any other ways a person might reasonably suggest a date as a string? Here goes.

1. Of the format MM/DD/YY, MM-DD-YY, or MM\DD\YY. YY may be YYYY in any of the cases. This is the trickiest pattern, since the user may mean either Year/Month/Day or Month/Day/Year. Two digits for the year makes it difficult to assume what the user meant. I chose to use the Month/Day/Year form, unless the user uses four digits to specify the year, in which point it is easy to figure out what they meant. Therefore, YYYY/MM/DD is also a valid format . As for the year, if four digits are not specified, then I assume that they are specifying the current millineum (2000).
/\A(\d+)\s*(-|\/|\\)\s*(\d+)\s*((-|\/|\\)\s*(\d+))?/

2. Of the format "June 27, 1983", "Jun 27, 1983", or "June 27" (in which the current year is implied). Granted Junileropwf 27, 1983 would also be valid here, but I didn't think such cases were important enough to detect and were instead a waste the readibility of the expression. In specifying the month, a minimum of 3 characters are required. The comma is optional, but at least one space must separate the tokens. If a day and year are not specified, then day one of the specified month in the current year should be used.
/(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)([^\s]* (\d+)(.*))?/i

3. Of the format YYYYMMDD. It has to be in that order, I didn't make allowances in this case.
/(\d\d\d\d)(\d\d)(\d\d)/

4. Of the format YYYY. The date is interpreted as Jan 1 of the year YYYY.
/(\d\d\d\d)/

5. Last or Next or Current Day of the Week (ie last Thu, next Thursday, or Thursday). I used a minimum of three letters for the weekday name to avoid matching cases I didn't want it to match. Once again, I didn't test that the letters following the first three were correct because I didn't think it improved the match any more.
/(last |next |this |)(sun|mon|tue|wed|thu|fri|sat).*/i

6. Yesterday, today, tomorrow. Also, my wife and I made up words Yupterday (the day before yesterday) and Threemorrow (the day after tomorrow.) [--We wanted to make up for the lack of an English word for these common references. Yes, we're pretty goofy people. Please feel free to adopt the words.] The test for these is really simple and reg-exp isn't even really needed. It's actually less effecient if you want single matches, O(nm) instead of O(n). Here's the case for the heck of it:
/(today|tomorrow|yesterday|yupterday|threemorrow)/i

If you think the program I described might be useful, here's the whole current version as html. Do whatever you want with it. I've found it useful and added a bunch of date converting methods for flavor (like get the end of the month, return day statistics, or get number of days between two dates) , but the heart of the whole thing is still the regular expressions.

No comments: