Make Timesheet Parser with Hoa Compiler
Last modified 2022/09/14 11:14I have been keeping my timelogs in a plain-text timesheet format as follows:
2019-02-11
09:00 [JIRA-1234] Adding some functionality
10:00 [standup]
10:15 [JIRA-1234] Fixing that annoying bug
11:00 [JIRA-2134] Review
12:00 [lunch]
13:00 [JIRA-1234] @pairing
14:00 [confused]
18:00 [finish]
2019-02-12
09:00 ...
Bascially, it’s much quick to log my time realistically. I don’t need to
continually break my concentration and assign time to tickets as I work on
them, or strain to remember (or make up) what I did at the end of the day, or
even at the end of the week. It also means I can record what really
happened, and not just logging random events on the ZZ-22
“catch-all” ticket
where the information is lost to the powers of analysis.
The only problem is that every week I need to translate this into not one but two JIRAs, this is an operation that involves a huge amount of clicking and waiting and confusion and ppaaiinn.
So, pain once a week instead of pain every day. But there is no reason that this situation cannot be ameliorated - we can parse the timesheet. Once we parse the timesheet we can sync it automatically with JIRA and my Monday morning trauma is at an end, and our project managers can be happier as I can accurately curate and log my time every day effortlessly.
And, the miscellaneous tickets can still be assigned to a bucket ticket, but at least the original information is preserved,
Parsing the Timesheet ¶
Why do we want to parse the timesheet? We want to extract the information from it, and eventually produce a data structure like:
[
'2019-01-01' => [
'entries' => [
[ 'time' => '10:00', 'category' => 'AN-1234', 'comment' => 'foobar' ],
[ 'time' => '11:00', 'category' => 'lunch', 'comment' => 'foobar' ],
]
],
'2019-01-02' => [
'entries' => [
[ 'time' => '10:00', 'category' => 'AN-456', 'comment' => 'foobar' ],
]
]
]
Once we have structured data we can do something useful with it.
We could use regular expressions to extract the data, but, well, it might not end well. Instead we are going to use a compiler and we are not going to write any PHP code at all.
The HOA Compiler ¶
The HOA
Compiler. The HOA Compiler is an
amazing library which can take a grammar in the form of a .pp
file (see
here
for good and detailed documentation).
The timesheet document
is composed of one or more date
entries (of the
form YYYY-MM-DD
) and each date entry consequently contains a list of entry
items, each defining the time
, and optionally a category
, comment
and
one or more tags
.
Let’s skip straight to it:
%token newline \n
%token space \s
%token date [0-9]{4}-[0-1][0-9]-[0-3][0-9] -> entry
%token entry:time [0-9]{1,2}:[0-9]{1,2}
%token entry:break \n\n -> default
%token entry:newline \n
%token entry:space \s
%token entry:text [a-zA-Z0-9'"\h.-]+
%token entry:tag @[a-zA-Z0-9-_]+
%token entry:bracket_ \[ -> category
%token category:name [A-Za-z-_0-9]+
%token category:_bracket \] -> entry
#document:
date()*
#date:
<date> <newline>? entry()*
#entry:
<time> <space>? category()? <space>? <text>? tag()* (<newline> | <break> )?
#category:
<bracket_> <name> <_bracket>
#tag:
<space>? <tag>
So first we have the tokens, which are PCRE (regex) patterns. These define
lexemes the which are like the “atoms” of our grammar. We then define the
rules which combine these atoms - when prefixed with #
become nodes in
the AST (more on this later). Note the following:
- The document has zero or many (
*
)date()
rules. - Each
date()
rule is composed of a<date>
followed by zero or one (?
) newlines, followed by one or manyentry()
rules. - Each
entry()
rule must have a valid<time>
token, followed by one or zero spaces, followed by acategory()
rule, followed by… etc.
Did you notice the ->
symbols? These are namespace transitions, they mean
that, when encountring a date
token the lexer should switch to the date
namespace - and it will then only consider tokens in this namespace, this is
necessary to stop rules conflicting (you don’t want to interpret a date
token in a category
for example). The Compiler also allows you to transition
to the previous namespace using __shift__
(see the
docs
for more info).
When there is no namespace, the default
namespace is implicitly used.
When there is a break
token in the date
namespace (two new lines as
defined above) we revert back to the default
namespace and can consider f.e.
the date
token again.
Namespaces are essential, and are what really help make the compiler a much better option than simple regular expressions.
You may notice that we parse the category as a rule, and the tag as a token. There is no particular reason for this other than laziness - we could also have parsed the category as a token, or the tag as a rule, but let’s look at the difference when the AST is rendered:
Tag:
#tag
> token(entry:space, )
> token(entry:tag, @barfoo)
Category:
#category
> token(entry:bracket_, [)
> token(category:name, AA-1234)
> token(category:_bracket, ])
The information we really want from the above two examples is the name -
barfoo
and AA-1234
respectively. With the category we can easily
extract this information from the token in the AST, but with the tag we need
to perform additional processing (e.g. ltrim('@barfoo', '@')
) in order to
obtain the tag name (barfoo
), with the category it is less trivial.
But wait, how did we get here?
Parsing the Timesheet ¶
In order to do anything useful, we want to get our hands on an AST (Abstract
Syntax Tree). This will
be the data structure containing all of our data, more-or-less neatly
organized into a tree structure of nodes (remember these are defined in the
grammar with the #
prefix, e.g. #entry
), each node contains the set of
tokens (and their values) defined in the rule.
We use the HOA Compiler as follows:
use Hoa\Compiler\Llk\Llk;
use Hoa\File\Read;
$compiler = Llk::load(new Read(__DIR__ . '/../../resources/timesheet.pp'));
$ast = $compiler->parse($string);
// profit!
Once we have the AST we can visualize it using the Dumper
class provided by
HOA:
use Hoa\Compiler\Visitor\Dump;
$dumper = new Dump();
echo $dumper->visit($ast);
Producing something like this:
> #document
> > #date
> > > token(date, 2019-01-01)
> > > token(entry:newline,
)
> > > #entry
> > > > token(entry:time, 10:00)
> > > > token(entry:space, )
> > > > token(entry:text, Fo)
> > > > token(entry:newline,
Walking the AST ¶
The AST variable has type TreeNode
and can be traversed easily with helper
methods such as getChildren()
, to do something useful with it, you will
probably want to walk the tree, the basic idea is something like the
following:
class TreeWalker
{
public function walk(TreeNode $node): array
{
$dates = [];
foreach ($node->getChildren() as $childNode) {
if ($childNode->getId() === 'date') {
$dates[] = $this->walkDate($childNode);
}
}
return $dates;
}
private function walkDate(TreeNode $node): array
{
$date = [
'entries' => [],
];
foreach ($node->getChildren() as $childNode) {
if ($childNode->getValueToken() === 'date') {
$date['date'] = new DateTimeImmutable($childNode->getValueValue()));
}
if ($childNode->getId() == 'entry') {
$date['entries'][] = $this->walkEntry($childNode));
}
}
return $date;
}
private function walkEntry(TreeNode $node): array
{
// etc.
}
}
The result would be something like:
[
'date' => '2019-21-13',
'entries' => [
[
// ...
],
[
// ...
]
],
'date' => '2019-21-14',
'entries' => [
[
// ...
],
[
// ...
]
],
]
This is a simplified version, see here for the complete version.
Note that we progressively build our data set and extract information from the tokens in the tree.
Summary ¶
Now that we have walked the AST we have a data structure suited to our needs, and the next step is to build some rudimentary reporting and then integrate with the JIRA API. Along the way the above will probably change significantly - but fortunately it is now easy to change.
The official documentation provides a much greater depth of knowledge than this blog post does, but it is perhaps useful to see it explained from a different perspective.
Here is to great and future hopes of increased productivity powered by HOA.