I am building a spark streaming app that takes in logs coming out of a server. A log line looks something like this.
2015-06-18T13:53:46.606-0400 CustomLog v4 INFO: source="ABCD" type="type1" <xml some xml here attr1='value1' attr2='value2' > </xml> <some more xml></> time ="232"
I am trying to follow the sample app written by databricks over here here .
I am kind of stuck at the pattern in ApacheAccessLog.scala. My log is a custom log and has this key="value" pairs in a typical log line.
I don't quite understand what the pattern means and how to change it to suit my app. I need to do some aggregation on the times based on the source and type keys in the log
Best How To :
The case class expects a variety of things like IP address that your log obviously doesn't have, therefore you would need to modify the case class definition to include just the fields that you want to add.
Just to illustrate here, let's make the case class like so:
case class ApacheAccessLog(source: String, type: String, time: Long)
Then you can replace the regex with one that finds those, you can play with the regex on regex101 here which I've prepared with something for you to start with, producing a regex something like this:
source="(.*?)" type="(.*?)" .* time ="(.*?)"
capturing the three groups of characters into the
m. Then you can fix the instantiation with these groups:
ApacheAccessLog(m.group(1), m.group(2), m.group(3).toLong)