Extract trace data from log files

Originators/Authors
Salman Elahi and Mathieu d’Aquin
UCIAD, Open University
Purpose
To extract activity data from log files in a flexible way in order to deal with a multiplicity of log file formats.
Background
To integrate log file data from diverse systems.
Ingredients
  • Log files
  • UCIAD Parderer
Assumptions
  • Log files relate to online resources
  • Parderer is parameterised with a parameter for the specific parser class applicable to the log file format
  • A parser class can have specific parameters, notably a regular expression describing log file entries
  • Output from Parderer is in RDF
Warnings
A specific Parderer for a particular log file format needs to be deployed on the server where the log file (or a copy) is stored
Method
  • Edit configuration file to specify the parameter class for Parderer and information about the server, eg where the log files are stored, the URI to describe the server in the RDF
  • Run Parderer via cron jobs, currently always daily
Individual steps
  • By default Parderer runs automatically (thanks to cron), but it can be run manually for specific dates via command line parameters
Output data
Information will be placed in a RDF output file as specified in the configuration files
File contents are in RDF XML format and comply with the ontology at available at http://github.com/uciad - see http://uciad.info/ub/2011/03/uciad-ontologies-a- starting-point/ for initial documentation.
Appendix A: Sample output
RDF for the trace
GET request to the URI http://data.open.ac.uk/resource/person/ext- 718a372e10788bb58d562a8bf6fb864e
<rdf:RDF>
<rdf:Description rdf:about="http://uciad.info/trace/kmi- web13/ede2ab38da27695eec1e0b375f9b20da">
<rdf:type rdf:resource="http://uciad.info/ontology/trace/Trace"/>
<hasAction rdf:resource="http://uciad.info/action/GET"/>
<hasPageInvolved rdf:resource="http://uciad.info/page/0b9abc62fcf90afc53797b938af435dd"/>
<hasResponse rdf:resource="http://uciad.info/response/ea95add1414aba134ff9e0482b921a33"/>
<hasSetting rdf:resource="http://uciad.info/actorsetting/119696ec92c5acec29397dc7ef98817f"/>
<hasTime rdf:datatype="http://www.w3.org/2001/XMLSchema#string">13/Jun/2011:01:37:23+0100</hasTime>
</rdf:Description>
</rdf:RDF>
<rdf:RDF>
<rdf:Description rdf:about="http://uciad.info/page/0b9abc62fcf90afc53797b938af435dd">
<rdf:type rdf:resource="http://uciad.info/ontology/sitemap/WebPage"/>
<isPartOf rdf:resource="http://uciad.info/ontology/test1/dataopenacuk"/>
<onServer rdf:resource="http://kmi-web13.open.ac.uk"/>
<url rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
/resource/person/ext-718a372e10788bb58d562a8bf6fb864e
</url>
</rdf:Description>
<rdf:RDF>
<rdf:Description rdf:about="http://uciad.info/ontology/test1/dataopenacuk">
<rdf:type rdf:resource="http://uciad.info/ontology/sitemap/Website"/>
<rdf:type rdf:resource="http://uciad.info/ontology/test1/LinkedDataPlatform"/>
<onServer rdf:resource="http://kmi-web13.open.ac.uk"/>
<urlPattern rdf:datatype="http://www.w3.org/2001/XMLSchema#string">/*</urlPattern>
</rdf:Description>
<rdf:Description rdf:about="http://uciad.info/response/ea95add1414aba134ff9e0482b921a33">
<rdf:type rdf:resource="http://uciad.info/ontology/trace/HTTPResponse"/>
<hasResponseCode rdf:resource="http://uciad.info/ontology/trace/200"/>
<hasSizeInBytes rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1085</hasSizeInBytes>
</rdf:Description>
<rdf:Description rdf:about="http://uciad.info/actorsetting/119696ec92c5acec29397dc7ef98817f">
<rdf:type rdf:resource="http://uciad.info/ontology/actor/ActorSetting"/>
<fromComputer rdf:resource="http://uciad.info/computer/7587772edef21a8461f6af0efaf150fc"/>
<hasAgent rdf:resource="http://uciad.info/actor/ceec6f92bcc8167cbb665f7e51b2a6b3"/>
</rdf:Description>
<rdf:Description rdf:about="http://uciad.info/computer/7587772edef21a8461f6af0efaf150fc">
<rdf:type rdf:resource="http://uciad.info/ontology/actor/Computer"/>
<hasIPAddress rdf:datatype="http://www.w3.org/2001/XMLSchema#string">129.13.186.4</hasIPAddress>
</rdf:Description>
<rdf:Description rdf:about="http://uciad.info/actor/ceec6f92bcc8167cbb665f7e51b2a6b3">
<rdf:type rdf:resource="http://uciad.info/ontology/actor/ActorAgent"/>
<agentId rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
ldspider (BTC 2011 crawl, harth@kit.edu, http://code.google.com/p/ldspider/wiki/Robots)
</agentId>
</rdf:Description>
Appendix B: Documentation to describe Parderer configuration
Here is an example Parderer configuration with comments
# the regular expression for a line of log when parsing apache logs
logPattern = ^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\S+) \"(.*)\" \"([^\"]+)\"
# the URI of the server which logs are being considered
serverURI = http://kmi-web13.open.ac.uk
# name of the server as it will appear in the data
serverName = kmi-web13
# local path to the log file
logFileDirectory = /web/logs/lucero.open.ac.uk
# pattern of the log file name
logFileNamePattern = access_log_%Y-%m.log
# number of variables in the regular expression
numberOfFields = 9
# URI of the SESAME triple store where the data should be uploaded
repURI = http://kmi-web01.open.ac.uk:8080/openrdf-sesame
# name of the repository in the SESAME triple store
repID = UCIADAll
# directory where the data dumps (in zipped RDF) should be placed
zippedFileOutPutDir = /web/lucero.open.ac.uk/applications/parsedLogs