These days we use Amazon Cloudfront for content delivery. Amazon has made it very easy to deliver files in a Amazon Simple Storage Service (S3) bucket using Amazon Cloudfront distribution. If you are using Cloudfront as Content Delivery Network (CDN) your next task will be monitoring the usage. For this Amazon Cloudfront has a provision to store access logs to a S3 bucket. My hurdle was to process the log files stored by Cloudfront. For sites hosted with apache I use Awstats for reading the logs. So my vote was for awstats. Please follow the steps one by one 😉
1. Need to download the log files stored in the S3 bucket. For this I had to use the a python script done by wpstorm.net but I had to make some modification so that it worked for me. Please follow the blog post if you need any help setting up the required libraries.
get-aws-logs.py
#! /usr/bin/env python """Download and delete log files for AWS S3 / CloudFront Usage: python get-aws-logs.py [options] Options: -b ..., --bucket=... AWS Bucket -p ..., --prefix=... AWS Key Prefix -a ..., --access=... AWS Access Key ID -s ..., --secret=... AWS Secret Access Key -l ..., --local=... Local Download Path -h, --help Show this help -d Show debugging information while parsing Examples: get-aws-logs.py -b eqxlogs get-aws-logs.py --bucket=eqxlogs get-aws-logs.py -p logs/cdn.example.com/ get-aws-logs.py --prefix=logs/cdn.example.com/ This program requires the boto module for Python to be installed. """ __author__ = "Johan Steen (http://www.artstorm.net/)" __version__ = "0.5.0" __date__ = "28 Nov 2010" import boto import getopt import sys, os from boto.s3.key import Key _debug = 0 class get_logs: """Download log files from the specified bucket and path and then delete them from the bucket. Uses: http://boto.s3.amazonaws.com/index.html """ # Set default values AWS_BUCKET_NAME = '{AWS_BUCKET_NAME}' AWS_KEY_PREFIX = '' AWS_ACCESS_KEY_ID = '{AWS_ACCESS_KEY_ID}' AWS_SECRET_ACCESS_KEY = '{AWS_SECRET_ACCESS_KEY}' LOCAL_PATH = '/tmp' # Don't change below here s3_conn = None bucket = None bucket_list = None def __init__(self): s3_conn = None bucket_list = None bucket = None def start(self): """Connect, get file list, copy and delete the logs""" self.s3Connect() self.getList() self.copyFiles() def s3Connect(self): """Creates a S3 Connection Object""" self.s3_conn = boto.connect_s3(self.AWS_ACCESS_KEY_ID, self.AWS_SECRET_ACCESS_KEY) def getList(self): """Connects to the bucket and then gets a list of all keys available with the chosen prefix""" self.bucket = self.s3_conn.get_bucket(self.AWS_BUCKET_NAME) self.bucket_list = self.bucket.list(self.AWS_KEY_PREFIX) def copyFiles(self): """Creates a local folder if not already exists and then download all keys and deletes them from the bucket""" # Using makedirs as it's recursive if not os.path.exists(self.LOCAL_PATH): os.makedirs(self.LOCAL_PATH) for key_list in self.bucket_list: key = str(key_list.key) # Get the log filename (L[-1] can be used to access the last item in a list). filename = key.split('/')[-1] # check if file exists locally, if not: download it if not os.path.exists(self.LOCAL_PATH+filename): key_list.get_contents_to_filename(self.LOCAL_PATH+filename) print "Downloaded "+filename # check so file is downloaded, if so: delete from bucket if os.path.exists(self.LOCAL_PATH+filename): key_list.copy(self.bucket,'archive/'+key_list.key) print "Moved "+filename key_list.delete() print "Deleted "+filename def usage(): print __doc__ def main(argv): try: opts, args = getopt.getopt(argv, "hb:p:l:a:s:d", ["help", "bucket=", "prefix=", "local=", "access=", "secret="]) except getopt.GetoptError: usage() sys.exit(2) logs = get_logs() for opt, arg in opts: if opt in ("-h", "--help"): usage() sys.exit() elif opt == '-d': global _debug _debug = 1 elif opt in ("-b", "--bucket"): logs.AWS_BUCKET_NAME = arg elif opt in ("-p", "--prefix"): logs.AWS_KEY_PREFIX = arg elif opt in ("-a", "--access"): logs.AWS_ACCESS_KEY_ID = arg elif opt in ("-s", "--secret"): logs.AWS_SECRET_ACCESS_KEY = arg elif opt in ("-l", "--local"): logs.LOCAL_PATH = arg logs.start() if __name__ == "__main__": main(sys.argv[1:])
Note: The above script will download the s3 logs to specified folder. Please make sure you put your Amazon access keys.
2. Now we have bash script which will uses the above python script to download the log files and combine all of them into a single log file and then it will be analyzed by awstats.
Warning: Please read through the script files and make necessary changes needed.
Note: You should have awstats installed on your system. The bellow script uses awstats.
Note: You can download the script files at the end of this blog post where awstats configuration with custom setup for cloudfront log format is also provided.
get-aws-logs.sh
#!/bin/bash # Initial, cron script to download and merge AWS logs # 29/11 - 2010, Johan Steen # 1. Setup variables date=`date +%Y-%m-%d` static_folder="/tmp/log_static_$date/" mkdir -pv $static_folder python /var/www/scripts/get-aws-logs.py --prefix=logs/www.imthi.com --local=$static_folder gunzip --quiet ${static_folder}* /usr/local/awstats/tools/logresolvemerge.pl ${static_folder}* | sed -r -e 's/([0-9]{4}-[0-9]{2}-[0-9]{2})\t([0-9]{2}:[0-9]{2}:[0-9]{2})/\1 \2/g' >> /var/www/logs/www.imthi.com.log rm -vrf $static_folder /usr/local/awstats/wwwroot/cgi-bin/awstats.pl -config=imthi -update
I would suggest you to test run the above scripts on a staging / testing environment before moving to a production. Again please change the scripts with your domain details and Amazon access keys.
Download the scripts to download and process Amazon Cloudfront Logs with Awstats.
Have a nice journey exploring the cloud 😉
Hello there and thanks for the information posted on your blog.
It seems we do not have the same cloudfront access logs’ format.
What I have is what is described here: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html
Fields are in that order (in my case and in Amazon’s docs):
#Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url c-user-agent x-sname x-sname-query x-file-ext x-sid
Whereas you propose to use the following LogFormat:
LogFormat=”%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query”
Can you please check the order of your fields?
Best regards,
Kmon
Dear Kmon
What you are looking at for streaming distribution log.
The following is an example of a log file for a streaming distribution.
#Version: 1.0
#Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url? c-user-agent x-sname x-sname-query x-file-ext x-sid
The following is an example log file for a download distribution.
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query
My script does parsing for only download distribution and not for streaming. Please change your fields in the awstats config with the different format for streaming distribution. It would be great if you can share the format for the same.
Cheers
Imthiaz
You are absolutely right.
Naively adapting awstats to eat cloudfront’s streaming format did not help much.
In fact, there are many events that can happen while streaming (connect, play, seek, stop, disconnect, etc) and the sc-bytes is unfortunately only the total bytes transferred to the client up to that event (one line of log per event).
The naive solution that I applied is to concentrate on “stop” events and thus counting the amount of data transferred.
As you can imagine, a client pressing on stop (e.g., to pause the video) and then play again will be counted “twice”, say XKB for the first stop and then YKB (where Y includes X) for the second stop, resulting in stats that did not make any sense, unfortunately.
If anyone has found a simple and neat solution to this problem, please post a comment!
Cheers,
Kmon
This is the code I use for parsing streaming logs, if it helps at all.
First, reading all the logfiles in a directory…
public function process_logs($x)
{
// First we’ll create an array of all viable log files in directory $x
$handle = opendir($x);
// for EACH file in this directory
while ($file = readdir($handle)){
// echo “Checking file: $file \n”;
// if the filename has the log stem in it then add it to the list
if (strstr($file,”E9PBX”))
{
echo “Adding file: $file\n”;
$log_files[]= $file;
}
}
foreach($log_files as $logfile)
{
// Open the file
$fh = fopen($logfile,’r’);
$i=0;
while (!feof ($fh)){
$i++;
//READ the LINE
$content = fgets ($fh, 4096);
// echo “line $i: ” . $content;
// echo “length $i: ” . strlen($content);
if ($i>2)
{
if (strpos($content,”http://”)){
$rEvent = new rtmpEvent();
$fields = explode(“\t”,$content);
$rEvent->init($fields);
}
}
}
}
}
And here’s the basic row that you’ll get back in $fields. I don’t use 9,14, or 16, but I think they are defined in the AWS documentation.
Side note, I am not a real programmer, I just futz around. I’m sure there are better ways to do this.
/* [0] => [date]
[1] => [time]
[2] => [edge location]
[3] => [ip address]
[4] => [event]
[5] => [leading bytes]
[6] => OK
[7] => [unique connection id]
[8] => [streaming server]
[9] =>
[10] => [actual connector]
[11] => [URL that made the call]
[12] => [user’s machine]
[13] => [path to file]
[14] => –
[15] => [filetype]
[16] => 1*/