searchlove london 2016 | dom woodman | how to get insight from your logs
Post on 07-Jan-2017
2.340 Views
Preview:
TRANSCRIPT
2009
God it’s bad.
-$1.5 Billion
Why hasn’t Google seen the changes on my page?
How should I prioritise errors in Search Console?
Are my canonicals being respected?
Does Google think this page is important?
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
What is a log?
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
IP Address
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Timestamp
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Request type
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Homepage
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Protocol
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Status Code
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Size of the page (in bytes)
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html))"
User Agent
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
5 things2 3 4 51
1 Diagnose crawling & indexation issues
2 3 4 51
Number of requests
Five folders Googlebot crawled the most
Five folders Googlebot crawled the most
Number of requests
% of Organic sessions VS % of crawl budget
Sessions
Crawl budget
2 Prioritisation
2 3 4 51
example.com/article
Prioritizing
1
Full
example.com/article/full
example.com/article/print
Prioritizing
2
example.com/article/pdf
Prioritizing
3
Prioritizing
1
Full
3 Spot bugs & view site health
2 3 4 51
Delayed errors with a limit of 1000
4 How important does Google see parts of your site?
2 3 4 51
My SEO was as bad as my design
But at least my hair was better
teflsearch.com
teflsearch.com/job-results
teflsearch.com/job-results/country/china
teflsearch.com/jobadvert3455
Average number of times Googlebot crawled a template
1. teflsearch.com
2. teflsearch.com/job-results
3. teflsearch.com/job-results/country/china
4. teflsearch.com/job-advert3455
1. teflsearch.com
2. teflsearch.com/job-results
3. teflsearch.com/job-results/country/china
4. teflsearch.com/job-advert3455
teflsearch.com/job-results
Average number of times Googlebot crawled a template
35%
5 How fresh does it think your content is?
2 3 4 51
bit.ly/moz-fresh
Average number of times a page template is crawled by Googlebot
●Improve our internal linking●Build trust with last modified date in sitemap
2 3 4 51
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Talk to a developer and
ask for information
Are all the logs in one place?
Hi xI’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re finding etc. There are also some things that are really helpful for us to know when getting logs.Do the logs have any personal information in?We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.Do you have any sort of caching which would create separate sets of logs?If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just those from the server. (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache external images then we don’t need it).Are there any sub parts of your site which log to a different place?Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well.Do you log hostname?It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful to have that turned on now for any future analysis.Is there anything else we should know?Best,{x}
Email for a developer
So we might have something that looks like this
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
How should we analyse our
logs?
BigQuery
BigQuery
Google’s online database for data
analysis.
1. Ask powerful questions2. Repeatable3. Scaleable4. Combine with crawl data5. Easy to set-up6. Easy to learn
What do we want from analysing our logs?
9,000,000 rows of data for 2 months.
400 - 800 queries
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Format the logs so we can import them into BigQuery
Separate the Googlebot logs from all the other logs
Screaming Frog Log Analyser Code something
Screaming Frog Log Analyser
Code something
bit.ly/logs-code
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Our data in BQ
We make sure we got what we wanted
THE QUESTION: What is the total number of
requests Googlebot makes each day to our site?
Our first SQL query
SELECT timestampFROM [mydata.log_analysis]
Our first SQL query
SELECT timestampFROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp)FROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp)FROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp) as dateFROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp) as dateFROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp) as date, count(*)FROM [mydata.log_analysis]
Our first SQL query
SELECT DATE(timestamp) as date, count(*)FROM [mydata.log_analysis]GROUP BY date
Our first SQL query
SELECT DATE(timestamp) as date, count(*) as number_of_requestsFROM [mydata.log_analysis]GROUP BY date
Our first SQL query
SELECT DATE(timestamp) as date, count(*) as number_of_requestsFROM [mydata.log_analysis]GROUP BY date
Comparing logs to GSC crawl volume
Number of requests
Run queries
Find something weird
Go look at crawl & website
Our data in BQ
1 Diagnose crawling & indexation issues
2 Prioritisation
3 Spot bugs & view site health
4 How important does Google see parts of your site?
5 How fresh does it think your content is?
1 Diagnose crawling & indexation issues
4 How important does Google see parts of your site?
What are the top 20 URLs crawled by Google over our logs?
Login is my top crawled page and then search?
What are the top 20 page_path_1 folders crawled by Google over our
logs?
Location folders are taking more than 70% of my budget
Getting data by the day
Page Number of Googlebot Requests
page1 200,000
page2 120,000
Number of Googlebot requests day by day
3 Spot bugs & view site health
How many of each status code does Google find per day over our
logs?
Number of Googlebot requests day by day
What are most requested 404 URLs by Googlebot over the past
30 days?
Boy does it want that ad-tech snippet
5 How fresh does it think your content is?
How many times on average is each page in a page template
crawled a day?
Average number of times a page template is crawled by Googlebot
How long does it take for a page to be discovered after being published?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?Which directories have the most 301 & 404 error codes?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?Which directories have the most 301 & 404 error codes?Which pages are crawled with parameters and without parameters?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?Which directories have the most 301 & 404 error codes?Which pages are crawled with parameters and without parameters?Which pages are only partly downloaded?How many hits does each section get, when the sections are classified in
an external dataset?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?Which directories have the most 301 & 404 error codes?Which pages are crawled with parameters and without parameters?Which pages are only partly downloaded?How many hits does each section get, when the sections are classified in
an external dataset?What percentage of a directory was crawled over the past 30 days?
How long does it take for a page to be discovered after being published?What are the top 20 combinations of page_path_1 & path_path_2
folders crawled by Google over the time period of our logs?Which pages have requests from Googlebot, which don’t appear in our
crawl?What are the top non-canonical pages being crawled?Which are most crawled parameters on the website?How often are the most visited parameters crawled each day?Which directories have the most 301 & 404 error codes?Which pages are crawled with parameters and without parameters?Which pages are only partly downloaded?How many hits does each section get, when the sections are classified in
an external dataset?What percentage of a directory was crawled over the past 30 days?What are the total number of requests across two different time periods?
That’s a lot of questions
bit.ly/logs-resource
bit.ly/logs-resource
bit.ly/logs-resource
bit.ly/logs-resource
In Summary
This is the thing you’re probably not doing
bit.ly/logs-resource@dom_woodman
bit.ly/logs-resource@dom_woodman
top related