5 Easy Steps To Fix Secure Page (https) Duplicate Content
Dan Sharp
Posted 23 March, 2011 by Dan Sharp in SEO
5 Easy Steps To Fix Secure Page (https) Duplicate Content
I have had a couple of cases (and queries) involving secure pages (https://) and duplicate content recently, so thought it would be a useful area to discuss.
Hypertext transfer protocol secure (https) pages are often used for payment transactions, logins and shopping baskets to provide an encrypted and secure connection. Secure pages of course can be crawled and indexed by the search engines like regular pages. Although it might be hard to spot the difference between a http and https version of a page, they are of course technically different URI (it might only be a ‘s’ that’s the difference!) and they will be treated as a separate page by the search engines.
So as an example, the two URI below would be seen as different pages –
http://www.screamingfrog.co.uk/
https://www.screamingfrog.co.uk/
This is often not a major issue, but we know duplicate content can be a problem as it causes dilution of link equity (splitting of pagerank between pages rather than combining to one target) aswell as a waste of crawl allowance.
So How Do Secure Pages Get In The Index?
Well, like any URI, they are either found via internal or external links. So either you are linking to the secure page from the website, or someone else externally is linking to the page (or another internal page connected to it!) and hence why it has been crawled and indexed. You can find secure pages in Googles index via the site: and inurl:https commands, like this example (we have zero results, wahey!).
However, one of the most common things we find is the use of a single secure page from a login or shopping cart / basket which then contains relative urls. For example –
“/this-is-a-relative-url/”
As shown above, relative urls of course don’t contain protocol information (whether they are http or https!). They simply use the same protocol as the parent page (unless stipulated in another way like a base tag). Hence, crawled from a secure page, the url would therefore be secure (https). Often entire websites can then be crawled in secure format by a simple switch like this!
So, What Steps Should You Take To Ensure Your Secure Pages Are Not Indexed?
1) First of all, make sure you use the correct protocol on the correct pages. Only secure pages that genuinely need to be, should be secure, like shopping basket, login or checkout pages etc. Product pages on the whole, shouldn’t be so make sure users can’t browse these and potentially link to secure versions of these pages.
2) Use absolute URLs – Absolute urls define the hyper text transfer protocol and don’t leave it to chance. So if you have a secure page that can be crawled (via internal or external links), make sure you have absolute urls.
3) You could also robots.txt out a shopping basket or login page so the search engines don’t crawl the page. Be careful not to block any other secure pages that you DO want in the index, or any secure pages which might of already accured some link equity (see point 5!). You can also consider the use of a ‘nofollow’ link attribute to the login/shopping basket page. This is the only page we might recommend using a nofollow on for internal links. Matt Cutts from Google commented on this previously in a Google Webmaster Help video. Please note, you shouldn’t have to take this step if you can follow the other steps in this guide. Ideally if you don’t want your shopping or login page in the index, use a meta noindex tag.
What Should I do If I Already Have Duplicate Secure Pages In The Index?
4) Find the reason why you have secure pages in the index, either internal or external links and follow the steps already outlined above. If you can’t find the link source internally (shameless plug), try the SEO spider which will do it for you. If it’s not an internal link, then there could be external links in play.
5) 301 permanently redirect the secure (https) page to the correct http version. This will mean the search engines drop the https out of the index, rank the correct http version and pass any link equity (or pagerank!) to the correct version of the page. If you can’t use a 301 redirect, then try using the canonical link element instead. Obviously make sure you haven’t blocked any of the pages you are going to redirect via robots.txt!
Hopefully this article will provide a useful guide to help remove any duplicate secure pages (https).
I always thought it best to use secure.whatever.tld for https all the time, keeping ssl off of the main site. I know people feel more secure when they see it, and since it gets indexed separately, there is less indexing trouble. But the biggest reason was for robots.txt management.
If you keep ssl on the same domain as non-ssl, as you do here, how do you serve up http://whatever.tld/robots.txt and also https://whatever.tld/robots.txt ? If you did want those to be different, you’d have a conundrum, no?
I believe it would be good, if you can add a Cannonical Redirect also to the Header section to prevent some worst scenario that we might not think which leads to duplication.
Add this line to meta section for each of the respective webpages to avoid duplication
E.g For this webpage that we are reading now, needs to avoid duplication then it would be good to add
For me this has come handy in many situations.
I have a question about this – how do you separate home/landing page & marcom pages currently in https to an http from the rest of the web pages which we want to keep as https? is there a way to do this? thanks!!
Hi Heddi,
Yes you can and it depends on your server set-up.
I advise to have a chat with your development team. My preference would certainly be to only use secure pages on pages that really do need to be secure.
Thanks,
Dan
This is by far the most complete article about removing dupes between the http and https protocol.
Thumbs up for suggesting changing relative paths to absolute ones. However, if that is not possible a good alternative would be using rel=”canonical” in the https pages so they all point to the http ones.
The 301 redirect solution would be the best but a bit risky if there are https pages that need to be excluded.
Thanks for the explanation of how to avoid indexing the SSL pages.
But why would I want that? I didn’t find any reasons so far.
Google is developing it’s next generation HTTP (Spdy), which is ALWAYS encrypted/secore. And the browser showing that the connection is secured creates trust in the shop.
But it has to be secure BEFORE using the login, otherwise how could the user tell?
So why would I want to remove HTTPS URLs from the index, instead of turning it around and ONLY indexing the secure pages?
From a usability point of view, redirecting users from their secure connection to an unsecure one sucks! We just would have one more irritating page that makes it hard for users to understand if a connection is secure or not.
The only argument I read so far is that people would link to the non-secure-page. But when investing in linkbuilding, why not start and promote a secure URL?
Please enlight me guys! I’d be glad to come to an understanding here.
Hi Andy,
You seem passionate. But it’s not SSL URLs specifically, it’s when they cause duplicate versions of http URLs.
I explained why at the start of the post – “This is often not a major issue, but we know duplicate content can be a problem as it causes dilution of link equity (splitting of pagerank between pages rather than combining to one target) aswell as a waste of crawl allowance.”
So you want one or the other for a URL, not both. Some sites choose to go completely secure which is cool. I think it’s unnecessary though!
Redirecting from a secure page to a non secure makes sense if the page doesn’t need to be secure. If it should be, don’t do it. Another way would be to use a canonical as mentioned.
Thanks for the comments.
Cheers.
Clear explanation! I like the way of finding indexed URLs through – “site: and inurl:https commands”.
Google has declared HTTPs as a ranking signal – tiny however a signal.
As an instant repercussion, most of the webmasters are looking to capitalize on it and start migrating sites from HTTP to HTTPS.
Now, interestingly, this site is still on HTTP, at least some of their service pages (i.e., landing pages – i guess so!).. any specific reason?
:-)
Hi Partha,
Yeah the Google PR team announced they will start using it as a small signal in scoring.
It’s their way of encounraging a safer web etc, which is cool.
Our site is mostly HTTP, but actually if you have a look our purchase page (the licence page) and the login page, they are both HTTPS. This is because users can submit sensitive data, so these pages should be secure.
Will we move the whole site to secure (even ‘informational’ type pages)?
It’s not necessary, but we may do in the future for ease. But I wouldn’t recommend considering it purely for SEO reasons personally.
Cheers.
Dan
Hi Dan
Informative and unique representation !!! I am a newcomer in this field and I would like to clear a doubt . I know “HTTPS” is just to use for secure pages like “Login”, “Purchase page” , “Checkout page” . etc. But , I think “Partha Sarathi Dutta” got a point that we can use “HTTPS” to capitalize the advantage as Google clarified “HTTPS” as a ranking signal. Is there any problem in it regarding SEO point of view ?
Thanks in advance..
Somnath
Hey Somnath,
HTTPS is now a tiny signal in scoring in the search engines.
There’s bigger considerations than just SEO though, security, the user, your tech, the effort, latency etc.
But, you could change the whole site to HTTPS (like we have) :-)
Cheers
Dan
My site was also indexed https (just mainpage). Site:page is showing https result for mainpage, others are ok. SSL is disabled on server. How can i do redirection from https to http on nginx? The only thing I can do now is creating a canonical address. any ideas?
Why I am unable to install screaming from in my Asus win 10 laptop.My lp is 32 bit.
I have tried to install it but unsuccessful every time?
Hi Kumar,
You’re commenting on a blog post that’s 6 years old :-)
We have a dedicated support team, you can simply message and receive help from for future reference – https://www.screamingfrog.co.uk/seo-spider/support/
Your issue is highly likely to be the following – https://www.screamingfrog.co.uk/seo-spider/faq/#why-wont-the-seo-spider-start
Cheers.
Dan
Is this a typo? To me, it looks like the URI are identical and so are seen as the same page.
“So as an example, the two URI below would be seen as different pages –
https://www.screamingfrog.co.uk/
https://www.screamingfrog.co.uk/
“
Hi Bob,
Well spotted. We moved to pure HTTPS from HTTP, and converted all HTTP links to HTTPS.
It looks like we also made non hyperlinks (outside of A elements) HTTPS.
I’ve updated.
Cheers.
Dan
Clear explanation! I like the way of the finding indexed URLs through – “site: and inurl:https commands”. nice
Hi,
I’m using wordpress plugin “HTTP / HTTPS Remover: SSL Mixed Content Fix”. But still some SEO tools showing SSL error against some “images url”.
suggestion:
If it is possible, then please integrate images upload section in comment box. Thanks
Hey Team! I’ve used a custom SF extraction to find the external-linking URLs our website. I would then like to take that list and check to see if some of those URL have been secured. list and put it into a How would I do that?