I am reading the book "The Ultimate Guide to Web Crawling"
The code used to run the first HTTP get-request is the following:
import requests
url = "https://scrapethissite.com/pages/simple/"
r = requests.get(url)
print("We got a {} response code from {}".format(r.status_code, url))
I got the error message:
HTTPSConnectionPool(host='scrapethissite.com', port=443): Max retries exceeded with url: /pages/simple/ (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))
I understand that my request doesn't go the right port. Is it linked to the fact that the website uses the communication protocol HTTPS (vs HTTP)? I am not sure, but it seems to be part of the problem.
I am using Python 3.8 on PyCharm. My SSL version is:
OpenSSL 1.1.1g 21 Apr 2020
I am a beginner in webcrawling. This is why I chose to run an alternative code to run my HTTP get-request, one that would allow me to select the appropriate port and protocol (Source: https://pythonprogramming.net/python-sockets/):
import socket
import ssl
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()
server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)
request = "GET / HTTP/1.1
Host: "+server+"
"
s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)
while (len(result) > 0):
print(result)
result = s.recv(4096)
I got the HTTP 200 OK status response so it is working well. I get this output in the PyCharm terminal:
b'HTTP/1.1 200 OK Date: Tue, 12 Jan 2021 14:59:35 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Set-Cookie: __cfduid=d205b0b8e8ce061174412767189bf10b41610463575; expires=Thu, 11-Feb-21 14:59:35 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax CF-Cache-Status: DYNAMIC cf-request-id: 0798b515a60000ea04f707d000000001 Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" Report-To: {"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=%2FROG7Z2JWZJBMeVNn1IgnJh2TZsqJCi9TJOL3zau98btlLo1nPg4WhGlmOz2SZ6PRep6%2BKZfv0M81fqKOw1l6%2BRbc5M9dErdtyeTsei9Ee%2F2jc0%3D"}],"group":"cf-nel","max_age":604800} NEL: {"report_to":"cf-nel","max_age":604800} Server: cloudflare CF-RAY: 6107be029e27ea04-IAD 1fb5 <!doctype html>
Scrape This Site | A public sandbox for learning web scraping Scrape This SiteSandboxLesson' b'sFAQLoginvar path = document.location.pathname; var tab = undefined; if (path === "/"){ tab = document.querySelector("#nav-homepage"); } else if (path.indexOf("/faq/") === 0){ tab = document.querySelector("#nav-faq"); } else if (path.indexOf("/lessons/") === 0){ tab = document.querySelector("#nav-lessons"); } else if (path.indexOf("/pages/") === 0) { tab = document.querySelector("#nav-sandbox"); } else if (path.indexOf("/login/") === 0) { tab = do' b'cument.querySelector("#nav-login"); } tab.classList.add("active")Scrape This Site
The internet's best resource for learningweb scraping.
Explore Sandbox Begin Lessons → Lessons and Videos © Hartley Bro' b'dy 2018 PNotify.prototype.options.styling = "bootstrap3"; $(function(){ }); $(function () { $('[data-toggle="tooltip"]').tooltip() }) $("video").hover(function() { $(this).prop("controls", true); }, function() { $(this).prop("controls", false); }); $("video").click(function() { if( this.paused){ this.play(); } else { this.pause(); } }); (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-41551755-8', 'auto'); ga('send', 'pageview'); !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n; n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window, document,'script','https://connect.facebook.net/en_US/fbevents.js'); fbq('init', '764287443701341'); fbq('track', "PageView"); /* */ window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'AW-950945448'); ' b'0 '
The only problem is that I want to scrape this website:
https://scrapethissite.com/pages/simple/
and not:
When I replace
server = 'scrapethissite.com'
by:
server = 'scrapethissite.com/pages/simple/'
in the previous code, I get this new error message:
socket.gaierror: [Errno 11001] getaddrinfo failed
My understanding is that the problem is linked to the proxy. Knowing that the problem may be linked to port, socket, proxy, etc., is informative, but I am not sure what/how to fix the code as it is working fine for one website but not the other.
Any help is highly appreciated. Thank you!
Following OneCricketeer's reply, the code is now:
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()
server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)
request = "GET /pages/simple HTTP/1.1
Host: "+server+"
"
s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)
while (len(result) > 0):
print(result)
result = s.recv(4096)
I get HTTP 301 MOVED PERMANENTLY status response.
b'HTTP/1.1 301 MOVED PERMANENTLY Date: Tue, 12 Jan 2021 15:34:15 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Set-Cookie: __cfduid=d6e32136f617c0b90e7f92a3e391c159f1610465655; expires=Thu, 11-Feb-21 15:34:15 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax Location: https://scrapethissite.com/pages/simple/ CF-Cache-Status: DYNAMIC cf-request-id: 0798d4d0d700002550fc1c3000000001 Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" Report-To: {"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=2moOTvTDPvS65D6d0LvsiZTLDqYcv8OFZvtunIQDq6H%2FKLucm1LOOlMABcnCUjUO9fK4bwd%2BVDiescQ0NyHbu3DxhTCkOUHTvMcilkM%2BdcZnz3A%3D"}],"group":"cf-nel","max_age":604800} NEL: {"report_to":"cf-nel","max_age":604800} Server: cloudflare CF-RAY: 6107f0c7bb432550-IAD 11f Redirecting...
Redirecting...
You should be redirected automatically to target URL: https://scrapethissite.com/pages/simple/. If not click the link. ' b'0 '
Is there something I missed?