Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am reading the book "The Ultimate Guide to Web Crawling"

The code used to run the first HTTP get-request is the following:

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url) 
print("We got a {} response code from {}".format(r.status_code, url))

I got the error message:

HTTPSConnectionPool(host='scrapethissite.com', port=443): Max retries exceeded with url: /pages/simple/ (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

I understand that my request doesn't go the right port. Is it linked to the fact that the website uses the communication protocol HTTPS (vs HTTP)? I am not sure, but it seems to be part of the problem.

I am using Python 3.8 on PyCharm. My SSL version is:

OpenSSL 1.1.1g 21 Apr 2020

I am a beginner in webcrawling. This is why I chose to run an alternative code to run my HTTP get-request, one that would allow me to select the appropriate port and protocol (Source: https://pythonprogramming.net/python-sockets/):

import socket
import ssl    

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET / HTTP/1.1
Host: "+server+"

"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

I got the HTTP 200 OK status response so it is working well. I get this output in the PyCharm terminal:

b'HTTP/1.1 200 OK Date: Tue, 12 Jan 2021 14:59:35 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Set-Cookie: __cfduid=d205b0b8e8ce061174412767189bf10b41610463575; expires=Thu, 11-Feb-21 14:59:35 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax CF-Cache-Status: DYNAMIC cf-request-id: 0798b515a60000ea04f707d000000001 Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" Report-To: {"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=%2FROG7Z2JWZJBMeVNn1IgnJh2TZsqJCi9TJOL3zau98btlLo1nPg4WhGlmOz2SZ6PRep6%2BKZfv0M81fqKOw1l6%2BRbc5M9dErdtyeTsei9Ee%2F2jc0%3D"}],"group":"cf-nel","max_age":604800} NEL: {"report_to":"cf-nel","max_age":604800} Server: cloudflare CF-RAY: 6107be029e27ea04-IAD 1fb5 <!doctype html>

Scrape This Site | A public sandbox for learning web scraping Scrape This Site
Sandbox
Lesson' b's
FAQ
Login
var path = document.location.pathname; var tab = undefined; if (path === "/"){ tab = document.querySelector("#nav-homepage"); } else if (path.indexOf("/faq/") === 0){ tab = document.querySelector("#nav-faq"); } else if (path.indexOf("/lessons/") === 0){ tab = document.querySelector("#nav-lessons"); } else if (path.indexOf("/pages/") === 0) { tab = document.querySelector("#nav-sandbox"); } else if (path.indexOf("/login/") === 0) { tab = do' b'cument.querySelector("#nav-login"); } tab.classList.add("active")

Scrape This Site

The internet's best resource for learning
web scraping.


Explore Sandbox Begin Lessons → Lessons and Videos © Hartley Bro' b'dy 2018 PNotify.prototype.options.styling = "bootstrap3"; $(function(){ }); $(function () { $('[data-toggle="tooltip"]').tooltip() }) $("video").hover(function() { $(this).prop("controls", true); }, function() { $(this).prop("controls", false); }); $("video").click(function() { if( this.paused){ this.play(); } else { this.pause(); } }); (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-41551755-8', 'auto'); ga('send', 'pageview'); !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n; n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window, document,'script','https://connect.facebook.net/en_US/fbevents.js'); fbq('init', '764287443701341'); fbq('track', "PageView"); /* */ window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'AW-950945448'); ' b'0 '

The only problem is that I want to scrape this website:

https://scrapethissite.com/pages/simple/

and not:

https://scrapethissite.com

When I replace

server = 'scrapethissite.com'

by:

server = 'scrapethissite.com/pages/simple/'

in the previous code, I get this new error message:

socket.gaierror: [Errno 11001] getaddrinfo failed

My understanding is that the problem is linked to the proxy. Knowing that the problem may be linked to port, socket, proxy, etc., is informative, but I am not sure what/how to fix the code as it is working fine for one website but not the other.

Any help is highly appreciated. Thank you!


Following OneCricketeer's reply, the code is now:

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET /pages/simple HTTP/1.1
Host: "+server+"

"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

I get HTTP 301 MOVED PERMANENTLY status response.

b'HTTP/1.1 301 MOVED PERMANENTLY Date: Tue, 12 Jan 2021 15:34:15 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Set-Cookie: __cfduid=d6e32136f617c0b90e7f92a3e391c159f1610465655; expires=Thu, 11-Feb-21 15:34:15 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax Location: https://scrapethissite.com/pages/simple/ CF-Cache-Status: DYNAMIC cf-request-id: 0798d4d0d700002550fc1c3000000001 Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" Report-To: {"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=2moOTvTDPvS65D6d0LvsiZTLDqYcv8OFZvtunIQDq6H%2FKLucm1LOOlMABcnCUjUO9fK4bwd%2BVDiescQ0NyHbu3DxhTCkOUHTvMcilkM%2BdcZnz3A%3D"}],"group":"cf-nel","max_age":604800} NEL: {"report_to":"cf-nel","max_age":604800} Server: cloudflare CF-RAY: 6107f0c7bb432550-IAD 11f Redirecting...

Redirecting...

You should be redirected automatically to target URL: https://scrapethissite.com/pages/simple/. If not click the link. ' b'0 '

Is there something I missed?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
2.8k views
Welcome To Ask or Share your Answers For Others

1 Answer

I am using Python 3.8 on PyCharm

Based on your print usage, you are actually using Python2...

In any case, this solution might work for the requests way

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url, verify=False) 

If you want to use the socket method, you would change GET / to GET /pages/simple, and keep the server as just the domain name

I understand that my request doesn't go the right port.

443 is the correct HTTPS port. The error is stating the SSL version is incorrect


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...