Accessing data over HTTP is more common every day. Be it APIs or webpages, intercommunication between applications is growing. And website scraping.
There is no easy built-in solution to perform HTTP calls in Java. Many packages offer some related functionalities, but it's not easy to pick one. Especially if you need some extra features like connecting via authenticated proxies.
We'll go from the basic request to advanced features using fluent.Request
, part of the Apache HttpComponents project.
Direct Request
The first step is to request the desired page. We will use httpbin for the demo. It shows headers and origin IP, allowing us to check if the request was successful.
We need to import Request
, get the target page and extract the result as a string. The package provides methods for those cases and many more. Lastly, print the response.
import org.apache.hc.client5.http.fluent.Request;
public class TestRequest {
public static void main(final String... args) throws Exception {
String url = "http://httpbin.org/anything";
String response = Request
.get(url) // use GET HTTP method
.execute() // perform the call
.returnContent() // handle and return response
.asString(); // convert response to string
System.out.println(response);
}
}
We are not handling the response nor checking for errors. It is a simplified version of a real-use case.
But we can see on the result that the request was successful, and our IP shows as the origin. We'll solve that in a moment.
Proxy Request
There are many reasons to add proxies to an HTTP request, such as security or anonymity. In any case, Java libraries (usually) make adding proxies complicated.
In our case, we can use viaProxy
with the proxy URL as long as we don't need authentication. More on that later.
For now, we'll use a proxy from a free list. Note that these free proxies might not work for you. They are short-time lived.
import org.apache.hc.client5.http.fluent.Request;
public class TestRequest {
public static void main(final String... args) throws Exception {
String url = "http://httpbin.org/anything";
String proxy = "http://169.57.1.85:8123"; // Free proxy
String response = Request.get(url)
.viaProxy(proxy) // will set the passed proxy
.execute().returnContent().asString();
System.out.println(response);
}
}
Proxy with Authentication
Paid or private proxy providers - such as ZenRows - frequently use authentication in each call. Sometimes it is done via IP allowed lists, but it's frequent to use other means like Proxy-Authorization
headers.
Calling the proxy without the proper auth method will result in an error: Exception in thread "main" org.apache.hc.client5.http.HttpResponseException: status code: 407, reason phrase: Proxy Authentication Required
.
Following the example, we will need two things: auth and passing the proxy as a Host.
Proxy-Authorization
contains the user and password base64 encoded.
Then, we need to change how viaProxy
gets the proxy since it does not allow URLs with user and password. For that, we will create a new HttpHost
passing in the whole URL. It will internally handle the problem and omit the unneeded parts.
import java.net.URI;
import java.util.Base64;
import org.apache.hc.client5.http.fluent.Request;
import org.apache.hc.core5.http.HttpHost;
public class TestRequest {
public static void main(final String... args) throws Exception {
String url = "http://httpbin.org/anything";
// Proxy URL as given by the provider
URI proxyURI = new URI("http://YOUR_API_KEY:@proxy.zenrows.com:8001");
String basicAuth = new String(
Base64.getEncoder() // get the base64 encoder
.encode(
// get user and password from the proxy URL
proxyURI.getUserInfo().getBytes()
));
String response = Request.get(url)
// add auth
.addHeader("Proxy-Authorization", "Basic " + basicAuth)
// will set the passed proxy as a host
.viaProxy(HttpHost.create(proxyURI))
.execute().returnContent().asString();
System.out.println(response);
}
}
Ignore SSL Certificates
When adding proxies to SSL (https) connections, libraries tend to raise a warning/error about the certificate. From a security perspective, that is awesome! We avoid being shown or redirected to sites we prefer to avoid.
But what about forcing our connections through our own proxies? There is no security risk in those cases, so we want to ignore those warnings. That is, again, not an easy task in Java.
The error goes something like this: Exception in thread "main" javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
.
For this case, we will modify the target URL by switching it to https
. And also, call a helper method that we'll create next. Nothing else changes on the main function.
public class TestRequest {
public static void main(final String... args) throws Exception {
// new method that will ignore certificate warnings
ignoreCertWarning();
String url = "https://httpbin.org/anything"; // switch to https
// ...
}
}
Now to the complicated and verbose part. We need to create an SSL context and fake certificates. As you can see, the certificates manager and its methods do nothing. It will just bypass the inner working and thus avoid the problems. Lastly, initialize the context with the created fake certs and set it as default. And we are good to go!
import java.security.cert.X509Certificate;
import javax.net.ssl.*;
public class TestRequest {
// ...
private static void ignoreCertWarning() {
SSLContext ctx = null;
TrustManager[] trustAllCerts = new X509TrustManager[] { new X509TrustManager() {
public X509Certificate[] getAcceptedIssuers() {return null;}
public void checkClientTrusted(X509Certificate[] certs, String authType) {}
public void checkServerTrusted(X509Certificate[] certs, String authType) {}
} };
try {
ctx = SSLContext.getInstance("SSL");
ctx.init(null, trustAllCerts, null);
SSLContext.setDefault(ctx);
} catch (Exception e) {}
}
}
Conclusion
Accessing data (or scraping) in Java can get complicated and verbose. But with the right tools and libraries, we got to tame its verbosity - but for the certificate.
We might get back to this topic in the future. The HttpComponents library offers attractive functionalities such as async and multi-threaded execution.