How to parse Google Search result in Java?

05 Oct How to parse Google Search result in Java?

Posted at 05:49h in tutorial by shashank 6 Comments

Google is an amazing resource but there are no APIs to really parse google search results. This is the JAVA code that I wrote that will help you parse any google search results.

How does Google Search work?

For example if you are searching for “How to parse Google Search result in Java” then this is the URL that you would want to hit: https://www.google.com/search?q=How+to+parse+Google+Search+result+in+Java&num=10

The part after “q” is used to pass the query and “num”: This tells google how many results to return.

Getting HTML search results from Google

This is a code that will search Google and return the HTML of the page.

/**   * The method will return the search page result in a {@link String} object
   *
   * @param googleSearchQuery the google search query
   * @return the content as {@link String} object
   * @throws Exception
   */
  public static String getSearchContent(String googleSearchQuery) throws Exception {
    //URL encode string in JAVA to use with google search
    System.out.println("Searching for: " + googleSearchQuery);
    googleSearchQuery = googleSearchQuery.trim();
    googleSearchQuery = URLEncoder
        .encode(googleSearchQuery, StandardCharsets.UTF_8.toString());
    String queryUrl = "https://www.google.com/search?q=" + googleSearchQuery + "&num=10";
    System.out.println(queryUrl);
    final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    URL url = new URL(queryUrl);
    final URLConnection connection = url.openConnection();
    /**
     * User-Agent is mandatory otherwise Google will return HTTP response
     * code: 403
     */
    connection.setRequestProperty("User-Agent", agent);
    final InputStream stream = connection.getInputStream();
    return getString(stream);
  }

The above code will “URL” encode the given search term string and then call google with the given search string and return the results in JAVA using URLConnection class. You can change the headers as well to prevent google from blocking you.

Parsing results from google search result HTML

We want to only find the real results from Google Search and for this we can use this simple Jsoup based HTML parser in JAVA:

/**
   * Parse all links
   *
   * @param html the page
   * @return the list with all URLSs
   * @throws Exception
   */
  public static List<String> parseLinks(final String html) throws Exception {
    List<String> result = new ArrayList<String>();
    Document doc = Jsoup.parse(html);
    Elements results = doc.select("a > h3");
    for (Element link : results) {
      Elements parent = link.parent().getAllElements();
      String relHref = parent.attr("href");
      if (relHref.startsWith("/url?q=")) {
        relHref = relHref.replace("/url?q=", "");
      }
      String[] splittedString = relHref.split("&sa=");
      if (splittedString.length > 1) {
        relHref = splittedString[0];
      }
      //System.out.println(relHref);
      result.add(relHref);
    }
    return result;
  }

The above code is a it tricky. So it is first finding “h3” elements inside “a”. Then it looks at the parent element of the current node. From the parent then it find the URL.

Google search result URL start with “/url?q=”, so we use regex to remove this String. Also the URL is followed by “&sa=”, so we split the string at this character and use the first part of the URL.

This is the full code for parsing google search results in JAVA:


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GoogleCrawler {
  /**
   * Method to convert the {@link InputStream} to {@link String}
   *
   * @param is the {@link InputStream} object
   * @return the {@link String} object returned
   */
  public static String getString(InputStream is) {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    String line;
    try {
      while ((line = br.readLine()) != null) {
        sb.append(line);
      }
    } catch (IOException e) {
      e.printStackTrace();
    } finally {
      /** finally block to close the {@link BufferedReader} */
      if (br != null) {
        try {
          br.close();
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    }
    return sb.toString();
  }
  /**
   * The method will return the search page result in a {@link String} object
   *
   * @param googleSearchQuery the google search query
   * @return the content as {@link String} object
   * @throws Exception
   */
  public static String getSearchContent(String googleSearchQuery) throws Exception {
    //URL encode string in JAVA to use with google search
    googleSearchQuery = URLEncoder
        .encode(googleSearchQuery, StandardCharsets.UTF_8.toString());
    final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    URL url = new URL(googleSearchQuery);
    final URLConnection connection = url.openConnection();
    /**
     * User-Agent is mandatory otherwise Google will return HTTP response
     * code: 403
     */
    connection.setRequestProperty("User-Agent", agent);
    final InputStream stream = connection.getInputStream();
    return getString(stream);
  }
  /**
   * Parse all links
   *
   * @param html the page
   * @return the list with all URLSs
   * @throws Exception
   */
  public static List<String> parseLinks(final String html) throws Exception {
    List<String> result = new ArrayList<String>();
    Document doc = Jsoup.parse(html);
    Elements results = doc.select("a > h3");
    for (Element link : results) {
      Elements parent = link.parent().getAllElements();
      String relHref = parent.attr("href");
      if (relHref.startsWith("/url?q=")) {
        relHref = relHref.replace("/url?q=", "");
      }
      String[] splittedString = relHref.split("&sa=");
      if (splittedString.length > 1) {
        relHref = splittedString[0];
      }
      //System.out.println(relHref);
      result.add(relHref);
    }
    return result;
  }
  public static List<String> searchGoogle(String searchTerm, Integer resultCount) throws Exception {
    if (resultCount == null) {
      resultCount = 10;
    }
    System.out.println("Searching for: " + searchTerm);
    String query = "https://www.google.com/search?q=" + searchTerm + "&num=" + resultCount;
    System.out.println(query);
    String page = getSearchContent(query);
    List<String> links = parseLinks(page);
    return links;
  }
  public static void main(String[] args) throws Exception {
    String searchTerm = "Elon Musk";
    System.out.println("Google Search Parser Tutorial");
    System.out.println("Searching for: " + searchTerm);
    String query = "https://www.google.com/search?q=" + searchTerm + "&num=10";
    System.out.println(query);
    String page = getSearchContent(query);
    List<String> links = parseLinks(page);
    System.out.println();
    System.out.println("Results:");
    for (int i = 0; i < links.size(); i++) {
      System.out.println(links.get(i));
    }
  }
}

The method “searchGoogle” will search google given the search term and return the results.

Problems:

I got the above JAVA code to work but looks like google will start blocking the IP address if you do too many calls as google hates parsing google search result. So this is pretty good code but it won’t work if you don’t have a way to renew the IP address of the caller after every few hundred calls.

The code was tested to work on October 4th, 2020. If Google changes something in their search results then this code might stop working but it should be easy to fix. Let me know in the comments if this code was helpful.

Tags:

code, Examples, Google, JAVA, Search

6 Comments

Roman
Posted at 02:45h, 13 January Reply

Do you know how to parse next hiden results?
- shashank
  Posted at 11:38h, 22 September Reply
  
  There are multiple ways. You can try increasing the total number of count results or try passing the next page URL.
Leo
Posted at 10:39h, 12 January Reply

This code still works correctly, thank you so much!
Joseph
Posted at 12:05h, 01 February Reply

So each time I run it, it gives me a MalformedURLException : no protocol
- shashank
  Posted at 06:59h, 14 February Reply
  
  Can you please share what URL are you using? Try printing the final URL that you are using and putting it into the browser.
Rasmus
Posted at 15:05h, 10 February Reply

Same issue here (MalformedURLException : no protocol)
The debugger gives me:
googleSearchQuery = “https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3DElon+Musk%26num%3D10”

Follow Me LI TW IG

05 Oct How to parse Google Search result in Java?

How does Google Search work?

Getting HTML search results from Google

Parsing results from google search result HTML

Problems:

Tags:

6 Comments

Roman

shashank

Leo

Joseph

shashank

Rasmus

Post A Comment