Alexander Chepurnoy

The Web of Mind

Google+ Hack - How to Get 440M Profile URLs

| Comments

Have you Google+ profile? I have bad news for you. Now I have your profile URL. And I gonna to show you how you can obtain 440M Google+ profile URLs too.

Ok, follow these steps:

  • Go to http://google.com/robots.txt . It’s the file containing rules for crawlers. Go to the end of it and find _Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml_ string.

  • profiles-sitemap.xml ? Hmm, what could it be? Let’s take a closer look: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml. It shows XML file with 50000 URLs to compressed sitemaps: <sitemapindex><sitemap><loc>http://www.gstatic.com/s2/sitemaps/sitemap-00000-of-50000.gz</loc><lastmod>2012-10-15</lastmod></sitemap> ...

  • Download one file and unpack it: http://www.gstatic.com/s2/sitemaps/sitemap-00000-of-50000.gz. You will see plain text file with ~9000 profile URLs!

  • Download files in bulk with such a bash script:

      #!/bin/bash  
      for i in {0..49999}  
      do  
          printf "http://www.gstatic.com/s2/sitemaps/sitemap-%05d-of-50000.gz" $i | wget -i- "$@"  
      done
    
  • Uncompress all files in bulk with another bash script: #!/bin/bash mkdir unpacked for a in *.gz; do gunzip -c $a > unpacked/echo $a | sed s/.gz//; done

  • Go to unpacked folder. You have 50000 plain text files in it with ~9000 profile URLs in each. How many in total? wc -l * gives you answer: “… 444024650 total”

  • All URLs got with legal methods only

  • Now you can parse profile data with Nutch or more simple parser(however, I recommend to use Hadoop/Akka/Rabbitmq to spread parsing over many machines). Or you can hire tech person able to do it if you didn’t get what last sentence is about.

Comments