Alexander Chepurnoy

The Web of Mind

Google+ Hack - How to Get 440M Profile URLs

| Comments

Have you Google+ profile? I have bad news for you. Now I have your profile URL. And I gonna to show you how you can obtain 440M Google+ profile URLs too.

Ok, follow these steps:

  • Go to . It’s the file containing rules for crawlers. Go to the end of it and find _Sitemap: string.

  • profiles-sitemap.xml ? Hmm, what could it be? Let’s take a closer look: It shows XML file with 50000 URLs to compressed sitemaps: <sitemapindex><sitemap><loc></loc><lastmod>2012-10-15</lastmod></sitemap> ...

  • Download one file and unpack it: You will see plain text file with ~9000 profile URLs!

  • Download files in bulk with such a bash script:

      for i in {0..49999}  
          printf "" $i | wget -i- "$@"  
  • Uncompress all files in bulk with another bash script: #!/bin/bash mkdir unpacked for a in *.gz; do gunzip -c $a > unpacked/echo $a | sed s/.gz//; done

  • Go to unpacked folder. You have 50000 plain text files in it with ~9000 profile URLs in each. How many in total? wc -l * gives you answer: “… 444024650 total”

  • All URLs got with legal methods only

  • Now you can parse profile data with Nutch or more simple parser(however, I recommend to use Hadoop/Akka/Rabbitmq to spread parsing over many machines). Or you can hire tech person able to do it if you didn’t get what last sentence is about.