HACKER Q&A
📣 nerdynapster

How do I build a robust social media profile data scraper?


* though any language would do, but please Python. * any existing solutions paid/free, is welcome.


  👤 AshArchangel Accepted Answer ✓
Hello, I do some webscraping at my job using python, but I've found that scraping social media for non-specific data is often the same amount of work as manual searching or using google search tools like "site:". With that said, and similar to anigbrowl's comment, without a specific goal in mind you will be hard pressed to solve your problem. Social media scraping varies heavily by platform in terms of what information is available to scrape (without brute force).

If you want some social media OSINT tools that are already built in python, Black Arch has a list of open-source tools that you can access and use:

https://blackarch.org/social.html

If you are trying to identify someone's social media, Sherlock or Spiderfoot are commonly cited, but again, I don't think that these tools save that much as opposed to just using Google's search logic efficiently.

https://github.com/sherlock-project/sherlock


👤 anigbrowl
You need to be more specific about what part you're having a problem with and what your goal is: to build a scraper you can sell or give away, to accumulate social media data for commercial purposes, or some research goal?

There's no generic solution, since every platform is different, and there's no one scraping library (or approach) to rule them all. Most efforts I've seen use BeautifulSoup to parse web pages and/or Selenium to automate browser actions, but I'm sure there are better alternatives. It is a frustrating space to work in as many/most tools are limited and the methods jealously guarded, much as most social media companies jealously guard the data they harvest.

You could probably learn a lot by leveraging existing tools and seeing what you can do on the analysis side. Twitter has a fairly well-specified API and if you are getting frustrated with the limits of that, there's twint. Facebook is the biggest 'pile' of data but they know it and when you look at the source a FB page you can see there's a lot of stuff that messes up your ability to parse that data, accidentally or deliberately. You might be better providing tooling for small but growing social media platforms that are not as big (and so less valuable/profitable to scrape) but also don't have the accumulated digital sediment that makes it difficult to do so.


👤 TechBro8615
It’s not an easy problem, and doing it successfully will be expensive.

First, you need a reliable API, ignoring rate limiting / verification concerns. To get that, you should reverse engineer the mobile apps and replicate their API calls. For many apps, you can find an active GitHub project doing this already. But note that it’s a moving target. As an alternative, you may consider setting up a device farm and automating interactions via something like Cycript.

Next, you need to circumvent anti-abuse measures. For most social networks, this means you need to create fake profiles. This will likely entail phone verification. You will also need multiple residential proxies to route traffic through, but not too many per account.

For phone verification, VOIP numbers will not work. Check blackhat forums for services offered in countries in SE Asia with SIM farms for real numbers. Note that you may not have perpetual access to these numbers, so if you’re prompted to re-verify a number for an account, you might have to just burn the account. You may be able to appear “less suspicious” by setting up TOTP (which can be automated) on your accounts and removing the phone number, if possible.

For IP addresses, you need non-datacenter IPs that other people are not using. Your best bet is luminati.io, which is the business side of the Hola chrome extension that routes your requests through users’ computers. You can get “sticky” IPs but only insofar as a user continues to be online. The minimum commitment is $500 per month, bandwidth is expensive and all requests are tracked. You will need to pass a Skype interview to sign up.

tl;dr It’s possible, but doing it successfully requires significant investment in infrastructure and time. You will need to partake in “gray market” activities and deal with some shady operations. Depending on your jurisdiction, you will likely violate at least one law.


👤 nerdynapster
thank you fellas for your comments, they are definitely helpful. for now, i'm finding a way to extract bios (biographies) from twitter, instagram and FB for research purposes; later I plan to scale that up to include other social media platforms (Youtube, LinkedIn, ...)

I am thinking it as use APIs for it (if they exist) or build a crawler to scrape the data.

is it something that can be done without violating any law?


👤 verdverm
You could pay for Nexus Lexus if you want the real dirty on people