Breaking & Scraping Pahe.in

From 2 years ago, i started a hacking project for challenge in WebScraping domain to scrape the whole Pahe.ph website which leaks movies, series, anime & more.

History

At first Pahe website was plain Self-Hosted Offshore WordPress site secured with Surcuri WAF System & links protected using SoraLink WordPress Plugin and there was no other implemented security measures on the website (like Cloudflare, Content Obfuscation & etc).

At this stage the scraper was working fine until they switched to Cloudflare Fronting. After this i needed a way to bypass Cloudflare security & captcha. and after some reconnaissance and probing i figured out a way to bypass Cloudflare in automated manners, and the scraper came back online.

After that they implemented content obfuscation to obfuscate the links section in the website using custom JS algorithm which i reversed and was able break and extract the links.

Results

So at the end i developed a Webscraper capable to scrape the whole website and address these issues

  • Bypass Surcuri WAF
  • Handle SoraLink & Extract Direct Links
  • Bypass Cloudflare Fronting
  • Extract Posts in Full Details
  • Decode Download Section Obfuscation
  • Stateful Scraper which support Resume, Failsafe, Looping operations.

Used Technologies

Public Release

I decided to released the scraper to the internet on GitHub, and the project is in active development right now from my side for maintenance and enhancement for a while, so check the active log on GitHub to know the latest To-Do points.

Build Releases

How To Use ?

  • Simpe, just start the application from “PaheScrapper.exe”
  • You gonna find multiple modes (Continue, Resync, New, Loop, State)
    • Continue: continue previous scraping loaded from manager state file.
    • Resync: force the scraper to rescan all the manager state file and resync with pahe.in website to the last state.
    • New: start new and clean scraping and discard old manager state if exist.
    • Loop: start the scraper in loop mode to scrape and then resync forever.
    • State: get the current scraper state.
  • After the scraper finish the website, you gonna find output file with the website context in the scraper directory or always you can check the manager state file to get the non-finished scraping.

Note: Scraper takes long time for the first run to build the manager state (week or more depend on your configuration) to scrape the whole website and takes small amount of time to resync and i’m working and testing the resync to make sure it fast enough to resync with the website daily cause they keep changing download links frequently.

Info: “PaheScrapper.exe.config” is the configuration for the scraper so leave default if you don’t know what you are doing.

Scraper in Action

13 Comments on “Breaking & Scraping Pahe.in

  1. Amazing 🥰
    Hope u provide either provide the scrap data or tutorial on how to run this locally by ourselves.

    Like

  2. Hey, do you plan to implement auto-start for this? So this scraper will auto-start when windows start and continue the scraping. I want to use this in an RDP
    Also thanks for this, i think it’s a good tool, i’m in process of scraping

    Like

  3. Is there any post processing 9f scrapped data as i have completed the scrapping but nothing in the links section. Used alpha 2.5.

    Like

  4. Error: session not created: This version of ChromeDriver only supports Chrome version 96
    Current browser version is 101.0.4951.41

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: