Breaking & Scraping Pahe.in
At first Pahe website was plain Self-Hosted Offshore WordPress site secured with Surcuri WAF System & links protected using SoraLink WordPress Plugin and there was no other implemented security measures on the website (like Cloudflare, Content Obfuscation & etc).
At this stage the scraper was working fine until they switched to Cloudflare Fronting. After this i needed a way to bypass Cloudflare security & captcha. and after some reconnaissance and probing i figured out a way to bypass Cloudflare in automated manners, and the scraper came back online.
After that they implemented content obfuscation to obfuscate the links section in the website using custom JS algorithm which i reversed and was able break and extract the links.
So at the end i developed a Webscraper capable to scrape the whole website and address these issues
- Bypass Surcuri WAF
- Handle SoraLink & Extract Direct Links
- Bypass Cloudflare Fronting
- Extract Posts in Full Details
- Decode Download Section Obfuscation
- Stateful Scraper which support Resume, Failsafe, Looping operations.
I decided to released the scraper to the internet on GitHub, and the project is in active development right now from my side for maintenance and enhancement for a while, so check the active log on GitHub to know the latest To-Do points.