Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Sign in to follow this  
Viral_

[Tutorial] PHP Scraping 101

Recommended Posts

I am going to show all of you the basics on how to scrape with PHP. For the demonstration I am going to show you how to scrape stats from the runescape website.
 
Now the basics of this is all going to be through CURL. If someone would like a tutorial on CURL I would be happy to give you the run down, but for this demonstration we are going to use a library
 
 
I have been very fond of this library. I have built my own but for an example of usability I will be using this library.
 
Requirements
  • Webserver with LAMP installed
  • Composer
If you do not have a webserver. You can go to this thread and follow my tutorial on how to set one up.
 
If you need help setting up composer go to this link
 
 
Tutorial
 
Run
 
composer require fabpot/goutte
 
In the folder where you are putting your php file to scrape.
 
Start your php file
 
<?PHP

?>
 
 
Now depending on if you are pulling a platform in or if you are starting from scratch. You always want some sort of debugging to see if there is any errors.
 
<?PHP

$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

?>

 

This will now allow us to track the errors. Now lets add the autoloader to make sure we are pulling in the goutte library.
 
<?PHP
$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

require_once('vendor/autoload.php');

?>

 

 
Now to instantiate the Goutte Client we need to put the use properties of it.
 
<?PHP
$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

require_once('vendor/autoload.php');

$username = 'zezima';

$client = new Client();

$post_values = array('user1' => $username, 'submit' => 'Search');

$crawler = $client->request('POST', 'http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

?>

 

 
Lets start the client now define our post parameters and the do our crawl. Now I found the post parameters by going to the hiscores lookup and then typing a username in. I was in the inspect element console for chrome and then I just looked at the request and what the parameters were. If you need more details on this a tutorial can be made.
 
Now lets go ahead and build a each loop on the element #contentHiscores tr. We want to loop through all the tr and find all the values. The syntax is kind of like Jquery.
 
<?PHP
$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

require_once('vendor/autoload.php');

$username = 'zezima';

$client = new Client();

$post_values = array('user1' => $username, 'submit' => 'Search');

$crawler = $client->request('POST', 'http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

$acc_array = array();
$crawler->filter('#contentHiscores tr')->each(function($node, $k) use (&$acc_array){
    if(!in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
      $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
    }
  });
?>

 

Lets now find the username and add it to the $acc_array
 
<?PHP
$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

require_once('vendor/autoload.php');

$username = 'zezima';

$client = new Client();

$post_values = array('user1' => $username, 'submit' => 'Search');

$crawler = $client->request('POST', 'http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

$acc_array = array();
$crawler->filter('#contentHiscores tr')->each(function($node, $k) use (&$acc_array){
    if(!in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
      $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
    }
  });

$acc_array['username'] = preg_replace('/Personal(.*?)('.$username.')$$/', '$2', $crawler->filter('#contentHiscores tr td')->children()->eq(0)->text());
?>

 

To finish it off to show you the results lets do a print_r
 
 
FINAL EXAMPLE:
 
<?PHP
$debug = 1;

if($debug == 1) {
  error_reporting(E_ALL);
  ini_set('display_errors', 1);
}

require_once('vendor/autoload.php');

$username = 'zezima';

$client = new Client();

$post_values = array('user1' => $username, 'submit' => 'Search');

$crawler = $client->request('POST', 'http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

$acc_array = array();
$crawler->filter('#contentHiscores tr')->each(function($node, $k) use (&$acc_array){
    if(!in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
      $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
    }
  });

$acc_array['username'] = preg_replace('/Personal(.*?)('.$username.')$$/', '$2', $crawler->filter('#contentHiscores tr td')->children()->eq(0)->text());

  echo '<pre>';
  print_r($acc_array);
?>

 

 
Hope this tutorial was helpful. If you have any questions feel free to reply and I will respond in a prompt manner or Pm me.
Edited by Viral_
Polishing
  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

  • Our picks

    • This update features:

      Fixed broken hooks from today's update


      Fix wilderness level with RuneLite (Thanks @Todd)


      Add support for Kotlin .class files in scripts (Thanks @wastedbro)


      Overhaul Inventory API (Thanks @wastedbro)


      Add List support for common methods


      Change method grouping to make more sense (by functionality)


      Refactor methods to utilize Java 8 streams instead of cumbersome loops




      Recognize chatbox minimization (Thanks @JoeDezzy1)


      Fix Screen#isInViewport when NPC chat is open (Thanks @JoeDezzy1)


      Fix login bot bugs (Thanks @erickho123)


      Fix hint arrow return values (Thanks @Encoded)


      Fix depositAllExcept functionality (Thanks @wastedbro)


      Change containing box interface bound and adjust for Y values (Thanks @erickho123)
      • 151 replies
    • This release will:

      Fix prayers and world hopper API (Thanks @JoeDezzy1 and @erickho123)


      Improve banking API (Thanks @Encoded)


      Adds methods for returning and using Java Lists, rather than arrays


      Slightly randomizes some hardcoded behaviour


      Removes sleeps from waitConditions; the efficiency saving potential is negligible in these use-cases, therefore cleaner code is preferable


      Other back-end improvements





      Note: If you are using LG, please restart both the RS client and TRiBot.
      • 90 replies
    • This release will:

      Add new internal framework for capturing exceptions


      Fix issue with not selecting the last column in world hopper (Thanks @Todd)


      Add a message about pin usage in Banking#openBank (Thanks @Todd)


      Disable the firewall by default (Thanks @Todd)


      Fix handling of the welcome screen after login (Thanks @Encoded)


      Fix wrong amount bank withdrawal (Thanks @Encoded)


      Fix Screen#isInViewport


      Fix Game#isInViewport (Thanks @Encoded)


      Call onBreakEnd for ListenerManager Breaking Listeners (Thanks @Encoded)


      Fix Prayer#getPrayerPoints NumberFormatException (Thanks @JoeDezzy1)



      Note: If you are using LG, please restart both the RS client and TRiBot.
      • 28 replies
    • This release will:

      Fix LG for both OSBuddy and RuneLite


      Fix issue where the resizable client isn't able to be made smaller (Thanks @JoeDezzy1)


      Fix detection of the logout game tab when resizable mode and side panels are enabled (Thanks @JoeDezzy1)


      Add initial support for Sentry to allow us to identify and easily debug exceptions happening with all TRiBot users


      Add methods to determine if the bank is actually loaded, and not just the overarching interface (Thanks @wastedbro)



      Upcoming updates:

      Improved CLI support


      Full Sentry support


      Much more
      • 64 replies
    • This release will:

      Fix NPE in Camera API (Thanks @wastedbro)


      Update deposit box interface ids (Thanks @Encoded)


      Add various bank methods (Thanks @wastedbro)


      Banking#getWithdrawXQuantity


      Banking#getDefaultWithdrawQuantity


      Banking#arePlaceholdersOn




      Fix resizeable minimap bug (Thanks @wastedbro)


      Remove Java 8 requirement


      Please note: TRiBot is not yet fully compatible with Java 10+




      Fix the break handler issues by ensuring the break handler thread never gets paused


      Fix broken settings hooks



      Upcoming updates:

      Improved CLI support


      Much more



      Note: If you are using LG, please restart both the RS client and TRiBot
      • 68 replies
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...