Goutte is a screen scraping and web crawling library for PHP on top of DomCrawler and Guzzle made by Fabian Potencer.
Building crawlers with this library is straightforward you just need to extract data with css selectors.
Let’s see an example that extracts PHP Symfony jobs in Barcelona from Spain main job portal: Infojobs
You just need to create an instance of Goutte Client make the request and it returns a crawler (http://symfony.com/doc/current/components/dom_crawler.html).
$client=new \Goutte\Client(); $crawler=$client->request("POST", "http://www.infojobs.net/jobsearch/search-results/list.xhtml", $paramaters);
After we just need to found the results. With chrome inspector or firebug we can found them easy.
$crawler->filter(".result-list")
Info jobs add some ads, with .list-logos css class, mixed with the results, but it’s easy to avoid theme with the css selector not.
$crawler->filter(".result-list > li:not(.list-logos)")
After that we just need to extract the data of each job. We can use each for this.
return $crawler->filter(".result-list > li:not(.list-logos)")->each(function (Crawler $node) { return array( "title" => trim($node->filter(".result-list-title a")->text()), "url" => $node->filter(".result-list-title a")->attr("href"), "employee" => trim($node->filter(".result-list-subtitle")->text()) ); });
Finally we put all together with the extra fields we need for this example.
class Infojobs { public function search(array $paramaters) { $client=new \Goutte\Client(); $crawler=$client->request("POST", "http://www.infojobs.net/jobsearch/search-results/list.xhtml", $paramaters); return $crawler->filter(".result-list > li:not(.list-logos)")->each(function (Crawler $node) { return array( "title" => trim($node->filter(".result-list-title a")->text()), "url" => $node->filter(".result-list-title a")->attr("href"), "employee" => trim($node->filter(".result-list-subtitle")->text()), "area" => trim($node->filter(".tag-group li")->getNode(0)->textContent), "created" => trim($node->filter(".tag-group li")->getNode(1)->textContent), "contract" => trim($node->filter(".tag-group li")->getNode(2)->textContent), "type" => trim($node->filter(".tag-group li")->getNode(3)->textContent), "salary" => trim($node->filter(".tag-group li")->getNode(4)->textContent), "description" => $node->filter(".result-list-description")->text() ); }); } }
After that we just need to call it with the following parameters:
$infojobs=new Infojobs(); $data =array( "palabra"=>"symfony", "region"=>"on", "of_provincia"=>"9", "canal"=>"0", "origen_busqueda"=>"0", "origen_accion"=>"0", "vieneUrlExecutive"=>"false" ); $jobs=$infojobs->search($data); print_r($jobs);
Here we have our data:
Array ( [0] => Array ( [title] => Programador PHP / Symfony 2 [url] => //www.infojobs.net/barcelona/programador-php-symfony-2/of-i08bf7b5ab640c7baaca8d12c80fbf2 [employee] => --- [area] => Barcelona [created] => 03 de oct [contract] => Contrato indefinido [type] => Jornada completa [salary] => 12.000€ - 18.000€ Bruto/año [description] => BCNTEKART, empresa especializada en servicios internet, web y desarrollo [...]
This is a really simple example to test this awesome tool.
Comments, doubts and questions are welcome.