Skip to content

Using PHP cURL to read RSS feed XML

I very rearly, use feeds, i prefer to go to my favourite website and see what is going on instead of downloading their content to my computer with some RSS reader. However the fact is feeds are getting popular and people are searching for a ways to access and easily automatically parse those feeds with PHP, and because RSS and ATOM are nothing more then XML documents then i have for you really simple way of handling those feeds which i want to share with you (thru my blog as well thru my blog feeds :)).

Obviously to parse feeds we have to get them first, very popular tool for handling HTTP connections (as well as other types of connections) is cURL library. Actually libcurl is desgned for connections and communication between different servers thru different protocols, it not only supports HTTP, HTTPS, gopher, telnet connections but also allows users to send data thru POST and GET and even allows to manage cookies send by server, so basically using this library you can get data feeds from literaly any page on the Internet, no matter if this data is password protected or requires to POST some data.

Using PHP cURL

To use cURL on windows you only need to uncomment it in php.ini file, on Linux (like always) you need to compile PHP with –with-curl. To connect to some RSS feeds with cURL first you need to init cURL resource handle this is done with:

$ch = curl_init("http://localhost/curl/rss.xml");

where obviously, the first param is the URL of website (feeds in our case) to which you want to connect to, next wee need to setup few connection options by using curl_setopt() with three parameters, where first param is cURL resource we created earlier, second cURL option key and third option value, for our simple connection we will need only two options.

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);

The most important here is to set CURLOPT_RETURNTRANSFER, because by default in PHP curl only sends data to server and do not wait to get response, it sometimes useful to only send request and do not wait for response, however in our case we want to parse XML feeds and it will be quite difficult if the server we are connecting to won’t send them to us.

Ok, next step is to execute connection, wait for a response, and close it, sounds like a lot of work but it is not, actually it is done with only two lines of code:

$data = curl_exec($ch);
curl_close($ch);

Half of the work is done if everyting went right and page we connected to contained RSS feeds or some kind of XML data, then $data variable contains string which can be no parsed. A lot of newbies try to parse such string with regular expressions or explode string and psudo parse it line by line, by removing tags with str_replace(). This is something i do not encourage, not only because it is very unprofessional, but also because since PHP 5.x we have built in tools for parsing XML data it will be not only more pro but a lot easier to do then pseudo parsing line by line.

Working with SimpleXML

Currently in PHP manual there are described 13 librarys for handling XML data, quite a lot, but most of them is designed to help build XML file not parse, so in our case the best bet is to use SimpleXML library, which is not only the best for converting string to XML object but is also built into PHP core so there is no need to install it. We have our XML string in $data variable so eveything we need to do to parse it is this:

$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);

Note, that we could also pass third (boolean) parameter to the constructor, which by default is set to false. If we would set it to true then first argument should be an URL pointing to XML document instead of XML data, so actually we do not have to use cURL at all.

That’s it now doc is an instance of SimpleXmlElement, which basically consists only from fields and arrays, if node occurs only once in XML document then it is a field, if it occurs many times then it is an array … well usually. If you want to know what is inside this object use old fashioned:

print_r($doc);

So far so good, but now comes the hard part, as you know on the web there are two popular feeds standards RSS and ATOM, each of them has a different structure, and what is worst different node names, fortuntely with SmpleXML it is easy to check it.

if(isset($doc->channel))
{
    parseRSS($doc);
}
if(isset($doc->entry))
{
    parseAtom($doc);
}

All RSS documents have <channel> node so if our document contains this node then there is a chance that it is a  RSS document on the other hand if it contains <entry> node there is a chance that it is an ATOM document. I used here if{ … } if { … } instead of if { … } else { … } because there is a chance that document we parsed is neither RSS nor ATOM. In code you can also see two functions parseRSS() and parseAtom() this are functions we will use to get data out of SimpleXmlElement objects and we are going to write them right now.

function parseRSS($xml)
{
    echo "<strong>".$xml->channel->title."</strong>";
    $cnt = count($xml->channel->item);
    for($i=0; $i<$cnt; $i++)
    {
	$url 	= $xml->channel->item[$i]->link;
	$title 	= $xml->channel->item[$i]->title;
	$desc = $xml->channel->item[$i]->description;
 
	echo '<a href="'.$url.'">'.$title.'</a>'.$desc.'';
    }
}

RSS is much more easier to handle then ATOM because it do not contins important data in attributes; Well there is not really much to talk about here you have access to any of nodes by using simple sytax $xml->node->childNode, if node is an array then you sligthly change the code to $xml->node[$i]->childNode->childChildNode.

The following example will a bit more complicated because in order to access entry URL we need to read <link> node attribute:

function parseAtom($xml)
{
    echo "<strong>".$xml->author->name."</strong>";
    $cnt = count($xml->entry);
    for($i=0; $i<$cnt; $i++)
    {
	$urlAtt = $xml->entry->link[$i]->attributes();
	$url	= $urlAtt['href'];
	$title 	= $xml->entry->title;
	$desc	= strip_tags($xml->entry->content);
 
	echo '<a href="'.$url.'">'.$title.'</a>'.$desc.'';
    }
}

Note, how we get to node attributes: $urlAtt = $xml->entry->link[0]->attributes(), now $urlAtt is associative array where attribute name is array key and attribute value is a value for this key.

Well, i do not what more to write here this is all really simple, probably this is why they called this library SimpleXml, if you wrote your code and want to test it, then for RSS feeds use some WordPress blog feeds, for ATOM feeds use some Blogger blog.

Published inGeneral