Stopping PHP From Stripping out Hyperlinks From a NITF XML Response While Parsing the XML

This is another of those particular posts that might help one or two people out. If I can save you some time working with the News Industry Text Format in PHP, I’ll be glad that you didn’t experience my frustration.

While working with the Associated Press API, I recently ran into a situation where ingested content from the NITF format they supply was being stripped out in PHP.

The code in question looked like this:

function download\_story\_nitfy($nitf\_href) {
	$nitf\_file = file\_get\_contents($nitf\_href . "&include=view\_default&apikey=" . AP\_API\_KEY);
    $nitf\_file = str\_replace(array("\\n", "\\r", "\\t"), '', $nitf\_file);
  
    $nitf\_xml = simplexml\_load\_string($nitf\_file);
    $nitf\_json = json\_encode($nitf\_xml);

    return json\_decode($nitf\_json);
}

The Associated Press API will return the XML, where the content is contained within the element inside of the XML response. Inspecting it in Postman and the browser is fine; however, the content being ingested suffered from missing links, thus breaking the content.

I knew the API response was OK, so I set out to debug and realised that the simplexml_load_string call was stripping out the links inside of the content.

This is a snippet of what the code looked like:

KEY DEVELOPMENTS IN THE RUSSIA-UKRAINE WAR:

              

— [Scars of war](https://apnews.com/article/russia-ukraine-kyiv-moscow-d01152d589a482b52f1072ce9886fbe1) seem to be everywhere in Ukraine after 3 months

              

— [Saving the children](https://apnews.com/article/russia-ukraine-government-and-politics-de1d3ccf3ef990a046cafd7209d4653d): War closes in on eastern Ukrainian town

              

— Sweden, Finland delegations go to Turkey for [NATO talks](https://apnews.com/article/russia-ukraine-middle-east-turkey-98d9b2bf7de63b3044d118e833626b13)

              

— US to end [Russia's ability to pay](https://apnews.com/article/russia-ukraine-janet-yellen-government-and-politics-20dbb506790dddc6f019fa7fdf265514) international investors

              

— UK [approves sale of Chelsea](https://apnews.com/article/russia-ukraine-putin-roman-abramovich-mlb-politics-710a500504e940db9d60ce3e674da346) soccer club by sanctioned Abramovich

The XML parser call in PHP would remove those links. They saw those as not being valid, or the parser wasn’t accounting for child nodes. It was a rather frustrating issue, and despite extensive Googling, I found no easy solution. Many were saying to use cdata markers around the links in other use cases.

In the end, that is what I did. Using a regular expression, I wrap all links in the response in CDATA markers.

function download\_story\_nitf($nitf\_href) {
    $nitf\_file = file\_get\_contents($nitf\_href . "&include=view\_default&apikey=" . AP\_API\_KEY);
    $nitf\_file = str\_replace(array("\\n", "\\r", "\\t"), '', $nitf\_file);

    $pattern = "/(.\*?)<\\/a>/i";
    $nitf\_file = preg\_replace($pattern, "$2]]>", $nitf\_file);
    
    $nitf\_xml = simplexml\_load\_string($nitf\_file);
    $nitf\_json = json\_encode($nitf\_xml);

    return json\_decode($nitf\_json);
}

I am not the world’s best coder, but this did the trick. Nothing else I tried worked. While this was for working with NITF XML, I assume this issue might crop up in other scenarios. So, this fix might work in your case too.