This is another of those particular posts that might help one or two people out. If I can save you some time working with the News Industry Text Format in PHP, I’ll be glad that you didn’t experience my frustration.
While working with the Associated Press API, I recently ran into a situation where ingested content from the NITF format they supply was being stripped out in PHP.
The code in question looked like this:
function download_story_nitfy($nitf_href) { $nitf_file = file_get_contents($nitf_href . "&include=view_default&apikey=" . AP_API_KEY); $nitf_file = str_replace(array("\n", "\r", "\t"), '', $nitf_file); $nitf_xml = simplexml_load_string($nitf_file); $nitf_json = json_encode($nitf_xml); return json_decode($nitf_json); }
The Associated Press API will return the XML, where the content is contained within the <block>
element inside of the XML response. Inspecting it in Postman and the browser is fine; however, the content being ingested suffered from missing links, thus breaking the content.
I knew the API response was OK, so I set out to debug and realised that the simplexml_load_string
call was stripping out the links inside of the content.
This is a snippet of what the code looked like:
<block> <p>KEY DEVELOPMENTS IN THE RUSSIA-UKRAINE WAR:</p> <p>— <a href="https://apnews.com/article/russia-ukraine-kyiv-moscow-d01152d589a482b52f1072ce9886fbe1">Scars of war</a> seem to be everywhere in Ukraine after 3 months</p> <p>— <a href="https://apnews.com/article/russia-ukraine-government-and-politics-de1d3ccf3ef990a046cafd7209d4653d">Saving the children</a>: War closes in on eastern Ukrainian town</p> <p>— Sweden, Finland delegations go to Turkey for <a href="https://apnews.com/article/russia-ukraine-middle-east-turkey-98d9b2bf7de63b3044d118e833626b13">NATO talks</a></p> <p>— US to end <a href="https://apnews.com/article/russia-ukraine-janet-yellen-government-and-politics-20dbb506790dddc6f019fa7fdf265514">Russia's ability to pay</a> international investors</p> <p>— UK <a href="https://apnews.com/article/russia-ukraine-putin-roman-abramovich-mlb-politics-710a500504e940db9d60ce3e674da346">approves sale of Chelsea</a> soccer club by sanctioned Abramovich</p> </block>
The XML parser call in PHP would remove those links. They saw those as not being valid, or the parser wasn’t accounting for child nodes. It was a rather frustrating issue, and despite extensive Googling, I found no easy solution. Many were saying to use cdata
markers around the links in other use cases.
In the end, that is what I did. Using a regular expression, I wrap all links in the response in CDATA
markers.
function download_story_nitf($nitf_href) { $nitf_file = file_get_contents($nitf_href . "&include=view_default&apikey=" . AP_API_KEY); $nitf_file = str_replace(array("\n", "\r", "\t"), '', $nitf_file); $pattern = "/<a (.*?)>(.*?)<\/a>/i"; $nitf_file = preg_replace($pattern, "<![CDATA[<a $1>$2</a>]]>", $nitf_file); $nitf_xml = simplexml_load_string($nitf_file); $nitf_json = json_encode($nitf_xml); return json_decode($nitf_json); }
I am not the world’s best coder, but this did the trick. Nothing else I tried worked. While this was for working with NITF XML, I assume this issue might crop up in other scenarios. So, this fix might work in your case too.