http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html
/**
* This method ensures that the output String has only
* valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see
* standard</a>. This method will return an empty
* String if the input is null or empty.
*
* @param in The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || (“”.equals(in))) return “”; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0×9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0×20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0×10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
Filed under: Code
/**
* Strip bad xml chars
* @author james ransom 2008
* yuku.com
*/
function strip_invalid_xml_chars2( $in )
{
$out = “”;
$length = strlen($in);
for ( $i = 0; $i = 0×20)
&& ($current = 0xE000) &&
($current = 0×10000) && ($current <= 0x10FFFF)))
{
$out .= chr($current);
}
else
{
$out .= ” “;
}
}
return $out;
}
I assume that comment was a PHP version, so you want something like this:
<?php
/**
* Strip bad xml chars
* @author phpzone
* @param string $in String passed by reference
* phpzone.co.uk
*/
function strip_invalid_xml_chars2( &$in )
{
$out = “”;
$length = strlen($in);
for ( $i = 0; $i = 0×20 && $current = 0xE000 && $current = 0×10000 && $current
I’ll try one more time, cut my post short last time, if it doesn’t work this time… oh well.
function strip_invalid_xml_chars2( &$in )
{
$out = “”;
$length = strlen($in);
for ( $i = 0; $i = 0×20 && $current = 0xE000 && $current = 0×10000 && $current <= 0x10FFFF):
// valid so leave “as is”
break;
default:
// invalid, so we set this char to a space
$in[$i] = ” “;
break;
}
}
// NOTE: doesn’t return a value as we worked purely on
// the string passed by reference, so don’t try catching a return value
}
there is something broken about this comment posting script, that code, is _not_ what I posted
The original is Java code. Thank you for the code you provided!
String str = “contains invalid chars”;
String stripped = str.replaceAll(“[^\\u0009\\u000a\\u000d\\u0020-\\ud7ff\\e0000-\\ufffd]“, “”)
Another option in php is to use strtr to replace bad characters with equivalents.
e.g. strtr($xml,array(‘& ‘=>’& ‘));
I am sure other languages have tr fuction equivalents.
Another victim of the comments system. Lets try this.
e.g. strtr($xml,array('& '=>'& '));Oh well. The second one should be “& amp;” without the space
Thanks! Same thing, using replaceAll instead of char walking:
String invalidXmlPattern = “[^"
+ "\\u0009\\u000A\\u000D"
+ "\\u0020-\\uD7FF"
+ "\\uE000-\\uFFFD"
+ "\\u10000-\\u10FFFF"
+ "]+”;
xml = xml.replaceAll(invalidXmlPattern, ” “);
I needed php code to validate xml code. Thanks for the post!