Ben J. Christensen

How to strip invalid XML characters

http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html   

  /**
     * This method ensures that the output String has only
     * valid XML unicode characters as specified by the
     * XML 1.0 standard. For reference, please see
     * standard</a>. This method will return an empty
     * String if the input is null or empty.
     *
     * @param in The String whose non-valid characters we want to remove.
     * @return The in String, stripped of non-valid characters.
     */
    public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || (“”.equals(in))) return “”; // vacancy test.
        for (int i = 0; i < in.length(); i++) {
            current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            if ((current == 0×9) ||
                (current == 0xA) ||
                (current == 0xD) ||
                ((current >= 0×20) && (current <= 0xD7FF)) ||
                ((current >= 0xE000) && (current <= 0xFFFD)) ||
                ((current >= 0×10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();

    }     

About these ads

Filed under: Code

11 Responses

  1. ransom says:

    /**
    * Strip bad xml chars
    * @author james ransom 2008
    * yuku.com
    */
    function strip_invalid_xml_chars2( $in )
    {

    $out = “”;

    $length = strlen($in);

    for ( $i = 0; $i = 0×20)
    && ($current = 0xE000) &&
    ($current = 0×10000) && ($current <= 0x10FFFF)))
    {
    $out .= chr($current);
    }
    else
    {
    $out .= ” “;
    }

    }

    return $out;

    }

  2. anthony says:

    I assume that comment was a PHP version, so you want something like this:
    <?php
    /**
    * Strip bad xml chars
    * @author phpzone
    * @param string $in String passed by reference
    * phpzone.co.uk
    */
    function strip_invalid_xml_chars2( &$in )
    {
    $out = “”;
    $length = strlen($in);
    for ( $i = 0; $i = 0×20 && $current = 0xE000 && $current = 0×10000 && $current

  3. anthony says:

    I’ll try one more time, cut my post short last time, if it doesn’t work this time… oh well.

    function strip_invalid_xml_chars2( &$in )
    {
    $out = “”;
    $length = strlen($in);
    for ( $i = 0; $i = 0×20 && $current = 0xE000 && $current = 0×10000 && $current <= 0x10FFFF):
    // valid so leave “as is”
    break;
    default:
    // invalid, so we set this char to a space
    $in[$i] = ” “;
    break;
    }
    }
    // NOTE: doesn’t return a value as we worked purely on
    // the string passed by reference, so don’t try catching a return value
    }

  4. anthony says:

    there is something broken about this comment posting script, that code, is _not_ what I posted ;)

  5. Ben Christensen says:

    The original is Java code. Thank you for the code you provided!

  6. String str = “contains invalid chars”;
    String stripped = str.replaceAll(“[^\\u0009\\u000a\\u000d\\u0020-\\ud7ff\\e0000-\\ufffd]“, “”)

  7. Tyler Sullivan says:

    Another option in php is to use strtr to replace bad characters with equivalents.

    e.g. strtr($xml,array(‘& ‘=>’& ‘));

    I am sure other languages have tr fuction equivalents.

  8. Diego says:

    Thanks! Same thing, using replaceAll instead of char walking:
    String invalidXmlPattern = “[^"
    + "\\u0009\\u000A\\u000D"
    + "\\u0020-\\uD7FF"
    + "\\uE000-\\uFFFD"
    + "\\u10000-\\u10FFFF"
    + "]+”;
    xml = xml.replaceAll(invalidXmlPattern, ” “);

  9. psd2magento says:

    I needed php code to validate xml code. Thanks for the post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Twitter Updates

View Ben Christensen's profile on LinkedIn
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: