Class Parser


  • public class Parser
    extends java.lang.Object
    Parses HTML into a Document. Generally best to use one of the more convenient parse methods in Jsoup.
    • Constructor Summary

      Constructors 
      Constructor Description
      Parser​(org.jsoup.parser.TreeBuilder treeBuilder)
      Create a new Parser, using the specified TreeBuilder
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      ParseErrorList getErrors()
      Retrieve the parse errors, if any, from the last parse.
      org.jsoup.parser.TreeBuilder getTreeBuilder()
      Get the TreeBuilder currently in use.
      static Parser htmlParser()
      Create a new HTML parser.
      boolean isContentForTagData​(java.lang.String normalName)
      (An internal method, visible for Element.
      boolean isTrackErrors()
      Check if parse error tracking is enabled.
      boolean isTrackPosition()
      Test if position tracking is enabled.
      Parser newInstance()
      Creates a new Parser as a deep copy of this; including initializing a new TreeBuilder.
      static Document parse​(java.lang.String html, java.lang.String baseUri)
      Parse HTML into a Document.
      static Document parseBodyFragment​(java.lang.String bodyHtml, java.lang.String baseUri)
      Parse a fragment of HTML into the body of a Document.
      static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml, Element context, java.lang.String baseUri)
      Parse a fragment of HTML into a list of nodes.
      static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml, Element context, java.lang.String baseUri, ParseErrorList errorList)
      Parse a fragment of HTML into a list of nodes.
      java.util.List<Node> parseFragmentInput​(java.lang.String fragment, Element context, java.lang.String baseUri)  
      Document parseInput​(java.io.Reader inputHtml, java.lang.String baseUri)  
      Document parseInput​(java.lang.String html, java.lang.String baseUri)  
      static java.util.List<Node> parseXmlFragment​(java.lang.String fragmentXml, java.lang.String baseUri)
      Parse a fragment of XML into a list of nodes.
      ParseSettings settings()
      Gets the current ParseSettings for this Parser
      Parser settings​(ParseSettings settings)
      Update the ParseSettings of this Parser, to control the case sensitivity of tags and attributes.
      Parser setTrackErrors​(int maxErrors)
      Enable or disable parse error tracking for the next parse.
      Parser setTrackPosition​(boolean trackPosition)
      Enable or disable source position tracking.
      Parser setTreeBuilder​(org.jsoup.parser.TreeBuilder treeBuilder)
      Update the TreeBuilder used when parsing content.
      static java.lang.String unescapeEntities​(java.lang.String string, boolean inAttribute)
      Utility method to unescape HTML entities from a string
      static Parser xmlParser()
      Create a new XML parser.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Parser

        public Parser​(org.jsoup.parser.TreeBuilder treeBuilder)
        Create a new Parser, using the specified TreeBuilder
        Parameters:
        treeBuilder - TreeBuilder to use to parse input into Documents.
    • Method Detail

      • newInstance

        public Parser newInstance()
        Creates a new Parser as a deep copy of this; including initializing a new TreeBuilder. Allows independent (multi-threaded) use.
        Returns:
        a copied parser
      • parseInput

        public Document parseInput​(java.lang.String html,
                                   java.lang.String baseUri)
      • parseInput

        public Document parseInput​(java.io.Reader inputHtml,
                                   java.lang.String baseUri)
      • parseFragmentInput

        public java.util.List<Node> parseFragmentInput​(java.lang.String fragment,
                                                       Element context,
                                                       java.lang.String baseUri)
      • getTreeBuilder

        public org.jsoup.parser.TreeBuilder getTreeBuilder()
        Get the TreeBuilder currently in use.
        Returns:
        current TreeBuilder.
      • setTreeBuilder

        public Parser setTreeBuilder​(org.jsoup.parser.TreeBuilder treeBuilder)
        Update the TreeBuilder used when parsing content.
        Parameters:
        treeBuilder - new TreeBuilder
        Returns:
        this, for chaining
      • isTrackErrors

        public boolean isTrackErrors()
        Check if parse error tracking is enabled.
        Returns:
        current track error state.
      • setTrackErrors

        public Parser setTrackErrors​(int maxErrors)
        Enable or disable parse error tracking for the next parse.
        Parameters:
        maxErrors - the maximum number of errors to track. Set to 0 to disable.
        Returns:
        this, for chaining
      • getErrors

        public ParseErrorList getErrors()
        Retrieve the parse errors, if any, from the last parse.
        Returns:
        list of parse errors, up to the size of the maximum errors tracked.
        See Also:
        setTrackErrors(int)
      • isTrackPosition

        public boolean isTrackPosition()
        Test if position tracking is enabled. If it is, Nodes will have a Position to track where in the original input source they were created from. By default, tracking is not enabled.
        Returns:
        current track position setting
      • setTrackPosition

        public Parser setTrackPosition​(boolean trackPosition)
        Enable or disable source position tracking. If enabled, Nodes will have a Position to track where in the original input source they were created from.
        Parameters:
        trackPosition - position tracking setting; true to enable
        Returns:
        this Parser, for chaining
      • settings

        public Parser settings​(ParseSettings settings)
        Update the ParseSettings of this Parser, to control the case sensitivity of tags and attributes.
        Parameters:
        settings - the new settings
        Returns:
        this Parser
      • settings

        public ParseSettings settings()
        Gets the current ParseSettings for this Parser
        Returns:
        current ParseSettings
      • isContentForTagData

        public boolean isContentForTagData​(java.lang.String normalName)
        (An internal method, visible for Element. For HTML parse, signals that script and style text should be treated as Data Nodes).
      • parse

        public static Document parse​(java.lang.String html,
                                     java.lang.String baseUri)
        Parse HTML into a Document.
        Parameters:
        html - HTML to parse
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        parsed Document
      • parseFragment

        public static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml,
                                                         Element context,
                                                         java.lang.String baseUri)
        Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
        Parameters:
        fragmentHtml - the fragment of HTML to parse
        context - (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML). This provides stack context (for implicit element creation).
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        list of nodes parsed from the input HTML. Note that the context element, if supplied, is not modified.
      • parseFragment

        public static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml,
                                                         Element context,
                                                         java.lang.String baseUri,
                                                         ParseErrorList errorList)
        Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
        Parameters:
        fragmentHtml - the fragment of HTML to parse
        context - (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML). This provides stack context (for implicit element creation).
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        errorList - list to add errors to
        Returns:
        list of nodes parsed from the input HTML. Note that the context element, if supplied, is not modified.
      • parseXmlFragment

        public static java.util.List<Node> parseXmlFragment​(java.lang.String fragmentXml,
                                                            java.lang.String baseUri)
        Parse a fragment of XML into a list of nodes.
        Parameters:
        fragmentXml - the fragment of XML to parse
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        list of nodes parsed from the input XML.
      • parseBodyFragment

        public static Document parseBodyFragment​(java.lang.String bodyHtml,
                                                 java.lang.String baseUri)
        Parse a fragment of HTML into the body of a Document.
        Parameters:
        bodyHtml - fragment of HTML
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        Document, with empty head, and HTML parsed into body
      • unescapeEntities

        public static java.lang.String unescapeEntities​(java.lang.String string,
                                                        boolean inAttribute)
        Utility method to unescape HTML entities from a string
        Parameters:
        string - HTML escaped string
        inAttribute - if the string is to be escaped in strict mode (as attributes are)
        Returns:
        an unescaped string
      • htmlParser

        public static Parser htmlParser()
        Create a new HTML parser. This parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags.
        Returns:
        a new HTML parser.
      • xmlParser

        public static Parser xmlParser()
        Create a new XML parser. This parser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input.
        Returns:
        a new simple XML parser.