diff --git a/README.md b/README.md index 5005b45..285881e 100644 --- a/README.md +++ b/README.md @@ -5,58 +5,33 @@ See LICENSE.txt for licensing information. *** txtmark is yet another markdown processor for the JVM. -... and is *damn* fast^^ -Again this is a WIP release. +* It is easy to use: -TODO: + String result = txtmark.Processor.process("This is ***TXTMARK***"); + +* It is fast (see below) + ... well, it is the fastest markdown processor on the JVM right now. -- block-level HTML element processing -- code clean-ups -- see below (markdown test suite) +This is a RC version, tagged v0.5 -### MarkdownTest results so far +For an in-depth explanation of the markdown syntax have a look at [daringfireball.net](http://daringfireball.net/projects/markdown/syntax). + + +### Markdown conformity *** -Based on [MarkdownTest\_1.0\_2007-05-09](http://daringfireball.net/projects/downloads/MarkdownTest_1.0_2007-05-09.tgz) +Txtmark passes all tests inside [MarkdownTest\_1.0\_2007-05-09](http://daringfireball.net/projects/downloads/MarkdownTest_1.0_2007-05-09.tgz) +except of two: -* Amps and angle encoding ... OK -* Auto links ... OK -* Backslash escapes ... OK -* Blockquotes with code blocks ... OK -* Code Blocks ... OK -* Code Spans ... OK -* Hard-wrapped paragraphs with list-like lines ... OK -* Horizontal rules ... OK -* Images ... FAILED (see [Note 1](#note0)) -* Inline HTML (Advanced) ... FAILED (see [Note 2](#note1)) -* Inline HTML (Simple) ... FAILED (see [Note 2](#note1)) -* Inline HTML comments ... FAILED (see [Note 2](#note1)) -* Links, inline style ... OK -* Links, reference style ... OK -* Links, shortcut references ... OK -* Literal quotes in titles ... FAILED (see [Note 3](#note2)) -* Markdown Documentation - Basics ... OK -* Markdown Documentation - Syntax ... FAILED (see [Note 2](#note1)) -* Nested blockquotes ... OK -* Ordered and unordered lists ... OK -* Strong and em together ... OK -* Tabs ... OK -* Tidyness ... OK +1. **Images.text** -17 passed; 6 failed. - -*** - -1.

Note:

Fails because Txtmark doesn't produce empty 'title' image attributes. (IMHO: Images ... OK) -2.

Note:

- Fails because of currently missing block-level HTML identification. +2. **Literal quotes in titles.text** -3.

Note:

What the frell ... this test will continue to FAIL. Sorry, but using unescaped `"` in a title which should be surrounded by `"` is unacceptable for me ;) @@ -74,142 +49,55 @@ Based on [MarkdownTest\_1.0\_2007-05-09](http://daringfireball.net/projects/down and Txtmark will produce the correct result. (IMHO: Literal quotes in titles ... OK) + ### Performance comparison of markdown processors for the JVM -*** +--- -Based on [this](http://henkelmann.eu/2011/01/10/performance_comparison_of_markdown_processor_for_the_jvm). -Txtmark's results should not be considered final, they may change in either direction -during the upcoming releases. -But I think you get the point. +Based on [this benchmark suite](http://henkelmann.eu/2011/01/10/performance_comparison_of_markdown_processor_for_the_jvm). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + +
TestActuariusPegDownKnockoffTxtmark
1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)
Plain Paragraphs969300146895656436211445
Every Word Emphasized14098841435141713161129215244
Every Word Strong108797811251100971795864046
Every Word Inline Code35127810471037949992454535
Every Word a Fast Link21231580523512408634707850
Every Word Consisting of Special XML Chars398139733341305537231918421841
Every Word wrapped in manual HTML tags3073290790188838263529492453
Every Line with a manual line break4375831370136313529574244
Every word with a full link39826610571014175516898847
Every word with a full image22813911101101191717733733
Every word with a reference link97269146190192004411763211830614311240
Every block a quote431205136613284744643536
Every block a codeblock68843873771611696119
Every block a list863912173517626026864636
All tests together3319295952455305102529751222173
TestActuariusPegDownKnockoffTxtmark
1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)1st Run (ms)2nd Run (ms)
Plain Paragraphs887461245522367645688947
Every Word Emphasized222020773411340630503305147266
Every Word Strong238422702456246623639235776257
Every Word Inline Code8248042337223723506236225455
Every Word a Fast Link3942373811641159862185958968
Every Word Consisting of Special XML Chars939393127544731480160835873614
Every Word wrapped in manual HTML tags68436828185018598699869211691154
Every Line with a manual line break85972429682946217119905856
Every word with a full link52850122522280351335126660
Every word with a full image39537424632569375737265655
Every word with a reference link1920819035391833871024345024494318261798
Every block a quote465449268726849789774848
Every block a codeblock1511345976012702623627
Every block a list1209110634483432141113685260
All tests together6062604211556115891982719637452448
+* Q: Why is Txtmark so slow when it comes to XML entities? +* A: Because Txtmark does some sanity checks on XML entities to make sure + it outputs valid XML. For example: + + &cutie; + + will produce (when processed with Markdown and most other markdown processors): + + &cutie; + + and + + &cutie; + + when processed with Txtmark. + +Tested versions: [Actuarius] version: 0.2 [PegDown] version: 0.8.5.4 [Knockoff] version: 0.7.3-15 -*** +--- [Markdown] is copyright (c) 2004 by John Gruber [Markdown]: http://daringfireball.net/projects/markdown/ diff --git a/build.xml b/build.xml index 5c0dc5c..9f21c33 100644 --- a/build.xml +++ b/build.xml @@ -19,8 +19,26 @@ - + + + + + + + + + + + + linkRefs = new HashMap(); /** The Decorator. */ - private final Decorator decorator = new DefaultDecorator(); + private Decorator decorator; /** Constructor. */ - public Emitter() + public Emitter(final Decorator decorator) { - // + this.decorator = decorator; } /** @@ -360,62 +360,7 @@ class Emitter if(start + 2 < in.length()) { temp.setLength(0); - temp.append('<'); - pos = start + 1; - if(in.charAt(pos) == '/') - { - temp.append('/'); - pos++; - } - if(pos < in.length() && Character.isLetter(in.charAt(pos))) - { - pos = Utils.readUntil(temp, in, pos, ' ', '/', '>'); - if(pos > 0) - { - while(pos < in.length() && in.charAt(pos) == ' ') - { - pos = Utils.skipSpaces(in, pos); - if(pos == -1) - break; - if(in.charAt(pos) == '/') - { - temp.append(" /"); - pos++; - break; - } - if(in.charAt(pos) == '>') - { - break; - } - temp.append(' '); - if(!Character.isLetter(in.charAt(pos))) - { - pos = -1; - break; - } - pos = Utils.readUntil(temp, in, pos, '='); - if(pos == -1) - break; - pos = Utils.readUntil(temp, in, pos, '\'', '"'); - if(pos == -1) - break; - final char lim = in.charAt(pos); - temp.append(lim); - pos++; - pos = Utils.readRawUntil(temp, in, pos, lim); - if(pos == -1) - break; - temp.append(lim); - pos++; - } - if(pos > 0 && pos < in.length() && in.charAt(pos) == '>') - { - temp.append('>'); - out.append(temp); - return pos; - } - } - } + return Utils.readXML(out, in, start); } return -1; @@ -712,8 +657,7 @@ class Emitter { out.append(line.value); } - if(line.next != null) - out.append('\n'); + out.append('\n'); line = line.next; } } diff --git a/src/java/txtmark/HTMLElement.java b/src/java/txtmark/HTMLElement.java index 6980a90..1513a9a 100644 --- a/src/java/txtmark/HTMLElement.java +++ b/src/java/txtmark/HTMLElement.java @@ -11,6 +11,7 @@ package txtmark; */ enum HTMLElement { + NONE, a, abbr, acronym, address, applet, area, b, base, basefont, bdo, big, blockquote, body, br, button, caption, cite, code, col, colgroup, diff --git a/src/java/txtmark/Line.java b/src/java/txtmark/Line.java index d6890c1..b3bcded 100644 --- a/src/java/txtmark/Line.java +++ b/src/java/txtmark/Line.java @@ -4,6 +4,8 @@ */ package txtmark; +import java.util.LinkedList; + /** * This class represents a text line. * @@ -26,7 +28,8 @@ class Line public Line previous = null, next = null; /** Is previous/next line empty? */ public boolean prevEmpty, nextEmpty; - + /** Final line of a XML block. */ + public Line xmlEndLine; /** Constructor. */ public Line() { @@ -243,6 +246,12 @@ class Line return LineType.OLIST; } + if(this.value.charAt(this.leading) == '<') + { + if(this.checkHTML()) + return LineType.XML; + } + if(this.next != null && !this.next.isEmpty) { if((this.next.value.charAt(0) == '-') && (this.next.countChars('-') > 0)) @@ -253,4 +262,133 @@ class Line return LineType.OTHER; } + + /** + * Reads an XML comment. Sets xmlEndLine. + * + * @param firstLine The Line to start reading from. + * @param start The starting position. + * @return The new position or -1 if it is no valid comment. + */ + private int readXMLComment(final Line firstLine, final int start) + { + Line line = firstLine; + if(start + 3 < line.value.length()) + { + if(line.value.charAt(2) == '-' && line.value.charAt(3) == '-') + { + int pos = start + 4; + while(line != null) + { + while(pos < line.value.length() && line.value.charAt(pos) != '-') + { + pos++; + } + if(pos == line.value.length()) + { + line = line.next; + pos = 0; + } + else + { + if(pos + 2 < line.value.length()) + { + if(line.value.charAt(pos + 1) == '-' && line.value.charAt(pos + 2) == '>') + { + this.xmlEndLine = line; + return pos + 3; + } + } + pos++; + } + } + } + } + return -1; + } + + /** + * Checks for a valid HTML block. Sets xmlEndLine. + * + * @return true if it is a valid block. + */ + private boolean checkHTML() + { + final LinkedList tags = new LinkedList(); + final StringBuilder temp = new StringBuilder(); + int pos = this.leading; + if(this.value.charAt(this.leading + 1) == '!') + { + if(this.readXMLComment(this, this.leading) > 0) + return true; + } + pos = Utils.readXML(temp, this.value, this.leading); + String element, tag; + if(pos > -1) + { + element = temp.toString(); + temp.setLength(0); + Utils.getXMLTag(temp, element); + tag = temp.toString().toLowerCase(); + if(!HTML.isHtmlBlockElement(tag)) + return false; + if(tag.equals("hr")) + { + this.xmlEndLine = this; + return true; + } + tags.add(tag); + + Line line = this; + while(line != null) + { + while(pos < line.value.length() && line.value.charAt(pos) != '<') + { + pos++; + } + if(pos >= line.value.length()) + { + line = line.next; + pos = 0; + } + else + { + temp.setLength(0); + final int newPos = Utils.readXML(temp, line.value, pos); + if(newPos > 0) + { + element = temp.toString(); + temp.setLength(0); + Utils.getXMLTag(temp, element); + tag = temp.toString().toLowerCase(); + if(HTML.isHtmlBlockElement(tag) && !tag.equals("hr")) + { + if(element.charAt(1) == '/') + { + if(!tags.getLast().equals(tag)) + return false; + tags.removeLast(); + } + else + { + tags.addLast(tag); + } + } + if(tags.size() == 0) + { + this.xmlEndLine = line; + break; + } + pos = newPos; + } + else + { + pos++; + } + } + } + return tags.size() == 0; + } + return false; + } } diff --git a/src/java/txtmark/LineType.java b/src/java/txtmark/LineType.java index 7020d45..807d861 100644 --- a/src/java/txtmark/LineType.java +++ b/src/java/txtmark/LineType.java @@ -24,5 +24,7 @@ enum LineType /** A block quote. */ BQUOTE, /** A horizontal ruler. */ - HR + HR, + /** Start of a XML block. */ + XML } diff --git a/src/java/txtmark/Processor.java b/src/java/txtmark/Processor.java index 6ed2879..60197c2 100644 --- a/src/java/txtmark/Processor.java +++ b/src/java/txtmark/Processor.java @@ -23,20 +23,21 @@ public class Processor /** The reader. */ private final Reader reader; /** The emitter. */ - private Emitter emitter = new Emitter(); + private final Emitter emitter; /** * Constructor. * * @param reader The input reader. */ - private Processor(Reader reader) + private Processor(Reader reader, Decorator decorator) { this.reader = reader; + this.emitter = new Emitter(decorator); } /** - * Transforms an input String into XHTML. + * Transforms an input String into XHTML using the default Decorator. * * @param input The String to process. * @return The processed String. @@ -48,7 +49,19 @@ public class Processor } /** - * Transforms an input file into XHTML using UTF-8 encoding. + * Transforms an input String into XHTML. + * + * @param input The String to process. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final String input, final Decorator decorator) throws IOException + { + return process(new StringReader(input), decorator); + } + + /** + * Transforms an input file into XHTML using UTF-8 encoding and the default Decorator. * * @param file The File to process. * @return The processed String. @@ -60,7 +73,19 @@ public class Processor } /** - * Transforms an input file into XHTML. + * Transforms an input file into XHTML using UTF-8 encoding. + * + * @param file The File to process. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final File file, final Decorator decorator) throws IOException + { + return process(file, "UTF-8", decorator); + } + + /** + * Transforms an input file into XHTML using the default Decorator. * * @param file The File to process. * @param encoding The encoding to use. @@ -69,13 +94,37 @@ public class Processor */ public static String process(final File file, final String encoding) throws IOException { - final Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(file), encoding)); - final Processor p = new Processor(r); - final String ret = p.process(); - r.close(); + return process(file, encoding, new DefaultDecorator()); + } + + /** + * Transforms an input file into XHTML. + * + * @param file The File to process. + * @param encoding The encoding to use. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final File file, final String encoding, final Decorator decorator) throws IOException + { + final FileInputStream input = new FileInputStream(file); + final String ret = process(input, encoding, decorator); + input.close(); return ret; } + /** + * Transforms an input stream into XHTML using UTF-8 encoding using the default Decorator. + * + * @param input The InputStream to process. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final InputStream input) throws IOException + { + return process(input, "UTF-8", new DefaultDecorator()); + } + /** * Transforms an input stream into XHTML using UTF-8 encoding. * @@ -83,9 +132,23 @@ public class Processor * @return The processed String. * @throws IOException if an IO error occurs */ - public static String process(final InputStream input) throws IOException + public static String process(final InputStream input, final Decorator decorator) throws IOException { - return process(input, "UTF-8"); + return process(input, "UTF-8", decorator); + } + + /** + * Transforms an input stream into XHTML using the default Decorator. + * + * @param input The InputStream to process. + * @param encoding The encoding to use. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final InputStream input, final String encoding) throws IOException + { + final Processor p = new Processor(new BufferedReader(new InputStreamReader(input, encoding)), new DefaultDecorator()); + return p.process(); } /** @@ -96,9 +159,24 @@ public class Processor * @return The processed String. * @throws IOException if an IO error occurs */ - public static String process(final InputStream input, final String encoding) throws IOException + public static String process(final InputStream input, final String encoding, final Decorator decorator) throws IOException { - final Processor p = new Processor(new BufferedReader(new InputStreamReader(input, encoding))); + final Processor p = new Processor(new BufferedReader(new InputStreamReader(input, encoding)), decorator); + return p.process(); + } + + /** + * Transforms an input stream into XHTML using the default Decorator. + * + * @param reader The Reader to process. + * @return The processed String. + * @throws IOException if an IO error occurs + */ + public static String process(final Reader reader) throws IOException + { + final Processor p = new Processor( + !(reader instanceof BufferedReader) ? new BufferedReader(reader) : reader, + new DefaultDecorator()); return p.process(); } @@ -109,9 +187,11 @@ public class Processor * @return The processed String. * @throws IOException if an IO error occurs */ - public static String process(final Reader reader) throws IOException + public static String process(final Reader reader, final Decorator decorator) throws IOException { - final Processor p = new Processor(!(reader instanceof BufferedReader) ? new BufferedReader(reader) : reader); + final Processor p = new Processor( + !(reader instanceof BufferedReader) ? new BufferedReader(reader) : reader, + decorator); return p.process(); } @@ -319,7 +399,9 @@ public class Processor final LineType t = line.getLineType(); if(listMode && (t == LineType.OLIST || t == LineType.ULIST)) break; - if(t == LineType.HEADLINE || t == LineType.HEADLINE1 || t == LineType.HEADLINE2 || t == LineType.HR || t == LineType.BQUOTE) + if(t == LineType.HEADLINE || t == LineType.HEADLINE1 || t == LineType.HEADLINE2 + || t == LineType.HR || t == LineType.BQUOTE + || t == LineType.XML) break; line = line.next; } @@ -349,6 +431,16 @@ public class Processor block.type = BlockType.CODE; block.removeSurroundingEmptyLines(); break; + case XML: + if(line.previous != null) + { + // FIXME ... this looks wrong + root.split(line.previous); + } + root.split(line.xmlEndLine).type = BlockType.XML; + root.removeLeadingEmptyLines(); + line = root.lines; + break; case BQUOTE: while(line != null) { @@ -366,6 +458,7 @@ public class Processor case HR: if(line.previous != null) { + // FIXME ... this looks wrong root.split(line.previous); } root.split(line).type = BlockType.RULER; @@ -442,8 +535,6 @@ public class Processor { final StringBuilder out = new StringBuilder(); -// long t0 = System.nanoTime(); - final Block parent = this.readLines(); parent.removeSurroundingEmptyLines(); @@ -455,9 +546,6 @@ public class Processor block = block.next; } -// t0 = System.nanoTime() - t0; -// out.append(String.format("\n\n", (int)(t0 * 1e-6))); - return out.toString(); } } diff --git a/src/java/txtmark/Utils.java b/src/java/txtmark/Utils.java index 4144fc4..62ac1de 100644 --- a/src/java/txtmark/Utils.java +++ b/src/java/txtmark/Utils.java @@ -428,4 +428,98 @@ class Utils } } } + + /** + * Extracts the tag from an XML element. + * + * @param out The StringBuilder to write to. + * @param in Input StringBuilder. + */ + public static void getXMLTag(final StringBuilder out, final StringBuilder in) + { + int pos = 1; + if(in.charAt(1) == '/') + pos++; + while(Character.isLetterOrDigit(in.charAt(pos))) + { + out.append(in.charAt(pos++)); + } + } + + /** + * Extracts the tag from an XML element. + * + * @param out The StringBuilder to write to. + * @param in Input String. + */ + public static void getXMLTag(final StringBuilder out, final String in) + { + int pos = 1; + if(in.charAt(1) == '/') + pos++; + while(Character.isLetterOrDigit(in.charAt(pos))) + { + out.append(in.charAt(pos++)); + } + } + + /** + * Reads an XML element. + * + * @param out The StringBuilder to write to. + * @param in Input String. + * @param start Starting position. + * @return The new position or -1 if this is no valid XML element. + */ + public static int readXML(final StringBuilder out, final String in, final int start) + { + int pos; + if(in.charAt(start + 1) == '/') + { + out.append("'); + if(pos == -1) return -1; + pos = skipSpaces(in, pos); + if(Character.isLetter(in.charAt(pos))) + { + while(in.charAt(pos) != '/' && in.charAt(pos) != '>') + { + out.append(' '); + pos = readRawUntil(out, in, pos, ' ', '='); + if(pos == -1) return -1; + pos = skipSpaces(in, pos); + if(pos == -1) return -1; + out.append('='); + pos = skipSpaces(in, pos + 1); + if(pos == -1) return -1; + final char lim = in.charAt(pos); + if(lim != '\'' && lim != '"') return -1; + out.append(lim); + pos = readRawUntil(out, in, pos + 1, lim); + if(pos == -1) return -1; + out.append(lim); + pos = skipSpaces(in, pos + 1); + if(pos == -1) return -1; + } + + } + if(in.charAt(pos) == '/') + { + out.append('/'); + pos++; + } + if(in.charAt(pos) == '>') + { + out.append('>'); + return pos; + } + return -1; + } }