dimanche 22 février 2015

extract latex equation block from html page with querySelectorAll


I try to extract latex equation formulas from a HTML page (generated with latex2html) in order to replace latex equations images by mathjax formulas.


First, I had the following idea, here's an example :


Input :



<div align="CENTER" class="mathdisplay"><a name="eq402"></a><!-- MATH
\begin{equation}
\text{d}\,v_{k}=\partial_{j}\,v_{k}\,\dfrac{\text{d}\,y^{j}}{\text{d}\,s}\,\text{d}\,s
\end{equation}
-->
<table class="equation" cellpadding="0" width="100%" align="CENTER">
<tr valign="MIDDLE">
<td nowrap align="CENTER"><span class="MATH">d<img width="150" height="65" align="MIDDLE" border="0" src="img1919.gif" alt="$\displaystyle \,v_{k}=\partial_{j}\,v_{k}\,\dfrac{\text{d}\,y^{j}}{\text{d}\,s}\,\text{d}\,s$"></span></td>
<td nowrap class="eqno" width="10" align="RIGHT">
(<span class="arabic">5</span>.<span class="arabic">65</span>)</td></tr>
</table></div>


By inserting the following javascript code at the bottom of the HTML page :



<script type="text/javascript">
function transform() {

[].forEach.call(document.querySelectorAll('table tr img'),function(img) {
var puretext = img.getAttribute('alt');
if(!puretext || puretext == 'up' || puretext == 'previous' || puretext == 'next' || puretext == 'contents') return;
puretext = puretext.replace(/..displaystyle /g,"$");
var text = document.createTextNode(puretext);
img.parentNode.insertBefore(text, img);
img.style.display = 'none';
});
}
transform();
</script>


I get the following rendering on my HTML page, i.e I have the mathjax formulae :



$\,v_{k}=\partial_{j}\,v_{k}\,\dfrac{\text{d}\,y^{j}}{\text{d}\,s}\,\text{d}\,s$


It could be enough but I noticed that sometimes, into the HTML page, I have for "alt" attribute an incomplete formulae, here is an example :



<div align="CENTER" class="mathdisplay"><a name="eq407"></a><!-- MATH
\begin{equation}
\text{d}\,(\mathbf{V}\,\cdot\,\mathbf{n})=\mathbf{V_{M}}(M')\,\cdot\,\mathbf{n}-\mathbf{V}(M)\,\cdot\,\mathbf{n}=[\mathbf{V_{M}}(M')-\mathbf{V}(M)]\,\cdot\,\mathbf{n}=\text{d}\,\mathbf{V}\,\cdot\,\mathbf{n}
\end{equation}
-->
<table class="equation" cellpadding="0" width="100%" align="CENTER">
<tr valign="MIDDLE">
<td nowrap align="CENTER"><span class="MATH">d<img width="538" height="38" align="MIDDLE" border="0" src="img1929.gif" alt="$\displaystyle \,(\mathbf{V}\,\cdot\,\mathbf{n})=\mathbf{V_{M}}(M')\,\cdot\,\mat...
...V}(M)\,\cdot\,\mathbf{n}=[\mathbf{V_{M}}(M')-\mathbf{V}(M)]\,\cdot\,\mathbf{n}=$">d<img width="56" height="34" align="MIDDLE" border="0" src="img1930.gif" alt="$\displaystyle \,\mathbf{V}\,\cdot\,\mathbf{n}$"></span></td>
<td nowrap class="eqno" width="10" align="RIGHT">
(<span class="arabic">5</span>.<span class="arabic">70</span>)</td></tr>
</table></div>


As you can see, I have for "alt" attribute of <img :


$\displaystyle \,(\mathbf{V}\,\cdot\,\mathbf{n})=\mathbf{V_{M}}(M')\,\cdot\,\mat... ...V}(M)\,\cdot\,\mathbf{n}=[\mathbf{V_{M}}(M')-\mathbf{V}(M)]\,\cdot\,\mathbf{n}=$


The entire latex equation has not been generated by latex2html (see ... characters)


So I can't always deal with the img alt attribute and I would like to use the \begin{equation} ... \end{equation} block which is into HTML comments tag ( <!-- ... --> )


How can I get this comments block with querySelectorAll ? does it exist for example a document.querySelectorAll('div.mathdisplay a comments'),function(comments) { or something like this which could allow to extract this block of comments ?


If I could get this text block, I would save it into a variable and insert it, as I did with my first idea, before the img tag, like this :



var text = document.createTextNode(puretext);
img.parentNode.insertBefore(text, img);
img.style.display = 'none';


Any help would be nice





Aucun commentaire:

Enregistrer un commentaire