diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..f46f00d6ee530dac728923ce79d3f542330db189 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,18 @@ +*.png filter=lfs diff=lfs merge=lfs -text +*.jpg filter=lfs diff=lfs merge=lfs -text +*.jpeg filter=lfs diff=lfs merge=lfs -text +*.gif filter=lfs diff=lfs merge=lfs -text +*.mp4 filter=lfs diff=lfs merge=lfs -text +*.mov filter=lfs diff=lfs merge=lfs -text +*.avi filter=lfs diff=lfs merge=lfs -text +*.csv filter=lfs diff=lfs merge=lfs -text +*.json filter=lfs diff=lfs merge=lfs -text +*.pdf filter=lfs diff=lfs merge=lfs -text +*.wav filter=lfs diff=lfs merge=lfs -text +*.mp3 filter=lfs diff=lfs merge=lfs -text +# the package and package lock should not be tracked +package.json -filter -diff -merge text +package-lock.json -filter -diff -merge text +# Notion imported images should NOT be in LFS (needed for Docker build) +app/src/content/assets/image/image_27877f1c*.png -filter -diff -merge text +app/scripts/notion-importer/output/** -filter -diff -merge text diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..3344bbd884b24bdc19198c0b2725a89b7593f84e --- /dev/null +++ b/.gitignore @@ -0,0 +1,41 @@ +# Python +__pycache__ +*.py[cod] +*.so +.Python +env/ +venv/ +*.egg-info/ +dist/ +build/ +*.egg +.idea/ +.vscode/ +.astro/ +.claude/ +*.swp +.DS_Store +# Node +node_modules/ +*.log +*.env +*.cache +.notion-to-md + +app/scripts/latex-to-mdx/output/ +app/scripts/notion-importer/output/**/* +app/src/content/embeds/typography/generated + +# PDF export +app/public/*.pdf +app/public/*.png +app/public/*.jpg +app/public/data/**/* + +.astro/ + +# Template sync temporary directories +.template-sync/ +.temp-*/ +.backup-*/ + diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000000000000000000000000000000000000..5837b2b57b8d319f7a12c1b0ff413044b7792f33 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,118 @@ +# Changelog + +All notable changes to the Research Article Template will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +### Added +- Initial open source release +- Comprehensive documentation +- Contributing guidelines +- License file + +## [1.0.0] - 2024-12-19 + +### Added +- **Core Features**: + - Markdown/MDX-based writing system + - KaTeX mathematical notation support + - Syntax highlighting for code blocks + - Academic citations with BibTeX integration + - Footnotes and sidenotes system + - Auto-generated table of contents + - Interactive Mermaid diagrams + - Plotly.js and D3.js integration + - HTML embed support + - Gradio app embedding + - Dataviz color palettes + - Image optimization + - SEO-friendly structure + - Automatic PDF export + - Dark/light theme toggle + - Mobile-responsive design + - LaTeX import functionality + - Template synchronization system + +- **Components**: + - Figure component with captions + - MultiFigure for image galleries + - Note component with variants + - Quote component + - Accordion for collapsible content + - Sidenote component + - Table of Contents + - Theme Toggle + - HTML Embed + - Raw HTML support + - SEO component + - Hero section + - Footer + - Full-width and wide layouts + +- **Build System**: + - Astro 4.10.0 integration + - PostCSS with custom media queries + - Automatic compression + - Docker support + - Nginx configuration + - Git LFS support + +- **Scripts**: + - PDF export functionality + - LaTeX to MDX conversion + - Template synchronization + - Font SVG generation + - TrackIO data generation + +- **Documentation**: + - Getting started guide + - Writing best practices + - Component reference + - LaTeX conversion guide + - Interactive examples + +### Technical Details +- **Framework**: Astro 4.10.0 +- **Styling**: PostCSS with custom properties +- **Math**: KaTeX 0.16.22 +- **Charts**: Plotly.js 3.1.0, D3.js 7.9.0 +- **Diagrams**: Mermaid 11.10.1 +- **Node.js**: >=20.0.0 +- **License**: CC-BY-4.0 + +### Browser Support +- Chrome (latest) +- Firefox (latest) +- Safari (latest) +- Edge (latest) + +--- + +## Version History + +- **1.0.0**: Initial stable release with full feature set +- **0.0.1**: Development version (pre-release) + +## Migration Guide + +### From 0.0.1 to 1.0.0 + +This is the first stable release. No breaking changes from the development version. + +### Updating Your Project + +Use the template synchronization system to update: + +```bash +npm run sync:template -- --dry-run # Preview changes +npm run sync:template # Apply updates +``` + +## Support + +- **Documentation**: [Hugging Face Space](https://huggingface.co/spaces/tfrere/research-article-template) +- **Issues**: [Community Discussions](https://huggingface.co/spaces/tfrere/research-article-template/discussions) +- **Contact**: [@tfrere](https://huggingface.co/tfrere) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000000000000000000000000000000000000..a4573b5d9abcd9e9ba35095677d0443b157298ec --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,196 @@ +# Contributing to Research Article Template + +Thank you for your interest in contributing to the Research Article Template! This document provides guidelines and information for contributors. + +## 🤝 How to Contribute + +### Reporting Issues + +Before creating an issue, please: +1. **Search existing issues** to avoid duplicates +2. **Use the issue template** when available +3. **Provide detailed information**: + - Clear description of the problem + - Steps to reproduce + - Expected vs actual behavior + - Environment details (OS, Node.js version, browser) + - Screenshots if applicable + +### Suggesting Features + +We welcome feature suggestions! Please: +1. **Check existing discussions** first +2. **Describe the use case** clearly +3. **Explain the benefits** for the community +4. **Consider implementation complexity** + +### Code Contributions + +#### Getting Started + +1. **Fork the repository** on Hugging Face +2. **Clone your fork**: + ```bash + git clone git@hf.co:spaces//research-article-template + cd research-article-template + ``` +3. **Install dependencies**: + ```bash + cd app + npm install + ``` +4. **Create a feature branch**: + ```bash + git checkout -b feature/your-feature-name + ``` + +#### Development Workflow + +1. **Make your changes** following our coding standards +2. **Test thoroughly**: + ```bash + npm run dev # Test locally + npm run build # Ensure build works + ``` +3. **Update documentation** if needed +4. **Commit with clear messages**: + ```bash + git commit -m "feat: add new component for interactive charts" + ``` + +#### Pull Request Process + +1. **Push your branch**: + ```bash + git push origin feature/your-feature-name + ``` +2. **Create a Pull Request** with: + - Clear title and description + - Reference related issues + - Screenshots for UI changes + - Testing instructions + +## 📋 Coding Standards + +### Code Style + +- **Use Prettier** for consistent formatting +- **Follow existing patterns** in the codebase +- **Write clear, self-documenting code** +- **Add comments** for complex logic +- **Use meaningful variable names** + +### File Organization + +- **Components**: Place in `src/components/` +- **Styles**: Use CSS modules or component-scoped styles +- **Assets**: Organize in `src/content/assets/` +- **Documentation**: Update relevant `.mdx` files + +### Commit Message Format + +We follow [Conventional Commits](https://www.conventionalcommits.org/): + +``` +type(scope): description + +feat: add new interactive chart component +fix: resolve mobile layout issues +docs: update installation instructions +style: improve button hover states +refactor: simplify component structure +test: add unit tests for utility functions +``` + +**Types**: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore` + +## 🧪 Testing + +### Manual Testing + +Before submitting: +- [ ] Test on different screen sizes +- [ ] Verify dark/light theme compatibility +- [ ] Check browser compatibility (Chrome, Firefox, Safari) +- [ ] Test with different content types +- [ ] Ensure accessibility standards + +### Automated Testing + +```bash +# Run build to catch errors +npm run build + +# Test PDF export +npm run export:pdf + +# Test LaTeX conversion +npm run latex:convert +``` + +## 📚 Documentation + +### Writing Guidelines + +- **Use clear, concise language** +- **Provide examples** for complex features +- **Include screenshots** for UI changes +- **Update both English content and code comments** + +### Documentation Structure + +- **README.md**: Project overview and quick start +- **CONTRIBUTING.md**: This file +- **Content files**: In `src/content/chapters/demo/` +- **Component docs**: Inline comments and examples + +## 🎯 Areas for Contribution + +### High Priority + +- **Bug fixes** and stability improvements +- **Accessibility enhancements** +- **Mobile responsiveness** +- **Performance optimizations** +- **Documentation improvements** + +### Feature Ideas + +- **New interactive components** +- **Additional export formats** +- **Enhanced LaTeX import** +- **Theme customization** +- **Plugin system** + +### Community + +- **Answer questions** in discussions +- **Share examples** of your work +- **Write tutorials** and guides +- **Help with translations** + +## 🚫 What Not to Contribute + +- **Breaking changes** without discussion +- **Major architectural changes** without approval +- **Dependencies** that significantly increase bundle size +- **Features** that don't align with the project's goals + +## 📞 Getting Help + +- **Discussions**: [Community tab](https://huggingface.co/spaces/tfrere/research-article-template/discussions) +- **Issues**: [Report bugs](https://huggingface.co/spaces/tfrere/research-article-template/discussions?status=open&type=issue) +- **Contact**: [@tfrere](https://huggingface.co/tfrere) on Hugging Face + +## 📄 License + +By contributing, you agree that your contributions will be licensed under the same [CC-BY-4.0 license](LICENSE) that covers the project. + +## 🙏 Recognition + +Contributors will be: +- **Listed in acknowledgments** (if desired) +- **Mentioned in release notes** for significant contributions +- **Credited** in relevant documentation + +Thank you for helping make scientific writing more accessible and interactive! 🎉 diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..8073f800d5831c53c02b9758b1282cbc6f7ef718 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,77 @@ +# Use an official Node runtime as the base image for building the application +# Build with Playwright (browsers and deps ready) +FROM mcr.microsoft.com/playwright:v1.55.0-jammy AS build + +# Install git, git-lfs, and dependencies for Pandoc (only if ENABLE_LATEX_CONVERSION=true) +RUN apt-get update && apt-get install -y git git-lfs wget && apt-get clean + +# Install latest Pandoc from GitHub releases (only installed if needed later) +RUN wget -qO- https://github.com/jgm/pandoc/releases/download/3.8/pandoc-3.8-linux-amd64.tar.gz | tar xzf - -C /tmp && \ + cp /tmp/pandoc-3.8/bin/pandoc /usr/local/bin/ && \ + cp /tmp/pandoc-3.8/bin/pandoc-lua /usr/local/bin/ && \ + rm -rf /tmp/pandoc-3.8 + +# Set the working directory in the container +WORKDIR /app + +# Copy package.json and package-lock.json +COPY app/package*.json ./ + +# Install dependencies +RUN npm install + +# Copy the rest of the application code +COPY app/ . + +# Conditionally convert LaTeX to MDX if ENABLE_LATEX_CONVERSION=true +ARG ENABLE_LATEX_CONVERSION=false +RUN if [ "$ENABLE_LATEX_CONVERSION" = "true" ]; then \ + echo "🔄 LaTeX importer enabled - running latex:convert..."; \ + npm run latex:convert; \ + else \ + echo "⏭️ LaTeX importer disabled - skipping..."; \ + fi + +# Pre-install notion-importer dependencies (for runtime import) +# Note: Notion import is done at RUNTIME (not build time) to access secrets +RUN cd scripts/notion-importer && npm install && cd ../.. + +# Ensure `public/data` is a real directory with real files (not a symlink) +# This handles the case where `public/data` is a symlink in the repo, which +# would be broken inside the container after COPY. +RUN set -e; \ + if [ -e public ] && [ ! -d public ]; then rm -f public; fi; \ + mkdir -p public; \ + if [ -L public/data ] || { [ -e public/data ] && [ ! -d public/data ]; }; then rm -f public/data; fi; \ + mkdir -p public/data; \ + cp -a src/content/assets/data/. public/data/ + +# Build the application (with minimal placeholder content) +RUN npm run build + +# Generate the PDF (light theme, full wait) +RUN npm run export:pdf -- --theme=light --wait=full + +# Generate LaTeX export +RUN npm run export:latex + +# Install nginx in the build stage (we'll use this image as final to keep Node.js) +RUN apt-get update && apt-get install -y nginx && apt-get clean && rm -rf /var/lib/apt/lists/* + +# Copy nginx configuration +COPY nginx.conf /etc/nginx/nginx.conf + +# Copy entrypoint script +COPY entrypoint.sh /entrypoint.sh +RUN chmod +x /entrypoint.sh + +# Create necessary directories and set permissions +RUN mkdir -p /var/cache/nginx /var/run /var/log/nginx /var/lib/nginx/body && \ + chmod -R 777 /var/cache/nginx /var/run /var/log/nginx /var/lib/nginx /etc/nginx/nginx.conf && \ + chmod -R 777 /app + +# Expose port 8080 +EXPOSE 8080 + +# Use entrypoint script that handles Notion import if enabled +ENTRYPOINT ["/entrypoint.sh"] diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..b267a53137822114e4c0bcef2e6383aaf52a70f1 --- /dev/null +++ b/LICENSE @@ -0,0 +1,33 @@ +Creative Commons Attribution 4.0 International License + +Copyright (c) 2024 Thibaud Frere + +This work is licensed under the Creative Commons Attribution 4.0 International License. +To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ +or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. + +You are free to: + + Share — copy and redistribute the material in any medium or format + Adapt — remix, transform, and build upon the material for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license terms. + +Under the following terms: + + Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. + + No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. + +Notices: + + You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. + + No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material. + +--- + +For the source code and technical implementation: +- The source code is available at: https://huggingface.co/spaces/tfrere/research-article-template +- Third-party figures and assets are excluded from this license and marked in their captions +- Dependencies and third-party libraries maintain their respective licenses diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e6088b71bbdefd489e78b8e13ec1c4e28e38bafb --- /dev/null +++ b/README.md @@ -0,0 +1,122 @@ +--- +title: 'The Smol Training Playbook: The Secrets to Building World-Class LLMs' +short_desc: 'A practical journey behind training SOTA LLMs' +emoji: 📝 +colorFrom: blue +colorTo: indigo +sdk: docker +pinned: false +header: mini +app_port: 8080 +tags: + - research-article-template + - research paper + - scientific paper + - data visualization +thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png +--- +
+ +# Research Article Template + +[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) +[![Node.js Version](https://img.shields.io/badge/node-%3E%3D20.0.0-brightgreen.svg)](https://nodejs.org/) +[![Astro](https://img.shields.io/badge/Astro-4.10.0-orange.svg)](https://astro.build/) +[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/tfrere/research-article-template) + + +**A modern, interactive template for scientific writing** that brings papers to life with web-native features. The web offers what static PDFs can't: **interactive diagrams**, **progressive notation**, and **exploratory views** that show how ideas behave. This template treats interactive artifacts—figures, math, code, and inspectable experiments—as **first-class** alongside prose, helping readers **build intuition** instead of skimming results—all with **minimal setup** and no web knowledge required. + +**[Try the live demo & documentation →](https://huggingface.co/spaces/tfrere/research-article-template)** + +
+ +## 🚀 Quick Start + +### Option 1: Duplicate on Hugging Face (Recommended) + +1. Visit **[🤗 Research Article Template](https://huggingface.co/spaces/tfrere/research-article-template)** +2. Click **"Duplicate this Space"** +3. Clone your new repository: + ```bash + git clone git@hf.co:spaces// + cd + ``` + +### Option 2: Clone Directly + +```bash +git clone https://github.com/tfrere/research-article-template.git +cd research-article-template +``` + +### Installation + +```bash +# Install Node.js 20+ (use nvm for version management) +nvm install 20 +nvm use 20 + +# Install Git LFS and pull assets +git lfs install +git lfs pull + +# Install dependencies +cd app +npm install + +# Start development server +npm run dev +``` + +Visit `http://localhost:4321` to see your site! + +## 🎯 Who This Is For + +- **Scientists** writing modern, web-native research papers +- **Educators** creating interactive, explorable lessons +- **Researchers** who want to focus on ideas, not infrastructure +- **Anyone** who values clear, engaging technical communication + +## 🌟 Inspired by Distill + +This template carries forward the spirit of [Distill](https://distill.pub/) (2016–2021), pushing interactive scientific writing even further with: +- Accessible, high-quality explanations +- Reproducible, production-ready demos +- Modern web technologies and best practices + +## 🤝 Contributing + +We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details. + +### Ways to Contribute + +- **Report bugs** - Open an issue with detailed information +- **Suggest features** - Share ideas for improvements +- **Improve documentation** - Help others get started +- **Submit code** - Fix bugs or add features +- **Join discussions** - Share feedback and ideas + +## 📄 License + +This project is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). + +- **Diagrams and text**: CC-BY 4.0 +- **Source code**: Available on [Hugging Face](https://huggingface.co/spaces/tfrere/research-article-template) +- **Third-party figures**: Excluded and marked in captions + +## 🙏 Acknowledgments + +- Inspired by [Distill](https://distill.pub/) and the interactive scientific writing movement +- Built with [Astro](https://astro.build/), [MDX](https://mdxjs.com/), and modern web technologies +- Community feedback and contributions from researchers worldwide + +## 📞 Support + +- **[Community Discussions](https://huggingface.co/spaces/tfrere/research-article-template/discussions)** - Ask questions and share ideas +- **[Report Issues](https://huggingface.co/spaces/tfrere/research-article-template/discussions?status=open&type=issue)** - Bug reports and feature requests +- **Contact**: [@tfrere](https://huggingface.co/tfrere) on Hugging Face + +--- + +**Made with ❤️ for the scientific community** \ No newline at end of file diff --git a/app/astro.config.mjs b/app/astro.config.mjs new file mode 100644 index 0000000000000000000000000000000000000000..9fb9cff3932e93e3d7e31db4cf75df045cbf821a --- /dev/null +++ b/app/astro.config.mjs @@ -0,0 +1,80 @@ +import { defineConfig } from 'astro/config'; +import mdx from '@astrojs/mdx'; +import svelte from '@astrojs/svelte'; +import mermaid from 'astro-mermaid'; +import compressor from 'astro-compressor'; +import remarkMath from 'remark-math'; +import rehypeKatex from 'rehype-katex'; +import remarkFootnotes from 'remark-footnotes'; +import rehypeSlug from 'rehype-slug'; +import rehypeAutolinkHeadings from 'rehype-autolink-headings'; +import rehypeCitation from 'rehype-citation'; +import rehypeCodeCopy from './plugins/rehype/code-copy.mjs'; +import rehypeReferencesAndFootnotes from './plugins/rehype/post-citation.mjs'; +import remarkIgnoreCitationsInCode from './plugins/remark/ignore-citations-in-code.mjs'; +import remarkUnwrapCitationLinks from './plugins/remark/unwrap-citation-links.mjs'; +import remarkDirective from 'remark-directive'; +import remarkOutputContainer from './plugins/remark/output-container.mjs'; +import rehypeRestoreAtInCode from './plugins/rehype/restore-at-in-code.mjs'; +import rehypeWrapTables from './plugins/rehype/wrap-tables.mjs'; +import rehypeWrapOutput from './plugins/rehype/wrap-outputs.mjs'; +// Built-in Shiki (dual themes) — no rehype-pretty-code + +// Plugins moved to app/plugins/* + +export default defineConfig({ + output: 'static', + integrations: [ + mermaid({ theme: 'neutral', autoTheme: true }), + mdx(), + svelte(), + // Precompress output with Gzip only (Brotli disabled due to server module mismatch) + compressor({ brotli: false, gzip: true }) + ], + devToolbar: { + enabled: false + }, + markdown: { + shikiConfig: { + themes: { + light: 'github-light', + dark: 'github-dark' + }, + defaultColor: false, + wrap: false, + langAlias: { + // Map MDX fences to TSX for better JSX tokenization + mdx: 'tsx' + } + }, + remarkPlugins: [ + remarkUnwrapCitationLinks, + remarkIgnoreCitationsInCode, + remarkMath, + [remarkFootnotes, { inlineNotes: true }], + remarkDirective, + remarkOutputContainer + ], + rehypePlugins: [ + rehypeSlug, + [rehypeAutolinkHeadings, { behavior: 'wrap' }], + [rehypeKatex, { + trust: true, + }], + [rehypeCitation, { + bibliography: 'src/content/bibliography.bib', + linkCitations: true, + csl: "apa", + noCite: false, + suppressBibliography: false, + }], + rehypeReferencesAndFootnotes, + rehypeRestoreAtInCode, + rehypeCodeCopy, + rehypeWrapOutput, + rehypeWrapTables + ] + } +}); + + diff --git a/app/package-lock.json b/app/package-lock.json new file mode 100644 index 0000000000000000000000000000000000000000..53ec7ac7a928402ea43e9b4e308dd16483ac53c2 Binary files /dev/null and b/app/package-lock.json differ diff --git a/app/package.json b/app/package.json new file mode 100644 index 0000000000000000000000000000000000000000..473e15216e9e41f2bd6881b6cc5f2470acee71fe Binary files /dev/null and b/app/package.json differ diff --git a/app/plugins/rehype/code-copy.mjs b/app/plugins/rehype/code-copy.mjs new file mode 100644 index 0000000000000000000000000000000000000000..29b135ee039c2af2f468bc836874f55a0a78ca17 --- /dev/null +++ b/app/plugins/rehype/code-copy.mjs @@ -0,0 +1,94 @@ +// Minimal rehype plugin to wrap code blocks with a copy button +// Exported as a standalone module to keep astro.config.mjs lean +export default function rehypeCodeCopy() { + return (tree) => { + // Walk the tree; lightweight visitor to find

+    const visit = (node, parent) => {
+      if (!node || typeof node !== 'object') return;
+      const children = Array.isArray(node.children) ? node.children : [];
+      if (node.tagName === 'pre' && children.some(c => c.tagName === 'code')) {
+        // Find code child
+        const code = children.find(c => c.tagName === 'code');
+        // Determine if single-line block: prefer Shiki lines, then text content
+        const countLinesFromShiki = () => {
+          const isLineEl = (el) => el && el.type === 'element' && el.tagName === 'span' && Array.isArray(el.properties?.className) && el.properties.className.includes('line');
+          const hasNonWhitespaceText = (node) => {
+            if (!node) return false;
+            if (node.type === 'text') return /\S/.test(String(node.value || ''));
+            const kids = Array.isArray(node.children) ? node.children : [];
+            return kids.some(hasNonWhitespaceText);
+          };
+          const collectLines = (node, acc) => {
+            if (!node || typeof node !== 'object') return;
+            if (isLineEl(node)) acc.push(node);
+            const kids = Array.isArray(node.children) ? node.children : [];
+            kids.forEach((k) => collectLines(k, acc));
+          };
+          const lines = [];
+          collectLines(code, lines);
+          const nonEmpty = lines.filter((ln) => hasNonWhitespaceText(ln)).length;
+          return nonEmpty || 0;
+        };
+        const countLinesFromText = () => {
+          // Parse raw text content of the  node including nested spans
+          const extractText = (node) => {
+            if (!node) return '';
+            if (node.type === 'text') return String(node.value || '');
+            const kids = Array.isArray(node.children) ? node.children : [];
+            return kids.map(extractText).join('');
+          };
+          const raw = extractText(code);
+          if (!raw || !/\S/.test(raw)) return 0;
+          return raw.split('\n').filter(line => /\S/.test(line)).length;
+        };
+        const lines = countLinesFromShiki() || countLinesFromText();
+        const isSingleLine = lines <= 1;
+        // Also treat code blocks shorter than a threshold as single-line (defensive)
+        if (!isSingleLine) {
+          const approxChars = (() => {
+            const extract = (n) => Array.isArray(n?.children) ? n.children.map(extract).join('') : (n?.type === 'text' ? String(n.value||'') : '');
+            return extract(code).length;
+          })();
+          if (approxChars < 6) {
+            node.__forceSingle = true;
+          }
+        }
+        // Replace 
 with wrapper div.code-card containing button + pre
+        const wrapper = {
+          type: 'element',
+          tagName: 'div',
+          properties: { className: ['code-card'].concat((isSingleLine || node.__forceSingle) ? ['no-copy'] : []) },
+          children: (isSingleLine || node.__forceSingle) ? [ node ] : [
+            {
+              type: 'element',
+              tagName: 'button',
+              properties: { className: ['code-copy', 'button--ghost'], type: 'button', 'aria-label': 'Copy code' },
+              children: [
+                {
+                  type: 'element',
+                  tagName: 'svg',
+                  properties: { viewBox: '0 0 24 24', 'aria-hidden': 'true', focusable: 'false' },
+                  children: [
+                    { type: 'element', tagName: 'path', properties: { d: 'M16 1H4c-1.1 0-2 .9-2 2v12h2V3h12V1zm3 4H8c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h11c1.1 0 2-.9 2-2V7c0-1.1-.9-2-2-2zm0 16H8V7h11v14z' }, children: [] }
+                  ]
+                }
+              ]
+            },
+            node
+          ]
+        };
+        if (parent && Array.isArray(parent.children)) {
+          const idx = parent.children.indexOf(node);
+          if (idx !== -1) parent.children[idx] = wrapper;
+        }
+        return; // don't visit nested
+      }
+      children.forEach((c) => visit(c, node));
+    };
+    visit(tree, null);
+  };
+}
+
+
+
+
diff --git a/app/plugins/rehype/post-citation.mjs b/app/plugins/rehype/post-citation.mjs
new file mode 100644
index 0000000000000000000000000000000000000000..b91ed218aab6ab5f7244d8c74f25b49378e219b6
--- /dev/null
+++ b/app/plugins/rehype/post-citation.mjs
@@ -0,0 +1,493 @@
+// rehype plugin to post-process citations and footnotes at build-time
+// - Normalizes the bibliography into 
    with
  1. +// - Linkifies DOI/URL occurrences inside references +// - Appends back-reference links (↩ back: 1, 2, ...) from each reference to in-text citation anchors +// - Cleans up footnotes block (.footnotes) + +export default function rehypeReferencesAndFootnotes() { + return (tree) => { + const isElement = (n) => n && typeof n === 'object' && n.type === 'element'; + const getChildren = (n) => (Array.isArray(n?.children) ? n.children : []); + + const walk = (node, parent, fn) => { + if (!node || typeof node !== 'object') return; + fn && fn(node, parent); + const kids = getChildren(node); + for (const child of kids) walk(child, node, fn); + }; + + const ensureArray = (v) => (Array.isArray(v) ? v : v != null ? [v] : []); + + const hasClass = (el, name) => { + const cn = ensureArray(el?.properties?.className).map(String); + return cn.includes(name); + }; + + const setAttr = (el, key, val) => { + el.properties = el.properties || {}; + if (val == null) delete el.properties[key]; + else el.properties[key] = val; + }; + + const getAttr = (el, key) => (el?.properties ? el.properties[key] : undefined); + + // Shared helpers for backlinks + backrefs block + const collectBacklinksForIdSet = (idSet, anchorPrefix) => { + const idToBacklinks = new Map(); + const idToAnchorNodes = new Map(); + if (!idSet || idSet.size === 0) return { idToBacklinks, idToAnchorNodes }; + walk(tree, null, (node) => { + if (!isElement(node) || node.tagName !== 'a') return; + const href = String(getAttr(node, 'href') || ''); + if (!href.startsWith('#')) return; + const id = href.slice(1); + if (!idSet.has(id)) return; + // Ensure a stable id + let anchorId = String(getAttr(node, 'id') || ''); + if (!anchorId) { + const list = idToBacklinks.get(id) || []; + anchorId = `${anchorPrefix}-${id}-${list.length + 1}`; + setAttr(node, 'id', anchorId); + } + const list = idToBacklinks.get(id) || []; + list.push(anchorId); + idToBacklinks.set(id, list); + const nodes = idToAnchorNodes.get(id) || []; + nodes.push(node); + idToAnchorNodes.set(id, nodes); + }); + return { idToBacklinks, idToAnchorNodes }; + }; + + const createBackIcon = () => ({ + type: 'element', + tagName: 'svg', + properties: { + className: ['back-icon'], + width: 12, + height: 12, + viewBox: '0 0 24 24', + fill: 'none', + stroke: 'currentColor', + 'stroke-width': 2, + 'stroke-linecap': 'round', + 'stroke-linejoin': 'round', + 'aria-hidden': 'true', + focusable: 'false' + }, + children: [ + { type: 'element', tagName: 'line', properties: { x1: 12, y1: 19, x2: 12, y2: 5 }, children: [] }, + { type: 'element', tagName: 'polyline', properties: { points: '5 12 12 5 19 12' }, children: [] } + ] + }); + + const appendBackrefsBlock = (listElement, idToBacklinks, ariaLabel) => { + if (!listElement || !idToBacklinks || idToBacklinks.size === 0) return; + for (const li of getChildren(listElement)) { + if (!isElement(li) || li.tagName !== 'li') continue; + const id = String(getAttr(li, 'id') || ''); + if (!id) continue; + const keys = idToBacklinks.get(id); + if (!keys || !keys.length) continue; + // Remove pre-existing .backrefs in this li to avoid duplicates + li.children = getChildren(li).filter((n) => !(isElement(n) && n.tagName === 'small' && hasClass(n, 'backrefs'))); + const small = { + type: 'element', + tagName: 'small', + properties: { className: ['backrefs'] }, + children: [] + }; + if (keys.length === 1) { + // Single backlink: just the icon wrapped in the anchor + const a = { + type: 'element', + tagName: 'a', + properties: { href: `#${keys[0]}`, 'aria-label': ariaLabel }, + children: [createBackIcon()] + }; + small.children.push(a); + } else { + // Multiple backlinks: icon + label + numbered links + small.children.push(createBackIcon()); + small.children.push({ type: 'text', value: ' back: ' }); + keys.forEach((backId, idx) => { + small.children.push({ + type: 'element', + tagName: 'a', + properties: { href: `#${backId}`, 'aria-label': ariaLabel }, + children: [{ type: 'text', value: String(idx + 1) }] + }); + if (idx < keys.length - 1) small.children.push({ type: 'text', value: ', ' }); + }); + } + li.children.push(small); + } + }; + // Remove default back-reference anchors generated by remark-footnotes inside a footnote item + const getTextContent = (el) => { + if (!el) return ''; + const stack = [el]; + let out = ''; + while (stack.length) { + const cur = stack.pop(); + if (!cur) continue; + if (cur.type === 'text') out += String(cur.value || ''); + const kids = getChildren(cur); + for (let i = kids.length - 1; i >= 0; i--) stack.push(kids[i]); + } + return out; + }; + + // Check if an element is part of KaTeX structure + const isKaTeXElement = (el) => { + if (!isElement(el)) return false; + const className = ensureArray(getAttr(el, 'className') || []).map(String); + // Check for KaTeX classes + if (className.some(c => c.includes('katex') || c.includes('math'))) return true; + // Check parent chain for KaTeX + let current = el; + for (let depth = 0; depth < 10; depth++) { + // We need to walk up, but we don't have parent references in rehype AST + // So check by tagName and common KaTeX patterns + const tag = String(current.tagName || '').toLowerCase(); + if (tag === 'math' || className.some(c => c.includes('katex'))) return true; + break; // Can't walk up in AST, just check current element + } + return false; + }; + + const removeFootnoteBackrefAnchors = (el) => { + if (!isElement(el)) return; + // Never modify KaTeX elements or their contents + if (isKaTeXElement(el)) return; + + const kids = getChildren(el); + for (let i = kids.length - 1; i >= 0; i--) { + const child = kids[i]; + if (isElement(child)) { + // Never touch KaTeX elements + if (isKaTeXElement(child)) continue; + + if ( + child.tagName === 'a' && ( + getAttr(child, 'data-footnote-backref') != null || + hasClass(child, 'footnote-backref') || + String(getAttr(child, 'role') || '').toLowerCase() === 'doc-backlink' || + String(getAttr(child, 'aria-label') || '').toLowerCase().includes('back to content') || + String(getAttr(child, 'href') || '').startsWith('#fnref') || + // Fallback: text-based detection like "↩" or "↩2" + /^\s*↩\s*\d*\s*$/u.test(getTextContent(child)) + ) + ) { + // Remove the anchor + el.children.splice(i, 1); + continue; + } + // Recurse into element (but not if it's KaTeX) + removeFootnoteBackrefAnchors(child); + // If a wrapper like or became empty, remove it + // BUT only if it's not part of KaTeX + const becameKids = getChildren(child); + if ((child.tagName === 'sup' || child.tagName === 'span') && + (!becameKids || becameKids.length === 0) && + !isKaTeXElement(child)) { + el.children.splice(i, 1); + } + } + } + }; + + + const normDoiHref = (href) => { + if (!href) return href; + const DUP = /https?:\/\/(?:dx\.)?doi\.org\/(?:https?:\/\/(?:dx\.)?doi\.org\/)+/gi; + const ONE = /https?:\/\/(?:dx\.)?doi\.org\/(10\.[^\s<>"']+)/i; + href = String(href).replace(DUP, 'https://doi.org/'); + const m = href.match(ONE); + return m ? `https://doi.org/${m[1]}` : href; + }; + + const DOI_BARE = /\b10\.[0-9]{4,9}\/[\-._;()\/:A-Z0-9]+\b/gi; + const URL_GEN = /\bhttps?:\/\/[^\s<>()"']+/gi; + + const linkifyTextNode = (textNode) => { + const text = String(textNode.value || ''); + let last = 0; + const parts = []; + const pushText = (s) => { if (s) parts.push({ type: 'text', value: s }); }; + + const matches = []; + // Collect URL matches + let m; + URL_GEN.lastIndex = 0; + while ((m = URL_GEN.exec(text)) !== null) { + matches.push({ type: 'url', start: m.index, end: URL_GEN.lastIndex, raw: m[0] }); + } + // Collect DOI matches + DOI_BARE.lastIndex = 0; + while ((m = DOI_BARE.exec(text)) !== null) { + matches.push({ type: 'doi', start: m.index, end: DOI_BARE.lastIndex, raw: m[0] }); + } + matches.sort((a, b) => a.start - b.start); + + for (const match of matches) { + if (match.start < last) continue; // overlapping + pushText(text.slice(last, match.start)); + if (match.type === 'url') { + const href = normDoiHref(match.raw); + const doiOne = href.match(/https?:\/\/(?:dx\.)?doi\.org\/(10\.[^\s<>"']+)/i); + const a = { + type: 'element', + tagName: 'a', + properties: { href, target: '_blank', rel: 'noopener noreferrer' }, + children: [{ type: 'text', value: doiOne ? doiOne[1] : href }] + }; + parts.push(a); + } else { + const href = `https://doi.org/${match.raw}`; + const a = { + type: 'element', + tagName: 'a', + properties: { href, target: '_blank', rel: 'noopener noreferrer' }, + children: [{ type: 'text', value: match.raw }] + }; + parts.push(a); + } + last = match.end; + } + + pushText(text.slice(last)); + return parts; + }; + + const linkifyInElement = (el) => { + const kids = getChildren(el); + for (let i = 0; i < kids.length; i++) { + const child = kids[i]; + if (!child) continue; + if (child.type === 'text') { + const replacement = linkifyTextNode(child); + if (replacement.length === 1 && replacement[0].type === 'text') continue; + // Replace the single text node with multiple nodes + el.children.splice(i, 1, ...replacement); + i += replacement.length - 1; + } else if (isElement(child)) { + if (child.tagName === 'a') { + const href = normDoiHref(getAttr(child, 'href')); + setAttr(child, 'href', href); + const m = String(href || '').match(/https?:\/\/(?:dx\.)?doi\.org\/(10\.[^\s<>"']+)/i); + if (m && (!child.children || child.children.length === 0)) { + child.children = [{ type: 'text', value: m[1] }]; + } + continue; + } + linkifyInElement(child); + } + } + // Deduplicate adjacent identical anchors + for (let i = 1; i < el.children.length; i++) { + const prev = el.children[i - 1]; + const curr = el.children[i]; + if (isElement(prev) && isElement(curr) && prev.tagName === 'a' && curr.tagName === 'a') { + const key = `${getAttr(prev, 'href') || ''}|${(prev.children?.[0]?.value) || ''}`; + const key2 = `${getAttr(curr, 'href') || ''}|${(curr.children?.[0]?.value) || ''}`; + if (key === key2) { + el.children.splice(i, 1); + i--; + } + } + } + }; + + // Find references container and normalize its list + const findReferencesRoot = () => { + let found = null; + walk(tree, null, (node) => { + if (found) return; + if (!isElement(node)) return; + const id = getAttr(node, 'id'); + if (id === 'references' || hasClass(node, 'references') || hasClass(node, 'bibliography')) { + found = node; + } + }); + return found; + }; + + const toOrderedList = (container) => { + // If there is already an
      , use it; otherwise convert common structures + let ol = getChildren(container).find((c) => isElement(c) && c.tagName === 'ol'); + if (!ol) { + ol = { type: 'element', tagName: 'ol', properties: { className: ['references'] }, children: [] }; + const candidates = getChildren(container).filter((n) => isElement(n)); + if (candidates.length) { + for (const node of candidates) { + if (hasClass(node, 'csl-entry') || node.tagName === 'li' || node.tagName === 'p' || node.tagName === 'div') { + const li = { type: 'element', tagName: 'li', properties: {}, children: getChildren(node) }; + if (getAttr(node, 'id')) setAttr(li, 'id', getAttr(node, 'id')); + ol.children.push(li); + } + } + } + // Replace container children by the new ol + container.children = [ol]; + } + if (!hasClass(ol, 'references')) { + const cls = ensureArray(ol.properties?.className).map(String); + if (!cls.includes('references')) cls.push('references'); + ol.properties = ol.properties || {}; + ol.properties.className = cls; + } + return ol; + }; + + const refsRoot = findReferencesRoot(); + let refsOl = null; + const refIdSet = new Set(); + const refIdToExternalHref = new Map(); + + if (refsRoot) { + refsOl = toOrderedList(refsRoot); + // Collect item ids and linkify their content + for (const li of getChildren(refsOl)) { + if (!isElement(li) || li.tagName !== 'li') continue; + if (!getAttr(li, 'id')) { + // Try to find a nested element with id to promote + const nestedWithId = getChildren(li).find((n) => isElement(n) && getAttr(n, 'id')); + if (nestedWithId) setAttr(li, 'id', getAttr(nestedWithId, 'id')); + } + const id = getAttr(li, 'id'); + if (id) refIdSet.add(String(id)); + linkifyInElement(li); + // Record first external link href (e.g., DOI/URL) if present + if (id) { + let externalHref = null; + const stack = [li]; + while (stack.length) { + const cur = stack.pop(); + const kids = getChildren(cur); + for (const k of kids) { + if (isElement(k) && k.tagName === 'a') { + const href = String(getAttr(k, 'href') || ''); + if (/^https?:\/\//i.test(href)) { + externalHref = href; + break; + } + } + if (isElement(k)) stack.push(k); + } + if (externalHref) break; + } + if (externalHref) refIdToExternalHref.set(String(id), externalHref); + } + } + setAttr(refsRoot, 'data-built-refs', '1'); + } + + // Collect in-text anchors that point to references ids + const { idToBacklinks: refIdToBacklinks, idToAnchorNodes: refIdToCitationAnchors } = collectBacklinksForIdSet(refIdSet, 'refctx'); + + // Append backlinks into references list items + appendBackrefsBlock(refsOl, refIdToBacklinks, 'Back to citation'); + + // Rewrite in-text citation anchors to external link when available + if (refIdToCitationAnchors.size > 0) { + for (const [id, anchors] of refIdToCitationAnchors.entries()) { + const ext = refIdToExternalHref.get(id); + if (!ext) continue; + for (const a of anchors) { + setAttr(a, 'data-ref-id', id); + setAttr(a, 'href', ext); + const existingTarget = getAttr(a, 'target'); + if (!existingTarget) setAttr(a, 'target', '_blank'); + const rel = String(getAttr(a, 'rel') || ''); + const relSet = new Set(rel ? rel.split(/\s+/) : []); + relSet.add('noopener'); + relSet.add('noreferrer'); + setAttr(a, 'rel', Array.from(relSet).join(' ')); + } + } + } + + // Deep clone a node and all its children (preserve KaTeX structure) + const deepCloneNode = (node) => { + if (!node || typeof node !== 'object') return node; + if (node.type === 'text') { + return { type: 'text', value: node.value }; + } + if (node.type === 'element') { + const cloned = { + type: 'element', + tagName: node.tagName, + properties: node.properties ? JSON.parse(JSON.stringify(node.properties)) : {}, + children: [] + }; + const kids = getChildren(node); + for (const child of kids) { + cloned.children.push(deepCloneNode(child)); + } + return cloned; + } + return node; + }; + + // Footnotes cleanup + backrefs harmonized with references + const cleanupFootnotes = () => { + let root = null; + walk(tree, null, (node) => { + if (!isElement(node)) return; + if (hasClass(node, 'footnotes')) root = node; + }); + if (!root) return { root: null, ol: null, idSet: new Set() }; + // Remove
      direct children + root.children = getChildren(root).filter((n) => !(isElement(n) && n.tagName === 'hr')); + // Ensure an
        + let ol = getChildren(root).find((c) => isElement(c) && c.tagName === 'ol'); + if (!ol) { + ol = { type: 'element', tagName: 'ol', properties: {}, children: [] }; + const items = getChildren(root).filter((n) => isElement(n) && (n.tagName === 'li' || hasClass(n, 'footnote') || n.tagName === 'p' || n.tagName === 'div')); + if (items.length) { + for (const it of items) { + // Deep clone to preserve all properties including KaTeX structure + const clonedChildren = getChildren(it).map(deepCloneNode); + const li = { type: 'element', tagName: 'li', properties: {}, children: clonedChildren }; + // Promote nested id if present (e.g.,

        ) + const nestedWithId = getChildren(it).find((n) => isElement(n) && getAttr(n, 'id')); + if (nestedWithId) setAttr(li, 'id', getAttr(nestedWithId, 'id')); + ol.children.push(li); + } + } + root.children = [ol]; + } + // For existing structures, try to promote ids from children when missing + for (const li of getChildren(ol)) { + if (!isElement(li) || li.tagName !== 'li') continue; + if (!getAttr(li, 'id')) { + const nestedWithId = getChildren(li).find((n) => isElement(n) && getAttr(n, 'id')); + if (nestedWithId) setAttr(li, 'id', getAttr(nestedWithId, 'id')); + } + // Remove default footnote backrefs anywhere inside (to avoid duplication) + // But preserve KaTeX elements + removeFootnoteBackrefAnchors(li); + } + setAttr(root, 'data-built-footnotes', '1'); + // Collect id set + const idSet = new Set(); + for (const li of getChildren(ol)) { + if (!isElement(li) || li.tagName !== 'li') continue; + const id = getAttr(li, 'id'); + if (id) idSet.add(String(id)); + } + return { root, ol, idSet }; + }; + + const { root: footRoot, ol: footOl, idSet: footIdSet } = cleanupFootnotes(); + + // Collect in-text anchors pointing to footnotes + const { idToBacklinks: footIdToBacklinks } = collectBacklinksForIdSet(footIdSet, 'footctx'); + + // Append backlinks into footnote list items (identical pattern to references) + appendBackrefsBlock(footOl, footIdToBacklinks, 'Back to footnote call'); + }; +} + + diff --git a/app/plugins/rehype/restore-at-in-code.mjs b/app/plugins/rehype/restore-at-in-code.mjs new file mode 100644 index 0000000000000000000000000000000000000000..09db2b1fb8720cefeb7a7d94ea85ba4db47b1612 --- /dev/null +++ b/app/plugins/rehype/restore-at-in-code.mjs @@ -0,0 +1,22 @@ +// Rehype plugin to restore '@' inside code nodes after rehype-citation ran +export default function rehypeRestoreAtInCode() { + return (tree) => { + const restoreInNode = (node) => { + if (!node || typeof node !== 'object') return; + const isText = node.type === 'text'; + if (isText && typeof node.value === 'string' && node.value.includes('__AT_SENTINEL__')) { + node.value = node.value.replace(/__AT_SENTINEL__/g, '@'); + } + const isCodeEl = node.type === 'element' && node.tagName === 'code'; + const children = Array.isArray(node.children) ? node.children : []; + if (isCodeEl && children.length) { + children.forEach(restoreInNode); + return; + } + children.forEach(restoreInNode); + }; + restoreInNode(tree); + }; +} + + diff --git a/app/plugins/rehype/wrap-outputs.mjs b/app/plugins/rehype/wrap-outputs.mjs new file mode 100644 index 0000000000000000000000000000000000000000..307047febe085ffa78f2468978e588bc3749b148 --- /dev/null +++ b/app/plugins/rehype/wrap-outputs.mjs @@ -0,0 +1,38 @@ +// Wrap plain-text content inside

        into a
        +export default function rehypeWrapOutput() {
        +  return (tree) => {
        +    const isWhitespace = (value) => typeof value === 'string' && !/\S/.test(value);
        +    const extractText = (node) => {
        +      if (!node) return '';
        +      if (node.type === 'text') return String(node.value || '');
        +      const kids = Array.isArray(node.children) ? node.children : [];
        +      return kids.map(extractText).join('');
        +    };
        +    const visit = (node) => {
        +      if (!node || typeof node !== 'object') return;
        +      const children = Array.isArray(node.children) ? node.children : [];
        +      if (node.type === 'element' && node.tagName === 'section') {
        +        const className = node.properties?.className || [];
        +        const classes = Array.isArray(className) ? className : [className].filter(Boolean);
        +        if (classes.includes('code-output')) {
        +          const meaningful = children.filter((c) => !(c.type === 'text' && isWhitespace(c.value)));
        +          if (meaningful.length === 1) {
        +            const only = meaningful[0];
        +            const isPlainParagraph = only.type === 'element' && only.tagName === 'p' && (only.children || []).every((c) => c.type === 'text');
        +            const isPlainText = only.type === 'text';
        +            if (isPlainParagraph || isPlainText) {
        +              const text = isPlainText ? String(only.value || '') : extractText(only);
        +              node.children = [
        +                { type: 'element', tagName: 'pre', properties: {}, children: [ { type: 'text', value: text } ] }
        +              ];
        +            }
        +          }
        +        }
        +      }
        +      children.forEach(visit);
        +    };
        +    visit(tree);
        +  };
        +}
        +
        +
        diff --git a/app/plugins/rehype/wrap-tables.mjs b/app/plugins/rehype/wrap-tables.mjs
        new file mode 100644
        index 0000000000000000000000000000000000000000..fc7944cb737ba8cfd2cbed28b66e2527c0234f89
        --- /dev/null
        +++ b/app/plugins/rehype/wrap-tables.mjs
        @@ -0,0 +1,43 @@
        +// rehype plugin: wrap bare  elements in a 
        container +// so that tables stay width:100% while enabling horizontal scroll when content overflows + +export default function rehypeWrapTables() { + return (tree) => { + const isElement = (n) => n && typeof n === 'object' && n.type === 'element'; + const getChildren = (n) => (Array.isArray(n?.children) ? n.children : []); + + const walk = (node, parent, fn) => { + if (!node || typeof node !== 'object') return; + fn && fn(node, parent); + const kids = getChildren(node); + for (const child of kids) walk(child, node, fn); + }; + + const ensureArray = (v) => (Array.isArray(v) ? v : v != null ? [v] : []); + const hasClass = (el, name) => ensureArray(el?.properties?.className).map(String).includes(name); + + const wrapTable = (tableNode, parent) => { + if (!parent || !Array.isArray(parent.children)) return; + // Don't double-wrap if already inside .table-scroll + if (parent.tagName === 'div' && hasClass(parent, 'table-scroll')) return; + + const wrapper = { + type: 'element', + tagName: 'div', + properties: { className: ['table-scroll'] }, + children: [tableNode] + }; + + const idx = parent.children.indexOf(tableNode); + if (idx >= 0) parent.children.splice(idx, 1, wrapper); + }; + + walk(tree, null, (node, parent) => { + if (!isElement(node)) return; + if (node.tagName !== 'table') return; + wrapTable(node, parent); + }); + }; +} + + diff --git a/app/plugins/remark/ignore-citations-in-code.mjs b/app/plugins/remark/ignore-citations-in-code.mjs new file mode 100644 index 0000000000000000000000000000000000000000..b5c3e279088bcbd325bdb2d031de77ed48fa5591 --- /dev/null +++ b/app/plugins/remark/ignore-citations-in-code.mjs @@ -0,0 +1,21 @@ +// Remark plugin to ignore citations inside code (block and inline) +export default function remarkIgnoreCitationsInCode() { + return (tree) => { + const visit = (node) => { + if (!node || typeof node !== 'object') return; + const type = node.type; + if (type === 'code' || type === 'inlineCode') { + if (typeof node.value === 'string' && node.value.includes('@')) { + // Use a sentinel to avoid rehype-citation, will be restored later in rehype + node.value = node.value.replace(/@/g, '__AT_SENTINEL__'); + } + return; // do not traverse into code + } + const children = Array.isArray(node.children) ? node.children : []; + children.forEach(visit); + }; + visit(tree); + }; +} + + diff --git a/app/plugins/remark/output-container.mjs b/app/plugins/remark/output-container.mjs new file mode 100644 index 0000000000000000000000000000000000000000..bb25220416a44e22007345265acb8d2eb803e93b --- /dev/null +++ b/app/plugins/remark/output-container.mjs @@ -0,0 +1,23 @@ +// Transform `:::output ... :::` into a
        wrapper +// Requires remark-directive to be applied before this plugin + +export default function remarkOutputContainer() { + return (tree) => { + const visit = (node) => { + if (!node || typeof node !== 'object') return; + + if (node.type === 'containerDirective' && node.name === 'output') { + node.data = node.data || {}; + node.data.hName = 'section'; + node.data.hProperties = { className: ['code-output'] }; + } + + const children = Array.isArray(node.children) ? node.children : []; + for (const child of children) visit(child); + }; + + visit(tree); + }; +} + + diff --git a/app/plugins/remark/outputs-container.mjs b/app/plugins/remark/outputs-container.mjs new file mode 100644 index 0000000000000000000000000000000000000000..5602aca8e635e00de98f49704be7e51e4f3e87b0 --- /dev/null +++ b/app/plugins/remark/outputs-container.mjs @@ -0,0 +1,23 @@ +// Transform `:::outputs ... :::` into a
        wrapper +// Requires remark-directive to be applied before this plugin + +export default function remarkOutputsContainer() { + return (tree) => { + const visit = (node) => { + if (!node || typeof node !== 'object') return; + + if (node.type === 'containerDirective' && node.name === 'outputs') { + node.data = node.data || {}; + node.data.hName = 'section'; + node.data.hProperties = { className: ['code-outputs'] }; + } + + const children = Array.isArray(node.children) ? node.children : []; + for (const child of children) visit(child); + }; + + visit(tree); + }; +} + + diff --git a/app/plugins/remark/unwrap-citation-links.mjs b/app/plugins/remark/unwrap-citation-links.mjs new file mode 100644 index 0000000000000000000000000000000000000000..89afd8d9b63d311aa6642a231741e8b219a6a962 --- /dev/null +++ b/app/plugins/remark/unwrap-citation-links.mjs @@ -0,0 +1,57 @@ +// Plugin remark pour transformer les liens markdown contenant des citations en citations simples +// Transforme [@reference](url) en [@reference] +export default function remarkUnwrapCitationLinks() { + return (tree) => { + // Fonction helper pour extraire le contenu textuel d'un nœud + const getTextContent = (node) => { + if (!node) return ''; + if (node.type === 'text') return node.value || ''; + if (Array.isArray(node.children)) { + return node.children.map(getTextContent).join(''); + } + return ''; + }; + + const visit = (node, parent) => { + if (!node || typeof node !== 'object') return; + + // Parcourir les enfants d'abord (post-order traversal) + const children = Array.isArray(node.children) ? node.children : []; + for (let i = 0; i < children.length; i++) { + const child = children[i]; + visit(child, node); + } + + // Si c'est un nœud de type 'link', vérifier son contenu + if (node.type === 'link' && parent && Array.isArray(parent.children)) { + // Récupérer le contenu textuel du lien + const textContent = getTextContent(node); + + // Debug + console.log('🔍 Link trouvé:', { + text: textContent, + url: node.url, + matches: /^@\w+/.test(textContent.trim()) + }); + + // Vérifier si c'est une citation (commence par @) + if (textContent && /^@\w+/.test(textContent.trim())) { + // Trouver l'index du nœud dans son parent + const index = parent.children.indexOf(node); + + if (index !== -1) { + console.log('✅ Transformation:', textContent); + // Remplacer le nœud link par un nœud text simple + parent.children[index] = { + type: 'text', + value: textContent.trim() + }; + } + } + } + }; + + visit(tree, null); + }; +} + diff --git a/app/postcss.config.mjs b/app/postcss.config.mjs new file mode 100644 index 0000000000000000000000000000000000000000..65fe6e9fd4437c66b3b2e303bd091a66cff025e5 --- /dev/null +++ b/app/postcss.config.mjs @@ -0,0 +1,14 @@ +// PostCSS config enabling Custom Media Queries +// Allows usage of: @media (--bp-content-collapse) { ... } + +import postcssCustomMedia from 'postcss-custom-media'; +import postcssPresetEnv from 'postcss-preset-env'; + +export default { + plugins: [ + postcssCustomMedia(), + postcssPresetEnv({ + stage: 0 + }) + ] +}; diff --git a/app/public/data b/app/public/data new file mode 120000 index 0000000000000000000000000000000000000000..7af5c0541877d3e5fd06c4a0bf6f8ffa18d2739a --- /dev/null +++ b/app/public/data @@ -0,0 +1 @@ +../src/content/assets/data \ No newline at end of file diff --git a/app/public/hf-space-parent-listener.js b/app/public/hf-space-parent-listener.js new file mode 100644 index 0000000000000000000000000000000000000000..d114abdeef1c38e61884fd16d09e9c757f454461 --- /dev/null +++ b/app/public/hf-space-parent-listener.js @@ -0,0 +1,55 @@ +/** + * Script for Hugging Face Spaces parent window + * This script listens to iframe messages and updates the parent window URL + * + * Usage instructions: + * 1. Add this script to your Hugging Face Space in app.py or in a Gradio component + * 2. Or use it in an HTML page that contains your iframe + */ + +(function () { + 'use strict'; + + // Listen to iframe messages + window.addEventListener('message', function (event) { + + // Check message type + if (event.data && event.data.type) { + switch (event.data.type) { + case 'urlChange': + case 'anchorChange': + case 'HF_SPACE_URL_UPDATE': + handleUrlChange(event.data); + break; + default: + // Unknown message type, ignore + } + } + }); + + function handleUrlChange(data) { + try { + const hash = data.hash || data.anchorId; + + if (hash) { + // Update URL with new anchor + const newUrl = new URL(window.location); + newUrl.hash = hash; + + // Use replaceState to avoid adding an entry to history + window.history.replaceState(null, '', newUrl.toString()); + } + } catch (error) { + // Silent error when updating URL + } + } + + // Utility function to test communication + window.testIframeCommunication = function () { + const iframe = document.querySelector('iframe'); + if (iframe) { + iframe.contentWindow.postMessage({ type: 'test' }, '*'); + } + }; + +})(); diff --git a/app/public/scripts/color-palettes.js b/app/public/scripts/color-palettes.js new file mode 100644 index 0000000000000000000000000000000000000000..370b1f464142e0d9280855b18f8f636db810ea6e --- /dev/null +++ b/app/public/scripts/color-palettes.js @@ -0,0 +1,274 @@ +// Global color palettes generator and watcher +// - Observes CSS variable --primary-color and theme changes +// - Generates categorical, sequential, and diverging palettes (OKLCH/OKLab) +// - Exposes results as CSS variables on :root +// - Supports variable color counts per palette via CSS vars +// - Dispatches a 'palettes:updated' CustomEvent after each update + +(() => { + const MODE = { cssRoot: document.documentElement }; + + const getCssVar = (name) => { + try { return getComputedStyle(MODE.cssRoot).getPropertyValue(name).trim(); } catch { return ''; } + }; + const getIntFromCssVar = (name, fallback) => { + const raw = getCssVar(name); + if (!raw) return fallback; + const v = parseInt(String(raw), 10); + if (Number.isNaN(v)) return fallback; + return v; + }; + const clamp = (n, min, max) => Math.max(min, Math.min(max, n)); + + // Color math (OKLab/OKLCH) + const srgbToLinear = (u) => (u <= 0.04045 ? u / 12.92 : Math.pow((u + 0.055) / 1.055, 2.4)); + const linearToSrgb = (u) => (u <= 0.0031308 ? 12.92 * u : 1.055 * Math.pow(Math.max(0, u), 1 / 2.4) - 0.055); + const rgbToOklab = (r, g, b) => { + const rl = srgbToLinear(r), gl = srgbToLinear(g), bl = srgbToLinear(b); + const l = Math.cbrt(0.4122214708 * rl + 0.5363325363 * gl + 0.0514459929 * bl); + const m = Math.cbrt(0.2119034982 * rl + 0.6806995451 * gl + 0.1073969566 * bl); + const s = Math.cbrt(0.0883024619 * rl + 0.2817188376 * gl + 0.6299787005 * bl); + const L = 0.2104542553 * l + 0.7936177850 * m - 0.0040720468 * s; + const a = 1.9779984951 * l - 2.4285922050 * m + 0.4505937099 * s; + const b2 = 0.0259040371 * l + 0.7827717662 * m - 0.8086757660 * s; + return { L, a, b: b2 }; + }; + const oklabToRgb = (L, a, b) => { + const l_ = L + 0.3963377774 * a + 0.2158037573 * b; + const m_ = L - 0.1055613458 * a - 0.0638541728 * b; + const s_ = L - 0.0894841775 * a - 1.2914855480 * b; + const l = l_ * l_ * l_; + const m = m_ * m_ * m_; + const s = s_ * s_ * s_; + const r = linearToSrgb(+4.0767416621 * l - 3.3077115913 * m + 0.2309699292 * s); + const g = linearToSrgb(-1.2684380046 * l + 2.6097574011 * m - 0.3413193965 * s); + const b3 = linearToSrgb(-0.0041960863 * l - 0.7034186147 * m + 1.7076147010 * s); + return { r, g, b: b3 }; + }; + const oklchToOklab = (L, C, hDeg) => { const h = (hDeg * Math.PI) / 180; return { L, a: C * Math.cos(h), b: C * Math.sin(h) }; }; + const oklabToOklch = (L, a, b) => { const C = Math.sqrt(a * a + b * b); let h = Math.atan2(b, a) * 180 / Math.PI; if (h < 0) h += 360; return { L, C, h }; }; + const clamp01 = (x) => Math.min(1, Math.max(0, x)); + const isInGamut = ({ r, g, b }) => r >= 0 && r <= 1 && g >= 0 && g <= 1 && b >= 0 && b <= 1; + const toHex = ({ r, g, b }) => { + const R = Math.round(clamp01(r) * 255), G = Math.round(clamp01(g) * 255), B = Math.round(clamp01(b) * 255); + const h = (n) => n.toString(16).padStart(2, '0'); + return `#${h(R)}${h(G)}${h(B)}`.toUpperCase(); + }; + const oklchToHexSafe = (L, C, h) => { let c = C; for (let i = 0; i < 12; i++) { const { a, b } = oklchToOklab(L, c, h); const rgb = oklabToRgb(L, a, b); if (isInGamut(rgb)) return toHex(rgb); c = Math.max(0, c - 0.02); } return toHex(oklabToRgb(L, 0, 0)); }; + const parseCssColorToRgb = (css) => { try { const el = document.createElement('span'); el.style.color = css; document.body.appendChild(el); const cs = getComputedStyle(el).color; document.body.removeChild(el); const m = cs.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/i); if (!m) return null; return { r: Number(m[1]) / 255, g: Number(m[2]) / 255, b: Number(m[3]) / 255 }; } catch { return null; } }; + + // Get primary color in OKLCH format to preserve precision + const getPrimaryOKLCH = () => { + const css = getCssVar('--primary-color'); + if (!css) return null; + + // For OKLCH colors, return the exact values without conversion + if (css.includes('oklch')) { + const oklchMatch = css.match(/oklch\(([^)]+)\)/); + if (oklchMatch) { + const values = oklchMatch[1].split(/\s+/).map(v => parseFloat(v.trim())); + if (values.length >= 3) { + const [L, C, h] = values; + return { L, C, h }; + } + } + } + + // For non-OKLCH colors, convert to OKLCH for consistency + const rgb = parseCssColorToRgb(css); + if (rgb) { + const { L, a, b } = rgbToOklab(rgb.r, rgb.g, rgb.b); + const { C, h } = oklabToOklch(L, a, b); + return { L, C, h }; + } + return null; + }; + + // Keep getPrimaryHex for backward compatibility, but now it converts from OKLCH + const getPrimaryHex = () => { + const oklch = getPrimaryOKLCH(); + if (!oklch) return null; + + const { a, b } = oklchToOklab(oklch.L, oklch.C, oklch.h); + const rgb = oklabToRgb(oklch.L, a, b); + return toHex(rgb); + }; + // No count management via CSS anymore; counts are passed directly to the API + + const generators = { + categorical: (baseOKLCH, count) => { + const { L, C, h } = baseOKLCH; + const L0 = Math.min(0.85, Math.max(0.4, L)); + const C0 = Math.min(0.35, Math.max(0.1, C || 0.2)); + const total = Math.max(1, Math.min(12, count || 8)); + const hueStep = 360 / total; + const results = []; + for (let i = 0; i < total; i++) { + const hDeg = (h + i * hueStep) % 360; + const lVar = ((i % 3) - 1) * 0.04; + results.push(oklchToHexSafe(Math.max(0.4, Math.min(0.85, L0 + lVar)), C0, hDeg)); + } + return results; + }, + sequential: (baseOKLCH, count) => { + const { L, C, h } = baseOKLCH; + const total = Math.max(1, Math.min(12, count || 8)); + const startL = Math.max(0.25, L - 0.18); + const endL = Math.min(0.92, L + 0.18); + const cBase = Math.min(0.33, Math.max(0.08, C * 0.9 + 0.06)); + const out = []; + for (let i = 0; i < total; i++) { + const t = total === 1 ? 0 : i / (total - 1); + const lNow = startL * (1 - t) + endL * t; + const cNow = cBase * (0.85 + 0.15 * (1 - Math.abs(0.5 - t) * 2)); + out.push(oklchToHexSafe(lNow, cNow, h)); + } + return out; + }, + diverging: (baseOKLCH, count) => { + const { L, C, h } = baseOKLCH; + const total = Math.max(1, Math.min(12, count || 8)); + + // Left endpoint: EXACT primary color (no darkening) + const leftLab = oklchToOklab(L, C, h); + // Right endpoint: complement with same L and similar C (clamped safe) + const compH = (h + 180) % 360; + const cSafe = Math.min(0.35, Math.max(0.08, C)); + const rightLab = oklchToOklab(L, cSafe, compH); + const whiteLab = { L: 0.98, a: 0, b: 0 }; // center near‑white + + const hexFromOKLab = (L, a, b) => toHex(oklabToRgb(L, a, b)); + const lerp = (a, b, t) => a + (b - a) * t; + const lerpOKLabHex = (A, B, t) => hexFromOKLab(lerp(A.L, B.L, t), lerp(A.a, B.a, t), lerp(A.b, B.b, t)); + + const out = []; + if (total % 2 === 1) { + const nSide = (total - 1) >> 1; // items on each side + // Left side: include left endpoint exactly at index 0 + for (let i = 0; i < nSide; i++) { + const t = nSide <= 1 ? 0 : (i / (nSide - 1)); // 0 .. 1 + // Move from leftLab to a value close (but not equal) to white; ensure last before center is lighter + const tt = t * 0.9; // keep some distance from pure white before center + out.push(lerpOKLabHex(leftLab, whiteLab, tt)); + } + // Center + out.push(hexFromOKLab(whiteLab.L, whiteLab.a, whiteLab.b)); + // Right side: start near white and end EXACTLY at rightLab + for (let i = 0; i < nSide; i++) { + const t = nSide <= 1 ? 1 : ((i + 1) / nSide); // (1/n)..1 + const tt = Math.max(0.1, t); // avoid starting at pure white + out.push(lerpOKLabHex(whiteLab, rightLab, tt)); + } + // Ensure first and last are exact endpoints + if (out.length) { out[0] = hexFromOKLab(leftLab.L, leftLab.a, leftLab.b); out[out.length - 1] = hexFromOKLab(rightLab.L, rightLab.a, rightLab.b); } + } else { + const nSide = total >> 1; + // Left half including left endpoint, approaching white but not reaching it + for (let i = 0; i < nSide; i++) { + const t = nSide <= 1 ? 0 : (i / (nSide - 1)); // 0 .. 1 + const tt = t * 0.9; + out.push(lerpOKLabHex(leftLab, whiteLab, tt)); + } + // Right half: mirror from near white to exact right endpoint + for (let i = 0; i < nSide; i++) { + const t = nSide <= 1 ? 1 : ((i + 1) / nSide); // (1/n)..1 + const tt = Math.max(0.1, t); + out.push(lerpOKLabHex(whiteLab, rightLab, tt)); + } + if (out.length) { out[0] = hexFromOKLab(leftLab.L, leftLab.a, leftLab.b); out[out.length - 1] = hexFromOKLab(rightLab.L, rightLab.a, rightLab.b); } + } + return out; + } + }; + + let lastSignature = ''; + + const updatePalettes = () => { + const primaryOKLCH = getPrimaryOKLCH(); + const primaryHex = getPrimaryHex(); + const signature = `${primaryOKLCH?.L},${primaryOKLCH?.C},${primaryOKLCH?.h}`; + if (signature === lastSignature) return; + lastSignature = signature; + try { document.dispatchEvent(new CustomEvent('palettes:updated', { detail: { primary: primaryHex, primaryOKLCH } })); } catch { } + }; + + const bootstrap = () => { + // Initial setup - only run once on page load + updatePalettes(); + + // Observer will handle all subsequent changes + const mo = new MutationObserver(() => updatePalettes()); + mo.observe(MODE.cssRoot, { attributes: true, attributeFilter: ['style', 'data-theme'] }); + + // Utility: choose high-contrast (or softened) text style against an arbitrary background color + const pickTextStyleForBackground = (bgCss, opts = {}) => { + const cssRoot = document.documentElement; + const getCssVar = (name) => { + try { return getComputedStyle(cssRoot).getPropertyValue(name).trim(); } catch { return ''; } + }; + const resolveCssToRgb01 = (css) => { + const rgb = parseCssColorToRgb(css); + if (!rgb) return null; + return rgb; // already 0..1 + }; + const mixRgb01 = (a, b, t) => ({ r: a.r * (1 - t) + b.r * t, g: a.g * (1 - t) + b.g * t, b: a.b * (1 - t) + b.b * t }); + const relLum = (rgb) => { + const f = (u) => srgbToLinear(u); + return 0.2126 * f(rgb.r) + 0.7152 * f(rgb.g) + 0.0722 * f(rgb.b); + }; + const contrast = (fg, bg) => { + const L1 = relLum(fg), L2 = relLum(bg); const a = Math.max(L1, L2), b = Math.min(L1, L2); + return (a + 0.05) / (b + 0.05); + }; + try { + const bg = resolveCssToRgb01(bgCss); + if (!bg) return { fill: getCssVar('--text-color') || '#000', stroke: 'var(--transparent-page-contrast)', strokeWidth: 1 }; + const candidatesCss = [getCssVar('--text-color') || '#111', getCssVar('--on-primary') || '#0f1115', '#000', '#fff']; + const candidates = candidatesCss + .map(css => ({ css, rgb: resolveCssToRgb01(css) })) + .filter(x => !!x.rgb); + // Pick the max contrast + let best = candidates[0]; let bestCR = contrast(best.rgb, bg); + for (let i = 1; i < candidates.length; i++) { + const cr = contrast(candidates[i].rgb, bg); + if (cr > bestCR) { best = candidates[i]; bestCR = cr; } + } + // Optional softening via blend factor (0..1), blending towards muted color + const blend = Math.min(1, Math.max(0, Number(opts.blend || 0))); + let finalRgb = best.rgb; + if (blend > 0) { + const mutedCss = getCssVar('--muted-color') || (getCssVar('--text-color') || '#111'); + const mutedRgb = resolveCssToRgb01(mutedCss) || best.rgb; + finalRgb = mixRgb01(best.rgb, mutedRgb, blend); + } + const haloStrength = Math.min(1, Math.max(0, Number(opts.haloStrength == null ? 0.5 : opts.haloStrength))); + const stroke = (best.css === '#000' || best.css.toLowerCase() === 'black') ? `rgba(255,255,255,${0.30 + 0.40 * haloStrength})` : `rgba(0,0,0,${0.30 + 0.30 * haloStrength})`; + return { fill: toHex(finalRgb), stroke, strokeWidth: (opts.haloWidth == null ? 1 : Number(opts.haloWidth)) }; + } catch { + return { fill: getCssVar('--text-color') || '#000', stroke: 'var(--transparent-page-contrast)', strokeWidth: 1 }; + } + }; + window.ColorPalettes = { + refresh: updatePalettes, + notify: () => { try { const primaryOKLCH = getPrimaryOKLCH(); const primaryHex = getPrimaryHex(); document.dispatchEvent(new CustomEvent('palettes:updated', { detail: { primary: primaryHex, primaryOKLCH } })); } catch { } }, + getPrimary: () => getPrimaryHex(), + getPrimaryOKLCH: () => getPrimaryOKLCH(), + getColors: (key, count = 6) => { + const primaryOKLCH = getPrimaryOKLCH(); + if (!primaryOKLCH) return []; + const total = Math.max(1, Math.min(12, Number(count) || 6)); + if (key === 'categorical') return generators.categorical(primaryOKLCH, total); + if (key === 'sequential') return generators.sequential(primaryOKLCH, total); + if (key === 'diverging') return generators.diverging(primaryOKLCH, total); + return []; + }, + getTextStyleForBackground: (bgCss, opts) => pickTextStyleForBackground(bgCss, opts || {}), + chooseReadableText: (bgCss, opts) => pickTextStyleForBackground(bgCss, opts || {}) + }; + }; + + if (document.readyState === 'loading') document.addEventListener('DOMContentLoaded', bootstrap, { once: true }); + else bootstrap(); +})(); + + diff --git a/app/public/thumb.png b/app/public/thumb.png new file mode 100644 index 0000000000000000000000000000000000000000..ef3ed3dd0ba629c350151af4dfde3ce8c9222ca6 --- /dev/null +++ b/app/public/thumb.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ae7bfe85551fa5f70df5341e6c3a5d5d5f0d68553d9a137725fda61d55627ded +size 279121 diff --git a/app/scripts/EXPORT-PDF-BOOK.md b/app/scripts/EXPORT-PDF-BOOK.md new file mode 100644 index 0000000000000000000000000000000000000000..25ed30daffd561c93dff4f8baf570f0e97eeb661 --- /dev/null +++ b/app/scripts/EXPORT-PDF-BOOK.md @@ -0,0 +1,311 @@ +# 📚 Export PDF Book avec Paged.js + +Système de génération de PDF professionnel avec mise en page type livre, propulsé par **Paged.js**. + +## ✨ Fonctionnalités + +### Mise en page professionnelle +- ✅ **Pagination automatique** avec Paged.js +- ✅ **Running headers** : titres de chapitres en haut de page +- ✅ **Numérotation des pages** : alternée gauche/droite +- ✅ **Marges asymétriques** : optimisées pour reliure (recto/verso) +- ✅ **Gestion veuves et orphelines** : évite les lignes isolées +- ✅ **Typographie professionnelle** : justification, césure automatique + +### Éléments de livre +- 📖 Compteurs automatiques : chapitres, figures, tableaux +- 📑 Notes de bas de page (si implémentées) +- 🔢 Numérotation hiérarchique (1.2.3, etc.) +- 📊 Support complet des visualisations D3/Plotly +- 🖼️ Figures avec légendes numérotées +- 📝 Citations et références + +## 🚀 Utilisation + +### Commande de base + +```bash +npm run export:pdf:book +``` + +Cette commande va : +1. Builder le site Astro (si nécessaire) +2. Démarrer un serveur preview +3. Charger la page et injecter Paged.js +4. Paginer le contenu automatiquement +5. Générer le PDF dans `dist/article-book.pdf` + +### Options disponibles + +```bash +# Thème sombre +npm run export:pdf:book -- --theme=dark + +# Format personnalisé +npm run export:pdf:book -- --format=Letter + +# Nom de fichier custom +npm run export:pdf:book -- --filename=mon-livre + +# Combinaison d'options +npm run export:pdf:book -- --theme=light --format=A4 --filename=thesis +``` + +#### Options détaillées + +| Option | Valeurs | Défaut | Description | +|--------|---------|--------|-------------| +| `--theme` | `light`, `dark` | `light` | Thème de couleur | +| `--format` | `A4`, `Letter`, `Legal`, `A3`, `Tabloid` | `A4` | Format de page | +| `--filename` | `string` | `article-book` | Nom du fichier de sortie | +| `--wait` | `full`, `images`, `plotly`, `d3` | `full` | Stratégie d'attente | + +## 📐 Format de page + +Le système utilise des marges optimisées pour l'impression livre : + +### Pages de droite (recto) +- Marge gauche : **25mm** (reliure) +- Marge droite : **20mm** +- Header droite : titre de section +- Footer droite : numéro de page + +### Pages de gauche (verso) +- Marge gauche : **20mm** +- Marge droite : **25mm** (reliure) +- Header gauche : titre de chapitre +- Footer gauche : numéro de page + +### Première page +- Marges augmentées (40mm haut/bas) +- Pas de headers/footers +- Centrée + +## 🎨 Personnalisation CSS + +Le style livre est défini dans : +``` +app/src/styles/_print-book.css +``` + +### Modifier les marges + +```css +@page { + margin-top: 20mm; + margin-bottom: 25mm; + /* ... */ +} + +@page :left { + margin-left: 20mm; + margin-right: 25mm; +} + +@page :right { + margin-left: 25mm; + margin-right: 20mm; +} +``` + +### Modifier la typographie + +```css +body { + font-family: "Georgia", "Palatino", "Times New Roman", serif; + font-size: 11pt; + line-height: 1.6; +} + +h2 { + font-size: 18pt; + /* ... */ +} +``` + +### Personnaliser les running headers + +```css +@page :left { + @top-left { + content: string(chapter-title); + font-size: 9pt; + font-style: italic; + /* ... */ + } +} + +@page :right { + @top-right { + content: string(section-title); + /* ... */ + } +} +``` + +### Ajouter un logo/filigrane + +```css +@page { + background-image: url('/logo.png'); + background-position: bottom center; + background-size: 20mm; + background-repeat: no-repeat; +} +``` + +## 🔧 Configuration Paged.js avancée + +### Hooks JavaScript personnalisés + +Vous pouvez ajouter des hooks Paged.js dans le script `export-pdf-book.mjs` : + +```javascript +// Après l'injection de Paged.js +await page.evaluate(() => { + class BookHooks extends window.Paged.Handler { + beforeParsed(content) { + // Modifier le contenu avant pagination + } + + afterParsed(parsed) { + // Après l'analyse + } + + afterRendered(pages) { + // Après le rendu de toutes les pages + console.log(`Rendered ${pages.length} pages`); + } + } + + window.Paged.registerHandlers(BookHooks); +}); +``` + +### Forcer des sauts de page + +Dans votre MDX : + +```mdx +## Chapitre 1 + +Contenu... + +
        + +## Chapitre 2 (commence sur une nouvelle page) +``` + +Ou avec une classe CSS : + +```css +.chapter-break { + break-after: page; +} +``` + +## 📊 Visualisations + +Les graphiques D3 et Plotly sont automatiquement : +- ✅ Redimensionnés pour le format livre +- ✅ Rendus en haute qualité +- ✅ Évitent les coupures de page +- ✅ Conservent l'interactivité dans le HTML source + +## 🐛 Dépannage + +### Le PDF est vide ou incomplet + +```bash +# Augmenter le temps d'attente +npm run export:pdf:book -- --wait=full +``` + +### Les images ne s'affichent pas + +Vérifiez que les chemins d'images sont **absolus** dans le HTML : +```html + + + + + +``` + +### Les graphiques sont coupés + +Ajoutez dans `_print-book.css` : +```css +.your-chart-class { + max-height: 180mm !important; + break-inside: avoid; +} +``` + +### Erreur "Paged.js not found" + +```bash +# Réinstaller Paged.js +cd app +npm install pagedjs +``` + +### Le serveur ne démarre pas + +```bash +# Port déjà utilisé ? Changer le port +PREVIEW_PORT=8081 npm run export:pdf:book +``` + +## 📚 Ressources Paged.js + +- **Documentation officielle** : https://pagedjs.org/documentation/ +- **Spécifications CSS Paged Media** : https://www.w3.org/TR/css-page-3/ +- **Exemples** : https://pagedjs.org/examples/ + +## 🆚 Différences avec export:pdf standard + +| Fonctionnalité | `export:pdf` | `export:pdf:book` | +|----------------|--------------|-------------------| +| Pagination | Navigateur standard | Paged.js professionnel | +| Running headers | ❌ | ✅ | +| Marges reliure | ❌ | ✅ | +| Numérotation avancée | ❌ | ✅ | +| Compteurs automatiques | ❌ | ✅ | +| Gestion veuves/orphelines | Basique | Avancée | +| Notes de bas de page | ❌ | ✅ (si activées) | +| Contrôle typographique | Standard | Professionnel | +| Table des matières | Manuelle | Automatique (avec CSS) | + +## 💡 Conseils pour un résultat optimal + +1. **Structurez votre contenu** avec des `

        ` pour les chapitres +2. **Utilisez des `

        ` pour les sections** (apparaissent dans les headers) +3. **Ajoutez des IDs** aux titres pour les références croisées +4. **Optimisez les images** : résolution 300 DPI pour l'impression +5. **Testez le rendu** avant l'impression finale +6. **Évitez les couleurs vives** en mode print (privilégier les niveaux de gris) + +## 🎯 Cas d'usage + +Ce système est idéal pour : +- 📘 **Thèses et mémoires** +- 📗 **Livres techniques** +- 📕 **Rapports académiques** +- 📙 **Documentation longue** +- 📓 **E-books premium** +- 📔 **Revues scientifiques** + +## 🔮 Améliorations futures + +- [ ] Génération automatique de table des matières +- [ ] Support des index +- [ ] Références croisées automatiques +- [ ] Export en EPUB +- [ ] Templates de livre préconfigurés +- [ ] Mode "two-up" pour visualisation double page + +--- + +**Créé avec ❤️ par votre équipe template** + diff --git a/app/scripts/README-PDF-BOOK.md b/app/scripts/README-PDF-BOOK.md new file mode 100644 index 0000000000000000000000000000000000000000..5f106d9c91a2a8df6b26deb1aa3ca92ad8af25b9 --- /dev/null +++ b/app/scripts/README-PDF-BOOK.md @@ -0,0 +1,309 @@ +# 📚 Export PDF Livre - Guide Complet + +Système de génération de PDF professionnel avec mise en page type livre pour votre template d'article scientifique. + +## 🎯 Objectif + +Créer des PDFs de qualité professionnelle avec : +- Typographie soignée (Georgia, justification, césure) +- Marges asymétriques pour reliure +- Running headers avec titres de chapitres +- Numérotation de pages gauche/droite +- Gestion veuves et orphelines +- Style livre académique/éditorial + +## 📦 Ce qui a été créé + +### Fichiers créés + +``` +app/ +├── scripts/ +│ ├── export-pdf-book.mjs ← Version avec Paged.js (avancée, en cours) +│ ├── export-pdf-book-simple.mjs ← Version simple (RECOMMANDÉE ✅) +│ └── EXPORT-PDF-BOOK.md ← Documentation détaillée +└── src/ + └── styles/ + └── _print-book.css ← Styles CSS Paged Media +``` + +### Commandes npm ajoutées + +```json +{ + "export:pdf:book": "Version Paged.js (expérimentale)", + "export:pdf:book:simple": "Version simple (stable ✅)" +} +``` + +## 🚀 Utilisation + +### Commande recommandée + +```bash +npm run export:pdf:book:simple +``` + +Le PDF sera généré dans : +- `dist/article-book.pdf` +- `public/article-book.pdf` (copie automatique) + +### Options disponibles + +```bash +# Thème sombre +npm run export:pdf:book:simple -- --theme=dark + +# Format Letter +npm run export:pdf:book:simple -- --format=Letter + +# Nom personnalisé +npm run export:pdf:book:simple -- --filename=ma-these + +# Combinaison +npm run export:pdf:book:simple -- --theme=light --format=A4 --filename=livre +``` + +## 🎨 Caractéristiques du style livre + +### Marges + +``` +Pages droites (recto) │ Pages gauches (verso) + │ + 20mm ──┐ │ ┌── 25mm + │ │ │ + ┌───────┴──────┐ │ ┌──────┴───────┐ + │ │ │ │ │ + │ CONTENU │ │ │ CONTENU │ + │ │ │ │ │ + └──────────────┘ │ └──────────────┘ + 25mm │ 20mm + (reliure) │ (reliure) +``` + +### Typographie + +- **Police** : Georgia, Palatino (serif) +- **Taille** : 11pt +- **Interlignage** : 1.6 +- **Alignement** : Justifié avec césure automatique +- **Retrait** : 5mm pour les paragraphes suivants + +### Titres + +```css +H2 (Chapitres) → 18pt, numérotés (1. 2. 3.) +H3 (Sections) → 14pt, numérotés (1.1, 1.2) +H4 (Sous-sections) → 12pt +``` + +### Compteurs automatiques + +- Chapitres : 1, 2, 3... +- Sections : 1.1, 1.2, 2.1... +- Figures : Figure 1.1, Figure 1.2... +- Tableaux : idem + +## 📐 Configuration CSS + +Le fichier `_print-book.css` contient tous les styles. Vous pouvez personnaliser : + +### Changer les polices + +```css +body { + font-family: "Baskerville", "Georgia", serif; + font-size: 12pt; +} +``` + +### Ajuster les marges + +```css +@page { + margin-top: 25mm; + margin-bottom: 30mm; +} + +@page :left { + margin-left: 18mm; + margin-right: 30mm; +} +``` + +### Personnaliser les headers + +```css +@page :left { + @top-left { + content: string(chapter-title); + font-size: 10pt; + color: #333; + } +} +``` + +### Forcer un saut de page + +Dans votre MDX : +```mdx +## Chapitre 1 + +Contenu... + +--- + +## Chapitre 2 (nouvelle page) +``` + +Ou avec CSS : +```css +.new-chapter { + break-before: page; +} +``` + +## 🆚 Comparaison des versions + +| Fonctionnalité | Simple | Paged.js | +|----------------|--------|----------| +| **Stabilité** | ✅ Excellente | ⚠️ En cours | +| **Vitesse** | ✅ Rapide | ⏱️ Plus lent | +| **Setup** | ✅ Aucun | 📦 Paged.js requis | +| **Marges reliure** | ✅ | ✅ | +| **Running headers** | ⚠️ Limité | ✅ Avancé | +| **Notes de bas de page** | ❌ | ✅ | +| **Table matières auto** | ❌ | ✅ | +| **Qualité typo** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | + +### Quand utiliser quelle version ? + +**Version Simple** (recommandée) : +- ✅ Pour la plupart des cas d'usage +- ✅ Stabilité prioritaire +- ✅ Génération rapide +- ✅ Résultats prévisibles + +**Version Paged.js** (expérimentale) : +- 🔬 Pour tester les fonctionnalités avancées +- 📚 Si vous avez besoin de notes de bas de page +- 📖 Pour des tables des matières générées automatiquement +- ⚠️ Nécessite plus de tests + +## 🐛 Dépannage + +### Le PDF est vide + +```bash +# Reconstruire d'abord +npm run build +npm run export:pdf:book:simple +``` + +### Les images manquent + +Vérifiez que les chemins sont absolus : +```html + + + + + +``` + +### Les graphiques sont coupés + +Dans `_print-book.css`, ajoutez : +```css +.your-chart { + max-height: 200mm; + break-inside: avoid; +} +``` + +### Port 8080 déjà utilisé + +```bash +PREVIEW_PORT=8081 npm run export:pdf:book:simple +``` + +## 🎓 Prochaines étapes + +### Améliorations possibles + +1. **Finaliser Paged.js** pour les fonctionnalités avancées +2. **Table des matières automatique** avec numéros de page +3. **Index** généré automatiquement +4. **Références croisées** (Voir Figure 2.3, etc.) +5. **Templates prédéfinis** : + - Thèse académique + - Rapport technique + - Livre scientifique + - Documentation + +### Contribuer + +Les styles sont dans `_print-book.css`. Pour proposer des améliorations : + +1. Testez avec votre contenu +2. Modifiez le CSS +3. Générez le PDF +4. Partagez vos modifications ! + +## 📚 Ressources + +### CSS Paged Media + +- [W3C Spec](https://www.w3.org/TR/css-page-3/) +- [CSS Tricks Guide](https://css-tricks.com/css-paged-media-guide/) +- [Print CSS Documentation](https://www.smashingmagazine.com/2015/01/designing-for-print-with-css/) + +### Paged.js + +- [Documentation](https://pagedjs.org/documentation/) +- [Exemples](https://pagedjs.org/examples/) +- [W3C Paged Media](https://www.w3.org/TR/css-page-3/) + +### Typographie de livre + +- [Butterick's Practical Typography](https://practicaltypography.com/) +- [The Elements of Typographic Style](http://webtypography.net/) + +## 💡 Cas d'usage + +Ce système est idéal pour : + +- 📘 **Thèses de doctorat** +- 📗 **Mémoires de master** +- 📕 **Rapports de recherche** +- 📙 **Documentation technique** +- 📓 **Livres blancs** +- 📔 **Livres auto-publiés** +- 📚 **Collections d'articles** + +## 🎉 Résultat + +Avec ce système, vous obtenez : + +✅ **PDF prêt pour l'impression** +- Marges correctes pour reliure +- Typographie professionnelle +- Mise en page cohérente + +✅ **Qualité éditoriale** +- Numérotation automatique +- Gestion des veuves/orphelines +- Césure propre + +✅ **Workflow moderne** +- Écriture en MDX +- Build automatisé +- Un seul fichier source + +--- + +**Créé avec ❤️ pour le Research Article Template** + +*Profitez de votre nouveau système d'export PDF livre !* 📚✨ + diff --git a/app/scripts/export-latex.mjs b/app/scripts/export-latex.mjs new file mode 100755 index 0000000000000000000000000000000000000000..d4efc9d206db22dcb9486afc2a50a9edd0bc4d37 --- /dev/null +++ b/app/scripts/export-latex.mjs @@ -0,0 +1,358 @@ +#!/usr/bin/env node +import { spawn } from 'node:child_process'; +import { promises as fs } from 'node:fs'; +import { resolve, dirname, basename, extname } from 'node:path'; +import process from 'node:process'; + +async function run(command, args = [], options = {}) { + return new Promise((resolvePromise, reject) => { + const child = spawn(command, args, { stdio: 'inherit', shell: false, ...options }); + child.on('error', reject); + child.on('exit', (code) => { + if (code === 0) resolvePromise(undefined); + else reject(new Error(`${command} ${args.join(' ')} exited with code ${code}`)); + }); + }); +} + +function parseArgs(argv) { + const out = {}; + for (const arg of argv.slice(2)) { + if (!arg.startsWith('--')) continue; + const [k, v] = arg.replace(/^--/, '').split('='); + out[k] = v === undefined ? true : v; + } + return out; +} + +function slugify(text) { + return String(text || '') + .normalize('NFKD') + .replace(/\p{Diacritic}+/gu, '') + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-+|-+$/g, '') + .slice(0, 120) || 'article'; +} + +async function checkPandocInstalled() { + try { + await run('pandoc', ['--version'], { stdio: 'pipe' }); + return true; + } catch { + return false; + } +} + +async function readMdxFile(filePath) { + try { + const content = await fs.readFile(filePath, 'utf-8'); + return content; + } catch (error) { + console.warn(`Warning: Could not read ${filePath}:`, error.message); + return ''; + } +} + +function extractFrontmatter(content) { + const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---\n/); + if (!frontmatterMatch) return { frontmatter: {}, content }; + + const frontmatterText = frontmatterMatch[1]; + const contentWithoutFrontmatter = content.replace(frontmatterMatch[0], ''); + + // More robust YAML parsing that handles complex structures + const frontmatter = {}; + const lines = frontmatterText.split('\n'); + let currentKey = null; + let currentValue = ''; + let inMultiLineValue = false; + let multiLineOperator = null; // '>' or '|' + + for (const line of lines) { + // Check if this is a new key + if (line.match(/^[a-zA-Z_][a-zA-Z0-9_]*\s*:/) && !inMultiLineValue) { + // Save previous key if exists + if (currentKey) { + frontmatter[currentKey] = currentValue.trim(); + } + + const [key, ...valueParts] = line.split(':'); + currentKey = key.trim(); + currentValue = valueParts.join(':').trim(); + + // Check for multi-line operators + if (currentValue.endsWith('>') || currentValue.endsWith('|')) { + multiLineOperator = currentValue.slice(-1); + currentValue = currentValue.slice(0, -1).trim(); + inMultiLineValue = true; + } else if (currentValue) { + inMultiLineValue = false; + } else { + inMultiLineValue = true; + } + } else if (currentKey && (inMultiLineValue || line.match(/^\s/))) { + // Continuation line or nested content + if (inMultiLineValue) { + if (line.trim() === '' && multiLineOperator === '>') { + // Empty line in folded style should become space + currentValue += ' '; + } else { + const lineContent = line.startsWith(' ') ? line : ' ' + line; + currentValue += lineContent; + } + } else { + currentValue += '\n' + line; + } + } + } + + // Save the last key + if (currentKey) { + frontmatter[currentKey] = currentValue.trim(); + } + + return { frontmatter, content: contentWithoutFrontmatter }; +} + +function cleanMdxToMarkdown(content) { + // Remove import statements + content = content.replace(/^import .+?;?\s*$/gm, ''); + + // Remove JSX component calls like + content = content.replace(/<[A-Z][a-zA-Z0-9]*\s*\/>/g, ''); + + // Convert JSX components to simpler markdown + // Handle Sidenote components specially + content = content.replace(/([\s\S]*?)<\/Sidenote>/g, (match, innerContent) => { + // Extract main content and aside content + const asideMatch = innerContent.match(/([\s\S]*?)<\/Fragment>/); + const mainContent = innerContent.replace(/[\s\S]*?<\/Fragment>/, '').trim(); + const asideContent = asideMatch ? asideMatch[1].trim() : ''; + + let result = mainContent; + if (asideContent) { + result += `\n\n> **Note:** ${asideContent}`; + } + return result; + }); + + // Handle Note components + content = content.replace(/]*>([\s\S]*?)<\/Note>/g, (match, innerContent) => { + return `\n> **Note:** ${innerContent.trim()}\n`; + }); + + // Handle Wide and FullWidth components + content = content.replace(/<(Wide|FullWidth)>([\s\S]*?)<\/\1>/g, '$2'); + + // Handle HtmlEmbed components (convert to simple text) + content = content.replace(/]*\/>/g, '*[Interactive content not available in LaTeX]*'); + + // Remove remaining JSX fragments + content = content.replace(/]*>([\s\S]*?)<\/Fragment>/g, '$1'); + content = content.replace(/<[A-Z][a-zA-Z0-9]*[^>]*>([\s\S]*?)<\/[A-Z][a-zA-Z0-9]*>/g, '$1'); + + // Clean up className attributes + content = content.replace(/className="[^"]*"/g, ''); + + // Clean up extra whitespace + content = content.replace(/\n{3,}/g, '\n\n'); + + // Clean up characters that might cause YAML parsing issues + // Remove any potential YAML-style markers that might interfere + content = content.replace(/^---$/gm, ''); + content = content.replace(/^\s*&\s+/gm, ''); // Remove YAML aliases + + return content.trim(); +} + +async function processChapterImports(content, contentDir) { + let processedContent = content; + + // First, extract all import statements and their corresponding component calls + const importPattern = /import\s+(\w+)\s+from\s+["']\.\/chapters\/([^"']+)["'];?/g; + const imports = new Map(); + let match; + + // Collect all imports + while ((match = importPattern.exec(content)) !== null) { + const [fullImport, componentName, chapterPath] = match; + imports.set(componentName, { path: chapterPath, importStatement: fullImport }); + } + + // Remove all import statements + processedContent = processedContent.replace(importPattern, ''); + + // Process each component call + for (const [componentName, { path: chapterPath }] of imports) { + const componentCallPattern = new RegExp(`<${componentName}\\s*\\/>`, 'g'); + + try { + const chapterFile = resolve(contentDir, 'chapters', chapterPath); + const chapterContent = await readMdxFile(chapterFile); + const { content: chapterMarkdown } = extractFrontmatter(chapterContent); + const cleanChapter = cleanMdxToMarkdown(chapterMarkdown); + + processedContent = processedContent.replace(componentCallPattern, cleanChapter); + console.log(`✅ Processed chapter: ${chapterPath}`); + } catch (error) { + console.warn(`Warning: Could not process chapter ${chapterPath}:`, error.message); + processedContent = processedContent.replace(componentCallPattern, `\n*[Chapter ${chapterPath} could not be loaded]*\n`); + } + } + + return processedContent; +} + +function createLatexPreamble(frontmatter) { + const title = frontmatter.title ? frontmatter.title.replace(/\n/g, ' ') : 'Untitled Article'; + const subtitle = frontmatter.subtitle || ''; + const authors = frontmatter.authors || ''; + const date = frontmatter.published || ''; + + return `\\documentclass[11pt,a4paper]{article} +\\usepackage[utf8]{inputenc} +\\usepackage[T1]{fontenc} +\\usepackage{amsmath,amsfonts,amssymb} +\\usepackage{graphicx} +\\usepackage{hyperref} +\\usepackage{booktabs} +\\usepackage{longtable} +\\usepackage{array} +\\usepackage{multirow} +\\usepackage{wrapfig} +\\usepackage{float} +\\usepackage{colortbl} +\\usepackage{pdflscape} +\\usepackage{tabu} +\\usepackage{threeparttable} +\\usepackage{threeparttablex} +\\usepackage{ulem} +\\usepackage{makecell} +\\usepackage{xcolor} +\\usepackage{listings} +\\usepackage{fancyvrb} +\\usepackage{geometry} +\\geometry{margin=1in} + +\\title{${title}${subtitle ? `\\\\\\large ${subtitle}` : ''}} +${authors ? `\\author{${authors}}` : ''} +${date ? `\\date{${date}}` : ''} + +\\begin{document} +\\maketitle +\\tableofcontents +\\newpage + +`; +} + +async function main() { + const cwd = process.cwd(); + const args = parseArgs(process.argv); + + // Check if pandoc is installed + const hasPandoc = await checkPandocInstalled(); + if (!hasPandoc) { + console.error('❌ Pandoc is not installed. Please install it first:'); + console.error(' macOS: brew install pandoc'); + console.error(' Ubuntu: apt-get install pandoc'); + console.error(' Windows: choco install pandoc'); + process.exit(1); + } + + const contentDir = resolve(cwd, 'src/content'); + const articleFile = resolve(contentDir, 'article.mdx'); + + // Check if article.mdx exists + try { + await fs.access(articleFile); + } catch { + console.error(`❌ Could not find article.mdx at ${articleFile}`); + process.exit(1); + } + + console.log('> Reading article content...'); + const articleContent = await readMdxFile(articleFile); + const { frontmatter, content } = extractFrontmatter(articleContent); + + console.log('> Processing chapters...'); + const processedContent = await processChapterImports(content, contentDir); + + console.log('> Converting MDX to Markdown...'); + const markdownContent = cleanMdxToMarkdown(processedContent); + + // Generate output filename + const title = frontmatter.title ? frontmatter.title.replace(/\n/g, ' ') : 'article'; + const outFileBase = args.filename ? String(args.filename).replace(/\.(tex|pdf)$/i, '') : slugify(title); + + // Create temporary markdown file (ensure it's pure markdown without YAML frontmatter) + const tempMdFile = resolve(cwd, 'temp-article.md'); + + // Clean the markdown content to ensure no YAML frontmatter remains + let cleanMarkdown = markdownContent; + // Remove any potential YAML frontmatter that might have leaked through + cleanMarkdown = cleanMarkdown.replace(/^---\n[\s\S]*?\n---\n/, ''); + // Remove any standalone YAML blocks that might cause issues + cleanMarkdown = cleanMarkdown.replace(/^---\n([\s\S]*?)\n---$/gm, ''); + + await fs.writeFile(tempMdFile, cleanMarkdown); + + + console.log('> Converting to LaTeX with Pandoc...'); + const outputLatex = resolve(cwd, 'dist', `${outFileBase}.tex`); + + // Ensure dist directory exists + await fs.mkdir(resolve(cwd, 'dist'), { recursive: true }); + + // Pandoc conversion arguments + const pandocArgs = [ + tempMdFile, + '-o', outputLatex, + '--from=markdown-yaml_metadata_block', // Explicitly exclude YAML metadata parsing + '--to=latex', + '--standalone', + '--toc', + '--number-sections', + '--highlight-style=tango', + '--listings' + ]; + + // Add bibliography if it exists + const bibFile = resolve(contentDir, 'bibliography.bib'); + try { + await fs.access(bibFile); + pandocArgs.push('--bibliography', bibFile); + pandocArgs.push('--citeproc'); + console.log('✅ Found bibliography file, including citations'); + } catch { + console.log('ℹ️ No bibliography file found'); + } + + try { + await run('pandoc', pandocArgs); + console.log(`✅ LaTeX generated: ${outputLatex}`); + + // Optionally compile to PDF if requested + if (args.pdf) { + console.log('> Compiling LaTeX to PDF...'); + const outputPdf = resolve(cwd, 'dist', `${outFileBase}.pdf`); + await run('pdflatex', ['-output-directory', resolve(cwd, 'dist'), outputLatex]); + console.log(`✅ PDF generated: ${outputPdf}`); + } + + } catch (error) { + console.error('❌ Pandoc conversion failed:', error.message); + process.exit(1); + } finally { + // Clean up temporary file + try { + await fs.unlink(tempMdFile); + } catch { } + } +} + +main().catch((err) => { + console.error(err); + process.exit(1); +}); diff --git a/app/scripts/export-pdf-book-simple.mjs b/app/scripts/export-pdf-book-simple.mjs new file mode 100755 index 0000000000000000000000000000000000000000..fe52da66a94ddb09b93cb18ef32cce4003b78576 --- /dev/null +++ b/app/scripts/export-pdf-book-simple.mjs @@ -0,0 +1,416 @@ +#!/usr/bin/env node +/** + * Export PDF Book - Version Simplifiée + * + * Génère un PDF de qualité professionnelle avec mise en page type livre + * directement avec Playwright + CSS Paged Media (sans Paged.js pour plus de stabilité) + * + * Usage : + * npm run export:pdf:book:simple + * npm run export:pdf:book:simple -- --theme=dark --format=A4 + */ + +import { spawn } from 'node:child_process'; +import { setTimeout as delay } from 'node:timers/promises'; +import { chromium } from 'playwright'; +import { resolve, dirname } from 'node:path'; +import { promises as fs } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import process from 'node:process'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); + +// ============================================================================ +// Utilitaires (réutilisés du script original) +// ============================================================================ + +async function run(command, args = [], options = {}) { + return new Promise((resolvePromise, reject) => { + const child = spawn(command, args, { stdio: 'inherit', shell: false, ...options }); + child.on('error', reject); + child.on('exit', (code) => { + if (code === 0) resolvePromise(undefined); + else reject(new Error(`${command} ${args.join(' ')} exited with code ${code}`)); + }); + }); +} + +async function waitForServer(url, timeoutMs = 60000) { + const start = Date.now(); + while (Date.now() - start < timeoutMs) { + try { + const res = await fetch(url); + if (res.ok) return; + } catch { } + await delay(500); + } + throw new Error(`Server did not start in time: ${url}`); +} + +function parseArgs(argv) { + const out = {}; + for (const arg of argv.slice(2)) { + if (!arg.startsWith('--')) continue; + const [k, v] = arg.replace(/^--/, '').split('='); + out[k] = v === undefined ? true : v; + } + return out; +} + +function slugify(text) { + return String(text || '') + .normalize('NFKD') + .replace(/\p{Diacritic}+/gu, '') + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-+|-+$/g, '') + .slice(0, 120) || 'article'; +} + +async function waitForImages(page, timeoutMs = 15000) { + await page.evaluate(async (timeout) => { + const deadline = Date.now() + timeout; + const imgs = Array.from(document.images || []); + const unloaded = imgs.filter(img => !img.complete || (img.naturalWidth === 0)); + await Promise.race([ + Promise.all(unloaded.map(img => new Promise(res => { + if (img.complete && img.naturalWidth !== 0) return res(undefined); + img.addEventListener('load', () => res(undefined), { once: true }); + img.addEventListener('error', () => res(undefined), { once: true }); + }))), + new Promise(res => setTimeout(res, Math.max(0, deadline - Date.now()))) + ]); + }, timeoutMs); +} + +async function waitForPlotly(page, timeoutMs = 20000) { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const hasPlots = () => Array.from(document.querySelectorAll('.js-plotly-plot')).length > 0; + while (!hasPlots() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + const deadline = start + timeout; + const allReady = () => Array.from(document.querySelectorAll('.js-plotly-plot')).every(el => el.querySelector('svg.main-svg')); + while (!allReady() && Date.now() < deadline) { + await new Promise(r => setTimeout(r, 200)); + } + }, timeoutMs); +} + +async function waitForD3(page, timeoutMs = 20000) { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const isReady = () => { + const hero = document.querySelector('.hero-banner'); + if (hero) { + return !!hero.querySelector('svg circle, svg path, svg rect, svg g'); + } + const containers = [ + ...Array.from(document.querySelectorAll('.d3-line')), + ...Array.from(document.querySelectorAll('.d3-bar')) + ]; + if (!containers.length) return true; + return containers.every(c => c.querySelector('svg circle, svg path, svg rect, svg g')); + }; + while (!isReady() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + }, timeoutMs); +} + +async function waitForStableLayout(page, timeoutMs = 5000) { + const start = Date.now(); + let last = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + let stableCount = 0; + while ((Date.now() - start) < timeoutMs && stableCount < 3) { + await page.waitForTimeout(250); + const now = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + if (now === last) stableCount += 1; else { stableCount = 0; last = now; } + } +} + +async function openAllAccordions(page) { + console.log('📂 Opening all accordions…'); + await page.evaluate(() => { + // Trouver tous les accordéons (details.accordion) + const accordions = document.querySelectorAll('details.accordion, details'); + let openedCount = 0; + + accordions.forEach((accordion) => { + if (!accordion.hasAttribute('open')) { + // Ouvrir l'accordéon en ajoutant l'attribut open + accordion.setAttribute('open', ''); + + // Forcer l'affichage du contenu pour le PDF + const wrapper = accordion.querySelector('.accordion__content-wrapper'); + if (wrapper) { + wrapper.style.height = 'auto'; + wrapper.style.overflow = 'visible'; + } + + openedCount++; + } + }); + + console.log(`Opened ${openedCount} accordion(s)`); + return openedCount; + }); + + // Petit délai pour que les accordéons se stabilisent + await page.waitForTimeout(500); +} + +async function waitForHtmlEmbeds(page, timeoutMs = 15000) { + console.log('⏳ Waiting for HTML embeds to render…'); + await page.evaluate(async (timeout) => { + const start = Date.now(); + + const isEmbedReady = (embed) => { + try { + // Vérifier si l'embed a du contenu + const hasContent = embed.querySelector('svg, canvas, div[id^="frag-"]'); + if (!hasContent) return false; + + // Vérifier si les SVG ont des éléments + const svgs = embed.querySelectorAll('svg'); + for (const svg of svgs) { + const hasShapes = svg.querySelector('path, circle, rect, line, polygon, g'); + if (!hasShapes) return false; + } + + // Vérifier si les canvas ont été dessinés + const canvases = embed.querySelectorAll('canvas'); + for (const canvas of canvases) { + try { + const ctx = canvas.getContext('2d'); + const imageData = ctx.getImageData(0, 0, Math.min(10, canvas.width), Math.min(10, canvas.height)); + // Vérifier si au moins un pixel est non-transparent + const hasPixels = Array.from(imageData.data).some((v, i) => i % 4 === 3 && v > 0); + if (!hasPixels) return false; + } catch (e) { + // Cross-origin ou erreur, on considère que c'est OK + } + } + + return true; + } catch (e) { + return false; + } + }; + + while (Date.now() - start < timeout) { + const embeds = Array.from(document.querySelectorAll('.html-embed__card')); + if (embeds.length === 0) break; // Pas d'embeds dans la page + + const allReady = embeds.every(isEmbedReady); + if (allReady) { + console.log(`All ${embeds.length} HTML embeds ready`); + break; + } + + await new Promise(r => setTimeout(r, 300)); + } + }, timeoutMs); +} + +// ============================================================================ +// Script principal +// ============================================================================ + +async function main() { + const cwd = process.cwd(); + const port = Number(process.env.PREVIEW_PORT || 8080); + const baseUrl = `http://127.0.0.1:${port}/`; + const args = parseArgs(process.argv); + + const theme = (args.theme === 'dark' || args.theme === 'light') ? args.theme : 'light'; + const format = args.format || 'A4'; + const wait = args.wait || 'full'; + + let outFileBase = (args.filename && String(args.filename).replace(/\.pdf$/i, '')) || 'article-book'; + + // Build si nécessaire + const distDir = resolve(cwd, 'dist'); + let hasDist = false; + try { + const st = await fs.stat(distDir); + hasDist = st && st.isDirectory(); + } catch { } + + if (!hasDist) { + console.log('📦 Building Astro site…'); + await run('npm', ['run', 'build']); + } else { + console.log('✓ Using existing dist/ build'); + } + + console.log('🚀 Starting Astro preview server…'); + const preview = spawn('npm', ['run', 'preview'], { cwd, stdio: 'inherit', detached: true }); + const previewExit = new Promise((resolvePreview) => { + preview.on('close', (code, signal) => resolvePreview({ code, signal })); + }); + + try { + await waitForServer(baseUrl, 60000); + console.log('✓ Server ready'); + + console.log('📖 Launching browser…'); + const browser = await chromium.launch({ headless: true }); + + try { + const context = await browser.newContext(); + + // Appliquer le thème + await context.addInitScript((desired) => { + try { + localStorage.setItem('theme', desired); + if (document && document.documentElement) { + document.documentElement.dataset.theme = desired; + } + } catch { } + }, theme); + + const page = await context.newPage(); + + // Viewport pour le contenu + await page.setViewportSize({ width: 1200, height: 1600 }); + + console.log('📄 Loading page…'); + await page.goto(baseUrl, { waitUntil: 'load', timeout: 60000 }); + + // Attendre les libraries + try { await page.waitForFunction(() => !!window.Plotly, { timeout: 8000 }); } catch { } + try { await page.waitForFunction(() => !!window.d3, { timeout: 8000 }); } catch { } + + // Récupérer le nom du fichier + if (!args.filename) { + const fromBtn = await page.evaluate(() => { + const btn = document.getElementById('download-pdf-btn'); + const f = btn ? btn.getAttribute('data-pdf-filename') : null; + return f || ''; + }); + if (fromBtn) { + outFileBase = String(fromBtn).replace(/\.pdf$/i, '') + '-book'; + } else { + const title = await page.evaluate(() => { + const h1 = document.querySelector('h1.hero-title'); + const t = h1 ? h1.textContent : document.title; + return (t || '').replace(/\s+/g, ' ').trim(); + }); + outFileBase = slugify(title) + '-book'; + } + } + + // Attendre le rendu du contenu + if (wait === 'images' || wait === 'full') { + console.log('⏳ Waiting for images…'); + await waitForImages(page); + } + if (wait === 'd3' || wait === 'full') { + console.log('⏳ Waiting for D3…'); + await waitForD3(page); + } + if (wait === 'plotly' || wait === 'full') { + console.log('⏳ Waiting for Plotly…'); + await waitForPlotly(page); + } + if (wait === 'full') { + await waitForHtmlEmbeds(page); + await waitForStableLayout(page); + } + + // Ouvrir tous les accordéons pour qu'ils soient visibles dans le PDF + await openAllAccordions(page); + await waitForStableLayout(page, 2000); + + // Activer le mode print + await page.emulateMedia({ media: 'print' }); + + console.log('📚 Applying book styles…'); + + // Injecter le CSS livre + const bookCssPath = resolve(__dirname, '..', 'src', 'styles', '_print-book.css'); + const bookCss = await fs.readFile(bookCssPath, 'utf-8'); + await page.addStyleTag({ content: bookCss }); + + // Attendre que le style soit appliqué + await page.waitForTimeout(1000); + + // Générer le PDF avec les options appropriées + const outPath = resolve(cwd, 'dist', `${outFileBase}.pdf`); + + console.log('🖨️ Generating PDF…'); + + await page.pdf({ + path: outPath, + format, + printBackground: true, + displayHeaderFooter: false, // On utilise CSS @page à la place + preferCSSPageSize: false, + margin: { + top: '20mm', + right: '20mm', + bottom: '25mm', + left: '25mm' + } + }); + + // Vérifier la taille du PDF + const stats = await fs.stat(outPath); + const sizeKB = Math.round(stats.size / 1024); + + console.log(`✅ PDF generated: ${outPath} (${sizeKB} KB)`); + + if (sizeKB < 10) { + console.warn('⚠️ Warning: PDF is very small, content might be missing'); + } + + // Copier dans public/ + const publicPath = resolve(cwd, 'public', `${outFileBase}.pdf`); + try { + await fs.mkdir(resolve(cwd, 'public'), { recursive: true }); + await fs.copyFile(outPath, publicPath); + console.log(`✅ PDF copied to: ${publicPath}`); + } catch (e) { + console.warn('⚠️ Unable to copy PDF to public/:', e?.message || e); + } + + } finally { + await browser.close(); + } + + } finally { + // Arrêter le serveur preview + console.log('🛑 Stopping preview server…'); + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGINT'); } catch { } + } + try { preview.kill('SIGINT'); } catch { } + await Promise.race([previewExit, delay(3000)]); + + if (!preview.killed) { + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGKILL'); } catch { } + } + try { preview.kill('SIGKILL'); } catch { } + } catch { } + await Promise.race([previewExit, delay(1000)]); + } + } catch { } + } + + console.log(''); + console.log('╔═══════════════════════════════════════════════════════════════╗'); + console.log('║ 📚 PDF BOOK (SIMPLE) GENERATED! 📚 ║'); + console.log('╚═══════════════════════════════════════════════════════════════╝'); + console.log(''); +} + +main().catch((err) => { + console.error('❌ Error:', err); + process.exit(1); +}); + diff --git a/app/scripts/export-pdf-book.mjs b/app/scripts/export-pdf-book.mjs new file mode 100755 index 0000000000000000000000000000000000000000..888d3d07c0d2c3576681e54e4b65cc08576db375 --- /dev/null +++ b/app/scripts/export-pdf-book.mjs @@ -0,0 +1,360 @@ +#!/usr/bin/env node +/** + * Export PDF Book avec Paged.js + * + * Génère un PDF de qualité professionnelle avec mise en page type livre + * à partir du contenu HTML compilé par Astro. + * + * Fonctionnalités : + * - Pagination automatique avec Paged.js + * - Running headers (titres chapitres en haut de page) + * - Numérotation des pages + * - Marges différentes gauche/droite (reliure) + * - Gestion veuves/orphelines + * - Typographie professionnelle + * + * Usage : + * npm run export:pdf:book + * npm run export:pdf:book -- --theme=dark --format=A4 + * + * Options : + * --theme=light|dark Thème (défaut: light) + * --format=A4|Letter Format de page (défaut: A4) + * --filename=xxx Nom du fichier de sortie + * --wait=full Mode d'attente (défaut: full) + */ + +import { spawn } from 'node:child_process'; +import { setTimeout as delay } from 'node:timers/promises'; +import { chromium } from 'playwright'; +import { resolve, dirname } from 'node:path'; +import { promises as fs } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import process from 'node:process'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); + +// ============================================================================ +// Utilitaires +// ============================================================================ + +async function run(command, args = [], options = {}) { + return new Promise((resolvePromise, reject) => { + const child = spawn(command, args, { stdio: 'inherit', shell: false, ...options }); + child.on('error', reject); + child.on('exit', (code) => { + if (code === 0) resolvePromise(undefined); + else reject(new Error(`${command} ${args.join(' ')} exited with code ${code}`)); + }); + }); +} + +async function waitForServer(url, timeoutMs = 60000) { + const start = Date.now(); + while (Date.now() - start < timeoutMs) { + try { + const res = await fetch(url); + if (res.ok) return; + } catch { } + await delay(500); + } + throw new Error(`Server did not start in time: ${url}`); +} + +function parseArgs(argv) { + const out = {}; + for (const arg of argv.slice(2)) { + if (!arg.startsWith('--')) continue; + const [k, v] = arg.replace(/^--/, '').split('='); + out[k] = v === undefined ? true : v; + } + return out; +} + +function slugify(text) { + return String(text || '') + .normalize('NFKD') + .replace(/\p{Diacritic}+/gu, '') + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-+|-+$/g, '') + .slice(0, 120) || 'article'; +} + +async function waitForImages(page, timeoutMs = 15000) { + await page.evaluate(async (timeout) => { + const deadline = Date.now() + timeout; + const imgs = Array.from(document.images || []); + const unloaded = imgs.filter(img => !img.complete || (img.naturalWidth === 0)); + await Promise.race([ + Promise.all(unloaded.map(img => new Promise(res => { + if (img.complete && img.naturalWidth !== 0) return res(undefined); + img.addEventListener('load', () => res(undefined), { once: true }); + img.addEventListener('error', () => res(undefined), { once: true }); + }))), + new Promise(res => setTimeout(res, Math.max(0, deadline - Date.now()))) + ]); + }, timeoutMs); +} + +async function waitForPlotly(page, timeoutMs = 20000) { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const hasPlots = () => Array.from(document.querySelectorAll('.js-plotly-plot')).length > 0; + while (!hasPlots() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + const deadline = start + timeout; + const allReady = () => Array.from(document.querySelectorAll('.js-plotly-plot')).every(el => el.querySelector('svg.main-svg')); + while (!allReady() && Date.now() < deadline) { + await new Promise(r => setTimeout(r, 200)); + } + }, timeoutMs); +} + +async function waitForD3(page, timeoutMs = 20000) { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const isReady = () => { + const hero = document.querySelector('.hero-banner'); + if (hero) { + return !!hero.querySelector('svg circle, svg path, svg rect, svg g'); + } + const containers = [ + ...Array.from(document.querySelectorAll('.d3-line')), + ...Array.from(document.querySelectorAll('.d3-bar')) + ]; + if (!containers.length) return true; + return containers.every(c => c.querySelector('svg circle, svg path, svg rect, svg g')); + }; + while (!isReady() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + }, timeoutMs); +} + +async function waitForStableLayout(page, timeoutMs = 5000) { + const start = Date.now(); + let last = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + let stableCount = 0; + while ((Date.now() - start) < timeoutMs && stableCount < 3) { + await page.waitForTimeout(250); + const now = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + if (now === last) stableCount += 1; else { stableCount = 0; last = now; } + } +} + +// ============================================================================ +// Script principal +// ============================================================================ + +async function main() { + const cwd = process.cwd(); + const port = Number(process.env.PREVIEW_PORT || 8080); + const baseUrl = `http://127.0.0.1:${port}/`; + const args = parseArgs(process.argv); + + const theme = (args.theme === 'dark' || args.theme === 'light') ? args.theme : 'light'; + const format = args.format || 'A4'; + const wait = args.wait || 'full'; + + let outFileBase = (args.filename && String(args.filename).replace(/\.pdf$/i, '')) || 'article-book'; + + // Build si nécessaire + const distDir = resolve(cwd, 'dist'); + let hasDist = false; + try { + const st = await fs.stat(distDir); + hasDist = st && st.isDirectory(); + } catch { } + + if (!hasDist) { + console.log('📦 Building Astro site…'); + await run('npm', ['run', 'build']); + } else { + console.log('✓ Using existing dist/ build'); + } + + console.log('🚀 Starting Astro preview server…'); + const preview = spawn('npm', ['run', 'preview'], { cwd, stdio: 'inherit', detached: true }); + const previewExit = new Promise((resolvePreview) => { + preview.on('close', (code, signal) => resolvePreview({ code, signal })); + }); + + try { + await waitForServer(baseUrl, 60000); + console.log('✓ Server ready'); + + console.log('📖 Launching browser with Paged.js…'); + const browser = await chromium.launch({ headless: true }); + + try { + const context = await browser.newContext(); + + // Appliquer le thème + await context.addInitScript((desired) => { + try { + localStorage.setItem('theme', desired); + if (document && document.documentElement) { + document.documentElement.dataset.theme = desired; + } + } catch { } + }, theme); + + const page = await context.newPage(); + + // Viewport large pour le contenu + await page.setViewportSize({ width: 1200, height: 1600 }); + + console.log('📄 Loading page…'); + await page.goto(baseUrl, { waitUntil: 'load', timeout: 60000 }); + + // Attendre les libraries + try { await page.waitForFunction(() => !!window.Plotly, { timeout: 8000 }); } catch { } + try { await page.waitForFunction(() => !!window.d3, { timeout: 8000 }); } catch { } + + // Récupérer le nom du fichier + if (!args.filename) { + const fromBtn = await page.evaluate(() => { + const btn = document.getElementById('download-pdf-btn'); + const f = btn ? btn.getAttribute('data-pdf-filename') : null; + return f || ''; + }); + if (fromBtn) { + outFileBase = String(fromBtn).replace(/\.pdf$/i, '') + '-book'; + } else { + const title = await page.evaluate(() => { + const h1 = document.querySelector('h1.hero-title'); + const t = h1 ? h1.textContent : document.title; + return (t || '').replace(/\s+/g, ' ').trim(); + }); + outFileBase = slugify(title) + '-book'; + } + } + + // Attendre le rendu du contenu + if (wait === 'images' || wait === 'full') { + console.log('⏳ Waiting for images…'); + await waitForImages(page); + } + if (wait === 'd3' || wait === 'full') { + console.log('⏳ Waiting for D3…'); + await waitForD3(page); + } + if (wait === 'plotly' || wait === 'full') { + console.log('⏳ Waiting for Plotly…'); + await waitForPlotly(page); + } + if (wait === 'full') { + await waitForStableLayout(page); + } + + // Activer le mode print AVANT d'injecter Paged.js + await page.emulateMedia({ media: 'print' }); + + console.log('📚 Injecting Paged.js…'); + + // Injecter le CSS livre + const bookCssPath = resolve(__dirname, '..', 'src', 'styles', '_print-book.css'); + const bookCss = await fs.readFile(bookCssPath, 'utf-8'); + await page.addStyleTag({ content: bookCss }); + + // Injecter Paged.js depuis node_modules + const pagedJsPath = resolve(cwd, 'node_modules', 'pagedjs', 'dist', 'paged.polyfill.js'); + await page.addScriptTag({ path: pagedJsPath }); + + console.log('⏳ Running Paged.js pagination…'); + + // Lancer la pagination avec Paged.Previewer + await page.evaluate(async () => { + if (window.Paged && window.Paged.Previewer) { + const previewer = new window.Paged.Previewer(); + await previewer.preview(); + } + }); + + // Attendre que les pages soient créées + await page.waitForFunction(() => { + const pages = document.querySelectorAll('.pagedjs_page'); + return pages && pages.length > 0; + }, { timeout: 60000 }); + + // Petit délai pour s'assurer que tout est stabilisé + await page.waitForTimeout(2000); + + console.log('✓ Pagination complete'); + + // Informations sur la pagination + const pageInfo = await page.evaluate(() => { + const pages = document.querySelectorAll('.pagedjs_page'); + return { + totalPages: pages.length, + hasContent: pages.length > 0 + }; + }); + + console.log(`📄 Generated ${pageInfo.totalPages} pages`); + + // Générer le PDF + const outPath = resolve(cwd, 'dist', `${outFileBase}.pdf`); + + console.log('🖨️ Generating PDF…'); + + await page.pdf({ + path: outPath, + format, + printBackground: true, + preferCSSPageSize: true, // Important : respecte les @page CSS + margin: { top: 0, right: 0, bottom: 0, left: 0 } // Marges gérées par CSS + }); + + console.log(`✅ PDF generated: ${outPath}`); + + // Copier dans public/ + const publicPath = resolve(cwd, 'public', `${outFileBase}.pdf`); + try { + await fs.mkdir(resolve(cwd, 'public'), { recursive: true }); + await fs.copyFile(outPath, publicPath); + console.log(`✅ PDF copied to: ${publicPath}`); + } catch (e) { + console.warn('⚠️ Unable to copy PDF to public/:', e?.message || e); + } + + } finally { + await browser.close(); + } + + } finally { + // Arrêter le serveur preview + console.log('🛑 Stopping preview server…'); + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGINT'); } catch { } + } + try { preview.kill('SIGINT'); } catch { } + await Promise.race([previewExit, delay(3000)]); + + if (!preview.killed) { + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGKILL'); } catch { } + } + try { preview.kill('SIGKILL'); } catch { } + } catch { } + await Promise.race([previewExit, delay(1000)]); + } + } catch { } + } + + console.log(''); + console.log('╔═══════════════════════════════════════════════════════════════╗'); + console.log('║ 📚 PDF BOOK GENERATED! 📚 ║'); + console.log('╚═══════════════════════════════════════════════════════════════╝'); + console.log(''); +} + +main().catch((err) => { + console.error('❌ Error:', err); + process.exit(1); +}); + diff --git a/app/scripts/export-pdf.mjs b/app/scripts/export-pdf.mjs new file mode 100644 index 0000000000000000000000000000000000000000..4c36e5ba264e88dfdf2cd35174394ebe5d6114c6 --- /dev/null +++ b/app/scripts/export-pdf.mjs @@ -0,0 +1,554 @@ +#!/usr/bin/env node +import { spawn } from 'node:child_process'; +import { setTimeout as delay } from 'node:timers/promises'; +import { chromium } from 'playwright'; +import { resolve } from 'node:path'; +import { promises as fs } from 'node:fs'; +import process from 'node:process'; + +async function run(command, args = [], options = {}) { + return new Promise((resolvePromise, reject) => { + const child = spawn(command, args, { stdio: 'inherit', shell: false, ...options }); + child.on('error', reject); + child.on('exit', (code) => { + if (code === 0) resolvePromise(undefined); + else reject(new Error(`${command} ${args.join(' ')} exited with code ${code}`)); + }); + }); +} + +async function waitForServer(url, timeoutMs = 60000) { + const start = Date.now(); + while (Date.now() - start < timeoutMs) { + try { + const res = await fetch(url); + if (res.ok) return; + } catch { } + await delay(500); + } + throw new Error(`Server did not start in time: ${url}`); +} + +function parseArgs(argv) { + const out = {}; + for (const arg of argv.slice(2)) { + if (!arg.startsWith('--')) continue; + const [k, v] = arg.replace(/^--/, '').split('='); + out[k] = v === undefined ? true : v; + } + return out; +} + +function slugify(text) { + return String(text || '') + .normalize('NFKD') + .replace(/\p{Diacritic}+/gu, '') + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-+|-+$/g, '') + .slice(0, 120) || 'article'; +} + +function parseMargin(margin) { + if (!margin) return { top: '12mm', right: '12mm', bottom: '16mm', left: '12mm' }; + const parts = String(margin).split(',').map(s => s.trim()).filter(Boolean); + if (parts.length === 1) { + return { top: parts[0], right: parts[0], bottom: parts[0], left: parts[0] }; + } + if (parts.length === 2) { + return { top: parts[0], right: parts[1], bottom: parts[0], left: parts[1] }; + } + if (parts.length === 3) { + return { top: parts[0], right: parts[1], bottom: parts[2], left: parts[1] }; + } + return { top: parts[0] || '12mm', right: parts[1] || '12mm', bottom: parts[2] || '16mm', left: parts[3] || '12mm' }; +} + +function cssLengthToMm(val) { + if (!val) return 0; + const s = String(val).trim(); + if (/mm$/i.test(s)) return parseFloat(s); + if (/cm$/i.test(s)) return parseFloat(s) * 10; + if (/in$/i.test(s)) return parseFloat(s) * 25.4; + if (/px$/i.test(s)) return (parseFloat(s) / 96) * 25.4; // 96 CSS px per inch + const num = parseFloat(s); + return Number.isFinite(num) ? num : 0; // assume mm if unitless +} + +function getFormatSizeMm(format) { + const f = String(format || 'A4').toLowerCase(); + switch (f) { + case 'letter': return { w: 215.9, h: 279.4 }; + case 'legal': return { w: 215.9, h: 355.6 }; + case 'a3': return { w: 297, h: 420 }; + case 'tabloid': return { w: 279.4, h: 431.8 }; + case 'a4': + default: return { w: 210, h: 297 }; + } +} + +async function waitForImages(page, timeoutMs = 15000) { + await page.evaluate(async (timeout) => { + const deadline = Date.now() + timeout; + const imgs = Array.from(document.images || []); + const unloaded = imgs.filter(img => !img.complete || (img.naturalWidth === 0)); + await Promise.race([ + Promise.all(unloaded.map(img => new Promise(res => { + if (img.complete && img.naturalWidth !== 0) return res(undefined); + img.addEventListener('load', () => res(undefined), { once: true }); + img.addEventListener('error', () => res(undefined), { once: true }); + }))), + new Promise(res => setTimeout(res, Math.max(0, deadline - Date.now()))) + ]); + }, timeoutMs); +} + +async function waitForPlotly(page, timeoutMs = 20000) { + try { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const hasPlots = () => Array.from(document.querySelectorAll('.js-plotly-plot')).length > 0; + // Wait until plots exist or timeout + while (!hasPlots() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + const deadline = start + timeout; + // Then wait until each plot contains the main svg + const allReady = () => Array.from(document.querySelectorAll('.js-plotly-plot')).every(el => el.querySelector('svg.main-svg')); + while (!allReady() && Date.now() < deadline) { + await new Promise(r => setTimeout(r, 200)); + } + console.log('Plotly ready or timeout'); + }, timeoutMs); + } catch (e) { + console.warn('waitForPlotly timeout or error:', e.message); + } +} + +async function waitForD3(page, timeoutMs = 20000) { + try { + await page.evaluate(async (timeout) => { + const start = Date.now(); + const isReady = () => { + // Prioritize hero banner if present (generic container) + const hero = document.querySelector('.hero-banner'); + if (hero) { + return !!hero.querySelector('svg circle, svg path, svg rect, svg g'); + } + // Else require all D3 containers on page to have shapes + const containers = [ + ...Array.from(document.querySelectorAll('.d3-line')), + ...Array.from(document.querySelectorAll('.d3-bar')) + ]; + if (!containers.length) return true; + return containers.every(c => c.querySelector('svg circle, svg path, svg rect, svg g')); + }; + while (!isReady() && (Date.now() - start) < timeout) { + await new Promise(r => setTimeout(r, 200)); + } + console.log('D3 ready or timeout'); + }, timeoutMs); + } catch (e) { + console.warn('waitForD3 timeout or error:', e.message); + } +} + +async function waitForStableLayout(page, timeoutMs = 5000) { + const start = Date.now(); + let last = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + let stableCount = 0; + while ((Date.now() - start) < timeoutMs && stableCount < 3) { + await page.waitForTimeout(250); + const now = await page.evaluate(() => document.scrollingElement ? document.scrollingElement.scrollHeight : document.body.scrollHeight); + if (now === last) stableCount += 1; else { stableCount = 0; last = now; } + } +} + +async function main() { + const cwd = process.cwd(); + const port = Number(process.env.PREVIEW_PORT || 8080); + const baseUrl = `http://127.0.0.1:${port}/`; + const args = parseArgs(process.argv); + // Default: light (do not rely on env vars implicitly) + const theme = (args.theme === 'dark' || args.theme === 'light') ? args.theme : 'light'; + const format = args.format || 'A4'; + const margin = parseMargin(args.margin); + const wait = (args.wait || 'full'); // 'networkidle' | 'images' | 'plotly' | 'full' + const bookMode = !!args.book; // Activer le mode livre avec --book + + // filename can be provided, else computed from DOM (button) or page title later + let outFileBase = (args.filename && String(args.filename).replace(/\.pdf$/i, '')) || 'article'; + + // Build only if dist/ does not exist + const distDir = resolve(cwd, 'dist'); + let hasDist = false; + try { + const st = await fs.stat(distDir); + hasDist = st && st.isDirectory(); + } catch { } + if (!hasDist) { + console.log('> Building Astro site…'); + await run('npm', ['run', 'build']); + } else { + console.log('> Skipping build (dist/ exists)…'); + } + + console.log('> Starting Astro preview…'); + // Start preview in its own process group so we can terminate all children reliably + const preview = spawn('npm', ['run', 'preview'], { cwd, stdio: 'inherit', detached: true }); + const previewExit = new Promise((resolvePreview) => { + preview.on('close', (code, signal) => resolvePreview({ code, signal })); + }); + + try { + await waitForServer(baseUrl, 60000); + console.log('> Server ready, generating PDF…'); + + const browser = await chromium.launch({ headless: true }); + try { + const context = await browser.newContext(); + await context.addInitScript((desired) => { + try { + localStorage.setItem('theme', desired); + // Apply theme immediately to avoid flashes + if (document && document.documentElement) { + document.documentElement.dataset.theme = desired; + } + } catch { } + }, theme); + const page = await context.newPage(); + // Pre-fit viewport width to printable width so charts size correctly + const fmt = getFormatSizeMm(format); + const mw = fmt.w - cssLengthToMm(margin.left) - cssLengthToMm(margin.right); + const printableWidthPx = Math.max(320, Math.round((mw / 25.4) * 96)); + await page.setViewportSize({ width: printableWidthPx, height: 1200 }); + await page.goto(baseUrl, { waitUntil: 'load', timeout: 60000 }); + // Give time for CDN scripts (Plotly/D3) to attach and for our fragment hooks to run + try { await page.waitForFunction(() => !!window.Plotly, { timeout: 8000 }); } catch { } + try { await page.waitForFunction(() => !!window.d3, { timeout: 8000 }); } catch { } + // Prefer explicit filename from the download button if present + if (!args.filename) { + const fromBtn = await page.evaluate(() => { + const btn = document.getElementById('download-pdf-btn'); + const f = btn ? btn.getAttribute('data-pdf-filename') : null; + return f || ''; + }); + if (fromBtn) { + outFileBase = String(fromBtn).replace(/\.pdf$/i, ''); + } else { + // Fallback: compute slug from hero title or document.title + const title = await page.evaluate(() => { + const h1 = document.querySelector('h1.hero-title'); + const t = h1 ? h1.textContent : document.title; + return (t || '').replace(/\s+/g, ' ').trim(); + }); + outFileBase = slugify(title); + } + // Ajouter suffixe -book si en mode livre + if (bookMode) { + outFileBase += '-book'; + } + } + + // Wait for render readiness + if (wait === 'images' || wait === 'full') { + console.log('⏳ Waiting for images…'); + await waitForImages(page); + } + if (wait === 'd3' || wait === 'full') { + console.log('⏳ Waiting for D3…'); + await waitForD3(page); + } + if (wait === 'plotly' || wait === 'full') { + console.log('⏳ Waiting for Plotly…'); + await waitForPlotly(page); + } + if (wait === 'full') { + console.log('⏳ Waiting for stable layout…'); + await waitForStableLayout(page); + } + + // Mode livre : ouvrir tous les accordéons + if (bookMode) { + console.log('📂 Opening all accordions for book mode…'); + await page.evaluate(() => { + const accordions = document.querySelectorAll('details.accordion, details'); + accordions.forEach((accordion) => { + if (!accordion.hasAttribute('open')) { + accordion.setAttribute('open', ''); + const wrapper = accordion.querySelector('.accordion__content-wrapper'); + if (wrapper) { + wrapper.style.height = 'auto'; + wrapper.style.overflow = 'visible'; + } + } + }); + }); + await waitForStableLayout(page, 2000); + } + + await page.emulateMedia({ media: 'print' }); + + // Enforce responsive sizing for SVG/iframes by removing hard attrs and injecting CSS (top-level and inside same-origin iframes) + try { + await page.evaluate(() => { + function isSmallSvg(svg) { + try { + const vb = svg && svg.viewBox && svg.viewBox.baseVal ? svg.viewBox.baseVal : null; + if (vb && vb.width && vb.height && vb.width <= 50 && vb.height <= 50) return true; + const r = svg.getBoundingClientRect && svg.getBoundingClientRect(); + if (r && r.width && r.height && r.width <= 50 && r.height <= 50) return true; + } catch { } + return false; + } + function lockSmallSvgSize(svg) { + try { + const r = svg.getBoundingClientRect ? svg.getBoundingClientRect() : null; + const w = (r && r.width) ? Math.round(r.width) : null; + const h = (r && r.height) ? Math.round(r.height) : null; + if (w) svg.style.setProperty('width', w + 'px', 'important'); + if (h) svg.style.setProperty('height', h + 'px', 'important'); + svg.style.setProperty('max-width', 'none', 'important'); + } catch { } + } + function fixSvg(svg) { + if (!svg) return; + // Do not alter hero banner SVG sizing; it may rely on explicit width/height + try { if (svg.closest && svg.closest('.hero-banner')) return; } catch { } + if (isSmallSvg(svg)) { lockSmallSvgSize(svg); return; } + try { svg.removeAttribute('width'); } catch { } + try { svg.removeAttribute('height'); } catch { } + svg.style.maxWidth = '100%'; + svg.style.width = '100%'; + svg.style.height = 'auto'; + if (!svg.getAttribute('preserveAspectRatio')) svg.setAttribute('preserveAspectRatio', 'xMidYMid meet'); + } + document.querySelectorAll('svg').forEach(fixSvg); + document.querySelectorAll('.mermaid, .mermaid svg').forEach((el) => { + if (el.tagName && el.tagName.toLowerCase() === 'svg') fixSvg(el); + else { el.style.display = 'block'; el.style.width = '100%'; el.style.maxWidth = '100%'; } + }); + document.querySelectorAll('iframe, embed, object').forEach((el) => { + el.style.width = '100%'; + el.style.maxWidth = '100%'; + try { el.removeAttribute('width'); } catch { } + // Best-effort inject into same-origin frames + try { + const doc = (el.tagName.toLowerCase() === 'object' ? el.contentDocument : el.contentDocument); + if (doc && doc.head) { + const s = doc.createElement('style'); + s.textContent = 'html,body{overflow-x:hidden;} svg,canvas,img,video{max-width:100%!important;height:auto!important;} svg[width]{width:100%!important}'; + doc.head.appendChild(s); + doc.querySelectorAll('svg').forEach((svg) => { if (isSmallSvg(svg)) lockSmallSvgSize(svg); else fixSvg(svg); }); + } + } catch (_) { /* cross-origin; ignore */ } + }); + }); + } catch { } + + // Generate OG thumbnail (1200x630) + try { + const ogW = 1200, ogH = 630; + await page.setViewportSize({ width: ogW, height: ogH }); + // Give layout a tick to adjust + await page.waitForTimeout(200); + // Ensure layout & D3 re-rendered after viewport change + await page.evaluate(() => { window.scrollTo(0, 0); window.dispatchEvent(new Event('resize')); }); + try { await waitForD3(page, 8000); } catch { } + + // Temporarily improve visibility for light theme thumbnails + // - Force normal blend for points + // - Ensure an SVG background (CSS background on svg element) + const cssHandle = await page.addStyleTag({ + content: ` + .hero .points { mix-blend-mode: normal !important; } + ` }); + const thumbPath = resolve(cwd, 'dist', 'thumb.auto.jpg'); + await page.screenshot({ path: thumbPath, type: 'jpeg', quality: 85, fullPage: false }); + // Also emit PNG for compatibility if needed + const thumbPngPath = resolve(cwd, 'dist', 'thumb.auto.png'); + await page.screenshot({ path: thumbPngPath, type: 'png', fullPage: false }); + const publicThumb = resolve(cwd, 'public', 'thumb.auto.jpg'); + const publicThumbPng = resolve(cwd, 'public', 'thumb.auto.png'); + try { await fs.copyFile(thumbPath, publicThumb); } catch { } + try { await fs.copyFile(thumbPngPath, publicThumbPng); } catch { } + // Remove temporary style so PDF is unaffected + try { await cssHandle.evaluate((el) => el.remove()); } catch { } + console.log(`✅ OG thumbnail generated: ${thumbPath}`); + } catch (e) { + console.warn('Unable to generate OG thumbnail:', e?.message || e); + } + const outPath = resolve(cwd, 'dist', `${outFileBase}.pdf`); + // Restore viewport to printable width before PDF (thumbnail changed it) + try { + const fmt2 = getFormatSizeMm(format); + const mw2 = fmt2.w - cssLengthToMm(margin.left) - cssLengthToMm(margin.right); + const printableWidthPx2 = Math.max(320, Math.round((mw2 / 25.4) * 96)); + await page.setViewportSize({ width: printableWidthPx2, height: 1400 }); + await page.evaluate(() => { window.scrollTo(0, 0); window.dispatchEvent(new Event('resize')); }); + try { await waitForD3(page, 8000); } catch { } + await waitForStableLayout(page); + // Re-apply responsive fixes after viewport change + try { + await page.evaluate(() => { + function isSmallSvg(svg) { + try { + const vb = svg && svg.viewBox && svg.viewBox.baseVal ? svg.viewBox.baseVal : null; + if (vb && vb.width && vb.height && vb.width <= 50 && vb.height <= 50) return true; + const r = svg.getBoundingClientRect && svg.getBoundingClientRect(); + if (r && r.width && r.height && r.width <= 50 && r.height <= 50) return true; + } catch { } + return false; + } + function lockSmallSvgSize(svg) { + try { + const r = svg.getBoundingClientRect ? svg.getBoundingClientRect() : null; + const w = (r && r.width) ? Math.round(r.width) : null; + const h = (r && r.height) ? Math.round(r.height) : null; + if (w) svg.style.setProperty('width', w + 'px', 'important'); + if (h) svg.style.setProperty('height', h + 'px', 'important'); + svg.style.setProperty('max-width', 'none', 'important'); + } catch { } + } + function fixSvg(svg) { + if (!svg) return; + // Do not alter hero banner SVG sizing; it may rely on explicit width/height + try { if (svg.closest && svg.closest('.hero-banner')) return; } catch { } + if (isSmallSvg(svg)) { lockSmallSvgSize(svg); return; } + try { svg.removeAttribute('width'); } catch { } + try { svg.removeAttribute('height'); } catch { } + svg.style.maxWidth = '100%'; + svg.style.width = '100%'; + svg.style.height = 'auto'; + if (!svg.getAttribute('preserveAspectRatio')) svg.setAttribute('preserveAspectRatio', 'xMidYMid meet'); + } + document.querySelectorAll('svg').forEach((svg) => { if (isSmallSvg(svg)) lockSmallSvgSize(svg); else fixSvg(svg); }); + document.querySelectorAll('.mermaid, .mermaid svg').forEach((el) => { + if (el.tagName && el.tagName.toLowerCase() === 'svg') fixSvg(el); + else { el.style.display = 'block'; el.style.width = '100%'; el.style.maxWidth = '100%'; } + }); + document.querySelectorAll('iframe, embed, object').forEach((el) => { + el.style.width = '100%'; + el.style.maxWidth = '100%'; + try { el.removeAttribute('width'); } catch { } + try { + const doc = (el.tagName.toLowerCase() === 'object' ? el.contentDocument : el.contentDocument); + if (doc && doc.head) { + const s = doc.createElement('style'); + s.textContent = 'html,body{overflow-x:hidden;} svg,canvas,img,video{max-width:100%!important;height:auto!important;} svg[width]{width:100%!important}'; + doc.head.appendChild(s); + doc.querySelectorAll('svg').forEach((svg) => { if (isSmallSvg(svg)) lockSmallSvgSize(svg); else fixSvg(svg); }); + } + } catch (_) { } + }); + }); + } catch { } + } catch { } + + // Inject styles for PDF + let pdfCssHandle = null; + try { + if (bookMode) { + // Mode livre : injecter le CSS livre complet + console.log('📚 Applying book styles…'); + const bookCssPath = resolve(cwd, 'src', 'styles', '_print-book.css'); + const bookCss = await fs.readFile(bookCssPath, 'utf-8'); + pdfCssHandle = await page.addStyleTag({ content: bookCss }); + await page.waitForTimeout(500); + } else { + // Mode normal : styles responsive de base + pdfCssHandle = await page.addStyleTag({ + content: ` + /* General container safety */ + html, body { overflow-x: hidden !important; } + + /* Make all vector/bitmap media responsive for print */ + svg, canvas, img, video { max-width: 100% !important; height: auto !important; } + /* Mermaid diagrams */ + .mermaid, .mermaid svg { display: block; width: 100% !important; max-width: 100% !important; height: auto !important; } + /* Any explicit width attributes */ + svg[width] { width: 100% !important; } + /* Iframes and similar embeds */ + iframe, embed, object { width: 100% !important; max-width: 100% !important; height: auto; } + + /* HtmlEmbed wrappers (defensive) */ + .html-embed, .html-embed__card { max-width: 100% !important; width: 100% !important; } + .html-embed__card > div[id^="frag-"] { width: 100% !important; max-width: 100% !important; } + + /* Banner centering & visibility */ + .hero .points { mix-blend-mode: normal !important; } + /* Do NOT force a fixed height to avoid clipping in PDF */ + .hero-banner { width: 100% !important; max-width: 980px !important; margin-left: auto !important; margin-right: auto !important; } + /* Generalize banner styles for all banner types */ + .hero-banner svg, + .hero-banner canvas, + .hero-banner .d3-galaxy, + .hero-banner .threejs-galaxy, + .hero-banner .d3-latent-space, + .hero-banner .neural-flow, + .hero-banner .molecular-space, + .hero-banner [class*="banner"] { + width: 100% !important; + height: auto !important; + max-width: 980px !important; + } + ` }); + } + } catch { } + await page.pdf({ + path: outPath, + format, + printBackground: true, + displayHeaderFooter: false, + preferCSSPageSize: false, + margin: bookMode ? { + top: '20mm', + right: '20mm', + bottom: '25mm', + left: '25mm' + } : margin + }); + try { if (pdfCssHandle) await pdfCssHandle.evaluate((el) => el.remove()); } catch { } + console.log(`✅ PDF generated: ${outPath}`); + + // Copy into public only under the slugified name + const publicSlugPath = resolve(cwd, 'public', `${outFileBase}.pdf`); + try { + await fs.mkdir(resolve(cwd, 'public'), { recursive: true }); + await fs.copyFile(outPath, publicSlugPath); + console.log(`✅ PDF copied to: ${publicSlugPath}`); + } catch (e) { + console.warn('Unable to copy PDF to public/:', e?.message || e); + } + } finally { + await browser.close(); + } + } finally { + // Try a clean shutdown of preview (entire process group first) + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGINT'); } catch { } + } + try { preview.kill('SIGINT'); } catch { } + await Promise.race([previewExit, delay(3000)]); + // Force kill if still alive + // eslint-disable-next-line no-unsafe-optional-chaining + if (!preview.killed) { + try { + if (process.platform !== 'win32') { + try { process.kill(-preview.pid, 'SIGKILL'); } catch { } + } + try { preview.kill('SIGKILL'); } catch { } + } catch { } + await Promise.race([previewExit, delay(1000)]); + } + } catch { } + } +} + +main().catch((err) => { + console.error(err); + process.exit(1); +}); + + diff --git a/app/scripts/generate-trackio-data.mjs b/app/scripts/generate-trackio-data.mjs new file mode 100644 index 0000000000000000000000000000000000000000..cbac5cb711cd5765e00c985d5c92d8eb1251631c --- /dev/null +++ b/app/scripts/generate-trackio-data.mjs @@ -0,0 +1,196 @@ +#!/usr/bin/env node + +// Generate synthetic Trackio-like CSV data with realistic ML curves. +// - Steps are simple integers (e.g., 1..N) +// - Metrics: epoch, train_accuracy, val_accuracy, train_loss, val_loss +// - W&B-like run names (e.g., pleasant-flower-1) +// - Deterministic with --seed +// +// Usage: +// node app/scripts/generate-trackio-data.mjs \ +// --runs 3 \ +// --steps 10 \ +// --out app/src/content/assets/data/trackio_wandb_synth.csv \ +// [--seed 42] [--epoch-max 3.0] [--amount 1.0] [--start 1] +// +// To overwrite the demo file used by the embed: +// node app/scripts/generate-trackio-data.mjs --runs 3 --steps 10 --out app/src/content/assets/data/trackio_wandb_demo.csv --seed 1337 + +import fs from 'node:fs/promises'; +import path from 'node:path'; + +function parseArgs(argv){ + const args = { runs: 3, steps: 10, out: '', seed: undefined, epochMax: 3.0, amount: 1, start: 1 }; + for (let i = 2; i < argv.length; i++){ + const a = argv[i]; + if (a === '--runs' && argv[i+1]) { args.runs = Math.max(1, parseInt(argv[++i], 10) || 3); continue; } + if (a === '--steps' && argv[i+1]) { args.steps = Math.max(2, parseInt(argv[++i], 10) || 10); continue; } + if (a === '--out' && argv[i+1]) { args.out = argv[++i]; continue; } + if (a === '--seed' && argv[i+1]) { args.seed = Number(argv[++i]); continue; } + if (a === '--epoch-max' && argv[i+1]) { args.epochMax = Number(argv[++i]) || 3.0; continue; } + if (a === '--amount' && argv[i+1]) { args.amount = Number(argv[++i]) || 1.0; continue; } + if (a === '--start' && argv[i+1]) { args.start = parseInt(argv[++i], 10) || 1; continue; } + } + if (!args.out) { + args.out = path.join('app', 'src', 'content', 'assets', 'data', 'trackio_wandb_synth.csv'); + } + return args; +} + +function mulberry32(seed){ + let t = seed >>> 0; + return function(){ + t += 0x6D2B79F5; + let r = Math.imul(t ^ (t >>> 15), 1 | t); + r ^= r + Math.imul(r ^ (r >>> 7), 61 | r); + return ((r ^ (r >>> 14)) >>> 0) / 4294967296; + }; +} + +function makeRng(seed){ + if (Number.isFinite(seed)) return mulberry32(seed); + return Math.random; +} + +function randn(rng){ + // Box-Muller transform + let u = 0, v = 0; + while (u === 0) u = rng(); + while (v === 0) v = rng(); + return Math.sqrt(-2.0 * Math.log(u)) * Math.cos(2.0 * Math.PI * v); +} + +function clamp(x, lo, hi){ + return Math.max(lo, Math.min(hi, x)); +} + +function logistic(t, k=6, x0=0.5){ + // 1 / (1 + e^{-k (t - x0)}) in [0,1] + return 1 / (1 + Math.exp(-k * (t - x0))); +} + +function expDecay(t, k=3){ + // (1 - e^{-k t}) in [0,1] + return 1 - Math.exp(-k * t); +} + +function pick(array, rng){ + return array[Math.floor(rng() * array.length) % array.length]; +} + +function buildRunNames(count, rng){ + const adjectives = [ + 'pleasant','brisk','silent','ancient','bold','gentle','rapid','shy','curious','lively', + 'fearless','soothing','glossy','hidden','misty','bright','calm','keen','noble','swift' + ]; + const nouns = [ + 'flower','glade','sky','river','forest','ember','comet','meadow','harbor','dawn', + 'mountain','prairie','breeze','valley','lagoon','desert','monsoon','reef','thunder','willow' + ]; + const names = new Set(); + let attempts = 0; + while (names.size < count && attempts < count * 20){ + attempts++; + const left = pick(adjectives, rng); + const right = pick(nouns, rng); + const idx = 1 + Math.floor(rng() * 9); + names.add(`${left}-${right}-${idx}`); + } + return Array.from(names); +} + +function formatLike(value, decimals){ + return Number.isFinite(decimals) && decimals >= 0 ? value.toFixed(decimals) : String(value); +} + +async function main(){ + const args = parseArgs(process.argv); + const rng = makeRng(args.seed); + + // Steps: integers from start .. start+steps-1 + const steps = Array.from({ length: args.steps }, (_, i) => args.start + i); + const stepNorm = (i) => (i - steps[0]) / (steps[steps.length-1] - steps[0]); + + const runs = buildRunNames(args.runs, rng); + + // Per-run slight variations + const runParams = runs.map((_r, idx) => { + const r = rng(); + // Final accuracies + const trainAccFinal = clamp(0.86 + (r - 0.5) * 0.12 * args.amount, 0.78, 0.97); + const valAccFinal = clamp(trainAccFinal - (0.02 + rng() * 0.05), 0.70, 0.95); + // Loss plateau + const lossStart = 7.0 + (rng() - 0.5) * 0.10 * args.amount; // ~7.0 ±0.05 + const lossPlateau = 6.78 + (rng() - 0.5) * 0.04 * args.amount; // ~6.78 ±0.02 + const lossK = 2.0 + rng() * 1.5; // decay speed + // Acc growth steepness and midpoint + const kAcc = 4.5 + rng() * 3.0; + const x0Acc = 0.35 + rng() * 0.25; + return { trainAccFinal, valAccFinal, lossStart, lossPlateau, lossK, kAcc, x0Acc }; + }); + + const lines = []; + lines.push('run,step,metric,value,stderr'); + + // EPOCH: linear 0..epochMax across steps + for (let r = 0; r < runs.length; r++){ + const run = runs[r]; + for (let i = 0; i < steps.length; i++){ + const t = stepNorm(steps[i]); + const epoch = args.epochMax * t; + lines.push(`${run},${steps[i]},epoch,${formatLike(epoch, 2)},`); + } + } + + // TRAIN LOSS & VAL LOSS + for (let r = 0; r < runs.length; r++){ + const run = runs[r]; + const p = runParams[r]; + let prevTrain = null; + let prevVal = null; + for (let i = 0; i < steps.length; i++){ + const t = stepNorm(steps[i]); + const d = expDecay(t, p.lossK); // 0..1 + let trainLoss = p.lossStart - (p.lossStart - p.lossPlateau) * d; + let valLoss = trainLoss + 0.02 + (rng() * 0.03); + // Add mild noise + trainLoss += randn(rng) * 0.01 * args.amount; + valLoss += randn(rng) * 0.012 * args.amount; + // Keep reasonable and mostly monotonic (small upward blips allowed) + if (prevTrain != null) trainLoss = Math.min(prevTrain + 0.01, trainLoss); + if (prevVal != null) valLoss = Math.min(prevVal + 0.012, valLoss); + prevTrain = trainLoss; prevVal = valLoss; + const stderrTrain = clamp(0.03 - 0.02 * t + Math.abs(randn(rng)) * 0.003, 0.006, 0.04); + const stderrVal = clamp(0.035 - 0.022 * t + Math.abs(randn(rng)) * 0.003, 0.008, 0.045); + lines.push(`${run},${steps[i]},train_loss,${formatLike(trainLoss, 3)},${formatLike(stderrTrain, 3)}`); + lines.push(`${run},${steps[i]},val_loss,${formatLike(valLoss, 3)},${formatLike(stderrVal, 3)}`); + } + } + + // TRAIN ACCURACY & VAL ACCURACY (logistic) + for (let r = 0; r < runs.length; r++){ + const run = runs[r]; + const p = runParams[r]; + for (let i = 0; i < steps.length; i++){ + const t = stepNorm(steps[i]); + const accBase = logistic(t, p.kAcc, p.x0Acc); + let trainAcc = clamp(0.55 + accBase * (p.trainAccFinal - 0.55), 0, 1); + let valAcc = clamp(0.52 + accBase * (p.valAccFinal - 0.52), 0, 1); + // Gentle noise + trainAcc = clamp(trainAcc + randn(rng) * 0.005 * args.amount, 0, 1); + valAcc = clamp(valAcc + randn(rng) * 0.006 * args.amount, 0, 1); + const stderrTrain = clamp(0.02 - 0.011 * t + Math.abs(randn(rng)) * 0.002, 0.006, 0.03); + const stderrVal = clamp(0.022 - 0.012 * t + Math.abs(randn(rng)) * 0.002, 0.007, 0.032); + lines.push(`${run},${steps[i]},train_accuracy,${formatLike(trainAcc, 4)},${formatLike(stderrTrain, 3)}`); + lines.push(`${run},${steps[i]},val_accuracy,${formatLike(valAcc, 4)},${formatLike(stderrVal, 3)}`); + } + } + + // Ensure directory exists + await fs.mkdir(path.dirname(args.out), { recursive: true }); + await fs.writeFile(args.out, lines.join('\n') + '\n', 'utf8'); + const relOut = path.relative(process.cwd(), args.out); + console.log(`Synthetic CSV generated: ${relOut}`); +} + +main().catch(err => { console.error(err?.stack || String(err)); process.exit(1); }); diff --git a/app/scripts/jitter-trackio-data.mjs b/app/scripts/jitter-trackio-data.mjs new file mode 100644 index 0000000000000000000000000000000000000000..ed09c7f702f5a0f4ada98a90313a449e04debee8 --- /dev/null +++ b/app/scripts/jitter-trackio-data.mjs @@ -0,0 +1,129 @@ +#!/usr/bin/env node + +// Jitter Trackio CSV data with small, controlled noise. +// - Preserves comments (# ...) and blank lines +// - Leaves 'epoch' values unchanged +// - Adds mild noise to train/val accuracy (clamped to [0,1]) +// - Adds mild noise to train/val loss (kept >= 0) +// - Keeps steps untouched +// Usage: +// node app/scripts/jitter-trackio-data.mjs \ +// --in app/src/content/assets/data/trackio_wandb_demo.csv \ +// --out app/src/content/assets/data/trackio_wandb_demo.jitter.csv \ +// [--seed 42] [--amount 1.0] [--in-place] + +import fs from 'node:fs/promises'; +import path from 'node:path'; + +function parseArgs(argv){ + const args = { in: '', out: '', seed: undefined, amount: 1, inPlace: false }; + for (let i = 2; i < argv.length; i++){ + const a = argv[i]; + if (a === '--in' && argv[i+1]) { args.in = argv[++i]; continue; } + if (a === '--out' && argv[i+1]) { args.out = argv[++i]; continue; } + if (a === '--seed' && argv[i+1]) { args.seed = Number(argv[++i]); continue; } + if (a === '--amount' && argv[i+1]) { args.amount = Number(argv[++i]) || 3; continue; } + if (a === '--in-place') { args.inPlace = true; continue; } + } + if (!args.in) throw new Error('--in is required'); + if (args.inPlace) args.out = args.in; + if (!args.out) { + const { dir, name, ext } = path.parse(args.in); + args.out = path.join(dir, `${name}.jitter${ext || '.csv'}`); + } + return args; +} + +function mulberry32(seed){ + let t = seed >>> 0; + return function(){ + t += 0x6D2B79F5; + let r = Math.imul(t ^ (t >>> 15), 1 | t); + r ^= r + Math.imul(r ^ (r >>> 7), 61 | r); + return ((r ^ (r >>> 14)) >>> 0) / 4294967296; + }; +} + +function makeRng(seed){ + if (Number.isFinite(seed)) return mulberry32(seed); + return Math.random; +} + +function randn(rng){ + // Box-Muller transform + let u = 0, v = 0; + while (u === 0) u = rng(); + while (v === 0) v = rng(); + return Math.sqrt(-2.0 * Math.log(u)) * Math.cos(2.0 * Math.PI * v); +} + +function jitterValue(metric, value, amount, rng){ + const m = metric.toLowerCase(); + if (m === 'epoch') return value; // keep as-is + if (m.includes('accuracy')){ + const n = Math.max(-0.02 * amount, Math.min(0.02 * amount, randn(rng) * 0.01 * amount)); + return Math.max(0, Math.min(1, value + n)); + } + if (m.includes('loss')){ + const n = Math.max(-0.03 * amount, Math.min(0.03 * amount, randn(rng) * 0.01 * amount)); + return Math.max(0, value + n); + } + // default: tiny noise + const n = Math.max(-0.01 * amount, Math.min(0.01 * amount, randn(rng) * 0.005 * amount)); + return value + n; +} + +function formatNumberLike(original, value){ + const s = String(original); + const dot = s.indexOf('.') + const decimals = dot >= 0 ? (s.length - dot - 1) : 0; + if (!Number.isFinite(value)) return s; + if (decimals <= 0) return String(Math.round(value)); + return value.toFixed(decimals); +} + +async function main(){ + const args = parseArgs(process.argv); + const rng = makeRng(args.seed); + const raw = await fs.readFile(args.in, 'utf8'); + const lines = raw.split(/\r?\n/); + const out = new Array(lines.length); + + for (let i = 0; i < lines.length; i++){ + const line = lines[i]; + if (!line || line.trim().length === 0) { out[i] = line; continue; } + if (/^\s*#/.test(line)) { out[i] = line; continue; } + + // Preserve header line unmodified + if (i === 0 && /^\s*run\s*,\s*step\s*,\s*metric\s*,\s*value\s*,\s*stderr\s*$/i.test(line)) { + out[i] = line; continue; + } + + const cols = line.split(','); + if (cols.length < 4) { out[i] = line; continue; } + + const [run, stepStr, metric, valueStr, stderrStr = ''] = cols; + const trimmedMetric = (metric || '').trim(); + const valueNum = Number((valueStr || '').trim()); + + if (!Number.isFinite(valueNum)) { out[i] = line; continue; } + + const jittered = jitterValue(trimmedMetric, valueNum, args.amount, rng); + const valueOut = formatNumberLike(valueStr, jittered); + + // Reassemble with original column count and positions + const result = [run, stepStr, metric, valueOut, stderrStr].join(','); + out[i] = result; + } + + const finalText = out.join('\n'); + await fs.writeFile(args.out, finalText, 'utf8'); + const relIn = path.relative(process.cwd(), args.in); + const relOut = path.relative(process.cwd(), args.out); + console.log(`Jittered data written: ${relOut} (from ${relIn})`); +} + +main().catch(err => { + console.error(err?.stack || String(err)); + process.exit(1); +}); diff --git a/app/scripts/latex-importer/README.md b/app/scripts/latex-importer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4c8a36f3739569c7d033658e82937ad2da5422e6 --- /dev/null +++ b/app/scripts/latex-importer/README.md @@ -0,0 +1,169 @@ +# LaTeX Importer + +Complete LaTeX to MDX (Markdown + JSX) importer optimized for Astro with advanced support for references, interactive equations, and components. + +## 🚀 Quick Start + +```bash +# Complete LaTeX → MDX conversion with all features +node index.mjs + +# For step-by-step debugging +node latex-converter.mjs # LaTeX → Markdown +node mdx-converter.mjs # Markdown → MDX +``` + +## 📁 Structure + +``` +latex-importer/ +├── index.mjs # Complete LaTeX → MDX pipeline +├── latex-converter.mjs # LaTeX → Markdown with Pandoc +├── mdx-converter.mjs # Markdown → MDX with Astro components +├── reference-preprocessor.mjs # LaTeX references cleanup +├── post-processor.mjs # Markdown post-processing +├── bib-cleaner.mjs # Bibliography cleaner +├── filters/ +│ └── equation-ids.lua # Pandoc filter for KaTeX equations +├── input/ # LaTeX sources +│ ├── main.tex +│ ├── main.bib +│ └── sections/ +└── output/ # Results + ├── main.md # Intermediate Markdown + └── main.mdx # Final MDX for Astro +``` + +## ✨ Key Features + +### 🎯 **Smart References** +- **Invisible anchors**: Automatic conversion of `\label{}` to `` +- **Clean links**: Identifier cleanup (`:` → `-`, removing prefixes `sec:`, `fig:`, `eq:`) +- **Cross-references**: Full support for `\ref{}` with functional links + +### 🧮 **Interactive Equations** +- **KaTeX IDs**: Conversion of `\label{eq:...}` to `\htmlId{id}{equation}` +- **Equation references**: Clickable links to mathematical equations +- **Advanced KaTeX support**: `trust: true` configuration for `\htmlId{}` + +### 🎨 **Automatic Styling** +- **Highlights**: `\highlight{text}` → `text` +- **Auto cleanup**: Removal of numbering `(1)`, `(2)`, etc. +- **Astro components**: Images → `Figure` with automatic imports + +### 🔧 **Robust Pipeline** +- **LaTeX preprocessor**: Reference cleanup before Pandoc +- **Lua filter**: Equation processing in Pandoc AST +- **Post-processor**: Markdown cleanup and optimization +- **MDX converter**: Final transformation with Astro components + +## 📊 Example Workflow + +```bash +# 1. Prepare LaTeX sources +cp my-paper/* input/ + +# 2. Complete automatic conversion +node index.mjs + +# 3. Generated results +ls output/ +# → main.md (Intermediate Markdown) +# → main.mdx (Final MDX for Astro) +# → assets/image/ (extracted images) +``` + +### 📋 Conversion Result + +The pipeline generates an MDX file optimized for Astro with: + +```mdx +--- +title: "Your Article Title" +description: "Generated from LaTeX" +--- + +import Figure from '../components/Figure.astro'; +import figure1 from '../assets/image/figure1.png'; + +## Section with invisible anchor + + +Here is some text with highlighted words. + +Reference to an interactive [equation](#equation-name). + +Equation with KaTeX ID: +$$\htmlId{equation-name}{E = mc^2}$$ + +
        +``` + +## ⚙️ Required Astro Configuration + +To use equations with IDs, add to `astro.config.mjs`: + +```javascript +import rehypeKatex from 'rehype-katex'; + +export default defineConfig({ + markdown: { + rehypePlugins: [ + [rehypeKatex, { trust: true }], // ← Important for \htmlId{} + ], + }, +}); +``` + +## 🛠️ Prerequisites + +- **Node.js** with ESM support +- **Pandoc** (`brew install pandoc`) +- **Astro** to use the generated MDX + +## 🎯 Technical Architecture + +### 4-Stage Pipeline + +1. **LaTeX Preprocessing** (`reference-preprocessor.mjs`) + - Cleanup of `\label{}` and `\ref{}` + - Conversion `\highlight{}` → CSS spans + - Removal of prefixes and problematic characters + +2. **Pandoc + Lua Filter** (`equation-ids.lua`) + - LaTeX → Markdown conversion with `gfm+tex_math_dollars+raw_html` + - Equation processing: `\label{eq:name}` → `\htmlId{name}{equation}` + - Automatic image extraction + +3. **Markdown Post-processing** (`post-processor.mjs`) + - KaTeX, Unicode, grouping commands cleanup + - Attribute correction with `:` + - Code snippet injection + +4. **MDX Conversion** (`mdx-converter.mjs`) + - Images transformation → `Figure` + - HTML span escaping correction + - Automatic imports generation + - MDX frontmatter + +## 📊 Conversion Statistics + +For a typical scientific document: +- **87 labels** detected and processed +- **48 invisible anchors** created +- **13 highlight spans** with CSS class +- **4 equations** with `\htmlId{}` KaTeX +- **40 images** converted to components + +## ✅ Project Status + +### 🎉 **Complete Features** +- ✅ **LaTeX → MDX Pipeline**: Full end-to-end functional conversion +- ✅ **Cross-document references**: Perfectly functional internal links +- ✅ **Interactive equations**: KaTeX support with clickable IDs +- ✅ **Automatic styling**: Highlights and Astro components +- ✅ **Robustness**: Automatic cleanup of all escaping +- ✅ **Optimization**: Clean code without unnecessary elements + +### 🚀 **Production Ready** +The toolkit is now **100% operational** for converting complex scientific LaTeX documents to MDX/Astro with all advanced features (references, interactive equations, styling). diff --git a/app/scripts/latex-importer/bib-cleaner.mjs b/app/scripts/latex-importer/bib-cleaner.mjs new file mode 100644 index 0000000000000000000000000000000000000000..4fb409a3838a1274770f41fc8b2a1457fa7de45d --- /dev/null +++ b/app/scripts/latex-importer/bib-cleaner.mjs @@ -0,0 +1,104 @@ +#!/usr/bin/env node + +import { readFileSync, writeFileSync, existsSync } from 'fs'; +import { join, dirname, basename } from 'path'; + +/** + * Clean a BibTeX file by removing local file references and paths + * @param {string} inputBibFile - Path to the input .bib file + * @param {string} outputBibFile - Path to the output cleaned .bib file + * @returns {boolean} - Success status + */ +export function cleanBibliography(inputBibFile, outputBibFile) { + if (!existsSync(inputBibFile)) { + console.log(' ⚠️ No bibliography file found:', inputBibFile); + return false; + } + + console.log('📚 Cleaning bibliography...'); + let bibContent = readFileSync(inputBibFile, 'utf8'); + + // Remove file paths and local references + bibContent = bibContent.replace(/file = \{[^}]+\}/g, ''); + + // Remove empty lines created by file removal + bibContent = bibContent.replace(/,\s*\n\s*\n/g, '\n\n'); + bibContent = bibContent.replace(/,\s*\}/g, '\n}'); + + // Clean up double commas + bibContent = bibContent.replace(/,,/g, ','); + + // Remove trailing commas before closing braces + bibContent = bibContent.replace(/,(\s*\n\s*)\}/g, '$1}'); + + writeFileSync(outputBibFile, bibContent); + console.log(` 📄 Clean bibliography saved: ${outputBibFile}`); + + return true; +} + +/** + * CLI for bibliography cleaning + */ +function main() { + const args = process.argv.slice(2); + + if (args.includes('--help') || args.includes('-h')) { + console.log(` +📚 BibTeX Bibliography Cleaner + +Usage: + node bib-cleaner.mjs [input.bib] [output.bib] + node bib-cleaner.mjs --input=input.bib --output=output.bib + +Options: + --input=FILE Input .bib file + --output=FILE Output cleaned .bib file + --help, -h Show this help + +Examples: + # Clean main.bib to clean.bib + node bib-cleaner.mjs main.bib clean.bib + + # Using flags + node bib-cleaner.mjs --input=references.bib --output=clean-refs.bib +`); + process.exit(0); + } + + let inputFile, outputFile; + + // Parse command line arguments + if (args.length >= 2 && !args[0].startsWith('--')) { + // Positional arguments + inputFile = args[0]; + outputFile = args[1]; + } else { + // Named arguments + for (const arg of args) { + if (arg.startsWith('--input=')) { + inputFile = arg.split('=')[1]; + } else if (arg.startsWith('--output=')) { + outputFile = arg.split('=')[1]; + } + } + } + + if (!inputFile || !outputFile) { + console.error('❌ Both input and output files are required'); + console.log('Use --help for usage information'); + process.exit(1); + } + + const success = cleanBibliography(inputFile, outputFile); + if (success) { + console.log('🎉 Bibliography cleaning completed!'); + } else { + process.exit(1); + } +} + +// Run CLI if called directly +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/latex-importer/filters/equation-ids.lua b/app/scripts/latex-importer/filters/equation-ids.lua new file mode 100644 index 0000000000000000000000000000000000000000..c07e21b001b4686324a974dae06c9f3093a540e9 --- /dev/null +++ b/app/scripts/latex-importer/filters/equation-ids.lua @@ -0,0 +1,134 @@ +--[[ +Pandoc Lua filter to add IDs to equations using KaTeX \htmlId syntax + +This filter processes display math equations and inline math that contain +\label{} commands, and wraps them with \htmlId{clean-id}{content} for KaTeX. + +Requirements: +- KaTeX renderer with trust: true option +- Equations with \label{} commands in LaTeX +--]] + +-- Function to clean identifier strings (remove prefixes and colons) +function clean_identifier(id_str) + if id_str and type(id_str) == "string" then + -- Remove common prefixes and replace colons with dashes + local clean = id_str + :gsub("^(eq|equation):", "") -- Remove eq: prefix + :gsub(":", "-") -- Replace colons with dashes + :gsub("[^a-zA-Z0-9_-]", "-") -- Replace other problematic chars + :gsub("-+", "-") -- Collapse multiple dashes + :gsub("^-", "") -- Remove leading dash + :gsub("-$", "") -- Remove trailing dash + + -- Ensure we don't have empty identifiers + if clean == "" then + clean = id_str:gsub(":", "-") + end + + return clean + end + return id_str +end + +-- Process Math elements (both inline and display) +function Math(el) + local math_content = el.text + + -- Look for \label{...} commands in the math content + local label_match = math_content:match("\\label%{([^}]+)%}") + + if label_match then + -- Clean the identifier + local clean_id = clean_identifier(label_match) + + -- Remove the \label{} command from the math content + local clean_math = math_content:gsub("\\label%{[^}]+%}", "") + + -- Clean up any extra whitespace or line breaks that might remain + clean_math = clean_math:gsub("%s*$", ""):gsub("^%s*", "") + + -- Handle different equation environments appropriately + -- For align environments, preserve them as they work with KaTeX + local has_align = clean_math:match("\\begin%{align%}") + + if has_align then + -- For align environments, we keep the structure and add ID as an attribute + -- KaTeX supports align environments natively + clean_math = clean_math:gsub("\\begin%{align%}", "\\begin{align}") + clean_math = clean_math:gsub("\\end%{align%}", "\\end{align}") + else + -- Remove other equation environments that don't work well with \htmlId + clean_math = clean_math:gsub("\\begin%{equation%}", ""):gsub("\\end%{equation%}", "") + clean_math = clean_math:gsub("\\begin%{equation%*%}", ""):gsub("\\end%{equation%*%}", "") + clean_math = clean_math:gsub("\\begin%{align%*%}", ""):gsub("\\end%{align%*%}", "") + end + + -- Clean up any remaining whitespace + clean_math = clean_math:gsub("%s*$", ""):gsub("^%s*", "") + + local new_math + if has_align then + -- For align environments, KaTeX doesn't support \htmlId with align + -- Instead, we add a special marker that the post-processor will convert to a span + -- This span will serve as an anchor for references + new_math = "%%ALIGN_ANCHOR_ID{" .. clean_id .. "}%%\n" .. clean_math + else + -- For other math, wrap with \htmlId{} + new_math = "\\htmlId{" .. clean_id .. "}{" .. clean_math .. "}" + end + + -- Return new Math element with the updated content + return pandoc.Math(el.mathtype, new_math) + end + + -- Return unchanged if no label found + return el +end + +-- Optional: Process RawInline elements that might contain LaTeX math +function RawInline(el) + if el.format == "latex" or el.format == "tex" then + local content = el.text + + -- Look for equation environments with labels + local label_match = content:match("\\label%{([^}]+)%}") + + if label_match then + local clean_id = clean_identifier(label_match) + + -- For raw LaTeX, we might need different handling + -- This is a simplified approach - adjust based on your needs + local clean_content = content:gsub("\\label%{[^}]+%}", "") + + if clean_content:match("\\begin%{equation") or clean_content:match("\\begin%{align") then + -- For equation environments, we might need to wrap differently + -- This depends on how your KaTeX setup handles equation environments + return pandoc.RawInline(el.format, clean_content) + end + end + end + + return el +end + +-- Optional: Process RawBlock elements for display equations +function RawBlock(el) + if el.format == "latex" or el.format == "tex" then + local content = el.text + + -- Look for equation environments with labels + local label_match = content:match("\\label%{([^}]+)%}") + + if label_match then + local clean_id = clean_identifier(label_match) + local clean_content = content:gsub("\\label%{[^}]+%}", "") + + -- For block equations, we might want to preserve the structure + -- but add the htmlId functionality + return pandoc.RawBlock(el.format, clean_content) + end + end + + return el +end diff --git a/app/scripts/latex-importer/index.mjs b/app/scripts/latex-importer/index.mjs new file mode 100644 index 0000000000000000000000000000000000000000..9cdb8e0ba583b8fe4ac1e8ad9f6a187be69884fb --- /dev/null +++ b/app/scripts/latex-importer/index.mjs @@ -0,0 +1,138 @@ +#!/usr/bin/env node + +import { join, dirname } from 'path'; +import { fileURLToPath } from 'url'; +import { copyFileSync } from 'fs'; +import { convertLatexToMarkdown } from './latex-converter.mjs'; +import { convertToMdx } from './mdx-converter.mjs'; +import { cleanBibliography } from './bib-cleaner.mjs'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Default configuration +const DEFAULT_INPUT = join(__dirname, 'input', 'main.tex'); +const DEFAULT_OUTPUT = join(__dirname, 'output'); +const ASTRO_CONTENT_PATH = join(__dirname, '..', '..', 'src', 'content', 'article.mdx'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + clean: false, + bibOnly: false, + convertOnly: false, + mdx: false, + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.split('=')[1]; + } else if (arg.startsWith('--output=')) { + config.output = arg.split('=')[1]; + } else if (arg === '--clean') { + config.clean = true; + } else if (arg === '--bib-only') { + config.bibOnly = true; + } else if (arg === '--convert-only') { + config.convertOnly = true; + } + } + + return config; +} + +function showHelp() { + console.log(` +🚀 LaTeX to Markdown Toolkit + +Usage: + node index.mjs [options] + +Options: + --input=PATH Input LaTeX file (default: input/main.tex) + --output=PATH Output directory (default: output/) + --clean Clean output directory before processing + --bib-only Only clean bibliography file + --convert-only Only convert LaTeX to Markdown (skip bib cleaning) + --help, -h Show this help + +Examples: + # Full conversion with bibliography cleaning + node index.mjs --clean + + # Only clean bibliography + node index.mjs --bib-only --input=paper.tex --output=clean/ + + # Only convert LaTeX (use existing clean bibliography) + node index.mjs --convert-only + + # Custom paths + node index.mjs --input=../paper/main.tex --output=../results/ --clean +`); +} + +function main() { + const args = process.argv.slice(2); + + if (args.includes('--help') || args.includes('-h')) { + showHelp(); + process.exit(0); + } + + const config = parseArgs(); + + console.log('🚀 LaTeX to Markdown Toolkit'); + console.log('=============================='); + + try { + if (config.bibOnly) { + // Only clean bibliography + console.log('📚 Bibliography cleaning mode'); + const bibInput = config.input.replace('.tex', '.bib'); + const bibOutput = join(config.output, 'main.bib'); + + cleanBibliography(bibInput, bibOutput); + console.log('🎉 Bibliography cleaning completed!'); + + } else if (config.convertOnly) { + // Only convert LaTeX + console.log('📄 Conversion only mode'); + convertLatexToMarkdown(config.input, config.output); + + } else { + // Full workflow + console.log('🔄 Full conversion workflow'); + convertLatexToMarkdown(config.input, config.output); + + // Convert to MDX if requested + const markdownFile = join(config.output, 'main.md'); + const mdxFile = join(config.output, 'main.mdx'); + + console.log('📝 Converting Markdown to MDX...'); + convertToMdx(markdownFile, mdxFile); + + // Copy MDX to Astro content directory + console.log('📋 Copying MDX to Astro content directory...'); + try { + copyFileSync(mdxFile, ASTRO_CONTENT_PATH); + console.log(` ✅ Copied to ${ASTRO_CONTENT_PATH}`); + } catch (error) { + console.warn(` ⚠️ Failed to copy MDX to Astro: ${error.message}`); + } + } + + } catch (error) { + console.error('❌ Error:', error.message); + process.exit(1); + } +} + +// Export functions for use as module +export { convertLatexToMarkdown, cleanBibliography }; + +// Run CLI if called directly +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/latex-importer/latex-converter.mjs b/app/scripts/latex-importer/latex-converter.mjs new file mode 100644 index 0000000000000000000000000000000000000000..7079e2e43b85e9947a771a33a9ef22adb329f35a --- /dev/null +++ b/app/scripts/latex-importer/latex-converter.mjs @@ -0,0 +1,330 @@ +#!/usr/bin/env node + +import { execSync } from 'child_process'; +import { readFileSync, writeFileSync, existsSync, mkdirSync } from 'fs'; +import { join, dirname, basename } from 'path'; +import { fileURLToPath } from 'url'; +import { cleanBibliography } from './bib-cleaner.mjs'; +import { postProcessMarkdown } from './post-processor.mjs'; +import { preprocessLatexReferences } from './reference-preprocessor.mjs'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Configuration +const DEFAULT_INPUT = join(__dirname, 'input', 'main.tex'); +const DEFAULT_OUTPUT = join(__dirname, 'output'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + clean: false + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.split('=')[1]; + } else if (arg.startsWith('--output=')) { + config.output = arg.split('=')[1]; + } else if (arg === '--clean') { + config.clean = true; + } + } + + return config; +} + +function ensureDirectory(dir) { + if (!existsSync(dir)) { + mkdirSync(dir, { recursive: true }); + } +} + +function cleanDirectory(dir) { + if (existsSync(dir)) { + execSync(`rm -rf "${dir}"/*`, { stdio: 'inherit' }); + } +} + +function preprocessLatexFile(inputFile, outputDir) { + const inputDir = dirname(inputFile); + const tempFile = join(outputDir, 'temp_main.tex'); + + console.log('🔄 Preprocessing LaTeX file to resolve \\input commands...'); + + let content = readFileSync(inputFile, 'utf8'); + + // Remove problematic commands that break pandoc + console.log('🧹 Cleaning problematic LaTeX constructs...'); + + // Fix citation issues - but not in citation keys + content = content.replace(/\$p_0\$(?![A-Za-z])/g, 'p0'); + + // Convert complex math environments to simple delimiters + content = content.replace(/\$\$\\begin\{equation\*\}/g, '$$'); + content = content.replace(/\\end\{equation\*\}\$\$/g, '$$'); + content = content.replace(/\\begin\{equation\*\}/g, '$$'); + content = content.replace(/\\end\{equation\*\}/g, '$$'); + // Keep align environments intact for KaTeX support + // Protect align environments by temporarily replacing them before cleaning & operators + const alignBlocks = []; + content = content.replace(/\\begin\{align\}([\s\S]*?)\\end\{align\}/g, (match, alignContent) => { + alignBlocks.push(match); + return `__ALIGN_BLOCK_${alignBlocks.length - 1}__`; + }); + + // Now remove & operators from non-align content (outside align environments) + content = content.replace(/&=/g, '='); + content = content.replace(/&/g, ''); + + // Restore align blocks with their & operators intact + alignBlocks.forEach((block, index) => { + content = content.replace(`__ALIGN_BLOCK_${index}__`, block); + }); + + // Convert LaTeX citations to Pandoc format + content = content.replace(/\\cite[tp]?\{([^}]+)\}/g, (match, citations) => { + // Handle multiple citations separated by commas - all become simple @citations + return citations.split(',').map(cite => `@${cite.trim()}`).join(', '); + }); + + // Handle complex \textsc with nested math - extract and simplify (but not in command definitions) + content = content.replace(/\\textsc\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}/g, (match, content_inside, offset) => { + // Skip if this is inside a \newcommand or similar definition + const before = content.substring(Math.max(0, offset - 50), offset); + if (before.includes('\\newcommand') || before.includes('\\renewcommand') || before.includes('\\def')) { + return match; // Keep original + } + + // Remove math delimiters inside textsc for simplification + const simplified = content_inside.replace(/\\\([^)]+\\\)/g, 'MATHEXPR'); + return `\\text{${simplified}}`; + }); + + // Remove complex custom commands that pandoc can't handle + content = content.replace(/\\input\{snippets\/[^}]+\}/g, '% Code snippet removed'); + + // Find all \input{} commands (but skip commented ones) + const inputRegex = /^([^%]*?)\\input\{([^}]+)\}/gm; + let match; + + while ((match = inputRegex.exec(content)) !== null) { + const beforeInput = match[1]; + const inputPath = match[2]; + + // Skip if the \input is commented (% appears before \input on the line) + if (beforeInput.includes('%')) { + continue; + } + let fullPath; + + // Skip only problematic files, let Pandoc handle macros + if (inputPath.includes('snippets/')) { + console.log(` Skipping: ${inputPath}`); + content = content.replace(`\\input{${inputPath}}`, `% Skipped: ${inputPath}`); + continue; + } + + // Handle paths with or without .tex extension + if (inputPath.endsWith('.tex')) { + fullPath = join(inputDir, inputPath); + } else { + fullPath = join(inputDir, inputPath + '.tex'); + } + + if (existsSync(fullPath)) { + console.log(` Including: ${inputPath}`); + let includedContent = readFileSync(fullPath, 'utf8'); + + // Clean included content too + includedContent = includedContent.replace(/\$p_0\$/g, 'p0'); + includedContent = includedContent.replace(/\\input\{snippets\/[^}]+\}/g, '% Code snippet removed'); + + // Handle complex \textsc in included content + includedContent = includedContent.replace(/\\textsc\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}/g, (match, content_inside, offset) => { + // Skip if this is inside a \newcommand or similar definition + const before = includedContent.substring(Math.max(0, offset - 50), offset); + if (before.includes('\\newcommand') || before.includes('\\renewcommand') || before.includes('\\def')) { + return match; // Keep original + } + + const simplified = content_inside.replace(/\\\([^)]+\\\)/g, 'MATHEXPR'); + return `\\text{${simplified}}`; + }); + + // Apply same align-preserving logic to included content + const alignBlocksIncluded = []; + includedContent = includedContent.replace(/\\begin\{align\}([\s\S]*?)\\end\{align\}/g, (match, alignContent) => { + alignBlocksIncluded.push(match); + return `__ALIGN_BLOCK_${alignBlocksIncluded.length - 1}__`; + }); + + // Remove alignment operators from non-align content in included files + includedContent = includedContent.replace(/&=/g, '='); + includedContent = includedContent.replace(/&/g, ''); + + // Restore align blocks with their & operators intact + alignBlocksIncluded.forEach((block, index) => { + includedContent = includedContent.replace(`__ALIGN_BLOCK_${index}__`, block); + }); + + // Convert math environments in included content + includedContent = includedContent.replace(/\$\$\\begin\{equation\*\}/g, '$$'); + includedContent = includedContent.replace(/\\end\{equation\*\}\$\$/g, '$$'); + includedContent = includedContent.replace(/\\begin\{equation\*\}/g, '$$'); + includedContent = includedContent.replace(/\\end\{equation\*\}/g, '$$'); + + // Convert citations in included content + includedContent = includedContent.replace(/\\cite[tp]?\{([^}]+)\}/g, (match, citations) => { + return citations.split(',').map(cite => `@${cite.trim()}`).join(', '); + }); + + content = content.replace(`\\input{${inputPath}}`, includedContent); + } else { + console.log(` ⚠️ File not found: ${fullPath} (skipping)`); + content = content.replace(`\\input{${inputPath}}`, `% File not found: ${inputPath}`); + } + } + + // Apply reference preprocessing AFTER input inclusion to ensure all references are captured + console.log('🔧 Preprocessing LaTeX references for MDX compatibility...'); + const referenceResult = preprocessLatexReferences(content); + content = referenceResult.content; + + // Write the preprocessed file + writeFileSync(tempFile, content); + return tempFile; +} + +function processBibliography(inputFile, outputDir) { + const bibFile = join(dirname(inputFile), 'main.bib'); + const outputBibFile = join(outputDir, 'main.bib'); + + if (!existsSync(bibFile)) { + console.log(' ⚠️ No bibliography file found'); + return null; + } + + const success = cleanBibliography(bibFile, outputBibFile); + return success ? outputBibFile : null; +} + +export function convertLatexToMarkdown(inputFile, outputDir) { + console.log('🚀 Simple LaTeX to Markdown Converter'); + console.log(`📁 Input: ${inputFile}`); + console.log(`📁 Output: ${outputDir}`); + + // Check if input file exists + if (!existsSync(inputFile)) { + console.error(`❌ Input file not found: ${inputFile}`); + process.exit(1); + } + + // Ensure output directory exists + ensureDirectory(outputDir); + + try { + // Check if pandoc is available + execSync('pandoc --version', { stdio: 'pipe' }); + } catch (error) { + console.error('❌ Pandoc not found. Please install it: brew install pandoc'); + process.exit(1); + } + + // Clean and copy bibliography + const cleanBibFile = processBibliography(inputFile, outputDir); + + // Preprocess the LaTeX file to resolve \input commands + const preprocessedFile = preprocessLatexFile(inputFile, outputDir); + + const inputFileName = basename(inputFile, '.tex'); + const outputFile = join(outputDir, `${inputFileName}.md`); + + try { + console.log('📄 Converting with Pandoc...'); + + // Enhanced pandoc conversion - use tex_math_dollars for KaTeX compatibility + const bibOption = cleanBibFile ? `--bibliography="${cleanBibFile}"` : ''; + + // Use gfm+tex_math_dollars for simple $ delimiters compatible with KaTeX + const mediaDir = join(outputDir, 'assets', 'image'); + ensureDirectory(mediaDir); + const inputDir = dirname(inputFile); + const equationFilterPath = join(__dirname, 'filters', 'equation-ids.lua'); + const pandocCommand = `pandoc "${preprocessedFile}" -f latex+latex_macros -t gfm+tex_math_dollars+raw_html --shift-heading-level-by=1 --wrap=none ${bibOption} --extract-media="${mediaDir}" --resource-path="${inputDir}" --lua-filter="${equationFilterPath}" -o "${outputFile}"`; + + console.log(` Running: ${pandocCommand}`); + execSync(pandocCommand, { stdio: 'pipe' }); + + // Clean up temp file + execSync(`rm "${preprocessedFile}"`, { stdio: 'pipe' }); + + // Post-processing to fix KaTeX incompatible constructions + let markdownContent = readFileSync(outputFile, 'utf8'); + + // Use modular post-processor with code injection + markdownContent = postProcessMarkdown(markdownContent, inputDir); + + writeFileSync(outputFile, markdownContent); + + console.log(`✅ Conversion completed: ${outputFile}`); + + // Show file size + const stats = execSync(`wc -l "${outputFile}"`, { encoding: 'utf8' }); + const lines = stats.trim().split(' ')[0]; + console.log(`📊 Result: ${lines} lines written`); + + } catch (error) { + console.error('❌ Pandoc conversion failed:'); + console.error(error.message); + // Clean up temp file on error + try { + execSync(`rm "${preprocessedFile}"`, { stdio: 'pipe' }); + } catch { } + process.exit(1); + } +} + +function main() { + const config = parseArgs(); + + if (config.clean) { + console.log('🧹 Cleaning output directory...'); + cleanDirectory(config.output); + } + + convertLatexToMarkdown(config.input, config.output); + + console.log('🎉 Simple conversion completed!'); +} + +// Show help if requested +if (process.argv.includes('--help') || process.argv.includes('-h')) { + console.log(` +🚀 Simple LaTeX to Markdown Converter + +Usage: + node scripts/simple-latex-to-markdown.mjs [options] + +Options: + --input=PATH Input LaTeX file (default: latex-converter/input-example/main.tex) + --output=PATH Output directory (default: output/) + --clean Clean output directory before conversion + --help, -h Show this help + +Examples: + # Basic conversion + node scripts/simple-latex-to-markdown.mjs + + # Custom paths + node scripts/simple-latex-to-markdown.mjs --input=my-paper.tex --output=converted/ + + # Clean output first + node scripts/simple-latex-to-markdown.mjs --clean +`); + process.exit(0); +} + +main(); diff --git a/app/scripts/latex-importer/mdx-converter.mjs b/app/scripts/latex-importer/mdx-converter.mjs new file mode 100644 index 0000000000000000000000000000000000000000..5a0deaf79026bc7e6cdf6af59c7c1b61cbea03fb --- /dev/null +++ b/app/scripts/latex-importer/mdx-converter.mjs @@ -0,0 +1,896 @@ +#!/usr/bin/env node + +import { readFileSync, writeFileSync, existsSync } from 'fs'; +import { join, dirname, basename, extname } from 'path'; +import { fileURLToPath } from 'url'; +import { extractAndGenerateFrontmatter } from './metadata-extractor.mjs'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Configuration +const DEFAULT_INPUT = join(__dirname, 'output', 'main.md'); +const DEFAULT_OUTPUT = join(__dirname, 'output', 'main.mdx'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.substring('--input='.length); + } else if (arg.startsWith('--output=')) { + config.output = arg.substring('--output='.length); + } else if (arg === '--help' || arg === '-h') { + console.log(` +📝 Markdown to MDX Converter + +Usage: + node mdx-converter.mjs [options] + +Options: + --input=PATH Input Markdown file (default: ${DEFAULT_INPUT}) + --output=PATH Output MDX file (default: ${DEFAULT_OUTPUT}) + --help, -h Show this help + +Examples: + # Basic conversion + node mdx-converter.mjs + + # Custom paths + node mdx-converter.mjs --input=article.md --output=article.mdx + `); + process.exit(0); + } else if (!config.input) { + config.input = arg; + } else if (!config.output) { + config.output = arg; + } + } + return config; +} + +/** + * Modular MDX post-processing functions for Astro compatibility + * Each function handles a specific type of transformation + */ + +/** + * Track which Astro components are used during transformations + */ +const usedComponents = new Set(); + +/** + * Track individual image imports needed + */ +const imageImports = new Map(); // src -> varName + +/** + * Add required component imports to the frontmatter + * @param {string} content - MDX content + * @returns {string} - Content with component imports + */ +/** + * Generate a variable name from image path + * @param {string} src - Image source path + * @returns {string} - Valid variable name + */ +function generateImageVarName(src) { + // Extract filename without extension and make it a valid JS variable + const filename = src.split('/').pop().replace(/\.[^.]+$/, ''); + return filename.replace(/[^a-zA-Z0-9]/g, '_').replace(/^[0-9]/, 'img_$&'); +} + +function addComponentImports(content) { + console.log(' 📦 Adding component and image imports...'); + + let imports = []; + + // Add component imports + if (usedComponents.size > 0) { + const componentImports = Array.from(usedComponents) + .map(component => `import ${component} from '../components/${component}.astro';`); + imports.push(...componentImports); + console.log(` ✅ Importing components: ${Array.from(usedComponents).join(', ')}`); + } + + // Add image imports + if (imageImports.size > 0) { + const imageImportStatements = Array.from(imageImports.entries()) + .map(([src, varName]) => `import ${varName} from '${src}';`); + imports.push(...imageImportStatements); + console.log(` ✅ Importing ${imageImports.size} image(s)`); + } + + if (imports.length === 0) { + console.log(' ℹ️ No imports needed'); + return content; + } + + const importBlock = imports.join('\n'); + + // Insert imports after frontmatter + const frontmatterEnd = content.indexOf('---', 3) + 3; + if (frontmatterEnd > 2) { + return content.slice(0, frontmatterEnd) + '\n\n' + importBlock + '\n' + content.slice(frontmatterEnd); + } else { + // No frontmatter, add at beginning + return importBlock + '\n\n' + content; + } +} + + +/** + * Convert grouped figures (subfigures) to MultiFigure components + * @param {string} content - MDX content + * @returns {string} - Content with MultiFigure components for grouped figures + */ +function convertSubfiguresToMultiFigure(content) { + console.log(' 🖼️✨ Converting subfigures to MultiFigure components...'); + + let convertedCount = 0; + + // Pattern to match:
        containing multiple
        elements with a global caption + // This matches the LaTeX subfigure pattern that gets converted by Pandoc + const subfigureGroupPattern = /
        \s*((?:
        [\s\S]*?<\/figure>\s*){2,})
        ([\s\S]*?)<\/figcaption>\s*<\/figure>/g; + + const convertedContent = content.replace(subfigureGroupPattern, (match, figuresMatch, globalCaption) => { + convertedCount++; + + // Extract individual figures within the group + // This pattern is more flexible to handle variations in HTML structure + const individualFigurePattern = /
        \s*]*\/>\s*

        <span id="([^"]*)"[^&]*><\/span><\/p>\s*

        ([\s\S]*?)<\/figcaption>\s*<\/figure>/g; + + const images = []; + let figureMatch; + + while ((figureMatch = individualFigurePattern.exec(figuresMatch)) !== null) { + const [, src, id, caption] = figureMatch; + + // Clean the source path (similar to existing transformImages function) + const cleanSrc = src.replace(/.*\/output\/assets\//, './assets/') + .replace(/\/Users\/[^\/]+\/[^\/]+\/[^\/]+\/[^\/]+\/[^\/]+\/app\/scripts\/latex-to-markdown\/output\/assets\//, './assets/'); + + // Clean caption text (remove HTML, normalize whitespace) + const cleanCaption = caption + .replace(/<[^>]*>/g, '') + .replace(/\n/g, ' ') + .replace(/\s+/g, ' ') + .replace(/'/g, "\\'") + .trim(); + + // Generate alt text from caption + const altText = cleanCaption.length > 100 + ? cleanCaption.substring(0, 100) + '...' + : cleanCaption; + + // Generate variable name for import + const varName = generateImageVarName(cleanSrc); + imageImports.set(cleanSrc, varName); + + images.push({ + src: varName, + alt: altText, + caption: cleanCaption, + id: id + }); + } + + // Clean global caption + const cleanGlobalCaption = globalCaption + .replace(/<[^>]*>/g, '') + .replace(/\n/g, ' ') + .replace(/\s+/g, ' ') + .replace(/'/g, "\\'") + .trim(); + + // Mark MultiFigure component as used + usedComponents.add('MultiFigure'); + + // Determine layout based on number of images + let layout = 'auto'; + if (images.length === 2) layout = '2-column'; + else if (images.length === 3) layout = '3-column'; + else if (images.length === 4) layout = '4-column'; + + // Generate MultiFigure component + const imagesJson = images.map(img => + ` {\n src: ${img.src},\n alt: "${img.alt}",\n caption: "${img.caption}",\n id: "${img.id}"\n }` + ).join(',\n'); + + return ``; + }); + + if (convertedCount > 0) { + console.log(` ✅ Converted ${convertedCount} subfigure group(s) to MultiFigure component(s)`); + } else { + console.log(' ℹ️ No subfigure groups found'); + } + + return convertedContent; +} + +/** + * Transform images to Figure components + * @param {string} content - MDX content + * @returns {string} - Content with Figure components + */ +/** + * Create Figure component with import + * @param {string} src - Clean image source + * @param {string} alt - Alt text + * @param {string} id - Element ID + * @param {string} caption - Figure caption + * @param {string} width - Optional width + * @returns {string} - Figure component markup + */ +function createFigureComponent(src, alt = '', id = '', caption = '', width = '') { + const varName = generateImageVarName(src); + imageImports.set(src, varName); + usedComponents.add('Figure'); + + const props = []; + props.push(`src={${varName}}`); + props.push('zoomable'); + props.push('downloadable'); + if (id) props.push(`id="${id}"`); + props.push('layout="fixed"'); + if (alt) props.push(`alt="${alt}"`); + if (caption) props.push(`caption={'${caption}'}`); + + return ``; +} + +function transformImages(content) { + console.log(' 🖼️ Transforming images to Figure components with imports...'); + + let hasImages = false; + + // Helper function to clean source paths + const cleanSrcPath = (src) => { + return src.replace(/.*\/output\/assets\//, './assets/') + .replace(/\/Users\/[^\/]+\/[^\/]+\/[^\/]+\/[^\/]+\/[^\/]+\/app\/scripts\/latex-to-markdown\/output\/assets\//, './assets/'); + }; + + // Helper to clean caption text + const cleanCaption = (caption) => { + return caption + .replace(/<[^>]*>/g, '') // Remove HTML tags + .replace(/\n/g, ' ') // Replace newlines with spaces + .replace(/\r/g, ' ') // Replace carriage returns with spaces + .replace(/\s+/g, ' ') // Replace multiple spaces with single space + .replace(/'/g, "\\'") // Escape quotes + .trim(); // Trim whitespace + }; + + // Helper to clean alt text + const cleanAltText = (alt, maxLength = 100) => { + const cleaned = alt + .replace(/<[^>]*>/g, '') // Remove HTML tags + .replace(/\n/g, ' ') // Replace newlines with spaces + .replace(/\r/g, ' ') // Replace carriage returns with spaces + .replace(/\s+/g, ' ') // Replace multiple spaces with single space + .trim(); // Trim whitespace + + return cleaned.length > maxLength + ? cleaned.substring(0, maxLength) + '...' + : cleaned; + }; + + // 1. Transform complex HTML figures with style attributes + content = content.replace( + /
        \s*\s*
        \s*(.*?)\s*<\/figcaption>\s*<\/figure>/gs, + (match, id, src, style, caption) => { + const cleanSrc = cleanSrcPath(src); + const cleanCap = cleanCaption(caption); + const altText = cleanAltText(cleanCap); + hasImages = true; + + return createFigureComponent(cleanSrc, altText, id, cleanCap); + } + ); + + // 2. Transform standalone img tags with style + content = content.replace( + //g, + (match, src, style, alt) => { + const cleanSrc = cleanSrcPath(src); + const cleanAlt = cleanAltText(alt || 'Figure'); + hasImages = true; + + return createFigureComponent(cleanSrc, cleanAlt); + } + ); + + // 3. Transform images within wrapfigure divs + content = content.replace( + /
        \s*r[\d.]+\s*]*\/>\s*<\/div>/gs, + (match, src) => { + const cleanSrc = cleanSrcPath(src); + hasImages = true; + + return createFigureComponent(cleanSrc, 'Figure'); + } + ); + + // 4. Transform simple HTML figure/img without style + content = content.replace( + /
        \s*\s*
        \s*(.*?)\s*<\/figcaption>\s*<\/figure>/gs, + (match, id, src, caption) => { + const cleanSrc = cleanSrcPath(src); + const cleanCap = cleanCaption(caption); + const altText = cleanAltText(cleanCap); + hasImages = true; + + return createFigureComponent(cleanSrc, altText, id, cleanCap); + } + ); + + // 5. Clean up figures with minipage divs + content = content.replace( + /
        \s*
        \s*]*\/>\s*<\/div>\s*]*>(.*?)<\/figcaption>\s*<\/figure>/gs, + (match, id, src, caption) => { + const cleanSrc = cleanSrcPath(src); + const cleanCap = cleanCaption(caption); + const altText = cleanAltText(cleanCap); + hasImages = true; + + return createFigureComponent(cleanSrc, altText, id, cleanCap); + } + ); + + // 6. Transform Pandoc-style images: ![alt](src){#id attr="value"} + content = content.replace( + /!\[([^\]]*)\]\(([^)]+)\)(?:\{([^}]+)\})?/g, + (match, alt, src, attributes) => { + const cleanSrc = cleanSrcPath(src); + const cleanAlt = cleanAltText(alt || 'Figure'); + hasImages = true; + + let id = ''; + if (attributes) { + const idMatch = attributes.match(/#([\w-]+)/); + if (idMatch) id = idMatch[1]; + } + + return createFigureComponent(cleanSrc, cleanAlt, id); + } + ); + + if (hasImages) { + console.log(' ✅ Figure components with imports will be created'); + } + + return content; +} + +/** + * Transform HTML spans with style attributes to appropriate components + * @param {string} content - MDX content + * @returns {string} - Content with transformed spans + */ +function transformStyledSpans(content) { + console.log(' 🎨 Transforming styled spans...'); + + // Transform HTML spans with style attributes + content = content.replace( + /(.*?)<\/span>/g, + (match, color, text) => { + // Map colors to semantic classes or components + const colorMap = { + 'hf2': 'text-hf-secondary', + 'hf1': 'text-hf-primary' + }; + + const className = colorMap[color] || `text-${color}`; + return `${text}`; + } + ); + + // Transform markdown spans with style attributes: [text]{style="color: color"} + content = content.replace( + /\[([^\]]+)\]\{style="color: ([^"]+)"\}/g, + (match, text, color) => { + // Map colors to semantic classes or components + const colorMap = { + 'hf2': 'text-hf-secondary', + 'hf1': 'text-hf-primary' + }; + + const className = colorMap[color] || `text-${color}`; + return `${text}`; + } + ); + + return content; +} + +/** + * Transform reference links to proper Astro internal links + * @param {string} content - MDX content + * @returns {string} - Content with transformed links + */ +function fixHtmlEscaping(content) { + console.log(' 🔧 Fixing HTML escaping in spans...'); + + let fixedCount = 0; + + // Pattern 1: \\ + content = content.replace(/\\\\<\/span\\>/g, (match, id, style) => { + fixedCount++; + // Fix common style issues like "position- absolute;" -> "position: absolute;" + const cleanStyle = style.replace('position- absolute;', 'position: absolute;'); + return ``; + }); + + // Pattern 2: \...\ + content = content.replace(/\\([^\\]+)\\<\/span\\>/g, (match, className, text) => { + fixedCount++; + // Remove numbering like (1), (2), (3) from highlight spans + let cleanText = text; + if (className === 'highlight') { + cleanText = text.replace(/^\(\d+\)\s*/, ''); + } + return `${cleanText}`; + }); + + // Pattern 3: HTML-encoded spans in paragraph tags + //

        <span id="..." style="..."></span>

        + content = content.replace(/

        <span id="([^"]*)" style="([^"]*)"><\/span><\/p>/g, (match, id, style) => { + fixedCount++; + // Fix common style issues like "position- absolute;" -> "position: absolute;" + const cleanStyle = style.replace('position- absolute;', 'position: absolute;'); + return ``; + }); + + // Pattern 4: HTML-encoded spans with class in paragraph tags + //

        <span class="...">...</span>

        + content = content.replace(/

        <span class="([^"]*)">([^&]*)<\/span><\/p>/g, (match, className, text) => { + fixedCount++; + // Remove numbering like (1), (2), (3) from highlight spans + let cleanText = text; + if (className === 'highlight') { + cleanText = text.replace(/^\(\d+\)\s*/, ''); + } + return `${cleanText}`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} escaped span(s)`); + } + + return content; +} + +function cleanHighlightNumbering(content) { + console.log(' 🔢 Removing numbering from highlight spans...'); + + let cleanedCount = 0; + // Clean numbering from non-escaped highlight spans too + content = content.replace(/(\(\d+\)\s*)([^<]+)<\/span>/g, (match, numbering, text) => { + cleanedCount++; + return `${text}`; + }); + + if (cleanedCount > 0) { + console.log(` ✅ Removed numbering from ${cleanedCount} highlight span(s)`); + } + + return content; +} + +function transformReferenceLinks(content) { + console.log(' 🔗 Transforming reference links...'); + + // Transform Pandoc reference links: [text](#ref){reference-type="ref" reference="ref"} + return content.replace( + /\[([^\]]+)\]\((#[^)]+)\)\{[^}]*reference[^}]*\}/g, + (match, text, href) => { + return `[${text}](${href})`; + } + ); +} + + +/** + * Fix frontmatter and ensure proper MDX format + * @param {string} content - MDX content + * @param {string} latexContent - Original LaTeX content for metadata extraction + * @returns {string} - Content with proper frontmatter + */ +function ensureFrontmatter(content, latexContent = '') { + console.log(' 📄 Ensuring proper frontmatter...'); + + if (!content.startsWith('---')) { + let frontmatter; + + if (latexContent) { + // Extract metadata from LaTeX using dedicated module + frontmatter = extractAndGenerateFrontmatter(latexContent); + console.log(' ✅ Generated frontmatter from LaTeX metadata'); + } else { + // Fallback frontmatter + const currentDate = new Date().toLocaleDateString('en-US', { + year: 'numeric', + month: 'short', + day: '2-digit' + }); + frontmatter = `--- +title: "Research Article" +published: "${currentDate}" +tableOfContentsAutoCollapse: true +--- + +`; + console.log(' ✅ Generated basic frontmatter'); + } + + return frontmatter + content; + } + + return content; +} + +/** + * Fix mixed math delimiters like $`...`$ or `...`$ + * @param {string} content - MDX content + * @returns {string} - Content with fixed math delimiters + */ +function fixMixedMathDelimiters(content) { + console.log(' 🔧 Fixing mixed math delimiters...'); + + let fixedCount = 0; + + // Fix patterns like $`...`$ (mixed delimiters) + content = content.replace(/\$`([^`]*)`\$/g, (match, mathContent) => { + fixedCount++; + return `$${mathContent}$`; + }); + + // Fix patterns like `...`$ (backtick start, dollar end) + content = content.replace(/`([^`]*)`\$/g, (match, mathContent) => { + fixedCount++; + return `$${mathContent}$`; + }); + + // Fix patterns like $`...` (dollar start, backtick end - less common) + content = content.replace(/\$`([^`]*)`(?!\$)/g, (match, mathContent) => { + fixedCount++; + return `$${mathContent}$`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} mixed math delimiter(s)`); + } + + return content; +} + +/** + * Clean up orphaned math delimiters and fix mixed content + * @param {string} content - MDX content + * @returns {string} - Content with cleaned math blocks + */ +function cleanOrphanedMathDelimiters(content) { + console.log(' 🧹 Cleaning orphaned math delimiters...'); + console.log(' 🔍 Content length:', content.length, 'chars'); + + let fixedCount = 0; + + // Fix orphaned $$ that are alone on lines (but not part of display math blocks) + // Only remove $$ that appear alone without corresponding closing $$ + content = content.replace(/^\$\$\s*$(?!\s*[\s\S]*?\$\$)/gm, () => { + fixedCount++; + return ''; + }); + + // Fix backticks inside $$....$$ blocks (Pandoc artifact) + const mathMatches = content.match(/\$\$([\s\S]*?)\$\$/g); + console.log(` 🔍 Found ${mathMatches ? mathMatches.length : 0} math blocks`); + + content = content.replace(/\$\$([\s\S]*?)\$\$/g, (match, mathContent) => { + // More aggressive: remove ALL single backticks in math blocks (they shouldn't be there) + let cleanedMath = mathContent; + + // Count backticks before + const backticksBefore = (mathContent.match(/`/g) || []).length; + + if (backticksBefore > 0) { + console.log(` 🔧 Found math block with ${backticksBefore} backtick(s)`); + } + + // Remove all isolated backticks (not in pairs) + cleanedMath = cleanedMath.replace(/`/g, ''); + + const backticksAfter = (cleanedMath.match(/`/g) || []).length; + + if (backticksBefore > 0) { + fixedCount++; + console.log(` 🔧 Removed ${backticksBefore} backtick(s) from math block`); + return `$$${cleanedMath}$$`; + } + return match; + }); + + // Fix escaped align in math blocks: \begin{align} -> \begin{align} + content = content.replace(/\\begin\{align\}/g, (match) => { + fixedCount++; + return '\\begin{align}'; + }); + + content = content.replace(/\\end\{align\}/g, (match) => { + fixedCount++; + return '\\end{align}'; + }); + + // Fix cases where text gets mixed with math blocks + // Pattern: ``` math ... ``` text ``` math + content = content.replace(/``` math\s*\n([\s\S]*?)\n```\s*([^`\n]*?)\s*``` math/g, (match, math1, text, math2) => { + if (text.trim().length > 0 && !text.includes('```')) { + fixedCount++; + return '```' + ' math\n' + math1 + '\n```\n\n' + text.trim() + '\n\n```' + ' math'; + } + return match; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} orphaned math delimiter(s)`); + } + + return content; +} + +/** + * Clean newlines from single-dollar math blocks ($...$) ONLY + * @param {string} content - MDX content + * @returns {string} - Content with cleaned math blocks + */ +function cleanSingleLineMathNewlines(content) { + console.log(' 🔢 Cleaning newlines in single-dollar math blocks ($...$)...'); + + let cleanedCount = 0; + + // ULTRA STRICT: Only target single dollar blocks ($...$) that contain newlines + // Use dotall flag (s) to match newlines with .*, and ensure we don't match $$ + const cleanedContent = content.replace(/\$(?!\$)([\s\S]*?)\$(?!\$)/g, (match, mathContent) => { + // Only process if the content contains newlines + if (mathContent.includes('\n')) { + cleanedCount++; + + // Remove ALL newlines and carriage returns, normalize whitespace + const cleanedMath = mathContent + .replace(/\n+/g, ' ') // Replace all newlines with spaces + .replace(/\r+/g, ' ') // Replace carriage returns with spaces + .replace(/\s+/g, ' ') // Normalize multiple spaces to single + .trim(); // Remove leading/trailing spaces + + return `$${cleanedMath}$`; + } + return match; // Keep original if no newlines + }); + + if (cleanedCount > 0) { + console.log(` ✅ Cleaned ${cleanedCount} single-dollar math block(s) with newlines`); + } + + return cleanedContent; +} + +/** + * Add proper line breaks around display math blocks ($$...$$) + * @param {string} content - MDX content + * @returns {string} - Content with properly spaced display math + */ +function formatDisplayMathBlocks(content) { + console.log(' 📐 Formatting display math blocks with proper spacing...'); + + let formattedCount = 0; + + // Find all $$...$$$ blocks (display math) and ensure proper line breaks + // Very strict: only matches exactly $$ followed by content followed by $$ + const formattedContent = content.replace(/\$\$([\s\S]*?)\$\$/g, (match, mathContent) => { + formattedCount++; + + // Clean up the math content - trim whitespace but preserve structure + const cleanedMath = mathContent.trim(); + + // Return with proper line breaks before and after + return `\n$$\n${cleanedMath}\n$$\n`; + }); + + if (formattedCount > 0) { + console.log(` ✅ Formatted ${formattedCount} display math block(s) with proper spacing`); + } + + return formattedContent; +} + +/** + * Clean newlines from figcaption content + * @param {string} content - MDX content + * @returns {string} - Content with cleaned figcaptions + */ +function cleanFigcaptionNewlines(content) { + console.log(' 📝 Cleaning newlines in figcaption elements...'); + + let cleanedCount = 0; + + // Find all

        ...
        blocks and remove internal newlines + const cleanedContent = content.replace(/]*)>([\s\S]*?)<\/figcaption>/g, (match, attributes, captionContent) => { + // Only process if the content contains newlines + if (captionContent.includes('\n')) { + cleanedCount++; + + // Remove newlines and normalize whitespace + const cleanedCaption = captionContent + .replace(/\n+/g, ' ') // Replace newlines with spaces + .replace(/\s+/g, ' ') // Normalize multiple spaces + .trim(); // Trim whitespace + + return `${cleanedCaption}
        `; + } + + return match; // Return unchanged if no newlines + }); + + if (cleanedCount > 0) { + console.log(` ✅ Cleaned ${cleanedCount} figcaption element(s)`); + } else { + console.log(` ℹ️ No figcaption elements with newlines found`); + } + + return cleanedContent; +} + +/** + * Remove HTML comments from MDX content + * @param {string} content - MDX content + * @returns {string} - Content without HTML comments + */ +function removeHtmlComments(content) { + console.log(' 🗑️ Removing HTML comments...'); + + let removedCount = 0; + + // Remove all HTML comments + const cleanedContent = content.replace(//g, () => { + removedCount++; + return ''; + }); + + if (removedCount > 0) { + console.log(` ✅ Removed ${removedCount} HTML comment(s)`); + } + + return cleanedContent; +} + +/** + * Clean up MDX-incompatible syntax + * @param {string} content - MDX content + * @returns {string} - Cleaned content + */ +function cleanMdxSyntax(content) { + console.log(' 🧹 Cleaning MDX syntax...'); + + return content + // NOTE: Math delimiter fixing is now handled by fixMixedMathDelimiters() + // Ensure proper spacing around JSX-like constructs + .replace(/>\s*\n<') + // Remove problematic heading attributes - be more specific to avoid matching \begin{align} + .replace(/^(#{1,6}\s+[^{#\n]+)\{[^}]+\}$/gm, '$1') + // Fix escaped quotes in text + .replace(/\\("|')/g, '$1'); +} + +/** + * Main MDX processing function that applies all transformations + * @param {string} content - Raw Markdown content + * @param {string} latexContent - Original LaTeX content for metadata extraction + * @returns {string} - Processed MDX content compatible with Astro + */ +function processMdxContent(content, latexContent = '') { + console.log('🔧 Processing for Astro MDX compatibility...'); + + // Clear previous tracking + usedComponents.clear(); + imageImports.clear(); + + let processedContent = content; + + // Apply each transformation step sequentially + processedContent = ensureFrontmatter(processedContent, latexContent); + processedContent = fixMixedMathDelimiters(processedContent); + + // Debug: check for $$ blocks after fixMixedMathDelimiters + const mathBlocksAfterMixed = (processedContent.match(/\$\$([\s\S]*?)\$\$/g) || []).length; + console.log(` 📊 Math blocks after mixed delimiters fix: ${mathBlocksAfterMixed}`); + + processedContent = cleanOrphanedMathDelimiters(processedContent); + processedContent = cleanSingleLineMathNewlines(processedContent); + processedContent = formatDisplayMathBlocks(processedContent); + processedContent = removeHtmlComments(processedContent); + processedContent = cleanMdxSyntax(processedContent); + processedContent = convertSubfiguresToMultiFigure(processedContent); + processedContent = transformImages(processedContent); + processedContent = transformStyledSpans(processedContent); + processedContent = transformReferenceLinks(processedContent); + processedContent = fixHtmlEscaping(processedContent); + processedContent = cleanHighlightNumbering(processedContent); + processedContent = cleanFigcaptionNewlines(processedContent); + + // Add component imports at the end + processedContent = addComponentImports(processedContent); + + return processedContent; +} + +function convertToMdx(inputFile, outputFile) { + console.log('📝 Modular Markdown to Astro MDX Converter'); + console.log(`📁 Input: ${inputFile}`); + console.log(`📁 Output: ${outputFile}`); + + // Check if input file exists + if (!existsSync(inputFile)) { + console.error(`❌ Input file not found: ${inputFile}`); + process.exit(1); + } + + try { + console.log('🔄 Reading Markdown file...'); + const markdownContent = readFileSync(inputFile, 'utf8'); + + // Try to read original LaTeX file for metadata extraction + let latexContent = ''; + try { + const inputDir = dirname(inputFile); + const latexFile = join(inputDir, '..', 'input', 'main.tex'); + if (existsSync(latexFile)) { + latexContent = readFileSync(latexFile, 'utf8'); + } + } catch (error) { + // Ignore LaTeX reading errors - we'll use fallback frontmatter + } + + // Apply modular MDX processing + const mdxContent = processMdxContent(markdownContent, latexContent); + + console.log('💾 Writing MDX file...'); + writeFileSync(outputFile, mdxContent); + + console.log(`✅ Conversion completed: ${outputFile}`); + + // Show file size + const inputSize = Math.round(markdownContent.length / 1024); + const outputSize = Math.round(mdxContent.length / 1024); + console.log(`📊 Input: ${inputSize}KB → Output: ${outputSize}KB`); + + } catch (error) { + console.error('❌ Conversion failed:'); + console.error(error.message); + process.exit(1); + } +} + +export { convertToMdx }; + +function main() { + const config = parseArgs(); + convertToMdx(config.input, config.output); + console.log('🎉 MDX conversion completed!'); +} + +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/latex-importer/metadata-extractor.mjs b/app/scripts/latex-importer/metadata-extractor.mjs new file mode 100644 index 0000000000000000000000000000000000000000..14943e71fe86b5b6b60543c05176bb82e9ab617c --- /dev/null +++ b/app/scripts/latex-importer/metadata-extractor.mjs @@ -0,0 +1,170 @@ +/** + * LaTeX Metadata Extractor + * Extracts document metadata from LaTeX files for frontmatter generation + */ + +/** + * Extract metadata from LaTeX content + * @param {string} latexContent - Raw LaTeX content + * @returns {object} - Extracted metadata object + */ +export function extractLatexMetadata(latexContent) { + const metadata = {}; + + // Extract title + const titleMatch = latexContent.match(/\\title\s*\{\s*([^}]+)\s*\}/s); + if (titleMatch) { + metadata.title = titleMatch[1] + .replace(/\n/g, ' ') + .trim(); + } + + // Extract authors with their specific affiliations + const authors = []; + const authorMatches = latexContent.matchAll(/\\authorOne\[[^\]]*\]\{([^}]+)\}/g); + + for (const match of authorMatches) { + const fullAuthorInfo = match[1]; + + // Determine affiliations based on macros present + const affiliations = []; + if (fullAuthorInfo.includes('\\ensps')) { + affiliations.push(1); // École Normale Supérieure + } + if (fullAuthorInfo.includes('\\hf')) { + affiliations.push(2); // Hugging Face + } + + // Clean author name by removing macros + let authorName = fullAuthorInfo + .replace(/\\ensps/g, '') // Remove École macro + .replace(/\\hf/g, '') // Remove Hugging Face macro + .replace(/\s+/g, ' ') // Normalize whitespace + .trim(); + + // Skip empty authors or placeholder entries + if (authorName && authorName !== '...') { + authors.push({ + name: authorName, + affiliations: affiliations.length > 0 ? affiliations : [2] // Default to HF if no macro + }); + } + } + + if (authors.length > 0) { + metadata.authors = authors; + } + + // Extract affiliations - create the two distinct affiliations + metadata.affiliations = [ + { + name: "École Normale Supérieure Paris-Saclay" + }, + { + name: "Hugging Face" + } + ]; + + // Extract date if available (common LaTeX patterns) + const datePatterns = [ + /\\date\s*\{([^}]+)\}/, + /\\newcommand\s*\{\\date\}\s*\{([^}]+)\}/, + ]; + + for (const pattern of datePatterns) { + const dateMatch = latexContent.match(pattern); + if (dateMatch) { + metadata.published = dateMatch[1].trim(); + break; + } + } + + // Fallback to current date if no date found + if (!metadata.published) { + metadata.published = new Date().toLocaleDateString('en-US', { + year: 'numeric', + month: 'short', + day: '2-digit' + }); + } + + return metadata; +} + +/** + * Generate YAML frontmatter from metadata object + * @param {object} metadata - Metadata object + * @returns {string} - YAML frontmatter string + */ +export function generateFrontmatter(metadata) { + let frontmatter = '---\n'; + + // Title + if (metadata.title) { + frontmatter += `title: "${metadata.title}"\n`; + } + + // Authors + if (metadata.authors && metadata.authors.length > 0) { + frontmatter += 'authors:\n'; + metadata.authors.forEach(author => { + frontmatter += ` - name: "${author.name}"\n`; + if (author.url) { + frontmatter += ` url: "${author.url}"\n`; + } + frontmatter += ` affiliations: [${author.affiliations.join(', ')}]\n`; + }); + } + + // Affiliations + if (metadata.affiliations && metadata.affiliations.length > 0) { + frontmatter += 'affiliations:\n'; + metadata.affiliations.forEach((affiliation, index) => { + frontmatter += ` - name: "${affiliation.name}"\n`; + if (affiliation.url) { + frontmatter += ` url: "${affiliation.url}"\n`; + } + }); + } + + // Publication date + if (metadata.published) { + frontmatter += `published: "${metadata.published}"\n`; + } + + // Additional metadata + if (metadata.doi) { + frontmatter += `doi: "${metadata.doi}"\n`; + } + + if (metadata.description) { + frontmatter += `description: "${metadata.description}"\n`; + } + + if (metadata.licence) { + frontmatter += `licence: >\n ${metadata.licence}\n`; + } + + if (metadata.tags && metadata.tags.length > 0) { + frontmatter += 'tags:\n'; + metadata.tags.forEach(tag => { + frontmatter += ` - ${tag}\n`; + }); + } + + // Default Astro configuration + frontmatter += 'tableOfContentsAutoCollapse: true\n'; + frontmatter += '---\n\n'; + + return frontmatter; +} + +/** + * Extract and generate frontmatter from LaTeX content + * @param {string} latexContent - Raw LaTeX content + * @returns {string} - Complete YAML frontmatter + */ +export function extractAndGenerateFrontmatter(latexContent) { + const metadata = extractLatexMetadata(latexContent); + return generateFrontmatter(metadata); +} diff --git a/app/scripts/latex-importer/package-lock.json b/app/scripts/latex-importer/package-lock.json new file mode 100644 index 0000000000000000000000000000000000000000..86ab5949a70a037c497af3022c6e68c7fe8c0e83 Binary files /dev/null and b/app/scripts/latex-importer/package-lock.json differ diff --git a/app/scripts/latex-importer/package.json b/app/scripts/latex-importer/package.json new file mode 100644 index 0000000000000000000000000000000000000000..16850d0a30f47e351a3303970a7ce3a2a881abb5 Binary files /dev/null and b/app/scripts/latex-importer/package.json differ diff --git a/app/scripts/latex-importer/post-processor.mjs b/app/scripts/latex-importer/post-processor.mjs new file mode 100644 index 0000000000000000000000000000000000000000..c108c173957c93412672add2199f978ad73ab73f --- /dev/null +++ b/app/scripts/latex-importer/post-processor.mjs @@ -0,0 +1,439 @@ +#!/usr/bin/env node + +import { readFileSync, writeFileSync, existsSync, readdirSync } from 'fs'; +import { join, dirname } from 'path'; +import { fileURLToPath } from 'url'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +/** + * Post-processor for cleaning Markdown content from LaTeX conversion + * Each function handles a specific type of cleanup for maintainability + */ + +/** + * Remove TeX low-level grouping commands that break KaTeX + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function removeTexGroupingCommands(content) { + console.log(' 🧹 Removing TeX grouping commands...'); + + return content + .replace(/\\mathopen\{\}\\mathclose\\bgroup/g, '') + .replace(/\\aftergroup\\egroup/g, '') + .replace(/\\bgroup/g, '') + .replace(/\\egroup/g, ''); +} + +/** + * Simplify LaTeX delimiter constructions + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function simplifyLatexDelimiters(content) { + console.log(' 🔧 Simplifying LaTeX delimiters...'); + + return content + .replace(/\\left\[\s*/g, '[') + .replace(/\s*\\right\]/g, ']'); +} + +/** + * Remove orphaned LaTeX labels + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function removeOrphanedLabels(content) { + console.log(' 🏷️ Removing orphaned labels...'); + + return content + .replace(/^\s*\\label\{[^}]+\}\s*$/gm, '') + .replace(/\\label\{[^}]+\}/g, ''); +} + +/** + * Fix KaTeX-incompatible math commands + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function fixMathCommands(content) { + console.log(' 📐 Fixing KaTeX-incompatible math commands...'); + + return content + // Replace \hdots with \ldots (KaTeX compatible) + .replace(/\\hdots/g, '\\ldots') + // Add more math command fixes here as needed + .replace(/\\vdots/g, '\\vdots'); // This one should be fine, but kept for consistency +} + +/** + * Convert LaTeX matrix commands to KaTeX-compatible environments + * @param {string} content - Markdown content + * @returns {string} - Content with fixed matrix commands + */ +function fixMatrixCommands(content) { + console.log(' 🔢 Converting matrix commands to KaTeX format...'); + + let fixedCount = 0; + + // Convert \pmatrix{...} to \begin{pmatrix}...\end{pmatrix} + content = content.replace(/\\pmatrix\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}/g, (match, matrixContent) => { + fixedCount++; + // Split by \\ for rows, handle nested braces + const rows = matrixContent.split('\\\\').map(row => row.trim()).filter(row => row); + return `\\begin{pmatrix}\n${rows.join(' \\\\\n')}\n\\end{pmatrix}`; + }); + + // Convert \bmatrix{...} to \begin{bmatrix}...\end{bmatrix} + content = content.replace(/\\bmatrix\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}/g, (match, matrixContent) => { + fixedCount++; + const rows = matrixContent.split('\\\\').map(row => row.trim()).filter(row => row); + return `\\begin{bmatrix}\n${rows.join(' \\\\\n')}\n\\end{bmatrix}`; + }); + + // Convert \vmatrix{...} to \begin{vmatrix}...\end{vmatrix} + content = content.replace(/\\vmatrix\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}/g, (match, matrixContent) => { + fixedCount++; + const rows = matrixContent.split('\\\\').map(row => row.trim()).filter(row => row); + return `\\begin{vmatrix}\n${rows.join(' \\\\\n')}\n\\end{vmatrix}`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} matrix command(s)`); + } + + return content; +} + +/** + * Fix Unicode characters that break MDX/JSX parsing + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function fixUnicodeIssues(content) { + console.log(' 🌐 Fixing Unicode characters for MDX compatibility...'); + + return content + // Replace Unicode middle dot (·) with \cdot in math expressions + .replace(/\$([^$]*?)·([^$]*?)\$/g, (match, before, after) => { + return `$${before}\\cdot${after}$`; + }) + // Replace Unicode middle dot in display math + .replace(/\$\$([^$]*?)·([^$]*?)\$\$/g, (match, before, after) => { + return `$$${before}\\cdot${after}$$`; + }) + // Replace other problematic Unicode characters + .replace(/[""]/g, '"') // Smart quotes to regular quotes + .replace(/['']/g, "'") // Smart apostrophes to regular apostrophes + .replace(/…/g, '...') // Ellipsis to three dots + .replace(/–/g, '-') // En dash to hyphen + .replace(/—/g, '--'); // Em dash to double hyphen +} + +/** + * Fix multiline math expressions for MDX compatibility + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function fixMultilineMath(content) { + console.log(' 📏 Fixing multiline math expressions for MDX...'); + + return content + // Convert multiline inline math to display math blocks (more precise regex) + // Only match if the content is a self-contained math expression within a single line + .replace(/\$([^$\n]*\\\\[^$\n]*)\$/g, (match, mathContent) => { + // Only convert if it contains actual math operators and line breaks + if (mathContent.includes('\\\\') && /[=+\-*/^_{}]/.test(mathContent)) { + // Remove leading/trailing whitespace and normalize newlines + const cleanedMath = mathContent + .replace(/^\s+|\s+$/g, '') + .replace(/\s*\\\\\s*/g, '\\\\\n '); + return `$$\n${cleanedMath}\n$$`; + } + return match; // Keep original if it doesn't look like multiline math + }) + // Ensure display math blocks are properly separated + .replace(/\$\$\s*\n\s*([^$]+?)\s*\n\s*\$\$/g, (match, mathContent) => { + return `\n$$\n${mathContent.trim()}\n$$\n`; + }); +} + +/** + * Inject code snippets into empty code blocks + * @param {string} content - Markdown content + * @param {string} inputDir - Directory containing the LaTeX source and snippets + * @returns {string} - Content with injected code snippets + */ +function injectCodeSnippets(content, inputDir = null) { + console.log(' 💻 Injecting code snippets...'); + + if (!inputDir) { + console.log(' ⚠️ No input directory provided, skipping code injection'); + return content; + } + + const snippetsDir = join(inputDir, 'snippets'); + + if (!existsSync(snippetsDir)) { + console.log(' ⚠️ Snippets directory not found, skipping code injection'); + return content; + } + + // Get all available snippet files + let availableSnippets = []; + try { + availableSnippets = readdirSync(snippetsDir); + console.log(` 📁 Found ${availableSnippets.length} snippet file(s): ${availableSnippets.join(', ')}`); + } catch (error) { + console.log(` ❌ Error reading snippets directory: ${error.message}`); + return content; + } + + // Find all empty code blocks + const emptyCodeBlockPattern = /```\s*(\w+)\s*\n\s*```/g; + + let processedContent = content; + let injectionCount = 0; + + processedContent = processedContent.replace(emptyCodeBlockPattern, (match, language) => { + // Map language names to file extensions + const extensionMap = { + 'python': 'py', + 'javascript': 'js', + 'typescript': 'ts', + 'bash': 'sh', + 'shell': 'sh' + }; + + const fileExtension = extensionMap[language] || language; + + // Try to find a matching snippet file for this language + const matchingFiles = availableSnippets.filter(file => + file.endsWith(`.${fileExtension}`) + ); + + if (matchingFiles.length === 0) { + console.log(` ⚠️ No ${language} snippet found (looking for .${fileExtension})`); + return match; + } + + // Use the first matching file (could be made smarter with context analysis) + const selectedFile = matchingFiles[0]; + const snippetPath = join(snippetsDir, selectedFile); + + try { + const snippetContent = readFileSync(snippetPath, 'utf8'); + injectionCount++; + console.log(` ✅ Injected: ${selectedFile}`); + return `\`\`\`${language}\n${snippetContent.trim()}\n\`\`\``; + } catch (error) { + console.log(` ❌ Error reading ${selectedFile}: ${error.message}`); + return match; + } + }); + + if (injectionCount > 0) { + console.log(` 📊 Injected ${injectionCount} code snippet(s)`); + } + + return processedContent; +} + +/** + * Fix all attributes that still contain colons (href, data-reference, id) + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function fixAllAttributes(content) { + console.log(' 🔗 Fixing all attributes with colons...'); + + let fixedCount = 0; + + // Fix href attributes containing colons + content = content.replace(/href="([^"]*):([^"]*)"/g, (match, before, after) => { + fixedCount++; + return `href="${before}-${after}"`; + }); + + // Fix data-reference attributes containing colons + content = content.replace(/data-reference="([^"]*):([^"]*)"/g, (match, before, after) => { + fixedCount++; + return `data-reference="${before}-${after}"`; + }); + + // Fix id attributes containing colons (like in Figure components) + content = content.replace(/id="([^"]*):([^"]*)"/g, (match, before, after) => { + fixedCount++; + return `id="${before}-${after}"`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} attribute(s) with colons`); + } + + return content; +} + +/** + * Fix link text content that still contains colons + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function fixLinkTextContent(content) { + console.log(' 📝 Fixing link text content with colons...'); + + let fixedCount = 0; + + // Fix text content within links that contain references with colons + // Pattern: [text:content] + const cleanedContent = content.replace(/]*)>\[([^:]*):([^\]]*)\]<\/a>/g, (match, attributes, before, after) => { + fixedCount++; + return `[${before}-${after}]`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} link text(s) with colons`); + } + + return cleanedContent; +} + +/** + * Convert align anchor markers to proper HTML spans outside math blocks + * @param {string} content - Markdown content + * @returns {string} - Content with converted anchor spans + */ +function convertAlignAnchors(content) { + console.log(' 🏷️ Converting align anchor markers to HTML spans...'); + + let convertedCount = 0; + + // Find and replace align anchor markers with proper spans outside math blocks + content = content.replace(/``` math\n%%ALIGN_ANCHOR_ID\{([^}]+)\}%%\n([\s\S]*?)\n```/g, (match, anchorId, mathContent) => { + convertedCount++; + return `\n\n\`\`\` math\n${mathContent}\n\`\`\``; + }); + + if (convertedCount > 0) { + console.log(` ✅ Converted ${convertedCount} align anchor marker(s) to spans`); + } + + return content; +} + +/** + * Main post-processing function that applies all cleanup steps + * @param {string} content - Raw Markdown content from Pandoc + * @param {string} inputDir - Optional: Directory containing LaTeX source for code injection + * @returns {string} - Cleaned Markdown content + */ +export function postProcessMarkdown(content, inputDir = null) { + console.log('🔧 Post-processing for KaTeX compatibility...'); + + let processedContent = content; + + // Apply each cleanup step sequentially + processedContent = removeTexGroupingCommands(processedContent); + processedContent = simplifyLatexDelimiters(processedContent); + processedContent = removeOrphanedLabels(processedContent); + processedContent = convertAlignAnchors(processedContent); + processedContent = fixMathCommands(processedContent); + processedContent = fixMatrixCommands(processedContent); + processedContent = fixUnicodeIssues(processedContent); + processedContent = fixMultilineMath(processedContent); + processedContent = fixAllAttributes(processedContent); + processedContent = fixLinkTextContent(processedContent); + + // Inject code snippets if input directory is provided + if (inputDir) { + processedContent = injectCodeSnippets(processedContent, inputDir); + } + + return processedContent; +} + +/** + * CLI interface for standalone usage + */ +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: join(__dirname, 'output', 'main.md'), + output: null, // Will default to input if not specified + verbose: false, + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.substring('--input='.length); + } else if (arg.startsWith('--output=')) { + config.output = arg.substring('--output='.length); + } else if (arg === '--verbose') { + config.verbose = true; + } else if (arg === '--help' || arg === '-h') { + console.log(` +🔧 Markdown Post-Processor + +Usage: + node post-processor.mjs [options] + +Options: + --input=PATH Input Markdown file (default: output/main.md) + --output=PATH Output file (default: overwrites input) + --verbose Verbose output + --help, -h Show this help + +Examples: + # Process main.md in-place + node post-processor.mjs + + # Process with custom paths + node post-processor.mjs --input=raw.md --output=clean.md + `); + process.exit(0); + } + } + + // Default output to input if not specified + if (!config.output) { + config.output = config.input; + } + + return config; +} + +function main() { + const config = parseArgs(); + + console.log('🔧 Markdown Post-Processor'); + console.log(`📁 Input: ${config.input}`); + console.log(`📁 Output: ${config.output}`); + + try { + const content = readFileSync(config.input, 'utf8'); + const processedContent = postProcessMarkdown(content); + + writeFileSync(config.output, processedContent); + + console.log(`✅ Post-processing completed: ${config.output}`); + + // Show stats if verbose + if (config.verbose) { + const originalLines = content.split('\n').length; + const processedLines = processedContent.split('\n').length; + console.log(`📊 Lines: ${originalLines} → ${processedLines}`); + } + + } catch (error) { + console.error('❌ Post-processing failed:'); + console.error(error.message); + process.exit(1); + } +} + +// Run CLI if called directly +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/latex-importer/reference-preprocessor.mjs b/app/scripts/latex-importer/reference-preprocessor.mjs new file mode 100644 index 0000000000000000000000000000000000000000..a3ae6ec933af1a90778a536a95d6675b7cfb5965 --- /dev/null +++ b/app/scripts/latex-importer/reference-preprocessor.mjs @@ -0,0 +1,239 @@ +#!/usr/bin/env node + +/** + * LaTeX Reference Preprocessor + * + * This module cleans up LaTeX references BEFORE Pandoc conversion to ensure + * consistent, MDX-compatible identifiers throughout the document. + * + * What it does: + * - Removes prefixes from labels: \label{sec:intro} → \label{sec-intro} + * - Updates corresponding refs: \ref{sec:intro} → \ref{sec-intro} + * - Handles all reference types: sec:, fig:, eq:, table:, etc. + * - Maintains consistency between labels and references + */ + +/** + * Extract all references from LaTeX content + * @param {string} content - LaTeX content + * @returns {Object} - Object with labels and refs arrays + */ +function extractReferences(content) { + const references = { + labels: new Set(), + refs: new Set(), + cites: new Set() + }; + + // Find all \label{...} commands + const labelMatches = content.matchAll(/\\label\{([^}]+)\}/g); + for (const match of labelMatches) { + references.labels.add(match[1]); + } + + // Find all \ref{...} commands + const refMatches = content.matchAll(/\\ref\{([^}]+)\}/g); + for (const match of refMatches) { + references.refs.add(match[1]); + } + + // Find all \cite{...} commands (already handled in existing code but included for completeness) + const citeMatches = content.matchAll(/\\cite[tp]?\{([^}]+)\}/g); + for (const match of citeMatches) { + // Handle multiple citations: \cite{ref1,ref2,ref3} + const citations = match[1].split(',').map(cite => cite.trim()); + citations.forEach(cite => references.cites.add(cite)); + } + + return references; +} + +/** + * Create clean identifier mapping + * @param {Object} references - References object from extractReferences + * @returns {Map} - Mapping from original to clean identifiers + */ +function createCleanMapping(references) { + const mapping = new Map(); + + // Create mapping for all unique identifiers + const allIdentifiers = new Set([ + ...references.labels, + ...references.refs + ]); + + for (const id of allIdentifiers) { + // Remove common prefixes and replace colons with dashes + let cleanId = id + .replace(/^(sec|section|ch|chapter|fig|figure|eq|equation|tab|table|lst|listing|app|appendix):/gi, '') + .replace(/:/g, '-') + .replace(/[^a-zA-Z0-9_-]/g, '-') // Replace any other problematic characters + .replace(/-+/g, '-') // Collapse multiple dashes + .replace(/^-|-$/g, ''); // Remove leading/trailing dashes + + // Ensure we don't have empty identifiers + if (!cleanId) { + cleanId = id.replace(/:/g, '-'); + } + + mapping.set(id, cleanId); + } + + return mapping; +} + +/** + * Convert labels to HTML anchor spans for better MDX compatibility + * @param {string} content - LaTeX content + * @param {Map} mapping - Identifier mapping (original -> clean) + * @returns {Object} - Result with content and count of conversions + */ +function convertLabelsToAnchors(content, mapping) { + let processedContent = content; + let anchorsCreated = 0; + + // Replace \label{...} with HTML anchor spans, but SKIP labels inside math environments + for (const [original, clean] of mapping) { + // Skip equation labels (they will be handled by the Lua filter) + if (original.startsWith('eq:')) { + continue; + } + + const labelRegex = new RegExp(`\\\\label\\{${escapeRegex(original)}\\}`, 'g'); + const labelMatches = processedContent.match(labelRegex); + + if (labelMatches) { + // Replace \label{original} with HTML span anchor (invisible but accessible) + processedContent = processedContent.replace(labelRegex, `\n\n\n\n`); + anchorsCreated += labelMatches.length; + } + } + + return { content: processedContent, anchorsCreated }; +} + +/** + * Convert \highlight{...} commands to HTML spans with CSS class + * @param {string} content - LaTeX content + * @returns {Object} - Result with content and count of conversions + */ +function convertHighlightCommands(content) { + let processedContent = content; + let highlightsConverted = 0; + + // Replace \highlight{...} with ... + processedContent = processedContent.replace(/\\highlight\{([^}]+)\}/g, (match, text) => { + highlightsConverted++; + return `${text}`; + }); + + return { content: processedContent, highlightsConverted }; +} + +/** + * Apply mapping to LaTeX content + * @param {string} content - Original LaTeX content + * @param {Map} mapping - Identifier mapping + * @returns {string} - Cleaned LaTeX content + */ +function applyMapping(content, mapping) { + let cleanedContent = content; + let changesCount = 0; + + // First, convert labels to anchor spans + const anchorResult = convertLabelsToAnchors(cleanedContent, mapping); + cleanedContent = anchorResult.content; + const anchorsCreated = anchorResult.anchorsCreated; + + // Convert \highlight{} commands to spans + const highlightResult = convertHighlightCommands(cleanedContent); + cleanedContent = highlightResult.content; + const highlightsConverted = highlightResult.highlightsConverted; + + // Then apply mapping to remaining references and equation labels + for (const [original, clean] of mapping) { + if (original !== clean) { + // Replace \ref{original} with \ref{clean} + const refRegex = new RegExp(`\\\\ref\\{${escapeRegex(original)}\\}`, 'g'); + const refMatches = cleanedContent.match(refRegex); + if (refMatches) { + cleanedContent = cleanedContent.replace(refRegex, `\\ref{${clean}}`); + changesCount += refMatches.length; + } + + // For equation labels, still clean the labels themselves (for the Lua filter) + if (original.startsWith('eq:')) { + const labelRegex = new RegExp(`\\\\label\\{${escapeRegex(original)}\\}`, 'g'); + const labelMatches = cleanedContent.match(labelRegex); + if (labelMatches) { + cleanedContent = cleanedContent.replace(labelRegex, `\\label{${clean}}`); + changesCount += labelMatches.length; + } + } + } + } + + return { + content: cleanedContent, + changesCount: changesCount + anchorsCreated, + highlightsConverted: highlightsConverted + }; +} + +/** + * Escape special regex characters + * @param {string} string - String to escape + * @returns {string} - Escaped string + */ +function escapeRegex(string) { + return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); +} + +/** + * Main preprocessing function + * @param {string} latexContent - Original LaTeX content + * @returns {Object} - Result with cleaned content and statistics + */ +export function preprocessLatexReferences(latexContent) { + console.log('🔧 Preprocessing LaTeX references for MDX compatibility...'); + + // 1. Extract all references + const references = extractReferences(latexContent); + + console.log(` 📊 Found: ${references.labels.size} labels, ${references.refs.size} refs`); + + // 2. Create clean mapping + const mapping = createCleanMapping(references); + + // 3. Apply mapping + const result = applyMapping(latexContent, mapping); + + if (result.changesCount > 0) { + console.log(` ✅ Processed ${result.changesCount} reference(s) and created anchor spans`); + + // Show some examples of changes + let exampleCount = 0; + for (const [original, clean] of mapping) { + if (original !== clean && exampleCount < 3) { + console.log(` ${original} → ${clean} (span + refs)`); + exampleCount++; + } + } + if (mapping.size > 3) { + console.log(` ... and ${mapping.size - 3} more anchor spans created`); + } + } else { + console.log(' ℹ️ No reference cleanup needed'); + } + + if (result.highlightsConverted > 0) { + console.log(` ✨ Converted ${result.highlightsConverted} \\highlight{} command(s) to `); + } + + return { + content: result.content, + changesCount: result.changesCount, + mapping: mapping, + references: references + }; +} diff --git a/app/scripts/notion-importer/.cursorignore b/app/scripts/notion-importer/.cursorignore new file mode 100644 index 0000000000000000000000000000000000000000..2eea525d885d5148108f6f3a9a8613863f783d36 --- /dev/null +++ b/app/scripts/notion-importer/.cursorignore @@ -0,0 +1 @@ +.env \ No newline at end of file diff --git a/app/scripts/notion-importer/README.md b/app/scripts/notion-importer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..998806457a853a6dc93a8bd393df921c0aea5eb4 --- /dev/null +++ b/app/scripts/notion-importer/README.md @@ -0,0 +1,334 @@ +# Notion Importer + +Complete Notion to MDX (Markdown + JSX) importer optimized for Astro with advanced media handling, interactive components, and seamless integration. + +## 🚀 Quick Start + +### Method 1: Using NOTION_PAGE_ID (Recommended) + +```bash +# Install dependencies +npm install + +# Setup environment variables +cp env.example .env +# Edit .env with your Notion token and page ID + +# Complete Notion → MDX conversion (fetches title/slug automatically) +NOTION_TOKEN=secret_xxx NOTION_PAGE_ID=abc123 node index.mjs + +# Or use .env file +node index.mjs +``` + +### Method 2: Using pages.json (Legacy) + +```bash +# Install dependencies +npm install + +# Setup environment variables +cp env.example .env +# Edit .env with your Notion token + +# Configure pages in input/pages.json +# { +# "pages": [ +# { +# "id": "your-page-id", +# "title": "Title", +# "slug": "slug" +# } +# ] +# } + +# Complete Notion → MDX conversion +node index.mjs + +# For step-by-step debugging +node notion-converter.mjs # Notion → Markdown +node mdx-converter.mjs # Markdown → MDX +``` + +## 📁 Structure + +``` +notion-importer/ +├── index.mjs # Complete Notion → MDX pipeline +├── notion-converter.mjs # Notion → Markdown with notion-to-md v4 +├── mdx-converter.mjs # Markdown → MDX with Astro components +├── post-processor.mjs # Markdown post-processing +├── package.json # Dependencies and scripts +├── env.example # Environment variables template +├── static/ # Static files injected at build time +│ ├── frontmatter.mdx # Static frontmatter (overrides all others) +│ └── bibliography.bib # Static bibliography +├── input/ # Configuration +│ └── pages.json # Notion pages to convert +└── output/ # Results + ├── *.md # Intermediate Markdown + ├── *.mdx # Final MDX for Astro + └── media/ # Downloaded media files +``` + +## ✨ Key Features + +### 🎯 **Advanced Media Handling** +- **Local download**: Automatic download of all Notion media (images, files, PDFs) +- **Path transformation**: Smart path conversion for web accessibility +- **Image components**: Automatic conversion to Astro `Image` components with zoom/download +- **Media organization**: Structured media storage by page ID + +### 🧮 **Interactive Components** +- **Callouts → Notes**: Notion callouts converted to Astro `Note` components +- **Enhanced tables**: Tables wrapped in styled containers +- **Code blocks**: Enhanced with copy functionality +- **Automatic imports**: Smart component and image import generation + +### 🎨 **Smart Formatting** +- **Link fixing**: Notion internal links converted to relative links +- **Artifact cleanup**: Removal of Notion-specific formatting artifacts +- **Static frontmatter**: Priority injection of custom frontmatter from `static/frontmatter.mdx` +- **Static bibliography**: Automatic copying of `static/bibliography.bib` +- **Astro compatibility**: Full compatibility with Astro MDX processing + +### 🔧 **Robust Pipeline** +- **Notion preprocessing**: Advanced page configuration and media strategy +- **Post-processing**: Markdown cleanup and optimization +- **MDX conversion**: Final transformation with Astro components +- **Auto-copy**: Automatic copying to Astro content directory + +## 📄 Static Files Configuration + +The importer supports static files for consistent metadata and bibliography: + +### Frontmatter (`static/frontmatter.mdx`) +Create this file to override frontmatter across all conversions: + +```yaml +--- +title: "My Article Title" +subtitle: "Optional subtitle" +description: "Article description for SEO" +authors: + - name: "Jane Doe" + url: "https://example.com" + affiliations: + - "Hugging Face" +tags: + - AI + - Research +doi: "10.1000/182" +tableOfContentsAutoCollapse: true +--- +``` + +This static frontmatter takes **highest priority** over any Notion metadata or existing frontmatter. + +### Bibliography (`static/bibliography.bib`) +Add your BibTeX entries to be copied to `src/content/bibliography.bib`: + +```bibtex +@article{example2024, + title={Example Article}, + author={Doe, Jane and Smith, John}, + journal={Example Journal}, + year={2024} +} +``` + +## 📊 Example Workflow + +```bash +# 1. Configure your Notion pages +# Edit input/pages.json with your page IDs + +# 2. Complete automatic conversion +NOTION_TOKEN=your_token node index.mjs --clean + +# 3. Generated results +ls output/ +# → getting-started.md (Intermediate Markdown) +# → getting-started.mdx (Final MDX for Astro) +# → media/ (downloaded images and files) +``` + +### 📋 Conversion Result + +The pipeline generates MDX files optimized for Astro with: + +```mdx +--- +title: "Getting Started with Notion" +published: "2024-01-15" +tableOfContentsAutoCollapse: true +--- + +import Image from '../components/Image.astro'; +import Note from '../components/Note.astro'; +import gettingStartedImage from './media/getting-started/image1.png'; + +## Introduction + +Here is some content with a callout: + + +This is a converted Notion callout. + + +And an image: + +
        +``` + +## ⚙️ Required Astro Configuration + +To use the generated MDX files, ensure your Astro project has the required components: + +```astro +// src/components/Figure.astro +--- +export interface Props { + src: any; + alt?: string; + caption?: string; + zoomable?: boolean; + downloadable?: boolean; + layout?: string; + id?: string; +} + +const { src, alt, caption, zoomable, downloadable, layout, id } = Astro.props; +--- + +
        + {alt} + {caption &&
        {caption}
        } +
        +``` + +## 🛠️ Prerequisites + +- **Node.js** with ESM support +- **Notion Integration**: Set up an integration in your Notion workspace +- **Notion Token**: Copy the "Internal Integration Token" +- **Shared Pages**: Share the specific Notion page(s) with your integration +- **Astro** to use the generated MDX + +## 🎯 Technical Architecture + +### 4-Stage Pipeline + +1. **Notion Preprocessing** (`notion-converter.mjs`) + - Configuration loading from `pages.json` + - Notion API client initialization + - Media download strategy configuration + +2. **Notion-to-Markdown** (notion-to-md v4) + - Page conversion with `NotionConverter` + - Media downloading with `downloadMediaTo()` + - File export with `DefaultExporter` + +3. **Markdown Post-processing** (`post-processor.mjs`) + - Notion artifact cleanup + - Link fixing and optimization + - Table and code block enhancement + +4. **MDX Conversion** (`mdx-converter.mjs`) + - Component transformation (Figure, Note) + - Automatic import generation + - Frontmatter enhancement + - Astro compatibility optimization + +## 📊 Configuration Options + +### Pages Configuration (`input/pages.json`) + +```json +{ + "pages": [ + { + "id": "your-notion-page-id", + "title": "Page Title", + "slug": "page-slug" + } + ] +} +``` + +### Environment Variables + +Copy `env.example` to `.env` and configure: + +```bash +cp env.example .env +# Edit .env with your actual Notion token +``` + +Required variables: +```bash +NOTION_TOKEN=secret_your_notion_integration_token_here +``` + +### Command Line Options + +```bash +# Full workflow +node index.mjs --clean --token=your_token + +# Notion to Markdown only +node index.mjs --notion-only + +# Markdown to MDX only +node index.mjs --mdx-only + +# Custom paths +node index.mjs --input=my-pages.json --output=converted/ +``` + +## 📊 Conversion Statistics + +For a typical Notion page: +- **Media files** automatically downloaded and organized +- **Callouts** converted to interactive Note components +- **Images** transformed to Figure components with zoom/download +- **Tables** enhanced with proper styling containers +- **Code blocks** enhanced with copy functionality +- **Links** fixed for proper internal navigation + +## ✅ Project Status + +### 🎉 **Complete Features** +- ✅ **Notion → MDX Pipeline**: Full end-to-end functional conversion +- ✅ **Media Management**: Automatic download and path transformation +- ✅ **Component Integration**: Seamless Astro component integration +- ✅ **Smart Formatting**: Intelligent cleanup and optimization +- ✅ **Robustness**: Error handling and graceful degradation +- ✅ **Flexibility**: Modular pipeline with step-by-step options + +### 🚀 **Production Ready** +The toolkit is now **100% operational** for converting Notion pages to MDX/Astro with all advanced features (media handling, component integration, smart formatting). + +## 🔗 Integration with notion-to-md v4 + +This toolkit leverages the powerful [notion-to-md v4](https://notionconvert.com/docs/v4/guides/) library with: + +- **Advanced Media Strategies**: Download, upload, and direct media handling +- **Custom Renderers**: Block transformers and annotation transformers +- **Exporter Plugins**: File, buffer, and stdout output options +- **Database Support**: Full database property and frontmatter transformation +- **Page References**: Smart internal link handling + +## 📚 Additional Resources + +- [notion-to-md v4 Documentation](https://notionconvert.com/docs/v4/guides/) +- [Notion API Documentation](https://developers.notion.com/) +- [Astro MDX Documentation](https://docs.astro.build/en/guides/integrations-guide/mdx/) +- [Media Handling Strategies](https://notionconvert.com/blog/mastering-media-handling-in-notion-to-md-v4-download-upload-and-direct-strategies/) +- [Frontmatter Transformation](https://notionconvert.com/blog/how-to-convert-notion-properties-to-frontmatter-with-notion-to-md-v4/) diff --git a/app/scripts/notion-importer/env.example b/app/scripts/notion-importer/env.example new file mode 100644 index 0000000000000000000000000000000000000000..7b89b420f3d18d11035486c98019d406ab813599 --- /dev/null +++ b/app/scripts/notion-importer/env.example @@ -0,0 +1,2 @@ +NOTION_TOKEN=ntn_xxx +NOTION_PAGE_ID=xxx diff --git a/app/scripts/notion-importer/index.mjs b/app/scripts/notion-importer/index.mjs new file mode 100644 index 0000000000000000000000000000000000000000..a09e81236f88cc9408453e4b394479cc9bd70769 --- /dev/null +++ b/app/scripts/notion-importer/index.mjs @@ -0,0 +1,494 @@ +#!/usr/bin/env node + +import { config } from 'dotenv'; +import { join, dirname, basename } from 'path'; +import { fileURLToPath } from 'url'; +import { copyFileSync, existsSync, mkdirSync, readFileSync, writeFileSync, readdirSync, statSync, unlinkSync } from 'fs'; +import { convertNotionToMarkdown } from './notion-converter.mjs'; +import { convertToMdx } from './mdx-converter.mjs'; +import { Client } from '@notionhq/client'; + +// Load environment variables from .env file (but don't override existing ones) +config({ override: false }); + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Default configuration +const DEFAULT_INPUT = join(__dirname, 'input', 'pages.json'); +const DEFAULT_OUTPUT = join(__dirname, 'output'); +const ASTRO_CONTENT_PATH = join(__dirname, '..', '..', 'src', 'content', 'article.mdx'); +const ASTRO_ASSETS_PATH = join(__dirname, '..', '..', 'src', 'content', 'assets', 'image'); +const ASTRO_BIB_PATH = join(__dirname, '..', '..', 'src', 'content', 'bibliography.bib'); +const STATIC_BIB_PATH = join(__dirname, 'static', 'bibliography.bib'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + clean: false, + notionOnly: false, + mdxOnly: false, + token: process.env.NOTION_TOKEN, + pageId: process.env.NOTION_PAGE_ID + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.split('=')[1]; + } else if (arg.startsWith('--output=')) { + config.output = arg.split('=')[1]; + } else if (arg.startsWith('--token=')) { + config.token = arg.split('=')[1]; + } else if (arg.startsWith('--page-id=')) { + config.pageId = arg.split('=')[1]; + } else if (arg === '--clean') { + config.clean = true; + } else if (arg === '--notion-only') { + config.notionOnly = true; + } else if (arg === '--mdx-only') { + config.mdxOnly = true; + } + } + + return config; +} + +function showHelp() { + console.log(` +🚀 Notion to MDX Toolkit + +Usage: + node index.mjs [options] + +Options: + --input=PATH Input pages configuration file (default: input/pages.json) + --output=PATH Output directory (default: output/) + --token=TOKEN Notion API token (or set NOTION_TOKEN env var) + --clean Clean output directory before processing + --notion-only Only convert Notion to Markdown (skip MDX conversion) + --mdx-only Only convert existing Markdown to MDX + --help, -h Show this help + +Environment Variables: + NOTION_TOKEN Your Notion integration token + +Examples: + # Full conversion workflow + NOTION_TOKEN=your_token node index.mjs --clean + + # Only convert Notion pages to Markdown + node index.mjs --notion-only --token=your_token + + # Only convert existing Markdown to MDX + node index.mjs --mdx-only + + # Custom paths + node index.mjs --input=my-pages.json --output=converted/ --token=your_token + +Configuration File Format (pages.json): +{ + "pages": [ + { + "id": "your-notion-page-id", + "title": "Page Title", + "slug": "page-slug" + } + ] +} + +Workflow: + 1. Notion → Markdown (with media download) + 2. Markdown → MDX (with Astro components) + 3. Copy to Astro content directory +`); +} + +function ensureDirectory(dir) { + if (!existsSync(dir)) { + mkdirSync(dir, { recursive: true }); + } +} + +async function cleanDirectory(dir) { + if (existsSync(dir)) { + const { execSync } = await import('child_process'); + execSync(`rm -rf "${dir}"/*`, { stdio: 'inherit' }); + } +} + +function readPagesConfig(inputFile) { + try { + const content = readFileSync(inputFile, 'utf8'); + return JSON.parse(content); + } catch (error) { + console.error(`❌ Error reading pages config: ${error.message}`); + return { pages: [] }; + } +} + +/** + * Create a temporary pages.json from NOTION_PAGE_ID environment variable + * Extracts title and generates slug from the Notion page + */ +async function createPagesConfigFromEnv(pageId, token, outputPath) { + try { + console.log('🔍 Fetching page info from Notion API...'); + const notion = new Client({ auth: token }); + const page = await notion.pages.retrieve({ page_id: pageId }); + + // Extract title + let title = 'Article'; + if (page.properties.title && page.properties.title.title && page.properties.title.title.length > 0) { + title = page.properties.title.title[0].plain_text; + } else if (page.properties.Name && page.properties.Name.title && page.properties.Name.title.length > 0) { + title = page.properties.Name.title[0].plain_text; + } + + // Generate slug from title + const slug = title + .toLowerCase() + .replace(/[^\w\s-]/g, '') + .replace(/\s+/g, '-') + .replace(/-+/g, '-') + .trim(); + + console.log(` ✅ Found page: "${title}" (slug: ${slug})`); + + // Create pages config + const pagesConfig = { + pages: [{ + id: pageId, + title: title, + slug: slug + }] + }; + + // Write to temporary file + writeFileSync(outputPath, JSON.stringify(pagesConfig, null, 4)); + console.log(` ✅ Created temporary pages config`); + + return pagesConfig; + } catch (error) { + console.error(`❌ Error fetching page from Notion: ${error.message}`); + throw error; + } +} + +/** + * Final cleanup function to remove exclude tags and unused imports + * @param {string} content - MDX content + * @returns {string} - Cleaned content + */ +function cleanupExcludeTagsAndImports(content) { + let cleanedContent = content; + let removedCount = 0; + const removedImageVariables = new Set(); + + // First, extract image variable names from exclude blocks before removing them + const excludeBlocks = cleanedContent.match(/[\s\S]*?<\/exclude>/g) || []; + excludeBlocks.forEach(match => { + const imageMatches = match.match(/src=\{([^}]+)\}/g); + if (imageMatches) { + imageMatches.forEach(imgMatch => { + const varName = imgMatch.match(/src=\{([^}]+)\}/)?.[1]; + if (varName) { + removedImageVariables.add(varName); + } + }); + } + }); + + // Remove tags and everything between them (including multiline) + cleanedContent = cleanedContent.replace(/[\s\S]*?<\/exclude>/g, (match) => { + removedCount++; + return ''; + }); + + // Remove unused image imports that were only used in exclude blocks + if (removedImageVariables.size > 0) { + removedImageVariables.forEach(varName => { + // Check if the variable is still used elsewhere in the content after removing exclude blocks + const remainingUsage = cleanedContent.includes(`{${varName}}`) || cleanedContent.includes(`src={${varName}}`); + + if (!remainingUsage) { + // Remove import lines for unused image variables + // Pattern: import VarName from './assets/image/filename'; + const importPattern = new RegExp(`import\\s+${varName.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s+from\\s+['"][^'"]+['"];?\\s*`, 'g'); + cleanedContent = cleanedContent.replace(importPattern, ''); + console.log(` 🗑️ Removed unused import: ${varName}`); + } + }); + } + + if (removedCount > 0) { + console.log(` 🧹 Final cleanup: removed ${removedCount} exclude block(s) and ${removedImageVariables.size} unused import(s)`); + } + + // Ensure there's always a blank line after imports before content starts + // Find the last import line and ensure there's a blank line before the next non-empty line + const lines = cleanedContent.split('\n'); + let lastImportIndex = -1; + + // Find the last import line + for (let i = 0; i < lines.length; i++) { + if (lines[i].trim().startsWith('import ') && lines[i].trim().endsWith(';')) { + lastImportIndex = i; + } + } + + // If we found imports, ensure there's a blank line after the last one + if (lastImportIndex >= 0) { + // Find the next non-empty line after the last import + let nextNonEmptyIndex = lastImportIndex + 1; + while (nextNonEmptyIndex < lines.length && lines[nextNonEmptyIndex].trim() === '') { + nextNonEmptyIndex++; + } + + // If there's no blank line between the last import and next content, add one + if (nextNonEmptyIndex > lastImportIndex + 1) { + // There are already blank lines, this is fine + } else { + // No blank line, add one + lines.splice(nextNonEmptyIndex, 0, ''); + } + + cleanedContent = lines.join('\n'); + } + + return cleanedContent; +} + +function copyToAstroContent(outputDir) { + console.log('📋 Copying MDX files to Astro content directory...'); + + try { + // Ensure Astro directories exist + mkdirSync(dirname(ASTRO_CONTENT_PATH), { recursive: true }); + mkdirSync(ASTRO_ASSETS_PATH, { recursive: true }); + + // Copy MDX file + const files = readdirSync(outputDir); + const mdxFiles = files.filter(file => file.endsWith('.mdx')); + if (mdxFiles.length > 0) { + const mdxFile = join(outputDir, mdxFiles[0]); // Take the first MDX file + // Read and write instead of copy to avoid EPERM issues + let mdxContent = readFileSync(mdxFile, 'utf8'); + + // Apply final cleanup to ensure no exclude tags or unused imports remain + mdxContent = cleanupExcludeTagsAndImports(mdxContent); + + writeFileSync(ASTRO_CONTENT_PATH, mdxContent); + console.log(` ✅ Copied and cleaned MDX to ${ASTRO_CONTENT_PATH}`); + } + + // Copy images from both media and external-images directories + const imageExtensions = ['.png', '.jpg', '.jpeg', '.gif', '.svg', '.webp', '.bmp', '.tiff', '.html']; + let totalImageCount = 0; + + function copyImagesRecursively(dir, sourceName) { + if (!existsSync(dir)) return; + + const files = readdirSync(dir); + for (const file of files) { + const filePath = join(dir, file); + const stat = statSync(filePath); + + if (stat.isDirectory()) { + copyImagesRecursively(filePath, sourceName); + } else if (imageExtensions.some(ext => file.toLowerCase().endsWith(ext))) { + const filename = basename(filePath); + const destPath = join(ASTRO_ASSETS_PATH, filename); + + try { + // Validate image by checking file size and basic structure + const stats = statSync(filePath); + if (stats.size === 0) { + console.log(` ⚠️ Skipping empty image: ${filename}`); + return; + } + + // Try to copy and validate the result + copyFileSync(filePath, destPath); + + // Additional validation - check if the copied file has reasonable size + const destStats = statSync(destPath); + if (destStats.size === 0) { + console.log(` ❌ Failed to copy corrupted image: ${filename}`); + // Remove the empty file + try { + unlinkSync(destPath); + } catch (e) { } + return; + } + + console.log(` ✅ Copied ${sourceName}: ${filename} (${destStats.size} bytes)`); + totalImageCount++; + } catch (error) { + console.log(` ❌ Failed to copy ${filename}: ${error.message}`); + } + } + } + } + + // Copy images from media directory (Notion images) + const mediaDir = join(outputDir, 'media'); + copyImagesRecursively(mediaDir, 'Notion image'); + + // Copy images from external-images directory (downloaded external images) + const externalImagesDir = join(outputDir, 'external-images'); + copyImagesRecursively(externalImagesDir, 'external image'); + + if (totalImageCount > 0) { + console.log(` ✅ Copied ${totalImageCount} total image(s) to ${ASTRO_ASSETS_PATH}`); + } + + // Always update image paths and filter problematic references in MDX file + if (existsSync(ASTRO_CONTENT_PATH)) { + const mdxContent = readFileSync(ASTRO_CONTENT_PATH, 'utf8'); + let updatedContent = mdxContent.replace(/\.\/media\//g, './assets/image/'); + // Remove the subdirectory from image paths since we copy images directly to assets/image/ + updatedContent = updatedContent.replace(/\.\/assets\/image\/[^\/]+\//g, './assets/image/'); + + // Check which images actually exist and remove references to missing/corrupted ones + const imageReferences = updatedContent.match(/\.\/assets\/image\/[^\s\)]+/g) || []; + const existingImages = existsSync(ASTRO_ASSETS_PATH) ? readdirSync(ASTRO_ASSETS_PATH).filter(f => + ['.png', '.jpg', '.jpeg', '.gif', '.svg', '.webp', '.bmp', '.tiff'].some(ext => f.toLowerCase().endsWith(ext)) + ) : []; + + for (const imgRef of imageReferences) { + const filename = basename(imgRef); + if (!existingImages.includes(filename)) { + console.log(` ⚠️ Removing reference to missing/corrupted image: ${filename}`); + // Remove the entire image reference (both Image component and markdown syntax) + updatedContent = updatedContent.replace( + new RegExp(`]*src=["']${imgRef.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}["'][^>]*\/?>`, 'g'), + '' + ); + updatedContent = updatedContent.replace( + new RegExp(`!\\[.*?\\]\\(${imgRef.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\)`, 'g'), + '' + ); + } + } + + writeFileSync(ASTRO_CONTENT_PATH, updatedContent); + console.log(` ✅ Updated image paths and filtered problematic references in MDX file`); + } + + // Copy static bibliography.bib if it exists, otherwise create empty + if (existsSync(STATIC_BIB_PATH)) { + const bibContent = readFileSync(STATIC_BIB_PATH, 'utf8'); + writeFileSync(ASTRO_BIB_PATH, bibContent); + console.log(` ✅ Copied static bibliography from ${STATIC_BIB_PATH}`); + } else { + writeFileSync(ASTRO_BIB_PATH, ''); + console.log(` ✅ Created empty bibliography (no static file found)`); + } + + } catch (error) { + console.warn(` ⚠️ Failed to copy to Astro: ${error.message}`); + } +} + + +async function main() { + const args = process.argv.slice(2); + + if (args.includes('--help') || args.includes('-h')) { + showHelp(); + process.exit(0); + } + + const config = parseArgs(); + + console.log('🚀 Notion to MDX Toolkit'); + console.log('========================'); + + try { + // Prepare input config file + let inputConfigFile = config.input; + let pageIdFromEnv = null; + + // If NOTION_PAGE_ID is provided via env var, create temporary pages.json + if (config.pageId && config.token) { + console.log('✨ Using NOTION_PAGE_ID from environment variable'); + const tempConfigPath = join(config.output, '.temp-pages.json'); + ensureDirectory(config.output); + await createPagesConfigFromEnv(config.pageId, config.token, tempConfigPath); + inputConfigFile = tempConfigPath; + pageIdFromEnv = config.pageId; + } else if (!existsSync(config.input)) { + console.error(`❌ No NOTION_PAGE_ID environment variable and no pages.json found at: ${config.input}`); + console.log('💡 Either set NOTION_PAGE_ID env var or create input/pages.json'); + process.exit(1); + } + + // Always clean output directory to avoid conflicts with previous imports + console.log('🧹 Cleaning output directory to avoid conflicts...'); + await cleanDirectory(config.output); + + // Clean assets/image directory and ensure proper permissions + console.log('🧹 Cleaning assets/image directory and setting permissions...'); + if (existsSync(ASTRO_ASSETS_PATH)) { + await cleanDirectory(ASTRO_ASSETS_PATH); + } else { + ensureDirectory(ASTRO_ASSETS_PATH); + } + + // Ensure proper permissions for assets directory + const { execSync } = await import('child_process'); + try { + execSync(`chmod -R 755 "${ASTRO_ASSETS_PATH}"`, { stdio: 'inherit' }); + console.log(' ✅ Set permissions for assets/image directory'); + } catch (error) { + console.log(' ⚠️ Could not set permissions (non-critical):', error.message); + } + + if (config.mdxOnly) { + // Only convert existing Markdown to MDX + console.log('📝 MDX conversion only mode'); + await convertToMdx(config.output, config.output); + copyToAstroContent(config.output); + + } else if (config.notionOnly) { + // Only convert Notion to Markdown + console.log('📄 Notion conversion only mode'); + await convertNotionToMarkdown(inputConfigFile, config.output, config.token); + + } else { + // Full workflow + console.log('🔄 Full conversion workflow'); + + // Step 1: Convert Notion to Markdown + console.log('\n📄 Step 1: Converting Notion pages to Markdown...'); + await convertNotionToMarkdown(inputConfigFile, config.output, config.token); + + // Step 2: Convert Markdown to MDX with Notion metadata + console.log('\n📝 Step 2: Converting Markdown to MDX...'); + const pagesConfig = readPagesConfig(inputConfigFile); + const firstPage = pagesConfig.pages && pagesConfig.pages.length > 0 ? pagesConfig.pages[0] : null; + const pageId = pageIdFromEnv || (firstPage ? firstPage.id : null); + await convertToMdx(config.output, config.output, pageId, config.token); + + // Step 3: Copy to Astro content directory + console.log('\n📋 Step 3: Copying to Astro content directory...'); + copyToAstroContent(config.output); + } + + console.log('\n🎉 Conversion completed successfully!'); + + } catch (error) { + console.error('❌ Error:', error.message); + process.exit(1); + } +} + +// Export functions for use as module +export { convertNotionToMarkdown, convertToMdx }; + +// Run CLI if called directly +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/notion-importer/input/pages.json b/app/scripts/notion-importer/input/pages.json new file mode 100644 index 0000000000000000000000000000000000000000..d043e3a1081cb57d6605c813415ae02f847db229 --- /dev/null +++ b/app/scripts/notion-importer/input/pages.json @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2d51fba4ce9b05562f5df611a150e3cd702b487d2e608441318336556e0f248a +size 188 diff --git a/app/scripts/notion-importer/mdx-converter.mjs b/app/scripts/notion-importer/mdx-converter.mjs new file mode 100644 index 0000000000000000000000000000000000000000..8d6a4e206dfe4bae21217d8c9cdd3c8d91a25583 --- /dev/null +++ b/app/scripts/notion-importer/mdx-converter.mjs @@ -0,0 +1,863 @@ +#!/usr/bin/env node + +import { readFileSync, writeFileSync, existsSync, mkdirSync, readdirSync, statSync } from 'fs'; +import { join, dirname, basename, extname } from 'path'; +import { fileURLToPath } from 'url'; +import matter from 'gray-matter'; +import fetch from 'node-fetch'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Configuration +const DEFAULT_INPUT = join(__dirname, 'output'); +const DEFAULT_OUTPUT = join(__dirname, 'output'); +const STATIC_FRONTMATTER_PATH = join(__dirname, 'static', 'frontmatter.mdx'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.substring('--input='.length); + } else if (arg.startsWith('--output=')) { + config.output = arg.substring('--output='.length); + } else if (arg === '--help' || arg === '-h') { + console.log(` +📝 Notion Markdown to MDX Converter + +Usage: + node mdx-converter.mjs [options] + +Options: + --input=PATH Input directory or file (default: ${DEFAULT_INPUT}) + --output=PATH Output directory (default: ${DEFAULT_OUTPUT}) + --help, -h Show this help + +Examples: + # Convert all markdown files in output directory + node mdx-converter.mjs + + # Convert specific file + node mdx-converter.mjs --input=article.md --output=converted/ + + # Convert directory + node mdx-converter.mjs --input=markdown-files/ --output=mdx-files/ + `); + process.exit(0); + } else if (!config.input) { + config.input = arg; + } else if (!config.output) { + config.output = arg; + } + } + return config; +} + +/** + * Track which Astro components are used during transformations + */ +const usedComponents = new Set(); + +/** + * Track individual image imports needed + */ +const imageImports = new Map(); // src -> varName + +/** + * Track external images that need to be downloaded + */ +const externalImagesToDownload = new Map(); // url -> localPath + +/** + * Generate a variable name from image path + * @param {string} src - Image source path + * @returns {string} - Valid variable name + */ +function generateImageVarName(src) { + // Extract filename without extension and make it a valid JS variable + const filename = src.split('/').pop().replace(/\.[^.]+$/, ''); + return filename.replace(/[^a-zA-Z0-9]/g, '_').replace(/^[0-9]/, 'img_$&'); +} + +/** + * Check if a URL is an external URL (HTTP/HTTPS) + * @param {string} url - URL to check + * @returns {boolean} - True if it's an external URL + */ +function isExternalImageUrl(url) { + try { + const urlObj = new URL(url); + // Just check if it's HTTP/HTTPS - we'll try to download everything + return urlObj.protocol === 'http:' || urlObj.protocol === 'https:'; + } catch { + return false; + } +} + +/** + * Extract image URL from Twitter/X page + * @param {string} tweetUrl - URL of the tweet + * @returns {Promise} - URL of the image or null if not found + */ +async function extractTwitterImageUrl(tweetUrl) { + try { + const response = await fetch(tweetUrl, { + headers: { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' + } + }); + + if (!response.ok) { + return null; + } + + const html = await response.text(); + + // Try to find image URLs in meta tags (Twitter Card) + const metaImageMatch = html.match(/} - Local path to the downloaded file + */ +async function downloadExternalImage(imageUrl, outputDir) { + try { + console.log(` 🌐 Downloading external URL: ${imageUrl}`); + + // Create output directory if it doesn't exist + if (!existsSync(outputDir)) { + mkdirSync(outputDir, { recursive: true }); + } + + let actualImageUrl = imageUrl; + + // Check if it's a Twitter/X URL + if (imageUrl.includes('twitter.com/') || imageUrl.includes('x.com/')) { + console.log(` 🐦 Detected Twitter/X URL, attempting to extract image...`); + const extractedUrl = await extractTwitterImageUrl(imageUrl); + if (extractedUrl) { + actualImageUrl = extractedUrl; + console.log(` ✅ Extracted image URL: ${extractedUrl}`); + } else { + console.log(` ⚠️ Could not automatically extract image from Twitter/X`); + console.log(` 💡 Manual download required:`); + console.log(` 1. Open ${imageUrl} in your browser`); + console.log(` 2. Right-click on the image and "Save image as..."`); + console.log(` 3. Save it to: app/src/content/assets/image/`); + throw new Error('Twitter/X images require manual download'); + } + } + + // Generate filename from URL + const urlObj = new URL(actualImageUrl); + const pathname = urlObj.pathname; + + // Determine file extension - try to get it from URL, default to jpg + let extension = 'jpg'; + if (pathname.includes('.')) { + const urlExtension = pathname.split('.').pop().toLowerCase(); + if (['jpg', 'jpeg', 'png', 'gif', 'svg', 'webp', 'bmp', 'tiff'].includes(urlExtension)) { + extension = urlExtension; + } + } + + // Generate unique filename + const filename = `external_${Date.now()}_${Math.random().toString(36).substr(2, 9)}.${extension}`; + const localPath = join(outputDir, filename); + + // Try to download the URL + const response = await fetch(actualImageUrl, { + headers: { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' + } + }); + + if (!response.ok) { + throw new Error(`HTTP ${response.status}: ${response.statusText}`); + } + + const buffer = await response.buffer(); + + // Validate that we actually got data + if (buffer.length === 0) { + throw new Error('Empty response'); + } + + // Validate that it's actually an image, not HTML + const contentType = response.headers.get('content-type'); + if (contentType && contentType.includes('text/html')) { + throw new Error('Downloaded content is HTML, not an image'); + } + + // Save to local file + writeFileSync(localPath, buffer); + + console.log(` ✅ Downloaded: ${filename} (${buffer.length} bytes)`); + return localPath; + + } catch (error) { + console.log(` ❌ Failed to download ${imageUrl}: ${error.message}`); + throw error; + } +} + +/** + * Process external images in content and download them + * @param {string} content - Markdown content + * @param {string} outputDir - Directory to save downloaded images + * @returns {Promise} - Content with external images replaced by local paths + */ +async function processExternalImages(content, outputDir) { + console.log(' 🌐 Processing external images...'); + + let processedCount = 0; + let downloadedCount = 0; + + // Find all external image URLs in markdown format: ![alt](url) + const externalImageRegex = /!\[([^\]]*)\]\(([^)]+)\)/g; + let match; + const externalImages = new Map(); // url -> alt text + + // First pass: collect all external image URLs + while ((match = externalImageRegex.exec(content)) !== null) { + const alt = match[1]; + const url = match[2]; + + if (isExternalImageUrl(url)) { + externalImages.set(url, alt); + console.log(` 🔍 Found external image: ${url}`); + } + } + + if (externalImages.size === 0) { + console.log(' ℹ️ No external images found'); + return content; + } + + // Second pass: download images and replace URLs + let processedContent = content; + + for (const [url, alt] of externalImages) { + try { + // Download the image + const localPath = await downloadExternalImage(url, outputDir); + const relativePath = `./assets/image/${basename(localPath)}`; + + // Replace the URL in content + processedContent = processedContent.replace( + new RegExp(`!\\[${alt.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\]\\(${url.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\)`, 'g'), + `![${alt}](${relativePath})` + ); + + downloadedCount++; + processedCount++; + + } catch (error) { + console.log(` ⚠️ Skipping external image due to download failure: ${url}`); + } + } + + if (downloadedCount > 0) { + console.log(` ✅ Downloaded ${downloadedCount} external image(s)`); + } + + return processedContent; +} + +/** + * Detect and track Astro components used in the content + * @param {string} content - MDX content + */ +function detectAstroComponents(content) { + console.log(' 🔍 Detecting Astro components in content...'); + + let detectedCount = 0; + + // Known Astro components that should be auto-imported + const knownComponents = [ + 'HtmlEmbed', 'Image', 'Note', 'Sidenote', 'Wide', 'FullWidth', + 'Accordion', 'Quote', 'Reference', 'Glossary', 'Stack', 'ThemeToggle', + 'RawHtml', 'HfUser' + ]; + + // Find all JSX elements that look like Astro components + // Pattern: + const componentMatches = content.match(/<([A-Z][a-zA-Z0-9]*)\s*[^>]*\/?>/g); + + if (componentMatches) { + for (const match of componentMatches) { + // Extract component name from the JSX element + const componentMatch = match.match(/<([A-Z][a-zA-Z0-9]*)/); + if (componentMatch) { + const componentName = componentMatch[1]; + + // Only track known Astro components (skip HTML elements) + if (knownComponents.includes(componentName) && !usedComponents.has(componentName)) { + usedComponents.add(componentName); + detectedCount++; + console.log(` 📦 Found component: ${componentName}`); + } + } + } + } + + if (detectedCount > 0) { + console.log(` ✅ Detected ${detectedCount} new Astro component(s)`); + } else { + console.log(` ℹ️ No new Astro components detected`); + } +} + +/** + * Add required component imports to the frontmatter + * @param {string} content - MDX content + * @returns {string} - Content with component imports + */ +function addComponentImports(content) { + console.log(' 📦 Adding component and image imports...'); + + let imports = []; + + // Add component imports + if (usedComponents.size > 0) { + const componentImports = Array.from(usedComponents) + .map(component => `import ${component} from '../components/${component}.astro';`); + imports.push(...componentImports); + console.log(` ✅ Importing components: ${Array.from(usedComponents).join(', ')}`); + } + + // Add image imports + if (imageImports.size > 0) { + const imageImportStatements = Array.from(imageImports.entries()) + .map(([src, varName]) => `import ${varName} from '${src}';`); + imports.push(...imageImportStatements); + console.log(` ✅ Importing ${imageImports.size} image(s)`); + } + + if (imports.length === 0) { + console.log(' ℹ️ No imports needed'); + return content; + } + + const importBlock = imports.join('\n'); + + // Insert imports after frontmatter + const frontmatterEnd = content.indexOf('---', 3) + 3; + if (frontmatterEnd > 2) { + return content.slice(0, frontmatterEnd) + '\n\n' + importBlock + '\n\n' + content.slice(frontmatterEnd); + } else { + // No frontmatter, add at beginning + return importBlock + '\n\n' + content; + } +} + + +/** + * Load static frontmatter from file + * @returns {object} - Static frontmatter data + */ +function loadStaticFrontmatter() { + try { + if (existsSync(STATIC_FRONTMATTER_PATH)) { + const staticContent = readFileSync(STATIC_FRONTMATTER_PATH, 'utf8'); + const { data } = matter(staticContent); + console.log(' ✅ Loaded static frontmatter from file'); + return data; + } + console.log(' ℹ️ No static frontmatter file found'); + return {}; + } catch (error) { + console.log(` ⚠️ Failed to load static frontmatter: ${error.message}`); + return {}; + } +} + +/** + * Ensure proper frontmatter for MDX using static file first, then existing data + * @param {string} content - MDX content + * @param {string} pageId - Notion page ID (optional, kept for compatibility but ignored) + * @param {string} notionToken - Notion API token (optional, kept for compatibility but ignored) + * @returns {string} - Content with proper frontmatter + */ +async function ensureFrontmatter(content, pageId = null, notionToken = null) { + console.log(' 📄 Ensuring proper frontmatter...'); + + // Load static frontmatter first (highest priority) + const staticData = loadStaticFrontmatter(); + + if (!content.startsWith('---')) { + // No frontmatter in content, use static + basic defaults + let baseData = { ...staticData }; + + // Add basic defaults for required fields if not in static + if (!baseData.title) baseData.title = 'Article'; + if (!baseData.published) { + baseData.published = new Date().toLocaleDateString('en-US', { + year: 'numeric', + month: 'short', + day: '2-digit' + }); + } + if (baseData.tableOfContentsAutoCollapse === undefined) { + baseData.tableOfContentsAutoCollapse = true; + } + + const frontmatter = matter.stringify('', baseData); + console.log(' ✅ Applied static frontmatter to content without frontmatter'); + return frontmatter + content; + } + + // Parse existing frontmatter and merge with static (static takes priority) + try { + const { data: existingData, content: body } = matter(content); + + // Merge: existing data first, then static data overrides + const mergedData = { ...existingData, ...staticData }; + + // Ensure required fields if still missing after merge + if (!mergedData.title) mergedData.title = 'Article'; + if (!mergedData.published) { + mergedData.published = new Date().toLocaleDateString('en-US', { + year: 'numeric', + month: 'short', + day: '2-digit' + }); + } + if (mergedData.tableOfContentsAutoCollapse === undefined) { + mergedData.tableOfContentsAutoCollapse = true; + } + + const enhancedContent = matter.stringify(body, mergedData); + console.log(' ✅ Merged static and existing frontmatter'); + return enhancedContent; + } catch (error) { + console.log(' ⚠️ Could not parse frontmatter, keeping as is'); + return content; + } +} + +/** + * Generate basic frontmatter + * @returns {string} - Basic frontmatter + */ +function generateBasicFrontmatter() { + const currentDate = new Date().toLocaleDateString('en-US', { + year: 'numeric', + month: 'short', + day: '2-digit' + }); + return `--- +title: "Notion Article" +published: "${currentDate}" +tableOfContentsAutoCollapse: true +--- + +`; +} + + +/** + * Check if a line is a table line + * @param {string} line - Line to check + * @returns {boolean} - True if it's a table line + */ +function isTableLine(line) { + const trimmed = line.trim(); + return trimmed.startsWith('|') && trimmed.endsWith('|'); +} + +/** + * Check if a line is a list item + * @param {string} line - Line to check + * @returns {boolean} - True if it's a list item + */ +function isListItem(line) { + const trimmed = line.trim(); + // Match: * -, + (bullet points) or 1. 2. 3. (numbered lists) + return /^\s*[\*\-\+]\s/.test(trimmed) || /^\s*\d+\.\s/.test(trimmed); +} + +/** + * Add a blank line after each markdown table and list + * @param {string} content - MDX content + * @returns {string} - Content with blank lines after tables and lists + */ +function addBlankLineAfterTablesAndLists(content) { + console.log(' 📋 Adding blank lines after tables and lists...'); + + let addedTableCount = 0; + let addedListCount = 0; + const lines = content.split('\n'); + const result = []; + + for (let i = 0; i < lines.length; i++) { + result.push(lines[i]); + + // Check if current line is the end of a table + if (isTableLine(lines[i])) { + // Look ahead to see if this is the last line of a table + let isLastTableLine = false; + + // Check if next line is empty or doesn't start with | + if (i + 1 >= lines.length || + lines[i + 1].trim() === '' || + !isTableLine(lines[i + 1])) { + + // Look back to find if we're actually inside a table + let tableLineCount = 0; + for (let j = i; j >= 0 && isTableLine(lines[j]); j--) { + tableLineCount++; + } + + // Only add blank line if we found at least 2 table lines (making it a real table) + if (tableLineCount >= 2) { + isLastTableLine = true; + } + } + + if (isLastTableLine) { + addedTableCount++; + result.push(''); // Add blank line + } + } + // Check if current line is the end of a list + else if (isListItem(lines[i])) { + // Look ahead to see if this is the last line of a list + let isLastListItem = false; + + // Check if next line is empty or doesn't start with list marker + if (i + 1 >= lines.length || + lines[i + 1].trim() === '' || + !isListItem(lines[i + 1])) { + isLastListItem = true; + } + + if (isLastListItem) { + addedListCount++; + result.push(''); // Add blank line + } + } + } + + if (addedTableCount > 0 || addedListCount > 0) { + console.log(` ✅ Added blank line after ${addedTableCount} table(s) and ${addedListCount} list(s)`); + } else { + console.log(' ℹ️ No tables or lists found to process'); + } + + return result.join('\n'); +} + +/** + * Transform markdown images to Image components + * @param {string} content - Markdown content + * @returns {string} - Content with Image components + */ +function transformMarkdownImages(content) { + console.log(' 🖼️ Transforming markdown images to Image components...'); + + let transformedCount = 0; + + // Transform markdown images: ![alt](src) -> + content = content.replace(/!\[([^\]]*)\]\(([^)]+)\)/g, (match, alt, src) => { + transformedCount++; + + // Clean up the src path - remove /media/ prefix and use relative path + let cleanSrc = src; + if (src.startsWith('/media/')) { + cleanSrc = src.replace('/media/', './assets/image/'); + } + + // Generate variable name for the image import + const varName = generateImageVarName(cleanSrc); + + // Add to imageImports if not already present + if (!imageImports.has(cleanSrc)) { + imageImports.set(cleanSrc, varName); + } + + // Extract filename for alt text if none provided + const finalAlt = alt || src.split('/').pop().split('.')[0]; + + return ``; + }); + + if (transformedCount > 0) { + console.log(` ✅ Transformed ${transformedCount} markdown image(s) to Image components with imports`); + } else { + console.log(' ℹ️ No markdown images found to transform'); + } + + return content; +} + +/** + * Add proper spacing around Astro components + * @param {string} content - MDX content + * @returns {string} - Content with proper spacing around components + */ +function addSpacingAroundComponents(content) { + console.log(' 📏 Adding spacing around Astro components...'); + + let processedContent = content; + let spacingCount = 0; + + // Known Astro components that should have spacing + const knownComponents = [ + 'HtmlEmbed', 'Image', 'Note', 'Sidenote', 'Wide', 'FullWidth', + 'Accordion', 'Quote', 'Reference', 'Glossary', 'Stack', 'ThemeToggle', + 'RawHtml', 'HfUser', 'Figure' + ]; + + // Process each component type + for (const component of knownComponents) { + // Pattern for components with content: ... + // Process this first to handle the complete component structure + const withContentPattern = new RegExp(`(<${component}[^>]*>)([\\s\\S]*?)(<\\/${component}>)`, 'g'); + processedContent = processedContent.replace(withContentPattern, (match, openTag, content, closeTag) => { + spacingCount++; + // Ensure blank line before opening tag and after closing tag + // Also ensure closing tag is on its own line + const trimmedContent = content.trim(); + return `\n\n${openTag}\n${trimmedContent}\n${closeTag}\n\n`; + }); + + // Pattern for self-closing components: + const selfClosingPattern = new RegExp(`(<${component}[^>]*\\/?>)`, 'g'); + processedContent = processedContent.replace(selfClosingPattern, (match) => { + spacingCount++; + return `\n\n${match}\n\n`; + }); + } + + // Clean up excessive newlines (more than 2 consecutive) + processedContent = processedContent.replace(/\n{3,}/g, '\n\n'); + + if (spacingCount > 0) { + console.log(` ✅ Added spacing around ${spacingCount} component(s)`); + } else { + console.log(' ℹ️ No components found to add spacing around'); + } + + return processedContent; +} + +/** + * Fix smart quotes (curly quotes) and replace them with straight quotes + * @param {string} content - Markdown content + * @returns {string} - Content with fixed quotes + */ +function fixSmartQuotes(content) { + console.log(' ✏️ Fixing smart quotes (curly quotes)...'); + + let fixedCount = 0; + const originalContent = content; + + // Replace opening smart double quotes (\u201C) with straight quotes (") + content = content.replace(/\u201C/g, '"'); + + // Replace closing smart double quotes (\u201D) with straight quotes (") + content = content.replace(/\u201D/g, '"'); + + // Replace opening smart single quotes (\u2018) with straight quotes (') + content = content.replace(/\u2018/g, "'"); + + // Replace closing smart single quotes (\u2019) with straight quotes (') + content = content.replace(/\u2019/g, "'"); + + // Count the number of replacements made + fixedCount = 0; + for (let i = 0; i < originalContent.length; i++) { + const char = originalContent[i]; + if (char === '\u201C' || char === '\u201D' || char === '\u2018' || char === '\u2019') { + fixedCount++; + } + } + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} smart quote(s)`); + } else { + console.log(' ℹ️ No smart quotes found'); + } + + return content; +} + +/** + * Main MDX processing function that applies all transformations + * @param {string} content - Raw Markdown content + * @param {string} pageId - Notion page ID (optional) + * @param {string} notionToken - Notion API token (optional) + * @param {string} outputDir - Output directory for downloaded images (optional) + * @returns {string} - Processed MDX content compatible with Astro + */ +async function processMdxContent(content, pageId = null, notionToken = null, outputDir = null) { + console.log('🔧 Processing for Astro MDX compatibility...'); + + // Clear previous tracking + usedComponents.clear(); + imageImports.clear(); + externalImagesToDownload.clear(); + + let processedContent = content; + + // Fix smart quotes first + processedContent = fixSmartQuotes(processedContent); + + // Process external images first (before other transformations) + if (outputDir) { + // Create a temporary external images directory in the output folder + const externalImagesDir = join(outputDir, 'external-images'); + processedContent = await processExternalImages(processedContent, externalImagesDir); + } + + // Apply essential steps only + processedContent = await ensureFrontmatter(processedContent, pageId, notionToken); + + // Add blank lines after tables and lists + processedContent = addBlankLineAfterTablesAndLists(processedContent); + + // Transform markdown images to Image components + processedContent = transformMarkdownImages(processedContent); + + // Add spacing around Astro components + processedContent = addSpacingAroundComponents(processedContent); + + // Detect Astro components used in the content before adding imports + detectAstroComponents(processedContent); + + // Add component imports at the end + processedContent = addComponentImports(processedContent); + + return processedContent; +} + +/** + * Convert a single markdown file to MDX + * @param {string} inputFile - Input markdown file + * @param {string} outputDir - Output directory + * @param {string} pageId - Notion page ID (optional) + * @param {string} notionToken - Notion API token (optional) + */ +async function convertFileToMdx(inputFile, outputDir, pageId = null, notionToken = null) { + const filename = basename(inputFile, '.md'); + const outputFile = join(outputDir, `${filename}.mdx`); + + console.log(`📝 Converting: ${basename(inputFile)} → ${basename(outputFile)}`); + + try { + const markdownContent = readFileSync(inputFile, 'utf8'); + const mdxContent = await processMdxContent(markdownContent, pageId, notionToken, outputDir); + writeFileSync(outputFile, mdxContent); + + console.log(` ✅ Converted: ${outputFile}`); + + // Show file size + const inputSize = Math.round(markdownContent.length / 1024); + const outputSize = Math.round(mdxContent.length / 1024); + console.log(` 📊 Input: ${inputSize}KB → Output: ${outputSize}KB`); + + } catch (error) { + console.error(` ❌ Failed to convert ${inputFile}: ${error.message}`); + } +} + +/** + * Convert all markdown files in a directory to MDX + * @param {string} inputPath - Input path (file or directory) + * @param {string} outputDir - Output directory + * @param {string} pageId - Notion page ID (optional) + * @param {string} notionToken - Notion API token (optional) + */ +async function convertToMdx(inputPath, outputDir, pageId = null, notionToken = null) { + console.log('📝 Notion Markdown to Astro MDX Converter'); + console.log(`📁 Input: ${inputPath}`); + console.log(`📁 Output: ${outputDir}`); + + // Check if input exists + if (!existsSync(inputPath)) { + console.error(`❌ Input not found: ${inputPath}`); + process.exit(1); + } + + try { + // Ensure output directory exists + if (!existsSync(outputDir)) { + mkdirSync(outputDir, { recursive: true }); + } + + let filesToConvert = []; + + if (statSync(inputPath).isDirectory()) { + // Convert all .md files in directory + const files = readdirSync(inputPath); + filesToConvert = files + .filter(file => file.endsWith('.md') && !file.includes('.raw.md')) + .map(file => join(inputPath, file)); + } else if (inputPath.endsWith('.md')) { + // Convert single file + filesToConvert = [inputPath]; + } else { + console.error('❌ Input must be a .md file or directory containing .md files'); + process.exit(1); + } + + if (filesToConvert.length === 0) { + console.log('ℹ️ No .md files found to convert'); + return; + } + + console.log(`🔄 Found ${filesToConvert.length} file(s) to convert`); + + // Convert each file + for (const file of filesToConvert) { + await convertFileToMdx(file, outputDir, pageId, notionToken); + } + + console.log(`✅ Conversion completed! ${filesToConvert.length} file(s) processed`); + + } catch (error) { + console.error('❌ Conversion failed:', error.message); + process.exit(1); + } +} + +export { convertToMdx }; + +function main() { + const config = parseArgs(); + convertToMdx(config.input, config.output); + console.log('🎉 MDX conversion completed!'); +} + +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/notion-importer/notion-converter.mjs b/app/scripts/notion-importer/notion-converter.mjs new file mode 100644 index 0000000000000000000000000000000000000000..a8324152b9cfe825c8a14f811af3c958643b5e36 --- /dev/null +++ b/app/scripts/notion-importer/notion-converter.mjs @@ -0,0 +1,266 @@ +#!/usr/bin/env node + +import { config } from 'dotenv'; +import { Client } from '@notionhq/client'; +import { NotionConverter } from 'notion-to-md'; +import { DefaultExporter } from 'notion-to-md/plugins/exporter'; +import { readFileSync, writeFileSync, existsSync, mkdirSync } from 'fs'; +import { join, dirname, basename } from 'path'; +import { fileURLToPath } from 'url'; +import { postProcessMarkdown } from './post-processor.mjs'; + +// Load environment variables from .env file (but don't override existing ones) +config({ override: false }); + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +// Configuration +const DEFAULT_INPUT = join(__dirname, 'input', 'pages.json'); +const DEFAULT_OUTPUT = join(__dirname, 'output'); + +function parseArgs() { + const args = process.argv.slice(2); + const config = { + input: DEFAULT_INPUT, + output: DEFAULT_OUTPUT, + clean: false, + token: process.env.NOTION_TOKEN + }; + + for (const arg of args) { + if (arg.startsWith('--input=')) { + config.input = arg.split('=')[1]; + } else if (arg.startsWith('--output=')) { + config.output = arg.split('=')[1]; + } else if (arg.startsWith('--token=')) { + config.token = arg.split('=')[1]; + } else if (arg === '--clean') { + config.clean = true; + } + } + + return config; +} + +function ensureDirectory(dir) { + if (!existsSync(dir)) { + mkdirSync(dir, { recursive: true }); + } +} + +function loadPagesConfig(configFile) { + if (!existsSync(configFile)) { + console.error(`❌ Configuration file not found: ${configFile}`); + console.log('📝 Create a pages.json file with your Notion page IDs:'); + console.log(` +{ + "pages": [ + { + "id": "your-notion-page-id-1", + "title": "Page Title 1", + "slug": "page-1" + }, + { + "id": "your-notion-page-id-2", + "title": "Page Title 2", + "slug": "page-2" + } + ] +} + `); + process.exit(1); + } + + try { + const config = JSON.parse(readFileSync(configFile, 'utf8')); + return config.pages || []; + } catch (error) { + console.error(`❌ Error reading configuration: ${error.message}`); + process.exit(1); + } +} + +/** + * Convert a single Notion page to Markdown with advanced media handling + * @param {Object} notion - Notion client + * @param {string} pageId - Notion page ID + * @param {string} outputDir - Output directory + * @param {string} pageTitle - Page title for file naming + * @returns {Promise} - Path to generated markdown file + */ +async function convertNotionPage(notion, pageId, outputDir, pageTitle) { + console.log(`📄 Converting Notion page: ${pageTitle} (${pageId})`); + + try { + // Create media directory for this page + const mediaDir = join(outputDir, 'media', pageId); + ensureDirectory(mediaDir); + + // Configure the DefaultExporter to save to a file + const outputFile = join(outputDir, `${pageTitle}.md`); + const exporter = new DefaultExporter({ + outputType: 'file', + outputPath: outputFile, + }); + + // Create the converter with media downloading strategy + const n2m = new NotionConverter(notion) + .withExporter(exporter) + // Download media to local directory with path transformation + .downloadMediaTo({ + outputDir: mediaDir, + // Transform paths to be web-accessible + transformPath: (localPath) => `/media/${pageId}/${basename(localPath)}`, + }); + + // Convert the page + const result = await n2m.convert(pageId); + + console.log(` ✅ Converted to: ${outputFile}`); + console.log(` 📊 Content length: ${result.content.length} characters`); + console.log(` 🖼️ Media saved to: ${mediaDir}`); + + return outputFile; + + } catch (error) { + console.error(` ❌ Failed to convert page ${pageId}: ${error.message}`); + throw error; + } +} + +/** + * Process Notion pages with advanced configuration + * @param {string} inputFile - Path to pages configuration + * @param {string} outputDir - Output directory + * @param {string} notionToken - Notion API token + */ +export async function convertNotionToMarkdown(inputFile, outputDir, notionToken) { + console.log('🚀 Notion to Markdown Converter'); + console.log(`📁 Input: ${inputFile}`); + console.log(`📁 Output: ${outputDir}`); + + // Validate Notion token + if (!notionToken) { + console.error('❌ NOTION_TOKEN not found. Please set it as environment variable or use --token=YOUR_TOKEN'); + process.exit(1); + } + + // Ensure output directory exists + ensureDirectory(outputDir); + + try { + // Initialize Notion client + const notion = new Client({ + auth: notionToken, + }); + + // Load pages configuration + const pages = loadPagesConfig(inputFile); + console.log(`📋 Found ${pages.length} page(s) to convert`); + + const convertedFiles = []; + + // Convert each page + for (const page of pages) { + try { + const outputFile = await convertNotionPage( + notion, + page.id, + outputDir, + page.slug || page.title?.toLowerCase().replace(/\s+/g, '-') || page.id + ); + convertedFiles.push(outputFile); + } catch (error) { + console.error(`❌ Failed to convert page ${page.id}: ${error.message}`); + // Continue with other pages + } + } + + // Post-process all converted files and create one intermediate file + console.log('🔧 Post-processing converted files...'); + for (const file of convertedFiles) { + try { + // Read the raw markdown from notion-to-md + let rawContent = readFileSync(file, 'utf8'); + + // Create intermediate file: raw markdown (from notion-to-md) + const rawFile = file.replace('.md', '.raw.md'); + writeFileSync(rawFile, rawContent); + console.log(` 📄 Created raw markdown: ${basename(rawFile)}`); + + // Apply post-processing with Notion client for page inclusion + let processedContent = await postProcessMarkdown(rawContent, notion, notionToken); + writeFileSync(file, processedContent); + console.log(` ✅ Post-processed: ${basename(file)}`); + } catch (error) { + console.error(` ❌ Failed to post-process ${file}: ${error.message}`); + } + } + + console.log(`✅ Conversion completed! ${convertedFiles.length} file(s) generated`); + + } catch (error) { + console.error('❌ Conversion failed:', error.message); + process.exit(1); + } +} + +function main() { + const config = parseArgs(); + + if (config.clean) { + console.log('🧹 Cleaning output directory...'); + // Clean output directory logic would go here + } + + convertNotionToMarkdown(config.input, config.output, config.token); + console.log('🎉 Notion conversion completed!'); +} + +// Show help if requested +if (process.argv.includes('--help') || process.argv.includes('-h')) { + console.log(` +🚀 Notion to Markdown Converter + +Usage: + node notion-converter.mjs [options] + +Options: + --input=PATH Input pages configuration file (default: input/pages.json) + --output=PATH Output directory (default: output/) + --token=TOKEN Notion API token (or set NOTION_TOKEN env var) + --clean Clean output directory before conversion + --help, -h Show this help + +Environment Variables: + NOTION_TOKEN Your Notion integration token + +Examples: + # Basic conversion with environment token + NOTION_TOKEN=your_token node notion-converter.mjs + + # Custom paths and token + node notion-converter.mjs --input=my-pages.json --output=converted/ --token=your_token + + # Clean output first + node notion-converter.mjs --clean + +Configuration File Format (pages.json): +{ + "pages": [ + { + "id": "your-notion-page-id", + "title": "Page Title", + "slug": "page-slug" + } + ] +} +`); + process.exit(0); +} + +// Run CLI if called directly +if (import.meta.url === `file://${process.argv[1]}`) { + main(); +} diff --git a/app/scripts/notion-importer/package-lock.json b/app/scripts/notion-importer/package-lock.json new file mode 100644 index 0000000000000000000000000000000000000000..690fd5728fff128e19c8881d41e2160ad6ab6efb Binary files /dev/null and b/app/scripts/notion-importer/package-lock.json differ diff --git a/app/scripts/notion-importer/package.json b/app/scripts/notion-importer/package.json new file mode 100644 index 0000000000000000000000000000000000000000..967cf990839e7eee04d275e5a79963e2582678aa Binary files /dev/null and b/app/scripts/notion-importer/package.json differ diff --git a/app/scripts/notion-importer/post-processor.mjs b/app/scripts/notion-importer/post-processor.mjs new file mode 100644 index 0000000000000000000000000000000000000000..e2810a1ef85cf065382c4be97f27de03952ab074 --- /dev/null +++ b/app/scripts/notion-importer/post-processor.mjs @@ -0,0 +1,837 @@ +#!/usr/bin/env node + +import { readFileSync, writeFileSync, existsSync, mkdirSync, unlinkSync } from 'fs'; +import { join, dirname, basename } from 'path'; +import { fileURLToPath } from 'url'; +import { Client } from '@notionhq/client'; +import { NotionConverter } from 'notion-to-md'; +import { DefaultExporter } from 'notion-to-md/plugins/exporter'; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); + +/** + * Ensure directory exists + */ +function ensureDirectory(dir) { + if (!existsSync(dir)) { + mkdirSync(dir, { recursive: true }); + } +} + +/** + * Post-process Notion-generated Markdown for better MDX compatibility + * @param {string} content - Raw markdown content from Notion + * @param {Client} notionClient - Notion API client (optional) + * @param {string} notionToken - Notion API token (optional) + * @returns {Promise} - Processed markdown content + */ +export async function postProcessMarkdown(content, notionClient = null, notionToken = null) { + console.log('🔧 Post-processing Notion Markdown for MDX compatibility...'); + + let processedContent = content; + + // Apply each transformation step + processedContent = removeExcludeTags(processedContent); + processedContent = await includeNotionPages(processedContent, notionClient, notionToken); + processedContent = cleanNotionArtifacts(processedContent); + processedContent = fixImageAltTextWithLinks(processedContent); + processedContent = fixNotionLinks(processedContent); + processedContent = fixJsxAttributes(processedContent); + processedContent = optimizeImages(processedContent); + processedContent = shiftHeadingLevels(processedContent); + processedContent = cleanEmptyLines(processedContent); + processedContent = fixCodeBlocks(processedContent); + processedContent = fixCodeBlockEndings(processedContent); + processedContent = unwrapHtmlCodeBlocks(processedContent); + processedContent = fixPlainTextCodeBlocks(processedContent); + processedContent = optimizeTables(processedContent); + + return processedContent; +} + +/** + * Remove tags and their content, plus associated media files + * @param {string} content - Markdown content + * @returns {string} - Content with exclude tags removed and unused imports cleaned + */ +function removeExcludeTags(content) { + console.log(' 🗑️ Removing tags and associated media...'); + + let removedCount = 0; + const removedImageVariables = new Set(); + const mediaFilesToDelete = new Set(); + + // First, extract image variable names and media files from exclude blocks before removing them + const excludeBlocks = content.match(/[\s\S]*?<\/exclude>/g) || []; + excludeBlocks.forEach(match => { + // Extract image variables from JSX components + const imageMatches = match.match(/src=\{([^}]+)\}/g); + if (imageMatches) { + imageMatches.forEach(imgMatch => { + const varName = imgMatch.match(/src=\{([^}]+)\}/)?.[1]; + if (varName) { + removedImageVariables.add(varName); + } + }); + } + + // Extract media file paths from markdown images + const markdownImages = match.match(/!\[[^\]]*\]\(([^)]+)\)/g); + if (markdownImages) { + markdownImages.forEach(imgMatch => { + const src = imgMatch.match(/!\[[^\]]*\]\(([^)]+)\)/)?.[1]; + if (src) { + // Extract filename from path like /media/pageId/filename.png + const filename = basename(src); + if (filename) { + mediaFilesToDelete.add(filename); + } + } + }); + } + }); + + // Remove tags and everything between them (including multiline) + content = content.replace(/[\s\S]*?<\/exclude>/g, (match) => { + removedCount++; + return ''; + }); + + // Delete associated media files + if (mediaFilesToDelete.size > 0) { + console.log(` 🗑️ Found ${mediaFilesToDelete.size} media file(s) to delete from exclude blocks`); + + // Try to find and delete media files in common locations + const possibleMediaDirs = [ + join(__dirname, 'output', 'media'), + join(__dirname, '..', '..', 'src', 'content', 'assets', 'image') + ]; + + mediaFilesToDelete.forEach(filename => { + let deleted = false; + for (const mediaDir of possibleMediaDirs) { + if (existsSync(mediaDir)) { + const filePath = join(mediaDir, filename); + if (existsSync(filePath)) { + try { + unlinkSync(filePath); + console.log(` 🗑️ Deleted media file: ${filename}`); + deleted = true; + break; + } catch (error) { + console.log(` ⚠️ Failed to delete ${filename}: ${error.message}`); + } + } + } + } + if (!deleted) { + console.log(` ℹ️ Media file not found: ${filename}`); + } + }); + } + + // Remove unused image imports that were only used in exclude blocks + if (removedImageVariables.size > 0) { + console.log(` 🖼️ Found ${removedImageVariables.size} unused image import(s) in exclude blocks`); + + removedImageVariables.forEach(varName => { + // Check if the variable is still used elsewhere in the content after removing exclude blocks + const remainingUsage = content.includes(`{${varName}}`) || content.includes(`src={${varName}}`); + + if (!remainingUsage) { + // Remove import lines for unused image variables + // Pattern: import VarName from './assets/image/filename'; + const importPattern = new RegExp(`import\\s+${varName.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s+from\\s+['"][^'"]+['"];?\\s*`, 'g'); + content = content.replace(importPattern, ''); + console.log(` 🗑️ Removed unused import: ${varName}`); + } + }); + + console.log(` 🧹 Cleaned up unused image imports`); + } + + if (removedCount > 0) { + console.log(` ✅ Removed ${removedCount} tag(s) and their content`); + } else { + console.log(' ℹ️ No tags found'); + } + + return content; +} + +/** + * Replace Notion page links with their actual content + * @param {string} content - Markdown content + * @param {Client} notionClient - Notion API client + * @param {string} notionToken - Notion API token + * @returns {Promise} - Content with page links replaced + */ +async function includeNotionPages(content, notionClient, notionToken) { + console.log(' 📄 Including linked Notion pages...'); + + if (!notionClient || !notionToken) { + console.log(' ℹ️ Skipping page inclusion (no Notion client/token provided)'); + return content; + } + + let includedCount = 0; + let skippedCount = 0; + + // First, identify all exclude blocks to avoid processing links within them + const excludeBlocks = []; + const excludeRegex = /[\s\S]*?<\/exclude>/g; + let excludeMatch; + + while ((excludeMatch = excludeRegex.exec(content)) !== null) { + excludeBlocks.push({ + start: excludeMatch.index, + end: excludeMatch.index + excludeMatch[0].length + }); + } + + // Helper function to check if a position is within an exclude block + const isWithinExcludeBlock = (position) => { + return excludeBlocks.some(block => position >= block.start && position <= block.end); + }; + + // Regex to match links to Notion pages with UUID format + // Pattern: [text](uuid-with-dashes) + const notionPageLinkRegex = /\[([^\]]+)\]\(([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\)/g; + + let processedContent = content; + let match; + + // Find all matches + const matches = []; + while ((match = notionPageLinkRegex.exec(content)) !== null) { + const linkStartPos = match.index; + + // Skip if this link is within an exclude block + if (isWithinExcludeBlock(linkStartPos)) { + console.log(` ⏭️ Skipping page link in exclude block: ${match[1]} (${match[2]})`); + skippedCount++; + continue; + } + + matches.push({ + fullMatch: match[0], + linkText: match[1], + pageId: match[2], + startPos: match.index, + endPos: match.index + match[0].length + }); + } + + // Process matches in reverse order to maintain correct indices + for (let i = matches.length - 1; i >= 0; i--) { + const link = matches[i]; + + try { + console.log(` 🔗 Fetching content for page: ${link.pageId}`); + + // Create media directory for this sub-page + const outputDir = join(__dirname, 'output'); + const mediaDir = join(outputDir, 'media', link.pageId); + ensureDirectory(mediaDir); + + // Configure the DefaultExporter to get content as string + const exporter = new DefaultExporter({ + outputType: 'string', + }); + + // Create the converter with media downloading strategy (same as convertNotionPage) + const converter = new NotionConverter(notionClient) + .withExporter(exporter) + // Download media to local directory with path transformation + .downloadMediaTo({ + outputDir: mediaDir, + // Transform paths to be web-accessible + transformPath: (localPath) => `/media/${link.pageId}/${basename(localPath)}`, + }); + + // Convert the page + const result = await converter.convert(link.pageId); + + console.log(` 🖼️ Media saved to: ${mediaDir}`); + + if (result && result.content) { + // Save raw content as .raw.md file + const rawFileName = `${link.linkText.toLowerCase().replace(/[^a-z0-9]+/g, '-')}-${link.pageId}`; + const rawFilePath = join(outputDir, `${rawFileName}.raw.md`); + + try { + writeFileSync(rawFilePath, result.content); + console.log(` 📄 Saved raw markdown: ${rawFileName}.raw.md`); + } catch (error) { + console.log(` ⚠️ Failed to save raw file: ${error.message}`); + } + + // Clean the content (remove frontmatter, etc.) + let pageContent = result.content; + + // Remove YAML frontmatter if present + pageContent = pageContent.replace(/^---[\s\S]*?---\s*\n/, ''); + + // Remove the first markdown heading (H1, H2, H3, etc.) from the included page + pageContent = pageContent.replace(/^#+ .+\n\n?/, ''); + + // Keep the page content without title + const finalContent = '\n\n' + pageContent.trim() + '\n\n'; + + // Replace the link with the content + processedContent = processedContent.substring(0, link.startPos) + + finalContent + + processedContent.substring(link.endPos); + + includedCount++; + console.log(` ✅ Included page content: ${link.linkText}`); + } else { + console.log(` ⚠️ No content found for page: ${link.pageId}`); + } + } catch (error) { + console.log(` ❌ Failed to fetch page ${link.pageId}: ${error.message}`); + // Keep the original link if we can't fetch the content + } + } + + if (includedCount > 0) { + console.log(` ✅ Included ${includedCount} Notion page(s)`); + } else { + console.log(' ℹ️ No Notion page links found to include'); + } + + if (skippedCount > 0) { + console.log(` ⏭️ Skipped ${skippedCount} page link(s) in exclude blocks`); + } + + return processedContent; +} + +/** + * Clean Notion-specific artifacts and formatting + * @param {string} content - Markdown content + * @returns {string} - Cleaned content + */ +function cleanNotionArtifacts(content) { + console.log(' 🧹 Cleaning Notion artifacts...'); + + let cleanedCount = 0; + + // Remove Notion's internal page references that don't convert well + content = content.replace(/\[([^\]]+)\]\(https:\/\/www\.notion\.so\/[^)]+\)/g, (match, text) => { + cleanedCount++; + return text; // Keep just the text, remove the broken link + }); + + // Clean up Notion's callout blocks that might not render properly + content = content.replace(/^> \*\*([^*]+)\*\*\s*\n/gm, '> **$1**\n\n'); + + // Remove Notion's page dividers that don't have markdown equivalents + content = content.replace(/^---+\s*$/gm, ''); + + // Clean up empty blockquotes + content = content.replace(/^>\s*$/gm, ''); + + // Fix corrupted bold/italic formatting from notion-to-md conversion + // Pattern: ***text*** **** -> ***text*** + content = content.replace(/\*\*\*([^*]+)\*\*\*\s+\*\*\*\*/g, (match, text) => { + cleanedCount++; + return `***${text.trim()}***`; + }); + + // Fix other corrupted asterisk patterns + // Pattern: **text** ** -> **text** + content = content.replace(/\*\*([^*]+)\*\*\s+\*\*/g, (match, text) => { + cleanedCount++; + return `**${text.trim()}**`; + }); + + if (cleanedCount > 0) { + console.log(` ✅ Cleaned ${cleanedCount} Notion artifact(s)`); + } + + return content; +} + +/** + * Fix image alt text that contains markdown links + * notion-to-md v4 sometimes generates: ![alt with [link](url)](image_path) + * This breaks MDX parsing. Clean it to: ![alt with @mention](image_path) + * @param {string} content - Markdown content + * @returns {string} - Content with fixed image alt text + */ +function fixImageAltTextWithLinks(content) { + console.log(' 🖼️ Fixing image alt text with embedded links...'); + + let fixedCount = 0; + + // Pattern: ![text [link](url) more_text](image_path) + // This regex finds images where the alt text contains markdown links + const imageWithLinksPattern = /!\[([^\]]*\[[^\]]+\]\([^)]+\)[^\]]*)\]\(([^)]+)\)/g; + + content = content.replace(imageWithLinksPattern, (match, altText, imagePath) => { + fixedCount++; + + // Remove all markdown links from alt text: [text](url) -> text + const cleanedAlt = altText.replace(/\[([^\]]+)\]\([^)]+\)/g, '$1'); + + // Also clean up any remaining brackets + const finalAlt = cleanedAlt.replace(/[\[\]]/g, ''); + + console.log(` 🔧 Fixed: "${altText.substring(0, 50)}..." -> "${finalAlt.substring(0, 50)}..."`); + + return `![${finalAlt}](${imagePath})`; + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} image(s) with embedded links in alt text`); + } else { + console.log(' ℹ️ No images with embedded links found'); + } + + return content; +} + +/** + * Fix Notion internal links to be more MDX-friendly + * @param {string} content - Markdown content + * @returns {string} - Content with fixed links + */ +function fixNotionLinks(content) { + console.log(' 🔗 Fixing Notion internal links...'); + + let fixedCount = 0; + + // Convert Notion page links to relative links (assuming they'll be converted to MDX) + content = content.replace(/\[([^\]]+)\]\(https:\/\/www\.notion\.so\/[^/]+\/([^?#)]+)\)/g, (match, text, pageId) => { + fixedCount++; + // Convert to relative link - this will need to be updated based on your routing + return `[${text}](#${pageId})`; + }); + + // Fix broken notion.so links that might be malformed + content = content.replace(/\[([^\]]+)\]\(https:\/\/www\.notion\.so\/[^)]*\)/g, (match, text) => { + fixedCount++; + return text; // Remove broken links, keep text + }); + + if (fixedCount > 0) { + console.log(` ✅ Fixed ${fixedCount} Notion link(s)`); + } + + return content; +} + +/** + * Fix JSX attributes that were corrupted during Notion conversion + * @param {string} content - Markdown content + * @returns {string} - Content with fixed JSX attributes + */ +function fixJsxAttributes(content) { + console.log(' 🔧 Fixing JSX attributes corrupted by Notion conversion...'); + + let fixedCount = 0; + + // Fix the specific issue: + // Pattern: + content = content.replace(/<(\w+)\s+\*\s*([^*\s]+)\s*\*\s*=\s*"([^"]*)"\s*\/?>/g, (match, tagName, attribute, value) => { + fixedCount++; + return `<${tagName} ${attribute}="${value}" />`; + }); + + // Pattern: + content = content.replace(/<(\w+)\s+\*\s*([^*\s]+)\s*\*\s*=\s*([^>\s\/]+)\s*\/?>/g, (match, tagName, attribute, value) => { + fixedCount++; + return `<${tagName} ${attribute}=${value} />`; + }); + + // Handle cases with **double asterisks** around attribute names + content = content.replace(/<(\w+)\s+\*\*\s*([^*\s]+)\s*\*\*\s*=\s*"([^"]*)"\s*\/?>/g, (match, tagName, attribute, value) => { + fixedCount++; + return `<${tagName} ${attribute}="${value}" />`; + }); + + content = content.replace(/<(\w+)\s+\*\*\s*([^*\s]+)\s*\*\*\s*=\s*([^>\s\/]+)\s*\/?>/g, (match, tagName, attribute, value) => { + fixedCount++; + return `<${tagName} ${attribute}=${value} />`; + }); + + // Fix HTML tags (like iframe, video, etc.) where URLs were corrupted by markdown conversion + // Pattern: src="[url](url)" -> src="url" + // Handle both regular quotes and various smart quote characters (", ", ', ', """, etc.) + // Handle attributes before and after src + + // Handle iframe tags with separate opening and closing tags FIRST: + content = content.replace(/]*?)\ssrc=[""''""\u201C\u201D\u2018\u2019]\[([^\]]+)\]\([^)]+\)[""''""\u201C\u201D\u2018\u2019]([^>]*?)>\s*<\/iframe>/gi, (match, before, urlText, after) => { + fixedCount++; + return ``; + }); + + // Handle self-closing iframe tags SECOND: +
        + +```mdx + + + +``` + + + +### HtmlEmbed + +The main purpose of the ```HtmlEmbed``` component is to **embed** a **Plotly** or **D3.js** chart in your article. **Libraries** are already imported in the template. + +They exist in the `app/src/content/embeds` folder. + +For researchers who want to stay in **Python** while targeting **D3**, the [d3blocks](https://github.com/d3blocks/d3blocks) library lets you create interactive D3 charts with only a few lines of code. In **2025**, **D3** often provides more flexibility and a more web‑native rendering than **Plotly** for custom visualizations. + + + + +| Prop | Required | Description +|-------------|----------|---------------------------------------------------------------------------------- +| `src` | Yes | Path to the embed file in the `embeds` folder. +| `title` | No | Short title displayed above the card. +| `desc` | No | Short description displayed below the card. Supports inline HTML (e.g., links). +| `frameless` | No | Removes the card background and border for seamless embeds. +| `align` | No | Aligns the title/description text. One of `left` (default), `center`, `right`. +| `id` | No | Adds an `id` to the outer figure for deep-linking and cross-references. +| `data` | No | Path (string) or array of paths (string[]) to data file(s) consumed by the embed. +| `config` | No | Optional object for embed options (e.g., `{ defaultMetric: 'average_rank' }`). + + +```mdx +import HtmlEmbed from '../../../components/HtmlEmbed.astro' + + + + +``` + + + +#### Data + +If you need to link your **HTML embeds** to **data files**, there is an **`assets/data`** folder for this. +As long as your files are there, they will be served from the **`public/data`** folder. +You can fetch them with this address: **`[domain]/data/your-data.ext`** + +Be careful, unlike images, data files are not optimized by Astro. You need to optimize them manually. diff --git a/app/src/content/chapters/demo/debug-components.mdx b/app/src/content/chapters/demo/debug-components.mdx new file mode 100644 index 0000000000000000000000000000000000000000..af27728101e6ddba3760428ceaeb7e925bb40f42 --- /dev/null +++ b/app/src/content/chapters/demo/debug-components.mdx @@ -0,0 +1,35 @@ +import Accordion from '../../../components/Accordion.astro'; +import HtmlEmbed from '../../../components/HtmlEmbed.astro'; +import Image from '../../../components/Image.astro'; +import Wide from '../../../components/Wide.astro'; +import FullWidth from '../../../components/FullWidth.astro'; +import Note from '../../../components/Note.astro'; + +| Prop | Required | +|------------------------|----------| +| `zoomable` | No | +| `downloadable` | No | +| `loading="lazy"` | No | +| `caption` | No | + + + | Prop | Required | Description +|-------------|----------|---------------------------------------------------------------------------------- +| `src` | Yes | Path to the embed file in the `embeds` folder. +| `title` | No | Short title displayed above the card. +| `desc` | No | Short description displayed below the card. Supports inline HTML (e.g., links). +| `frameless` | No | Removes the card background and border for seamless embeds. +| `align` | No | Aligns the title/description text. One of `left` (default), `center`, `right`. + + + + +

        Simple example

        +
        + + + ```mdx + import HtmlEmbed from '../../../components/HtmlEmbed.astro' + + ``` + diff --git a/app/src/content/chapters/demo/getting-started.mdx b/app/src/content/chapters/demo/getting-started.mdx new file mode 100644 index 0000000000000000000000000000000000000000..985e578800eee4e704676852b7fa9954d8dabc9a --- /dev/null +++ b/app/src/content/chapters/demo/getting-started.mdx @@ -0,0 +1,104 @@ +import Sidenote from '../../../components/Sidenote.astro'; +import Note from '../../../components/Note.astro'; + +## Getting Started + +### Installation + +The recommended way is to **duplicate this Space** on **Hugging Face** rather than cloning it directly: + +1. Open the Space: **[🤗 science-blog-template](https://huggingface.co/spaces/tfrere/science-blog-template)**
        and click `Duplicate this Space`. +2. Give it a **name**, choose **visibility**, and keep the **free CPU instance**. +3. **Clone** your new Space repository. +```bash +git clone git@hf.co:spaces// +cd +``` +
        +4. Use **Node.js 20 or newer**.
        To manage versions, consider using **nvm** + - macOS/Linux: see [nvm-sh](https://github.com/nvm-sh/nvm) + - Windows: see [nvm-windows](https://github.com/coreybutler/nvm-windows) + +```bash +nvm install 20 +nvm use 20 +node -v +``` + +5. Install lfs and pull files from the repository. +```bash +git lfs install +git lfs pull +``` +If you attempt to push binary files without Git LFS installed, you will encounter an error. + + +6. Install dependencies. + +```bash +cd app +npm install +``` + + + Alternatively, you can use **Yarn** as your package manager. + + + +
        And that's it! + +**You're ready to go!** 🎉 + +### Development + +```bash +npm run dev +``` + +Once started, the dev server is available at `http://localhost:4321`. + +### Build + +```bash +npm run build +``` + + +### Deploy + +**Every push** automatically triggers a **build** and **deploy** on Spaces. +```bash +# Make edits locally, then: +git add . +git commit -m "Update content" +git push +``` + + +Serving the `dist/` directory on any static host is enough to deliver the site. + + +A [slugified-title].pdf and thumb.jpg are also generated at build time.
        You can find them in the public folder and point to them at `[domain]/public/thumb.jpg`. +
        + +### Template Synchronization + +Keep your project up-to-date with the latest template improvements. The sync system fetches the most recent changes from the official template repository at `https://huggingface.co/spaces/tfrere/research-article-template` and copies them to your project. + +```bash +# Preview what would be updated +npm run sync:template -- --dry-run + +# Update template files (preserves your content) +npm run sync:template +``` + +**What gets preserved:** +- Your content in `/src/content/` + +**What gets updated:** +- All template files (components, styles, configuration) +- Dockerfile and deployment configuration +- Dependencies and build system + + diff --git a/app/src/content/chapters/demo/greetings.mdx b/app/src/content/chapters/demo/greetings.mdx new file mode 100644 index 0000000000000000000000000000000000000000..5e54e70e5f6d8ce9008d2f84d1c969fde5c1606c --- /dev/null +++ b/app/src/content/chapters/demo/greetings.mdx @@ -0,0 +1,16 @@ +## Greetings + +Huge thanks to the following people for their **precious feedbacks**! + +import HfUser from '../../../components/HfUser.astro'; + +
        + + + + + + + + +
        diff --git a/app/src/content/chapters/demo/import-content.mdx b/app/src/content/chapters/demo/import-content.mdx new file mode 100644 index 0000000000000000000000000000000000000000..b27f502c5ab40f5e672aa1240fbd82cba2ddbf37 --- /dev/null +++ b/app/src/content/chapters/demo/import-content.mdx @@ -0,0 +1,74 @@ +import Note from '../../../components/Note.astro'; + +## Import from LaTeX + + +⚠️ **Experimental** — May not work with all LaTeX documents. + + +Transform LaTeX papers into interactive web articles. + +### Quick Start + +```bash +cd app/scripts/latex-importer/ +cp your-paper.tex input/main.tex +cp your-paper.bib input/main.bib +node index.mjs +``` + +### What Gets Converted + +- `\label{eq:name}` → Interactive equations +- `\ref{eq:name}` → Clickable links +- `\includegraphics{}` → `` components +- Bibliography integration + +### Prerequisites + +- **Pandoc** (`brew install pandoc`) +- LaTeX source files and figures + +### Docker Deployment + +Set `ENABLE_LATEX_CONVERSION=true` in your Hugging Face Space to enable automatic conversion during build. + +## Import from Notion + + +⚠️ **Experimental** — May not work with all Notion pages. + + +Convert Notion pages into interactive web articles. + +### Quick Start + +```bash +cd app/scripts/notion-importer/ +npm install +cp env.example .env +# Edit .env with your Notion token +# Edit input/pages.json with your page IDs +node index.mjs +``` + +### What Gets Converted + +- Images +- Callouts → `` components +- Enhanced tables and code blocks +- Smart link conversion + +### Prerequisites + +- **Node.js** with ESM support +- **Notion Integration** with token +- **Shared Pages** with your integration + + +💡 **Hugging Face Spaces** — Add your `NOTION_TOKEN` to Space secrets for secure access. + + +### Docker Deployment + +Set `ENABLE_NOTION_CONVERSION=true` in your Hugging Face Space to enable automatic conversion during build. \ No newline at end of file diff --git a/app/src/content/chapters/demo/introduction.mdx b/app/src/content/chapters/demo/introduction.mdx new file mode 100644 index 0000000000000000000000000000000000000000..4fe0531d5ccdc1d1493b9a23b416af5a999cabd4 --- /dev/null +++ b/app/src/content/chapters/demo/introduction.mdx @@ -0,0 +1,71 @@ +import Sidenote from "../../../components/Sidenote.astro"; +import HtmlEmbed from "../../../components/HtmlEmbed.astro"; + +Welcome to this open source **research article template**. It helps you publish **clear**, **modern**, and **interactive technical writing** with **minimal setup**. + + + Read time: 20–25 minutes. + + +Grounded in up to date good practices in web dev, it favors **interactive explanations**, **clear notation**, and **inspectable examples** over static snapshots. + +Available on [GitHub](https://github.com/tfrere/research-article-template) deployable on [Hugging Face Spaces](https://huggingface.co/spaces/tfrere/research-article-template). + +#### Features + +
        + Markdown-based + KaTeX math + Syntax highlighting + Academic citations + Footnotes + Table of contents + Mermaid diagrams + Plotly ready + D3.js ready + HTML embeds + Gradio app embeds + Dataviz color palettes + Optimized images + Lightweight bundle + SEO friendly + Automatic build + Automatic PDF export + Dark theme + Mobile friendly + Latex import + Template update system +
        + + + If you have questions, remarks or suggestions, open a discussion on the Community tab! + + +## Introduction +The web offers what static PDFs can’t: **interactive diagrams**, **progressive notation**, and **exploratory views** that show how ideas behave. This template treats **interactive artifacts**—figures, math, code, and inspectable experiments—as **first‑class** alongside prose, helping readers **build intuition** instead of skimming results. + +### Who is this for + +Ideal for anyone creating **web‑native** and **interactive** content with **minimal setup**: + +- For **scientists** writing modern web‑native papers +- For **educators** building explorable lessons. + +**No web knowledge required**—just write in **Markdown**. + +This is not a CMS or a multi‑page blog—it's a **focused**, **single‑page**, **MDX‑first** workflow. + +### Inspired by Distill + +This project stands in the direct continuity of [Distill](https://distill.pub/) (2016–2021). Our goal is to carry that spirit forward and push it even further: **accessible scientific writing**, **high‑quality interactive explanations**, and **reproducible**, production‑ready demos. + +{/* To give you a sense of what inspired this template, here is a short, curated list of **well‑designed** and often **interactive** works from Distill: + +- [Growing Neural Cellular Automata](https://distill.pub/2020/growing-ca/) +- [Activation Atlas](https://distill.pub/2019/activation-atlas/) +- [Handwriting with a Neural Network](https://distill.pub/2016/handwriting/) +- [The Building Blocks of Interpretability](https://distill.pub/2018/building-blocks/) */} + +{/* + I'm always excited to discover more great examples—please share your favorites in the Community tab! + */} \ No newline at end of file diff --git a/app/src/content/chapters/demo/markdown.mdx b/app/src/content/chapters/demo/markdown.mdx new file mode 100644 index 0000000000000000000000000000000000000000..d090b3f0f5d9883742b4235e81ddbe616c4571e1 --- /dev/null +++ b/app/src/content/chapters/demo/markdown.mdx @@ -0,0 +1,461 @@ +import placeholder from '../../assets/image/placeholder.png'; +import audioDemo from '../../assets/audio/audio-example.wav'; +import HtmlEmbed from '../../../components/HtmlEmbed.astro'; +import Sidenote from '../../../components/Sidenote.astro'; +import Wide from '../../../components/Wide.astro'; +import Note from '../../../components/Note.astro'; +import FullWidth from '../../../components/FullWidth.astro'; +import Accordion from '../../../components/Accordion.astro'; +import Image from '../../../components/Image.astro'; + +## Markdown + +All the following **markdown features** are available **natively** in the `article.mdx` file. No imports needed, just write markdown directly: + +**Text formatting** — `**Bold**` → **Bold**, `*italic*` → *italic*, `~~strikethrough~~` → ~~strikethrough~~ + +**Code** — `` `inline code` `` → `inline code`, triple backticks for code blocks + +**Lists** — `- Item` for bullets, `1. Item` for numbered lists with nesting support + +**Links** — `[text](url)` → [External links](https://example.com) and internal navigation + +**Highlight** — `text` → Highlighted text for emphasis + +See also the complete [**Markdown documentation**](https://www.markdownguide.org/basic-syntax/). + +**Advanced features** — Explore specialized content types: + + + +### Math + +**KaTeX** provides full LaTeX math support with two simple syntaxes: + +**Inline math** — Use `$...$` for equations within text: $x^2 + y^2 = z^2$ + +**Block math** — Use `$$...$$` for centered equations: + +$$ +\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V +$$ + +**Advanced features** — Aligned equations with IDs for cross-referencing: + +```math +\htmlId{trajectory_definition}{\begin{align} + \log p_\theta(\mathcal D) &= \log \sum_{i=0}^N p_\theta ((o,a)_i) \\ + &= \log \sum_{i=0}^N \int_{\text{supp}({Z})} p_\theta((o,a)_i \vert z) p(z) \\ + &= \log \sum_{i=0}^N \int_{\text{supp}({Z})} \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z) p(z) \\ + &= \log \sum_{i=0}^N \mathbb E_{z \sim p_\theta(\bullet \vert (o,a)_i)} [\frac{p(z)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z)], +\end{align}} +``` + +You can reference equations with links like [this equation](#trajectory_definition). + + +```mdx +$x^2 + y^2 = z^2$ + +$$ +\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V +$$ + +$$ +\htmlId{trajectory_definition}{\begin{align} + \log p_\theta(\mathcal D) &= \log \sum_{i=0}^N p_\theta ((o,a)_i) \\ + &= \log \sum_{i=0}^N \int_{\text{supp}({Z})} p_\theta((o,a)_i \vert z) p(z) \\ + &= \log \sum_{i=0}^N \int_{\text{supp}({Z})} \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z) p(z) \\ + &= \log \sum_{i=0}^N \mathbb E_{z \sim p_\theta(\bullet \vert (o,a)_i)} [\frac{p(z)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z)], +\end{align}} +$$ + +``` + + +### Code + +Use inline code with backticks \`...\` or \`\`\` fenced code blocks \`\`\` with a language for syntax highlighting (e.g., \`python\`). + +As an example, here is inline code: `greet("Astro")` and below is a block. + + +```python +def greet(name: str) -> None: + print(f"Hello, {name}!") +``` + + +````mdx +`greet("Astro")` + +```python +def greet(name: str) -> None: + print(f"Hello, {name}!") +``` +```` + + + +### Code output + +If you want to display the output of a code block, you can use the `:::output` directive. If it's directly below the code block, it will adapt to the code block's styling. + +```python +def greet(name: str) -> None: + print(f"Hello, {name}!") + +greet("Astro") +``` +:::output +Hello, Astro! +::: + +Or it can also be used at a standalone block. + +:::output +Hello i'm a standalone output block. +::: + + +```python +print("This script prints a very very long line to check overflow behavior.") +``` +:::output +This script prints a very very long line to check overflow behavior. +::: + + + + +````mdx +```python +def greet(name: str) -> None: + print(f"Hello, {name}!") + +greet("Astro") +``` +:::output +Hello, Astro! +::: + +Or you can also use it at a standalone block. + +:::output +Hello i'm a standalone outputs block. +::: +```` + + +### Citation + +The **citation keys** come from `app/src/content/bibliography.bib`. + +**Citation** use the `@` syntax (e.g., `[@vaswani2017attention]` or `@vaswani2017attention` in narrative form) and are **automatically** collected to render the **bibliography** at the end of the article. + +1) In-text citation with brackets: [@vaswani2017attention]. + +2) Narrative citation: As shown by @kingma2015adam, stochastic optimization is widely used. + +3) Multiple citations and a footnote together: see [@mckinney2017python; @he2016resnet] for related work. + +4) All citations in one group: [@vaswani2017attention; @mckinney2017python; @he2016resnet; @silver2017mastering; @openai2023gpt4; @doe2020thesis; @cover2006entropy; @zenodo2021dataset; @sklearn2024; @smith2024privacy; @kingma2015adam; @raffel2020t5]. + + +```mdx +1) In-text citation with brackets: [@vaswani2017attention]. + +2) Narrative citation: As shown by @kingma2015adam, stochastic optimization is widely used. + +3) Multiple citations and a footnote together: see [@mckinney2017python; @he2016resnet] for related work. + +4) All citations in one group: [@vaswani2017attention; @mckinney2017python; @he2016resnet; @silver2017mastering; @openai2023gpt4; @doe2020thesis; @cover2006entropy; @zenodo2021dataset; @sklearn2024; @smith2024privacy; @kingma2015adam; @raffel2020t5]. +``` + + +You can change the citation style in the `astro.config.mjs` file. There are several styles available: `apa`, `vancouver`, `harvard1`, `chicago`, `mla`. Default is `apa`. + +### Footnote + +**Footnote** use an identifier like `[^f1]` and a definition anywhere in the document, e.g., `[^f1]: Your explanation`. They are **numbered** and **listed automatically** at the end of the article. + +1) Footnote attached to the sentence above[^f1]. + +[^f1]: Footnote attached to the sentence above. + +2) Multi-paragraph footnote example[^f2]. + +[^f2]: Multi-paragraph footnote. First paragraph. + + Second paragraph with a link to [Astro](https://astro.build). + +2) Footnote containing a list[^f3]. + +[^f3]: Footnote with a list: + + - First item + - Second item + +3) Footnote with an inline code and an indented code block[^f4]. + +[^f4]: Footnote with code snippet: + + ```ts + function add(a: number, b: number) { + return a + b; + } + ``` + Result: `add(2, 3) === 5`. + +4) Footnote that includes citation inside[^f5] and another footnote[^f1]. + +[^f5]: Footnote containing citation [@vaswani2017attention] and [@kingma2015adam]. + +5) Footnote with mathematical expressions[^f6]. + +[^f6]: This footnote includes inline math $E = mc^2$ and a display equation: + + $$ + \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} + $$ + + This is the Gaussian integral, a fundamental result in probability theory. + + +```mdx +1) Footnote attached to the sentence above[^f1]. + +2) Multi-paragraph footnote example[^f2]. + +2) Footnote containing a list[^f3]. + +3) Footnote with an inline code and an indented code block[^f4]. + +4) Footnote that includes citation inside[^f5]. + +5) Footnote with mathematical expressions[^f6]. + +[^f1]: Footnote attached to the sentence above. + +[^f2]: Multi-paragraph footnote. First paragraph. + + Second paragraph with a link to [Astro](https://astro.build). + +[^f3]: Footnote with a list: + + - First item + - Second item + +[^f4]: Footnote with code snippet: + + function add(a: number, b: number) { + return a + b; + } + + Result: `add(2, 3) === 5`. + +[^f5]: Footnote containing citation [@vaswani2017attention] and [@kingma2015adam]. + +[^f6]: This footnote includes inline math $E = mc^2$ and a display equation: + + $$ + \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} + $$ + + This is the Gaussian integral, a fundamental result in probability theory. +``` + + + +### Referencing + +In research articles, you may have to make references to anything. They are basically html anchors. They can be used internally in the article or externally in other articles. + + +1. **Title**
        + Each title is automatically generated with a slugged version from the citation key. ( slugged title from the citation key ) + like for example, the id `#mermaid-diagrams` is generated from the `Mermaid diagrams` title. +

        **Example** [Mermaid diagrams](#mermaid-diagram) + +2. **Image and chart**
        + You can make a link to an image or a chart by adding an ID on it.
        `` then you can link to it with a link like `Fig 1`. +

        **Example** [Chart 1](#neural-network-mnist-like) or [Fig 1](#placeholder-image) + + + **Available with:** `Reference`, `Image`, and `HtmlEmbed` components all support the `id` prop for creating referenceable anchors. + + + +```mdx + #### Mermaid diagrams + [Mermaid diagrams](#mermaid-diagrams) + + + [Chart 1](#neural-network-mnist-like) + + + [Fig 1](#placeholder-image) +``` + + + +### Mermaid diagram + +Native mermaid diagrams are supported (use a \`\`\`mermaid\`\`\` code fence). You can use the live editor to create your diagram and copy the code to your article. + +```mermaid +erDiagram + DATASET ||--o{ SAMPLE : contains + RUN }o--o{ SAMPLE : uses + RUN ||--|| MODEL : trains + RUN ||--o{ METRIC : logs + + DATASET { + string id + string name + } + + SAMPLE { + string id + string uri + } + + MODEL { + string id + string framework + } + + RUN { + string id + date startedAt + } + + METRIC { + string name + float value + } +``` + + +````mdx +```mermaid +erDiagram + DATASET ||--o{ SAMPLE : contains + RUN }o--o{ SAMPLE : uses + RUN ||--|| MODEL : trains + RUN ||--o{ METRIC : logs + + DATASET { + string id + string name + } + + SAMPLE { + string id + string uri + } + + MODEL { + string id + string framework + } + + RUN { + string id + date startedAt + } + + METRIC { + string name + float value + } +``` +```` + + + +### Separator + +Use `---` on its own line to insert a horizontal separator between sections. This is a standard Markdown “thematic break”. Don’t confuse it with the `---` used at the very top of the file to delimit the frontmatter. + +--- + + +```mdx +Intro paragraph. + +--- + +Next section begins here. +``` + + +### Table + +Use pipe tables like `| Column |` with header separator `| --- |`. You can control alignment with `:---` (left), `:---:` (center), and `---:` (right). + +| Model | Accuracy | F1-Score | Training Time | Status | +|:---|:---:|:---:|---:|:---:| +| **BERT-base** | 0.89 | 0.89 | 2.5h | ✅ | +| **RoBERTa-large** | 0.92 | 0.92 | 4.2h | ✅ | +| **DeBERTa-v3** | 0.94 | 0.94 | 5.8h | ✅ | +| **GPT-3.5-turbo** | 0.91 | 0.91 | 0.1h | ✅ | + + +```mdx +| Model | Accuracy | F1-Score | Training Time | Status | +|:---|:---:|:---:|---:|:---:| +| **BERT-base** | 0.89 | 0.89 | 2.5h | ✅ | +| **RoBERTa-large** | 0.92 | 0.92 | 4.2h | ✅ | +| **DeBERTa-v3** | 0.94 | 0.94 | 5.8h | ✅ | +| **GPT-3.5-turbo** | 0.91 | 0.91 | 0.1h | ✅ | +``` + + +### Audio + +Embed audio using `