Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

GitHub build Documentation Status Language Machines Badge DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Frog - A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch

Copyright 2006-2020
Ko van der Sloot, Maarten van Gompel, Antal van den Bosch, Bertjan Busser

Centre for Language and Speech Technology, Radboud University Nijmegen
Induction of Linguistic Knowledge Research Group, Tilburg University

Website: https://languagemachines.github.io/frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool, which is currently maintained and developed by the Language Machines Research Group and the Centre for Language and Speech Technology at Radboud University Nijmegen. A dependency parser, a base phrase chunker, and a named-entity recognizer module were added more recently. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.

Various (re)programming rounds have been made possible through funding by NWO, the Netherlands Organisation for Scientific Research, particularly under the CGN project, the IMIX programme, the Implicit Linguistics project, the CLARIN-NL programme and the CLARIAH programme.

License

Frog is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version (see the file COPYING)

frog is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Comments and bug-reports are welcome at our issue tracker at https://github.com/LanguageMachines/frog/issues or by mailing lamasoftware (at) science.ru.nl. Updates and more info may be found on https://languagemachines.github.io/frog .

Installation

To install Frog, first consult whether your distribution's package manager has an up-to-date package. If not, for easy installation of Frog and its many dependencies, it is included as part of our software distribution LaMachine: https://proycon.github.io/LaMachine .

To be able to succesfully build Frog from source instead, you need the following dependencies:

The data for Frog is packaged seperately and needs to be installed prior to installing frog:

To compile and install manually from source instead, provided you have all the dependencies installed:

$ bash bootstrap.sh
$ ./configure
$ make
$ make install

and optionally:

$ make check

This software has been tested on:

  • Intel platforms running several versions of Linux, including Ubuntu, Debian, Arch Linux, Fedora (both 32 and 64 bits)
  • Apple platform running macOS

Contents of this distribution:

  • Sources
  • Licensing information ( COPYING )
  • Installation instructions ( INSTALL )
  • Build system based on GNU Autotools
  • Example data files ( in the demos directory )
  • Documentation ( in the docs directory and on https://frognlp.readthedocs.io )

Documentation

The Frog documentation can be found on https://frognlp.readthedocs.io

Credits

Many thanks go out to the people who made the developments of the Frog components possible: Walter Daelemans, Jakub Zavrel, Ko van der Sloot, Sabine Buchholz, Sander Canisius, Gert Durieux, Peter Berck and Maarten van Gompel.

Thanks to Erik Tjong Kim Sang and Lieve Macken for stress-testing the first versions of Tadpole, the predecessor of Frog

Owner
Language Machines
NLP Research group at Centre for Language Studies, Radboud University Nijmegen
Language Machines
Comments
  • Endless loop of parsing empty sentences when frog server connection is closed

    Endless loop of parsing empty sentences when frog server connection is closed

    When I send something to a frog server (frog -S ), I get back results. But when the connection is broken it ends up in an endless loop. See the attached screenshot. I'm using the latest LaMachine distribution, but I've had this issue forever and therefore never used the server parameter.

    image

  • Frog (through python-frog) accumulates a huge number of temporary files

    Frog (through python-frog) accumulates a huge number of temporary files

    User @stergiosmorakis ran a script processing tweets (via python-frog) that ran for a while and produced a lot of /tmp/frog* files with short input sequences. A million files were accumulated at a certain point. I should investigate when these are created (initial investigation seemed as if it was only used in server mode, which is not the case here, but I probably missed something). Then they should be cleared up in an earlier stage.

  • Frog breaks while processing large amount of txt data

    Frog breaks while processing large amount of txt data

    Frog is used to analyze 64 different txt files on 64 cores. It is initiated in LaMachine with frog.nf --inputdir chunks --outputdir chunks --inputformat text --sentenceperline --workers 64. However, I started the process several times, it once ran for a whole day but another time broke after only a few hours. The absolute runtime on the data should comprise around 20 days according to my calculations. Here is an excerpt of the error message.

    executor >  local (64)
    [7f/9f749c] process > frog_text2folia (48) [ 97%] 62 of 64, failed: 62
    WARN: Killing pending tasks (63)
    Error executing process > 'frog_text2folia (37)'
    
    Caused by:
      Process `frog_text2folia (37)` terminated with an error exit status (1)
    
    Command executed:
    
      set +u
            if [ ! -z "/vol/customopt/lamachine.stable" ]; then
                source /vol/customopt/lamachine.stable/bin/activate
            fi
            set -u
      
            opts=""
            if [[ "true" == "true" ]]; then
                opts="$opts -n"
            fi
            if [ ! -z "" ]; then
      frog-mopts="$opts --skip="
      fi
      
            #move input files to separate staging directory
            mkdir input
            mv *.txt input/
      
            #output will be in cwd
            mkdir output
            frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
            cd output
            for f in *.xml; do
                if [[ ${f%.folia.xml} == $f ]]; then
                    newf="${f%.xml}.frogged.folia.xml"
                else
                    newf="${f%.folia.xml}.frogged.folia.xml"
                fi
                mv $f ../$newf
            done
            cd ..
    
    Command exit status:
      1
    
    Command output:
      Now using node v13.3.0 (npm v6.13.4)
    
    Command error:
      frog-mbma-:	o - 0 
      frog-mbma-:	r - 0 
      frog-mbma-:	t - 0 
      frog-mbma-:	m - N  morpheme ='ma'
      frog-mbma-:	a - 0 
      frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
      frog-mbma-:	t - 0 
      frog-mbma-:	 - V  delete='jege'
      frog-mbma-:	 - 0 
      frog-mbma-:	 - /  INFLECTION: pv delete='ge'
      frog-mbma-:	 - 0 
      frog-mbma-:	z - 0 
      frog-mbma-:	o - 0 
      frog-mbma-:	c - 0  insert='ek' delete='ch'
      frog-mbma-:	h - 0 
      frog-mbma-:	t - /  INFLECTION: pv
      frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting ' is impossible. (a != ').
      frog-mbma-:Reject rule: MBMA rule (qatar):
      frog-mbma-:	q - N  morpheme ='q'
      frog-mbma-:	a - /  INFLECTION: de delete='''
    executor >  local (64)
    [98/46c588] process > frog_text2folia (31) [100%] 64 of 64, failed: 64
    WARN: Killing pending tasks (63)
    Error executing process > 'frog_text2folia (37)'
    
    Caused by:
      Process `frog_text2folia (37)` terminated with an error exit status (1)
    
    Command executed:
    
      set +u
            if [ ! -z "/vol/customopt/lamachine.stable" ]; then
                source /vol/customopt/lamachine.stable/bin/activate
            fi
            set -u
      
            opts=""
            if [[ "true" == "true" ]]; then
                opts="$opts -n"
            fi
            if [ ! -z "" ]; then
      frog-mopts="$opts --skip="
      fi
      
            #move input files to separate staging directory
            mkdir input
            mv *.txt input/
      
            #output will be in cwd
            mkdir output
            frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
            cd output
            for f in *.xml; do
                if [[ ${f%.folia.xml} == $f ]]; then
                    newf="${f%.xml}.frogged.folia.xml"
                else
                    newf="${f%.folia.xml}.frogged.folia.xml"
                fi
                mv $f ../$newf
            done
            cd ..
    
    Command exit status:
      1
      frog-mbma-:	o - 0 
      frog-mbma-:	r - 0 
      frog-mbma-:	t - 0 
      frog-mbma-:	m - N  morpheme ='ma'
      frog-mbma-:	a - 0 
      frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
      frog-mbma-:	t - 0 
      frog-mbma-:	 - V  delete='jege'
      frog-mbma-:	 - 0 
      frog-mbma-:	 - /  INFLECTION: pv delete='ge'
      frog-mbma-:	 - 0 
      frog-mbma-:	z - 0 
      frog-mbma-:	o - 0 
      frog-mbma-:	c - 0  insert='ek' delete='ch'
      frog-mbma-:	h - 0 
      frog-mbma-:	t - /  INFLECTION: pv
      frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting ' is impossible. (a != ').
      frog-mbma-:Reject rule: MBMA rule (qatar):
      frog-mbma-:	q - N  morpheme ='q'
      frog-mbma-:	a - /  INFLECTION: de delete='''
      frog-mbma-:	t - 0 
      frog-mbma-:	a - 0 
      frog-mbma-:	r - 0  INFLECTION: e
      frog-mbma-:tag: / infl: morhemes: [q] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting 's is impossible. (t != ').
      frog-mbma-:Reject rule: MBMA rule (ruytse):
      frog-mbma-:	r - N  morpheme ='ruy'
      frog-mbma-:	u - 0 
      frog-mbma-:	y - 0 
      frog-mbma-:	t - /  INFLECTION: m delete=''s'
      frog-mbma-:	s - 0 
      frog-mbma-:	e - /  INFLECTION: E/P
      frog-mbma-:tag: / infl: morhemes: [ruy] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting 's is impossible. (t != ').
      frog-mbma-:Reject rule: MBMA rule (duyts):
      frog-mbma-:	d - N  morpheme ='d'
      frog-mbma-:	u - N  morpheme ='uy'
      frog-mbma-:	y - 0 
      frog-mbma-:	t - /  INFLECTION: m delete=''s'
      frog-mbma-:	s - 0  INFLECTION: e
      frog-mbma-:tag: / infl: morhemes: [d,uy] description:  confidence: 0
      frog-mbma-:
      frog-:problem frogging: nlcow14ax_all_clean_martijn_36.txt
      frog-:std::bad_alloc
      frog-:Wed Jan 15 17:16:55 2020 Frog finished
      mv: cannot stat '*.xml': No such file or directory
    
    Work dir:
      /vol/tensusers2/hmueller/LAMACHINE/wd3/work/85/5e0fda647124c40fd8fd4d2846df61
    
    Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    
  • Frog gets progressively slower when running for hours, days

    Frog gets progressively slower when running for hours, days

    When running frog for a long time, performance decreases significantly.

    For instance, I'm processing a 2.8GB file. These are some pv outputs at several moments:

    frog: 15.8MiB  0:05:10 [39.2KiB/s] [>                                 ]  0% ETA  4:20:44:08
    frog:  545MiB  3:29:02 [62.2KiB/s] [>                                 ]  2% ETA  5:13:26:11     
    frog:  858MiB  5:23:19 [30.6KiB/s] [=>                                ]  4% ETA  5:09:10:47
    frog: 3.05GiB 73:28:54 [    0 B/s] [====>                             ] 14% ETA 17:22:57:01
    

    At first, the expected time is 4 days and 20 hours. After having run for 3 days, the expected time has gone up to almost 18 days.

    This is the script used:

    #!/usr/bin/env bash
    FILE=$1
    FILE_SIZE=`wc -c < "$FILE" | cut -f1`
    let "EXPECTED_FROG_SIZE = 8 * FILE_SIZE"
    BODY="${FILE%.*}"
    
    echo "Processing: $FILE"
    echo "Size: $FILE_SIZE"
    echo "Writing output to: ${BODY}_frog.txt"
    
    pv -s ${FILE_SIZE} -cN in ${FILE} | frog --skip=acmnp 2> ${BODY}_frog.log | pv -s ${EXPECTED_FROG_SIZE} -cN frog > ${BODY}_frog.txt
    
  • Frog can't deal with tokens that contain spaces

    Frog can't deal with tokens that contain spaces

    In historical dutch, certain words may be written apart although they can be considered one token: "vol daen" (voldaan) and represented as a single <w> in FoLiA. Would the various Frog modules (mblem, mbpos etc) be able to deal with spaces in tokens?

  • Error processing pre-tokenised FoLiA with untokenised parts

    Error processing pre-tokenised FoLiA with untokenised parts

    I'm running into an error when processing pre-tokenised FoLiA:

    mlp09$ frog --language=nld --skip=tmcpa /scratch/proycon/HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.folia.xml -X

    Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

    It seems the error occurs on a paragraph which contains text but no sentences/words (so it is untokenised unlike the others), when removing this paragraph, everything does process fine. It might be indicative of a more structural problem though, as the problem also occurs when I do not skip the tokeniser.

  • question on accuracy

    question on accuracy

    Hello, this is not an issue, just a question. Basically on accuracy of the different NLP tasks. I'm interested in comparing different types of NLP annotators and their accuracy. How well does frog do regarding accuracy on tokenisation, parts of speech tagging, lemmatisation, morphological feature annotation, dependency parsing? Are there numbers available which are comparable to the CONLL17 shared task (for example by training the frog on Dutch data from universal dependencies and next outputting the results (for example by using the evaluation script used by the CONLL17 shared task available at https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py) Are such numbers available?

  • MWU output when no Parser is selected

    MWU output when no Parser is selected

    @Irishx suggested:

    The default setting of Frog is to place mwu on 1 line as 1 token while this is actually only needed for the parser, even if you use the skip option to exclude parsing. Perhaps we should change this default setting?

    @Irishx do you intend to disable MWU detection too, when the Parser is skipped?

    This is easy to implement, but might change outcomes of older scripts. I am not sure if that would be a problem.

  • Frog can't find ucto's configuration file for non-standard rules?

    Frog can't find ucto's configuration file for non-standard rules?

    There's something wrong with the installation of the historical models still. Frog can't seem to find the tokeniser settings:

    $ frog --language dum    
    frog 0.19 (c) CLTS, ILK 1998 - 2019
    CLST  - Centre for Language and Speech Technology,Radboud University
    ILK   - Induction of Linguistic Knowledge Research Group,Tilburg University
    based on [ucto 0.19, libfolia 2.4, timbl 6.4.14, ticcutils 0.23, mbt 3.5]
    removing old debug files using: 'find frog.*.debug -mtime +1 -exec rm {} \;'
    frog-:config read from: /data2/dev/share/frog/dum/frog.cfg
    frog-:Missing [[mbma]] section in config file.
    frog-:Disabled the Morhological analyzer.
    frog-:Missing [[IOB]] section in config file.
    frog-:Disabled the IOB Chunker.
    frog-:Missing [[NER]] section in config file.
    frog-:Disabled the NER.
    frog-:Missing [[mwu]] section in config file.
    frog-:Disabled the Multi Word Unit.
    frog-:Also disabled the parser.
    frog-mblem-frog-mblem-:Initiating lemmatizer...
    ucto: textcat configured from: /data2/dev/share/ucto/textcat.cfg
    frog-tok-:Language List =[dum]
    ucto: No useful settingsfile(s) could be found.
    frog-tagger-tagger-:reading subsets from /data2/dev/share/frog/dum//crmsub.cgn
    frog-tagger-tagger-:reading constraints from /data2/dev/share/frog/dum//crmconstraints.cgn
    frog-:Initialization failed for: [tokenizer] 
    frog-:fatal error: Frog init failed
    
    $ cat /data2/dev/share/frog/dum/frog.cfg | grep tok
    [[tokenizer]]
    rulesFile=tokconfig-nld-historical
    
    $ ls /data2/dev/share/ucto/*hist*
    /data2/dev/share/ucto/tokconfig-nld-historical
    
  • Redesign Frog

    Redesign Frog

    @antalvdb and @proycon

    I think it is time for a great overhaul of Frog, to

    • reduce complexity
    • increase maintainability
    • be more flexible
    • speed things up

    The main aspect to consider: At the moment Frog uses FoLiA as it's internal data-structure. That seemed a good plan once, but with growing data-sets this now became a memory hog. Also it has some nasty MultiThreading issues. (like different FoLiA files after every run, all valid) It also makes processing line-by-line almost impossible, as the whole input file is stuffed into one FoLiA document. I think that processing smaller chunks would speed up the process, and will deliver output at a more constant pace. (not just one burst of FoLiA at the end) This would mean that we cannot directly use the Ucto 'tokenize-to-FoLiA` facility anymore. But that is a small burden, IMHO. Also there remains the problem of FoLiA output at the end of Frog. That would still mean a very large file, on large input. But it IS possible to create a 'main' file with a bunch of sub-files, using the FoLiA 'external' mechanism. In the current situation this is almost impossible.

    Also I am working on a FoLiA Builder class, which creates a FoLiA file on disk incrementally, without first creating it completely in memory. This reduces the footprint of Frog but still may produce insane large files (>10Gb or so)

    caveat: This solution does not work for FoLiA input into Frog. If you read a very large file,it stays large :) What MIGHT be possible is to do is to make a FoLiA Reader class, which reads chunks form a FoLiA file, have them processed by Frog and assemble the results back into another FoLiA file. (using the Builder for instance)

    I think other improvements might also be welcome, but this is the most intrusive one, I guess. At the and, the Frog results will not be changed, just produced faster.

    I think I would need a few months to accomplish this. It would be a nice task. But a bit to large to do 'on the side' maybe a CLARIAH+ or a CLARIN-NL project?

    Comments and additions welcome!

  • "terminate called without an active exception"

    Forwarding bug report mailed by Alex Bransen:

    ik krijg frog maar niet aan de praat met folia als input, maar wil even
    checken of dat door mijn installatie komt, of dat jij het ook hebt. Zou je
    als je even tijd hebt het bijgevoegde folia bestand door frog willen
    gooien, om te kijken of jij ook een error krijgt? Ik krijg:

    "terminate called without an active exception" (lekker nuttige error ook)

    cmd is: frog -x BAAC_A-11-0119.xml -X frogged-BAAC.xml

    gek genoeg als ik de -x flag naar -t verander (en hij dus het xml als plain
    text gaat behandelen) doet hij het wel..

    Volgens foliavalidator is het wel valid XML trouwens.

    Input file is https://download.anaproy.nl/BAAC_A-11-0119.xml , I can reproduce the bug locally on the latest development version.

  • Segfault on FoLiA in to FoLiA out (speech data with events and utterances)

    Segfault on FoLiA in to FoLiA out (speech data with events and utterances)

    Frog (libfolia) segfaults on the attached FoLiA input upon FoLiA serialisation.

    <?xml version="1.0" encoding="utf-8"?>
    <FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.5" xml:id="example">
      <metadata>
          <annotations>
              <text-annotation>
                             <annotator processor="p1" />
              </text-annotation>
              <utterance-annotation>
                             <annotator processor="p1" />
              </utterance-annotation>
              <event-annotation set="speech">
                             <annotator processor="p1" />
              </event-annotation>
          </annotations>
          <provenance>
             <processor xml:id="p1" name="proycon" type="manual" />
          </provenance>
      </metadata>
      <text xml:id="example.speech">
          <event xml:id="turn.1" class="turn" src="piet.wav" begintime="00:00:00.720" endtime="00:00:53.230">
            <utt xml:id="example.utt.1" speaker="Piet">
                <t>Het is vandaag 1 januari 2019. Mijn naam is Piet voor het project Diplomatieke Getuigenissen heb ik vandaag een gesprek met Piet. Ook met ons in de kamer is Piet die voor ons het geluid en de video verzorgt. Meneer Piet misschien dat we gewoon kunnen beginnen met dat u iets over uw opleiding vertelt en hoe u bij Buitenlandse Zaken bent komen te werken?</t>
            </utt>
            <utt xml:id="example.utt.2" speaker="Piet">
                <t>Ja ik ben geboren in 1936. Volgens de boeken het heilige jaar voor de Chinezen. 1936. In 2036 is er weer zo'n heilig jaar. Ik ben ... </t>
            </utt>
          </event>
      </text>
    </FoLiA>
    

    Call: frog --skip=pac -x anon_1.folia.xml -X anon_1.out.folia.xml

    All actual processing goes fine, it is the FoLiA serialisation in the end that fails.

    gdb backtrace:

    Thread 1 "frog" received signal SIGSEGV, Segmentation fault.
    0x0000000000000000 in ?? ()
    (gdb) bt
    #0  0x0000000000000000 in ?? ()
    #1  0x00007fa4eae08999 in folia::AbstractElement::append (this=<optimized out>, [email protected]=0x7fa4e700a580, child=<optimized out>, [email protected]=0x7fa4e659a7f0) at folia_impl.cxx:3129
    #2  0x00007fa4eae98ee2 in folia::AbstractStructureElement::append (this=0x7fa4e700a580, child=0x7fa4e659a7f0) at folia_subclasses.cxx:784
    #3  0x00007fa4eae306fc in folia::AbstractElement::AbstractElement ([email protected]=0x7fa4e659a7f0, [email protected]=0x7fa4eb5abfc0 <VTT for folia::Paragraph+16>, p=..., [email protected]=0x7fa4e700a580, __in_chrg=<optimized out>) at folia_impl.cxx:293
    #4  0x00007fa4eb4cd949 in folia::AbstractStructureElement::AbstractStructureElement (p=0x7fa4e700a580, props=..., __vtt_parm=0x7fa4eb5abfb8 <VTT for folia::Paragraph+8>, this=0x7fa4e659a7f0, __in_chrg=<optimized out>)
        at /usr/local/include/libfolia/folia_subclasses.h:59
    #5  folia::Paragraph::Paragraph (p=0x7fa4e700a580, a=..., this=0x7fa4e659a7f0, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /usr/local/include/libfolia/folia_subclasses.h:626
    #6  folia::FoliaElement::add_child<folia::Paragraph> (args=..., this=0x7fa4e700a580) at /usr/local/include/libfolia/folia_impl.h:125
    #7  FrogAPI::handle_one_text_parent (this=0x7ffc1bc9e600, os=..., e=0x7fa4e700a580, sentence_done=<optimized out>) at FrogAPI.cxx:2567
    #8  0x00007fa4eb4ce462 in FrogAPI::run_folia_engine (this=0x7ffc1bc9e600, infilename=..., output_stream=...) at FrogAPI.cxx:2661
    #9  0x00007fa4eb4d0bf1 in FrogAPI::FrogFile (this=0x7ffc1bc9e600, infilename=...) at FrogAPI.cxx:2743
    #10 0x00007fa4eb4d3cbd in FrogAPI::run_on_files (this=0x7ffc1bc9e600) at FrogAPI.cxx:1175
    #11 0x000055c8b0feafd2 in main (argc=<optimized out>, argv=<optimized out>) at Frog.cxx:229
    frog_segfault (END)
    
  • Praktische vragen rondom grote datasets

    Praktische vragen rondom grote datasets

    Ik heb een corpus met 25 miljard woorden die ik wil 'froggen', daarvoor heb ik een 32 core/128GB RAM computer. M'n plan is om 16 losse instanties van frog te draaien, max. 500 woorden per zin. En het lijkt dat ik dan 10k woorden per seconde kan 'froggen'.

    Maar dat roept natuurlijk wat vragen op.

    1: Hoe waarschijnlijk is het dat 128GB ergens halverwege het proces te weinig blijkt? Is het geheugenverbruik redelijk constant? Is het verstandig om m'n data op te delen in kleine 'chunks'? 2: Met 10kw/s zou m'n hele corpus ongeveer een maand aan computatie vereisen, wat ok is, maar ook voldoende lang dat ik wel wat tijd wil investering om de performance beter te krijgen (met het idee dat dat ook zinnig is voor anderen). Is er nog 'laag hangend fruit' wat betreft de performance? Mijn eerste instinct is dat er veel 'communicatie' is en dat de interne representatie van tokens en hun metadata te complex is. Maar dat is niet makkelijk om om te bouwen. 3: FoLiA is heel flexibel maar simpelweg te veel data, zelfs compressed zou het corpus dan vele terabytes zijn, gewoon niet zo handig als het niet op een normale SSD past. Ik heb even een opzetje gemaakt voor een bestandsformaat wat een vaste 8 bytes per token gebruikt. Daar moet natuurlijk wat voor inleveren (maar 2⁸ PoS tags, theoretisch kun je er ~320 maken volgens mij en maar 2²⁵ soorten tokens (ex. MWU)) , maar dan kun je wel complexe zoekopdrachten best snel doen (ik mik op bijv. 100 miljoen tokens/s). En met basic compressie van de meeste frequente woorden kan ik dan m'n hele corpus in geheugen houden. Is hier misschien al is eerder over nagedacht? Ik zou niet halfslachtig iets opnieuw willen implementeren. 4: Waar zou ik kunnen lezen over hoe frog is getraind en op welke data? Ik zou graag een inschatting maken van wat voor nauwkeurigheid ik kan verwachten op m'n dataset. En waar mogelijk iets hertrainen om beter aan daar beter op aan te sluiten.

  • Simplify option and configuration handling

    Simplify option and configuration handling

    In frog there is quite a messy way of handling options and configuration details. We have FrogOptions and a TiCC::Configuration part to store information.

    This could be simplified a lot by making the Configuration internal to the configuration file parsing and storing all necessary information in the Options.

  • Keep the deep_morph structure intact when resolving MWU's

    Keep the deep_morph structure intact when resolving MWU's

    When resolving MWU's (in frog_data::resolve_mwus() ) the deep_morphs structure is lost; only the deep_morph_string member is resolved. This is disadvantageous, as it is impossible to retrieve the separate deep_morphs and inflections from MWU's (without stupid split() actions). This especially a problem while creating JSON output. TODO: rework MWU resolving keeping the parts available.

  • Add JSON output as an alternative to 'tabbed' format

    Add JSON output as an alternative to 'tabbed' format

    The 'tabbed' format is quite rigid, and sometimes difficult to read. (especially when some modules are skipped). It might be handy to create JSON output as an alternative. This might really be useful for the SERVER mode.

    NOTE: consider JSON input for the server too then,

  • Use MBMA to split compounds

    Use MBMA to split compounds

    in --deep_morph mode, MBMA can detect al kinds of compounds, and even outputs them. it would be very useful if we could add some code to give the logical splitting of the detected compounds.

    e.g. Frog now gives for 'appeltaart' [[appel]noun[taart]noun]noun/singular NN-compound it seems doable to also give 'appel-taart'

    In practice this can become very complicated: 'appelgebak' gives: [[appel]noun[[ge][bak]noun]noun/singular]noun NN-compound

    You would like te get appel-gebak' NOTappel-ge-bakorappelge-bak` For longer compounds it gets even more difficult.

    verkeersagent [[verkeer]noun[s][[ageer]verb[ent]]noun]noun/singular NN-compound

    But still it seems worth investigating.

This robot lcoalisation package for lidar-map based localisation using multi-sensor state estimation.
This robot lcoalisation package for lidar-map based localisation using multi-sensor state estimation.

A ROS-based NDT localizer with multi-sensor state estimation This repo is a ROS based multi-sensor robot localisation. An NDT localizer is loosely-cou

Dec 15, 2022
Ingescape - Model-based framework for broker-free distributed software environments
 Ingescape - Model-based framework for broker-free distributed software environments

Ingescape - Model-based framework for broker-free distributed software environments Overview Scope and Goals Ownership and License Dependencies with o

Jan 5, 2023
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Jan 1, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Dec 31, 2022
A lightweight C++ machine learning library for embedded electronics and robotics.

Fido Fido is an lightweight, highly modular C++ machine learning library for embedded electronics and robotics. Fido is especially suited for robotic

Dec 17, 2022
A C++ standalone library for machine learning

Flashlight: Fast, Flexible Machine Learning in C++ Quickstart | Installation | Documentation Flashlight is a fast, flexible machine learning library w

Jan 8, 2023
mlpack: a scalable C++ machine learning library --
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

Dec 30, 2022
RNNLIB is a recurrent neural network library for sequence learning problems. Forked from Alex Graves work http://sourceforge.net/projects/rnnl/

Origin The original RNNLIB is hosted at http://sourceforge.net/projects/rnnl while this "fork" is created to repeat results for the online handwriting

Dec 26, 2022
Flashlight is a C++ standalone library for machine learning
Flashlight is a C++ standalone library for machine learning

Flashlight is a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research Speech team and the creators of Torch and Deep Speech.

Jan 8, 2023
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference
 Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Jan 7, 2023
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

Nov 24, 2022
ML++ - A library created to revitalize C++ as a machine learning front end
ML++ - A library created to revitalize C++ as a machine learning front end

ML++ Machine learning is a vast and exiciting discipline, garnering attention from specialists of many fields. Unfortunately, for C++ programmers and

Dec 31, 2022
This is a group project for the unit Technical Software Design.

electoral-project This is a group project for the unit Technical Software Design. Group number: 9 Members of this group: Grace Tang, Lorien Cutler, Jo

May 24, 2021
C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library

Build Status Travis CI VM: Linux x64: Raspberry Pi 3: Jetson TX2: Backstory I set to build ccv with a minimalism inspiration. That was back in 2010, o

Jan 6, 2023
RP-VIO: Robust Plane-based Visual-Inertial Odometry for Dynamic Environments (Code & Dataset)
RP-VIO: Robust Plane-based Visual-Inertial Odometry for Dynamic Environments (Code & Dataset)

RP-VIO: Robust Plane-based Visual-Inertial Odometry for Dynamic Environments RP-VIO is a monocular visual-inertial odometry (VIO) system that uses onl

Jan 6, 2023
An SDL2-based implementation of OpenAL in a single C file.

MojoAL MojoAL is a full OpenAL 1.1 implementation, written in C, in a single source file. It uses Simple Directmedia Layer (SDL) 2.0 to handle much of

Dec 28, 2022
A Modular Framework for LiDAR-based Lifelong Mapping
A Modular Framework for LiDAR-based Lifelong Mapping

LT-mapper News July 2021 A preprint manuscript is available (download the preprint). LT-SLAM module is released.

Dec 30, 2022