History of Cologne Digital Lexicons презентация

Июль 31, 2022

Главная
Информатика
History of Cologne Digital Lexicons

Содержание

3. Digital Lexicons Digital Lexicons 1988-1994 1994-2005 pre-2014 2014-2019
4. Austin 1988 “Many Sanskritists are highly computer literate” “Bright hopes” by D. Wujastyk Undoing sandhi, conjunct
5. Post-Austin 1988 (Kharagpur 2019) Undoing sandhi solved, opensource 1992-2000, Peter Scharf (Pascal) 2009 Jim Funderburk (Perl,
6. Post-Austin 1988 (Kharagpur 2019) Sanskrit text archive (GRETIL), 2001 "simply rapid access library“ no “grammatical and
7. Post-Austin 1988 (Kharagpur 2019) Full contextual reference (Panini) GRA links to RV, not yet Panini 2018
8. Cologne 1997 Edition Coding yet to be done supplement transliteration of Greek botanical terms verbal forms
9. MW 2019: Supplement MW supplement (additions and corrections) fully integrated AFAIK 2018? Jim Funderburk https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html
10. MW 2019: Translitate Greek transliteration of Greek (16 out of 34 dictionaries) 2007, 2010 Beta Code
11. MW 2017: Botanical Terms to recognise and to renew plant names, Linnaean taxonomy changed over time
12. MW 2017: Botanical Terms Mis-markup (surnames coded as plants) Roxb., Hex., Gaertn., Nees., Schott., Bl., Wall.,
13. MW 2017: Verbal Forms Compare verbal forms databases Gérard Huet (gitlab INRIA) Amba Kulkarni (Uni of
14. MW 2019: Literary Sources Interlinking with Pāṇini was meant initially Cologne interlinking only for GRA to
15. Cologne 2019: Useful Byproducts List of all Sanskrit headwords from dictionaries sanhw1.txt & sanhw2.txt dīpita:dīpita:AP,AP90,MW,MW72,SHS,STC,WIL,YAT dīpitar:dīpitar:PW,PWG
16. MW 2017: Misc User Interface Replica of Printed Fonts for Web Display https://github.com/sanskrit-lexicon/MWS/issues/51
17. PW 2017: Code Reorganization Sample meta-line format; addition of div markup (breaking huge blobs of text
18. Simple Search
19. Cologne 2020: Simple Search How `simple` at Cologne works (#3) Searching for khan: kāma kaṇa khan
20. Sanskrit Dataset Crowdsourcing Carthago delenda est When we say DCS is the source, we are not
21. Sanskrit Dataset Crowdsourcing Carthago delenda est At the level of Cologne I’ve seen what 2.5 people
23. Скачать презентацию

Слайд 2

Слайд 3

Digital Lexicons
Digital Lexicons
1988-1994
1994-2005
pre-2014
2014-2019

Слайд 4

Austin 1988
“Many Sanskritists are highly computer literate”
“Bright hopes” by D. Wujastyk
Undoing

sandhi, conjunct characters
Sanskrit text archive, a remake of Thesaurus Linguae Graecae, est. 1972
Full textual reference (Panini)

https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 5

Post-Austin 1988 (Kharagpur 2019)
Undoing sandhi solved, opensource
1992-2000, Peter Scharf (Pascal)
2009 Jim

Funderburk (Perl, Java)
2015 Jim Funderburk (Python 2.7)
Conjunct characters are not an issue in Unicode. Not widely used in India and that does become an issue (ex., Pune intranet). It’s solved in 2016 for OCR.

https://github.com/funderburkjim/ScharfSandhi

Слайд 6

Post-Austin 1988 (Kharagpur 2019)
Sanskrit text archive (GRETIL), 2001
"simply rapid access library“

no “grammatical and lexical systems”
Digital Corpus of Sanskrit (DCS), 2010
560 000 lemmatized sentences (linguistic database, Sanskrit expert system)
Parallel Sanskrit-Russian Corpora, 2013
Rigveda, Atharvaveda, Mahabharata, Ramayana

https://github.com/funderburkjim/ScharfSandhi

Слайд 7

Post-Austin 1988 (Kharagpur 2019)
Full contextual reference (Panini)
GRA links to RV,

not yet Panini 2018 Jim Funderburk

https://github.com/funderburkjim/ScharfSandhi

Слайд 8

Cologne 1997 Edition
Coding yet to be done
supplement
transliteration of

Greek
botanical terms
verbal forms
literary sources

https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 9

MW 2019: Supplement
MW supplement (additions and corrections)
fully integrated AFAIK 2018? Jim

Funderburk

https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 10

MW 2019: Translitate Greek
transliteration of Greek (16 out of 34 dictionaries)
2007,

2010 Beta Code to Unicode Jim Funderburk, Peter Scharf
2010? Interlinking with Perseus Jim Funderburk
2015-2019 Proofreading Old Greek Jim Funderburk, Jonathan Migliori

https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 11

MW 2017: Botanical Terms
to recognise and to renew plant names, Linnaean

taxonomy changed over time (15826 cases in 8408 entries in MW)
Hedysarum_Gangeticum
sesamum_grain
the_flower_of_HibHibiscus_MutMutabilis

https://github.com/sanskrit-lexicon/MWS/issues/51

Слайд 12

MW 2017: Botanical Terms
Mis-markup (surnames coded as plants)
Roxb., Hex., Gaertn., Nees.,

Schott., Bl., Wall., Benth., Spreng., Willd., Schott.
Erycibe_Paniculata_Roxb. ---> Erycibe_PaniculataRoxb.
L. after botanical nomenclature is not L[exicographer], but Carl Linnaeus.
corrections can generate false positives, work with allbot1a.txt has just begun, but stopped rapidly

https://github.com/sanskrit-lexicon/MWS/issues/51

Слайд 13

MW 2017: Verbal Forms
Compare verbal forms databases
Gérard Huet (gitlab INRIA)
Amba Kulkarni

(Uni of Hyderabad)
Dhaval Pathel (SanskritVerb)
Jim Funderburk
? Oliver Hellwig

https://github.com/sanskrit-lexicon/MWS/issues/51

Слайд 14

MW 2019: Literary Sources
Interlinking with Pāṇini was meant initially
Cologne interlinking only

for GRA to RV
Turned out we still do not know how to resolve all abbreviations of literary sources
Punctuation between references: unsolved
Review of abbreviations (mwabbreviations)

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/ejf/hwnorm1c/hwnorm1c.txt

Слайд 15

Cologne 2019: Useful Byproducts
List of all Sanskrit headwords from dictionaries sanhw1.txt

& sanhw2.txt
dīpita:dīpita:AP,AP90,MW,MW72,SHS,STC,WIL,YAT
dīpitar:dīpitar:PW,PWG
dīpitā:dīpitā:SKD
dīpitṛ:dīpitṛ:AP,BUR,MW,MW72,SHS,WIL,YAT
dīptaka:dīptaka:MW,MW72,PW,PWG,SHS,WIL,YAT;dīptakaṃ:SKD;dīptakaḥ:AP,AP90
MW normalized grammatical information
Spellchecking & hyphenation (possible patterns)

https://raw.githubusercontent.com/sanskrit-lexicon/CORRECTIONS/master/sanhw2/sanhw2.txt

Слайд 16

MW 2017: Misc User Interface
Replica of Printed Fonts for Web Display
https://github.com/sanskrit-lexicon/MWS/issues/51

Слайд 17

PW 2017: Code Reorganization Sample
meta-line format;
addition of div markup (breaking huge

blobs of text into much more manageable pieces);
addition of abbreviation markup;
conversion to modern IAST;
improvements to spelling of the list of works and authors;
xml markup in place of most esoteric markup using special symbols.

https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Слайд 18

Simple Search

Слайд 19

Cologne 2020: Simple Search
How `simple` at Cologne works (#3)
Searching for khan:

kāma kaṇa khan kam kāṇa khāna kan khana kaṇ khaṇa kām kham kāna kana (14 results).
„Sanskrit made easy“ in Prof. Huet wording (#2)
Implemented at SpokenSanskrit.org (#1)
To do in 2020
Cut off verbal endings (enter an inflected form and get underlying MW dictionary words)

https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Слайд 20

Sanskrit Dataset Crowdsourcing
Carthago delenda est
When we say DCS is the source,

we are not actually giving a real source. It itself bases on GRETIL (108 Mb of HTML files, 1600 texts), which is nothing but an aggregator.
https://github.com/sanskrit-lexicon/

https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Слайд 21

Sanskrit Dataset Crowdsourcing
Carthago delenda est
At the level of Cologne I’ve seen

what 2.5 people can do in 5 years. What if we can unite 25 Sanskrit enthusiasts, manually checking the suspicious words found marked via Fuzzy (Levenshtein) algorithm
https://github.com/sanskrit-lexicon/

https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

History of Cologne Digital Lexicons презентация

Содержание

Digital LexiconsDigital Lexicons 1988-1994 1994-2005 pre-2014 2014-2019

Austin 1988“Many Sanskritists are highly computer literate”“Bright hopes” by D. WujastykUndoing

Post-Austin 1988 (Kharagpur 2019)Undoing sandhi solved, opensource1992-2000, Peter Scharf (Pascal)2009 Jim

Post-Austin 1988 (Kharagpur 2019)Sanskrit text archive (GRETIL), 2001"simply rapid access library“

Post-Austin 1988 (Kharagpur 2019)Full contextual reference (Panini) GRA links to RV,

Cologne 1997 EditionCoding yet to be done supplement transliteration of

MW 2019: Supplement MW supplement (additions and corrections)fully integrated AFAIK 2018? Jim

MW 2019: Translitate Greektransliteration of Greek (16 out of 34 dictionaries)2007,

MW 2017: Botanical Termsto recognise and to renew plant names, Linnaean

MW 2017: Botanical TermsMis-markup (surnames coded as plants)Roxb., Hex., Gaertn., Nees.,

MW 2017: Verbal FormsCompare verbal forms databasesGérard Huet (gitlab INRIA)Amba Kulkarni

MW 2019: Literary SourcesInterlinking with Pāṇini was meant initiallyCologne interlinking only

Cologne 2019: Useful ByproductsList of all Sanskrit headwords from dictionaries sanhw1.txt

MW 2017: Misc User InterfaceReplica of Printed Fonts for Web Displayhttps://github.com/sanskrit-lexicon/MWS/issues/51

PW 2017: Code Reorganization Samplemeta-line format;addition of div markup (breaking huge

Simple Search

Cologne 2020: Simple SearchHow `simple` at Cologne works (#3)Searching for khan:

Sanskrit Dataset CrowdsourcingCarthago delenda estWhen we say DCS is the source,

Sanskrit Dataset CrowdsourcingCarthago delenda estAt the level of Cologne I’ve seen

Похожие презентации

Digital Lexicons
Digital Lexicons
1988-1994
1994-2005
pre-2014
2014-2019

Austin 1988
“Many Sanskritists are highly computer literate”
“Bright hopes” by D. Wujastyk
Undoing

Post-Austin 1988 (Kharagpur 2019)
Undoing sandhi solved, opensource
1992-2000, Peter Scharf (Pascal)
2009 Jim

Post-Austin 1988 (Kharagpur 2019)
Sanskrit text archive (GRETIL), 2001
"simply rapid access library“

Post-Austin 1988 (Kharagpur 2019)
Full contextual reference (Panini)
GRA links to RV,

Cologne 1997 Edition
Coding yet to be done
supplement
transliteration of

MW 2019: Supplement
MW supplement (additions and corrections)
fully integrated AFAIK 2018? Jim

MW 2019: Translitate Greek
transliteration of Greek (16 out of 34 dictionaries)
2007,

MW 2017: Botanical Terms
to recognise and to renew plant names, Linnaean

MW 2017: Botanical Terms
Mis-markup (surnames coded as plants)
Roxb., Hex., Gaertn., Nees.,

MW 2017: Verbal Forms
Compare verbal forms databases
Gérard Huet (gitlab INRIA)
Amba Kulkarni

MW 2019: Literary Sources
Interlinking with Pāṇini was meant initially
Cologne interlinking only

Cologne 2019: Useful Byproducts
List of all Sanskrit headwords from dictionaries sanhw1.txt

MW 2017: Misc User Interface
Replica of Printed Fonts for Web Display
https://github.com/sanskrit-lexicon/MWS/issues/51

PW 2017: Code Reorganization Sample
meta-line format;
addition of div markup (breaking huge

Cologne 2020: Simple Search
How `simple` at Cologne works (#3)
Searching for khan:

Sanskrit Dataset Crowdsourcing
Carthago delenda est
When we say DCS is the source,

Sanskrit Dataset Crowdsourcing
Carthago delenda est
At the level of Cologne I’ve seen