Aug 08, 2020
- started with huggingface/transormers default pipeline model
- Really well trained but thought it was only extractive.
- We needed abstractive summary, also way to train model with data = (man page content, tldr summary)
- Found way to parse man pages for commands using
groff
$ groff -man -T utf8 $(man -w cp)
$ tldr groff
groff
Typesetting program that reads plain text mixed with formatting commands and produces formatted output.
It is the GNU replacement for the `troff` and `nroff` Unix commands for text formatting.
More information: <https://www.gnu.org/software/groff>.
- Render a man page as plain text, and display the result:
groff -man -T utf8 manpage.1
- Render a man page using the ASCII output device, and display it using a pager:
groff -man -T ascii manpage.1 | less
- Render a man page into an HTML file:
groff -man -T html manpage.1 > page.html
- Process a roff file using the `tbl` and `pic` preprocessors, and the `me` macro set:
groff -t -p -me -T utf8 foo.me
- Run a `groff` command with preprocessor and macro options guessed by the `grog` utility:
eval "$(grog -T utf8 foo.me)"
Results of exploration:
- Good intro and exploration of abstractive vs extractive summarization
- Extractive:
- Gensim.TextRank –> Creates graph of sent-sent connected by similarity. Uses page rank to decide sentances to be selected for summary.
- PyTextRnak –> Similar approach
- LexRank –> Similar, but uses idf cosine similarity as edge weight
- Latent Semantic Analysis –> Read More
- Spacial decomposition of sentance to lower dimension
- each singular vector can capture word combination patter
- Magnitude of singular value represent presence of the pattern in sentance
- Abstractive:
- Measuring quality of generated summary:
- Recall Based:
- ROUGE-N (based on n-gram, N == n-gram)
- It is the ratio of the count of N-gram phrases which occur in both the model and gold summary, to the count of all N-gram phrases that are present in the gold summary.
- Precision Based:
- BLEU metric
- It is the ratio of the count of N-gram phrases which occur in both the model and gold summary, to the count of all N-gram phrases that are present in the model summary.
- BLEU with modified N-gram
- weighted n-gram score. For example unigrams, bigrams and trigrams as 0.4, 0.3, 0.2 respectively.