TwoKinds [of] data

The comic stuff here.

Moderator: Moderators

Message
Author
User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

TwoKinds [of] data

#1 Post by Technic[Bot] »

Probably you just clicked on this thread thinking: "What is.. even...?"
Well let me explain. In my line of work one of the things I do is analyze data. Mostly simply measurements like distances and whatnot. But i also enjoy working with any kind of data, for fun and sometimes profit.
So after being here for a bit over 5 months i decided to do some simple analysis of any TwoKinds related data that I could find. At first i just wanted to satisfy my own curiosity and, as I i said, I find this to be fun. But after a few days of work I decided to share some of what i found with everyone in the forum! Hopefully someone will find it interesting and we'll all learn something new.

A fair warning thought: if not obvious by now this post will be number heavy so if that makes you uncomfortable. Well i did warned you, in any casei plan to stuff all my math heavy stuff in this thread so you can easily ignore it. If that makes you happy.
I will organize this in spoiler tags, as not to make a massive wall of text and to have some sort of order to this little project of mine.

Also, feedback is welcome, if any graph or part of this is not clear or if you know some part is wrong don't hesitate to tell me. Also if you have any questions, i am all ears. Finally if for some reason this is not considered kosher for the forum, just shoot me a message and i will remove it.

And without any further ado lets cut to the meat of the business here, hope you find some of this insightful:

Comic Schedule
Spoiler!
One of the main questions we get in the forum and probably any other TK related place, is when the next page gonna be uploaded. Predicting the future is way out of my pay-grade, thought sometimes i try.
In any case we fortunately have the upload date for every page since the prologue so we know how long Tom took to make every page.
Image
This graph shows how long Tom took to make and upload each consecutive page in days, so he took 2 days between pages 7 and 8, 11 between page 1031 and 1032, and so on and so forth. I am surprised that the longest this has gone without updating is a little over a month. Tom is dedicated, that much is obvious.
In case of some hard numbers the average time between pages is 5.2 days and if you look close you can see two modes: On the early days he posted every two days, but after the spike in the middle of the graph he went to a 7 days schedule.
However we see relative high variance, meaning he sometimes takes more or less in publishing.
This graph if fine and dandy however is a bit cluttered, sorry about that it is actually a lot of information in a small form factor. However we can use the power of a different visualization: a histogram:
Image
Here we can see how many pages were published n days apart. So for example over 200 consecutive pages were posted with only 2 days of difference and only once he took more than a month.
As we can see Tom is quite consistent, publishing mostly every 7 days as he says he will. But it is interesting that he had a much more intense schedule way back before page 400. Honestly his tenacity is admirable.
Twokinds popularity
Spoiler!
One of my favorite sites on all of the Internet is the Google trends site. It lets you track the volume of searches by term on Google over the past 10 years or so. You can compare terms, check by region, dates etcetera.
I encourage you to go there and type your favorite movie/book/comic and see how it has fared over time.
In case of Twokinds we see this:
Image
A warning thought google does not tell you the number of searches explicitly, instead :
Google wrote: Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.
Meaning that the numbers are normalized with respect the highest value in the graph.
This is all well and good but unless you are a real die-hard fan and memorized when Tom posted each page it does not tell you much. Instead this next graph changes the x-axis for something more informative, the page number.
Image
If you pay attention you see this graph starts at page 28, Google records go back only to 2004 and the comic started on September 2003, so I have missing data there. Also it does not compile data until the end of the month, so July is missing.
As we can see the peak in Google searches is around June 2011 or pages 620's That is around the time when the group left the Basitin isles. And then it took a sharp drop near page 870, a little before the Edinmire incident.
Twokinds fandom
Spoiler!
Precise information about who and when someones visit the comic is something only Tom would now about, he does have Google analytics turned on on the landing siteafter all. But google trends does give us general geographic information about anyone who searches for TwoKinds on its platform. It likes to display that in a (heat)map with is nice and all that but it is hard to read. so i compiled it into a nice bar chart:
Image
However i think is a bit skewed, according to this Philippines is were most of searches originate from, since we have seen a few attempted scamers on the forum recently so i assume this is just bots looking for easy prey on a relatively small anthro comic.
So i dropped the Philippines from the list and renormalized the search:
Image
Here we can see Tom has a lot of searches coming from Canada and the US, nothing unexpected, but also most searches come from Finland and northern Europe.
On a semi related note, if you want to order Tom's merch from anywhere outside the US-Canada it cost a whole lot of money, and considering there are a lot of European fans maybe next time Tom should look for a shipping company that also works outside North America.
In terms of perspectives south of Rio Bravo, something I am an expert, Mexico is tied with Chile. Chile is the highest consumer of any sort of comic on South America whereas
my motherland is the closest to the US. So the comic does not seems to appeal much in this latitudes.
FAQ
Spoiler!
Some questions that you might be typing right now:
  • Just What.. how... why?
    Because I find this FUN!
  • You must be fun at parties
    Actually I am :mrgrin: !
  • Where did you get this information?
    As i said Google Trends, is public you should check it out.
    If you are asking about the dates, i recently re-read the comic simply wrote down the relevant information that is publicly available on the page
  • How accurate is this
    I get my date from the TwoKinds page itself. In respect to Google information i assume is pretty good. They manage to cross reference your search with what you actually click afterwards so they have a different class from the TwoKinds comic and everyone who misspells Two Kinds by Amy Tan
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

User avatar
NuclearBird
Master
Posts: 296
Joined: Tue Aug 08, 2017 9:56 am

Re: TwoKinds [of] data

#2 Post by NuclearBird »

Hmmmm, statistics. Always so nice.
Tbh, the searches coming from Hungary surprise me, all things considered.
If the universe is infinite, does that mean that there is a version of me out there who's thinking the exact same thing?

While we're on the topic of alternate universes, is there one where I'm a lawyer? If yes, then I may be more evil than I thought.

User avatar
Vintage
Certified Fool
Posts: 1213
Joined: Mon May 26, 2014 3:32 pm
Location: Planet Zambodia
Fav. Twokinds Character: Natani

Re: TwoKinds [of] data

#3 Post by Vintage »

This kind of data is stuff I absolutely love to see!

Interesting to see that the end of the Basitin Isles arc appears to have incredibly high interest relatively speaking.

I also think it's pretty cool to know that on average, new pages are published around 4 days after the last one. Always feels like a looooot more :grin:
Image Image
*pssst* Want'a see what happens when I attempt art? (Avatar made by WoofSenpai & NowandLater)

User avatar
Ddraig
Templar Master
Posts: 443
Joined: Sat Dec 26, 2015 11:06 pm

Re: TwoKinds [of] data

#4 Post by Ddraig »

Vintage wrote: Mon Jul 09, 2018 5:31 am This kind of data is stuff I absolutely love to see!

Interesting to see that the end of the Basitin Isles arc appears to have incredibly high interest relatively speaking.

I also think it's pretty cool to know that on average, new pages are published around 4 days after the last one. Always feels like a looooot more :grin:
There does seem to be a bit of an explosion of interest there, doesn't there?
"Light thinks it travels faster than anything, but it's wrong. No matter how fast light travels, it always finds that darkness has gotten there first, and is waiting for it."

User avatar
Vintage
Certified Fool
Posts: 1213
Joined: Mon May 26, 2014 3:32 pm
Location: Planet Zambodia
Fav. Twokinds Character: Natani

Re: TwoKinds [of] data

#5 Post by Vintage »

Ddraig wrote: Mon Jul 09, 2018 9:28 pmThere does seem to be a bit of an explosion of interest there, doesn't there?
Makes me wonder if this data can be cross-referenced with the data that Google's indexed for the forum here. We still have the mystery of why we had our peak activity in August 2013.
Image Image
*pssst* Want'a see what happens when I attempt art? (Avatar made by WoofSenpai & NowandLater)

User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

Re: TwoKinds [of] data

#6 Post by Technic[Bot] »

Vintage wrote: Mon Jul 09, 2018 5:31 am This kind of data is stuff I absolutely love to see!

Interesting to see that the end of the Basitin Isles arc appears to have incredibly high interest relatively speaking.

I also think it's pretty cool to know that on average, new pages are published around 4 days after the last one. Always feels like a looooot more :grin:
Thanks! it was also fun to put together.
The mean is actually 5.2 days is the green dashed line on top of that mess of a graph.
Vintage wrote: Tue Jul 10, 2018 12:28 am
Ddraig wrote: Mon Jul 09, 2018 9:28 pmThere does seem to be a bit of an explosion of interest there, doesn't there?
Makes me wonder if this data can be cross-referenced with the data that Google's indexed for the forum here. We still have the mystery of why we had our peak activity in August 2013.
As much as it is useful and interesting Google trends data is a bit superficial at the end. Really nuanced data, like who, when and why someone queried a term Is indirectly sold by google to third parties via ads. What I am trying to say is that information regarding the specifics of this forum is unlikely to be public.
That being said if i could get some info about the posts frequency in this site cross referencing the information would not be that hard. I could, theoretically, scrape all the necessary information from the forum, doing it by hand would be impossible. But that would like painfully close to a DDoS so i would rather not do it without asking for clearance to the administration.

In any case my opinion about why the Bastitin Isles arc was so popular:
Spoiler!
Most of the time when you write something you either have to choose to either advance the plot or work on character progression. As a general rule of thumb Tom is not very good at moving the plot forward, but is pretty good when we he decides to develop his characters, his wished desires dream and relations.
The Basitin arc was unusual as he managed to do both thing at once and get a pretty good end product. The subplot moved forward, with high-stakes, action and comedy. But we also get to see more into the minds of fan favorite Natani and Keith (OTP) and got character development for pretty much everyone.
Well kind off. The basitin arc is like that side-mission on games you are expected to make half ways into the story but really has not much impact on the main plot.
In any case not that other point were not as good but in my opinion we have not seen such a good blend between plot and character development on other parts of the comic.
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

Re: TwoKinds [of] data

#7 Post by Technic[Bot] »

When you look at this comic you can see at least two different kinds of data (pun intended). First we have the meta-data, as publishing date, interest curves, geolocalization etcetera. Most of the previous post was based on that, metadata. But there is more information in the comic, namely the dialogue and the pages themselves. The pages, or images, are highly dimensional data so it is pretty hard to get something out of that. But thankfully we have the full transcript of the comic in plain text, sort of more on that later, so we can crunch some data about the trasncriot or as i like to call it the [n]"Twokinds dialogue/play"[/n] simply because that is how i stored it on my computer as a theather play.
So now this post is gonna deal mainly about the Twokinds transcript, how many lines and words, how many characters and all that good stuff. Again spoilered so the post looks nice and clean:

Something Meta first
Spoiler!
I know i just said that this post is all about the twokinds dialogue, but i actually forgot to talk about why i decided to start this. Besides you know, being pretty damn fun in itself.
Originally just after finishing reading the whole archive I, as many of us before me, was left aching for more and waiting for the update day every week, and as we have seen in previous graphs Toms sometimes takes a bit more than usual. So I wondered:
What is the probability that Tom publishes a new page today? Or In a more general manner:
What is the probability that Tom published a new page on any given weekday?
Since, again the publish date is freely available on the comic site this is not particularly complicated to do. So after some processing we got this:
Image
This graph show how many times the comic page was published on any given weekday. AS we can see it is most likely that Tom publishes on Wednesday but Fridays and Mondays are still good days to hope for a new page!
Now to the topic at hand
Spoiler!
Ok back to todays topic: The dialogue. First and to get it out of the way some numerics!
  • The total number of lines of dialogue in the comic as of page is 10,381 as of page 1033
  • The total number of speaking characters is 141 characters, that is not including Mrs Nibbly.
    Funnily enoguh there is a butterfly with one single dialogue: "!"
  • The comic is composed of 104,688 words, for comparison Harry Potter and the Prisoner of Azkaban is 107,253 words long. It seems a lot although by no mean large, really big books start at 500K words
  • The comic is written in 567,426 characters! more on this later.
Who speaks the most?
Spoiler!
"That was boring" I can almost hear you say, And well it is a bit but that is surface level information. Something I find more interesting is: Which characters do most of the talking? Since Tom has over 15 "main" characters and a lot more "secondary" and background characters, so who is doing all the talking?
Spoilers: Not many, considering characters only mentioned in the Characters page We can see how many "speech bubbles" or lines each one has had:
Image
But lest be honest here there are at least twice as many characters that are of some significance in the story, so how about them? Well considering characters who have had mare than 75 lines of dialogue in the comic we have this nice graph.
Image
As we can see the characters that speak the most are Trace, Flora, Keith and Natani. By a really large margin. Surprisingly enough in 5th place comes our "favorite pervert" Eric with a bit less than 400 lines, follower closely with everyone's favorite shape-shifter Raine.
A better visualization
Spoiler!
"So on proportion with the whole corpus of the comic how does this stack up?" I imagine you say, well there is a different graph style for that, the Pie Chart!
In this chart we can see all character under 1% percent of the comic lines got grouped into the others category:
Image
But just how much of the plot is driven by our 4 main heroes? Almost half of it:
Image
In this chart any character with less than 5% of the total number of lines got grouped in the Others category. Numerically 50.11 % of all lines belong to either Flora, Natani, Trace or Keith.
Honestly i am surprised by Natani, he has been in the comic for significantly less time than the other 3, not only that but in the story canon she can only speak Keidran, despite this a little more than 10% of the comic is Natani. This dude has a lot to say!
Random Curio
Spoiler!
Just some trivia i found while doing this: Minor fun details about the Twokinds Transcript
1.- Not quite plain text, you see for ellipsis (...) and apostrophes (') were not simply ASCII characters but rather utf-8 or unicode emojis if you will, that makes processing the text a bit more
complicated.
2.- Some names are changed: Despite rarely reading his name on comic, being mostly refered to by his last name: Alaric dialogues are always marked as Nikolai.
2.1.- On similar fashion Kathrin name was written as Kat before the Basitin Isles arc, changing to Kathrin afterwards.
2.2.- Young Natani has been both refered as "Young Natani" and as "Youngtani" the popular fanon abbrerviation for her.
And that is all for the moment, these things do take more time to make than I expected. There are a few more things i was planing to do with the data i have, you can expect it sometime on the future.
Also if you have any query about the data, something you want to know about the comic or so, feel free to leave a suggestion, one problem with datasets is that sometimes you have no idea what to do with them.

Also sorry for the double post but enough time has passed right?...
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

User avatar
Dadrobit
Grand Templar
Posts: 1216
Joined: Mon Aug 22, 2011 5:46 am
Location: Sunny Arizona

Re: TwoKinds [of] data

#8 Post by Dadrobit »

Keep it up,this stuff is fun! Did a bit of posting history sleuthing once or twice myself, so it's cool to see all of this laid out like this. :mrgrin:
Image

User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

Re: TwoKinds [of] data

#9 Post by Technic[Bot] »

Now to something a bit different. This will be a bit of a shorter post. More on that at the end.
In the las post we dealt with the Twokinds Transcript but we looked mostly at the question: Who said what? Or as it turn out How much each character spoke? This is interesting and all but it dealt mostly with information we can take from the structure of the text in this case: a theater play.
This is interesting and gave us a lot of insights but does not directly deals with the content of the dialogue. Wich i find a bit more interesting. So what can we do with that?
A kind cloud
Spoiler!
A word cloud:
In this case the word size is directly proportional to the frequency of the term in the corpus. That is the more times a word appears on the transcript the larger it appears in the word cloud.
This is not a very clean or "formal" way to present data as all the information is implicit on the word size, but oh boy is it information dense and ridiculous intuitive.
Image
Here we can see the most important words in the whole comic:
  • Characters: Keith, Natani and of course Trace and Flora
  • The races: Keidran, Basitins and humans
  • Important term: Know, think and well, this last case and interjection
It is interesting, this a comic heavily focused in characters and their relations, kinda was hoping for the word love to appear, so the appearance of names is not unexpected.
I think the terms like "know" and "think" relate to the amnesiac protagonist and how information about basic world-building is so scarce in the comic.
But by all means look at the image and make your own conclusions
That is all she wrote
Spoiler!
Well at least for now. This is all i had planned to do with the data i have available for now. not that i do not have enough data but i think this is enough for the moment. Call it a season finale, or something. Hopefully i will get to make a "season two" some time in the future, once i get more ideas what to do with all the data, or were I can get more tidbits of information.
Again any questions comments or anything is welcome. :mrgreen:
Credits
Spoiler!
A few shout out to all the people who made this little project of mine possible:
  • All those who spend their time to transcribe the comic
  • Mr Tom for keeping all this data available and free, and for writing the comic of course
  • The teams that develop and maintain python, numpy, pandas, matplotlib and nlltk
[/spoilers]
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

User avatar
Neptune
Master
Posts: 255
Joined: Fri Dec 22, 2017 12:01 am
Location: The Wall.
Fav. Twokinds Character: the bear in my signature

Re: TwoKinds [of] data

#10 Post by Neptune »

Trivia: If all of the lines of dialogue were pasted on A4 paper with 600 characters per page, the corpus would consist of an impressive 946 pages.

Now, imagine if TwoKinds was, instead, a novel (or epic which means a novel over 100,000 words), I'd imagine that if we were 30-50% through, that would be an amazing ~1850-3200 pages! Well, accounting for literary devices besides dialogue, and it all being compressed into one volume instead of several (considering the 300-page splitting rule, this would be around 6 to 11 volumes).

If TwoKinds was actually a book, then Tom would actually be done with it lol
Image Haha, he's so tiny! Where is he going?

User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

Re: TwoKinds [of] data

#11 Post by Technic[Bot] »

Deep Neural Keidran: Part 1
Ok this is something a bit different from what i did before.
In my line of work you get to play wiht al sort of programs and machines that do all sorts of crazy and interesting stuff. Both really old ~1960ś and really new stuff. This decade deep learning and neural networks are all the rage.
And since most of this stuff can run, somewhat, on consumer grade computers i could not resist but to grab some of the popular architectures and run them on the Twokinds Dataset, or as we know it the comic!
Fair waring thought this will be mostly me fawning machine learning and whatnot, also it might get a bit math heavy. So feel free to skip it if this is not your cup of tea but I do promise nice pictures and some insight on modern artificial intelligence. And hopefully you all have at least half the fun i had making this!
First some legal information, so Tom won't sue me to high heaven and back:
Spoiler!
The following work is based on Tom's Finsbach comic Twokinds.. All characters are His property.
I am legally required to offer a copy of the CC BY-NC-SA 3.0 US license wich the licensor uses for his work.
Also the following derivative work is also based on the same license as required by the licenser.
This is a completely NON-Commercial work and I did not received nor will receive any compensation in any shape of from from anyone.
Please don't sue me Tom...
With all the formalities covered lets start!

YOLO
Spoiler!
YOLO is an acronym for a horrible quip that hopefully nobody uses anymore. Also is the name of an neural netwrok architecture designed by Joseph Redmon. Yolo work on Darknet a framework written by Mr Redmon himself.
The algorithm is a state of the art object detection algorithm that runs on real time. You can check the paper online Warning: Very math heavy .
In short this algorithm takes an images and detects all known objects on them:
Image
By drawing a colorful box surrounding each recognition. It is trained on the COCO Dataset a bunch of images with tags. This is important as it can only recognize things that are contained on this big iamge dump (300k images!)
So how does i work on the comic characters? See yourself:
Note None of the boxes were drawn by any human, the pieces of software itself decides were and how to draw them.
Image
As you can see it classifies most characters as persons, the closes category on the dataset.
This looks all fine and dandy until i tell you i had to tweak some parameters to get this result. YOLO works by estimating the probability some area in an image belongs to a category, by default if the probability is over 25 % it declares it is indeed that and draws a box around it. However unless i drop the threshold to 5% i can't get that kind of results...
For example in this lovely picture the detector is unable too see anything but the glass Maeve is carrying on the platter
Spoiler!
Image
Well it might simply be because Maeve is a cat/snow leopard thing and Maddie some cat-cougar hybrid right? Not necessarily:
Spoiler!
Image
As you can see it has no problem seeing older Maeve and even correctly classifying baby Maeve as a cat. However if you look closely it even mis classifies her face as a clock...
Why is this? Well i am actually stress testing this thing. It has never seen a cartoon in its "life" much less an anthropic cat so it is still surprising, at least for me, that it can pull this kind of feat. What i am trying to say here is that there is no good reason why this thing is performing as well as it is.

Since this post will get exceedingly long if i keep posting images directly here i am gonna link you to my imgur account TechnicBot's account were you can see more of my experiments under the yolo tag. I uploaded only the the ones that struck me as the most significant and interesting.
Also if for whatever reason you want to give it a whirl: You can download it from its Github link thought is a bit hard to set up. If you want to make it run on Windows you should use this fork a warning though: You will need to compile it from source. Also it is completely free!

So in any case some insights of the results:
  • Most humans it manages to detect although for some reason Trace is almost invisible to the software.
  • If you are good with YOLO and tell it to relax it predictions it can get Keith most of the time. As I said: most of the time: Occasionally confuses him for a baseball glove or a ball.
  • And since we are dealing with our favorite rabbit humanoid SO we can see YOLO is quite food of Natani
  • But seems to highly dislike Flora, seriously it was pretty damn hard to get her to get properly labeled:
If you want me to check how any comic page or sketch is processed by YOLO feel free to ask, and i will uploaded it the imgur repo y mentioned. Also any doubt, questions, comments, death threats etc. Feel free to post here :grin:
Now at this point i imagine most of you are begging for a mod to ban me so you don't have to read me anymore so i will leave this here for now.
Well that was something...
Oh before I forget there is a part 2! This thing turned out to be longer and harder than expected so i decided to split it into two. Next time we will be seeing a different architecture and a different problem. So stay tuned for more Machine learned Keidran!!!
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

Arcus_Deer
Traveler
Posts: 25
Joined: Sun Sep 02, 2018 12:56 pm
Location: Midwest
Fav. Twokinds Character: n

Re: TwoKinds [of] data

#12 Post by Arcus_Deer »

This is incredibly interesting! Though I do not have any experience with this sort of thing, would it be possible to theoretically (I understand it would be a [censored] load of work to actually do) replace the COCO Dataset with a custom one made up of Twokinds characters? Would the program be smart enough to, say, identify the characters in Tom's colored Patreon art if the database was built from all of the comic work?

Also very interesting that it for some reason cannot identify Trace, can it really identify Karen better? I mean she has unusual hair and ears, which I assumed would be less human looking than Trace!

I can't wait for the next bit, thank you very much for sharing!! If I find any pictures I think would be interested to test, I'll dm you. Thanks again!

aitaituo
Templar GrandMaster
Posts: 683
Joined: Wed Nov 24, 2010 10:02 pm

Re: TwoKinds [of] data

#13 Post by aitaituo »

Maeve is a clock. It is known.

User avatar
Technic[Bot]
Grand Templar
Posts: 1246
Joined: Sat Jan 27, 2018 9:48 pm
Location: México
Fav. Twokinds Character: Raine!
Contact:

Re: TwoKinds [of] data

#14 Post by Technic[Bot] »

You ask and i answer
Spoiler!
Arcus_Deer wrote: Thu Sep 06, 2018 8:09 pm This is incredibly interesting! Though I do not have any experience with this sort of thing, would it be possible to theoretically (I understand it would be a [censored] load of work to actually do) replace the COCO Dataset with a custom one made up of Twokinds characters? Would the program be smart enough to, say, identify the characters in Tom's colored Patreon art if the database was built from all of the comic work?

Also very interesting that it for some reason cannot identify Trace, can it really identify Karen better? I mean she has unusual hair and ears, which I assumed would be less human looking than Trace!

I can't wait for the next bit, thank you very much for sharing!! If I find any pictures I think would be interested to test, I'll dm you. Thanks again!
It is indeed completely plausible. A lot of people need object detection for specific tasks so COCO is not particularly useful for them and so they use their own dataset.
But it is a lot of work. You need to draw a box around every object you want to detect in your image set, manually and then set some configuration files. You do that for every image or until you go insane.
Personally I have done it once, for cutlery detection, and took me around 3 hours to label 500 images. The recommended number of examples per class y around a thousand so you can imagine this taking and incredible large amount of time.
But if you manage that i suppose it could recognize almost every character with over 80% accuracy. The problem is that the comic is fairly small, only a thousand images or so. It might not be enough to train the system.

On the other hand the problem with its his hair and long robe. As the system is trained only on real images of people it has a hard time recognizing his head a such, normal people don't have blue hair! And since he uses robes most of the time his hands and legs are obscured and such information would help the system.
Anyhow now without more further ado:

Machine Learned Keidran: Part 2
As i said this was supposed to be one single post, but it was a bit more complicated than i expected it to be so i ended up splitting it in two so:
First some more legal disclaimers !!!!! :
For Tom:
Spoiler!
The following work is based on Tom's Finsbach comic Twokinds.. All characters are His property.
I am legally required to offer a copy of the CC BY-NC-SA 3.0 US license wich the licensor uses for his work.
Also the following derivative work is also based on the same license as required by the licenser.
This is a completely NON-Commercial work and I did not received nor will receive any compensation in any shape of from from anyone.
Please don't sue me Tom...
For CMU
Spoiler!
The following work is made utilizing software made and designed by the Perceptual Computing Lab at Carnegie Mellon University.
This work has been solely for research purposes and is in no way or form commercial-use.
I am also required to link the aforementioned licence
And a link to the original work [url=https://github.com/CMU-Perceptual-Compu ... b/openpose]Openpose

[/url]
Hopefully that will get any lawyer off my back!

Openpose
Spoiler!
As exciting as i find it a lot of person might YOLO to be a bit boring: "Yeah whatevs it draw boxes around people, suuuper exiciting". And true it is a bit basic in functionality but until recently having a computer detect objects on a scene with that performance was impossible.
Anyhow there are many thing you can do with deep learning. A similar problem is pose detection: Or given an image: find and label all the persons joints. This is more clear if I show a photo:
Image
So figuring out, automatically the pose of every person on the scene. That is relatively simple for people but giant problem for a computer.
Enter Openpose Carnegie Melon system for human pose estimation on real time. It is trained on a bunch of phtos of people were every joint is tagged. I will spare you the details but if you are interesting you can check their 2017 paper yeah it is quite new!
So how does this system trained on only real people fares against the comic?
Spoiler!
Image
Honestly better than expected:
Spoiler!
Image
In lieu of spamming all my experiments and getting banned for flooding i will link to my imgur account again: This time on the Openpose folder.
Some insights on the matter:
  • When you get a full body shot of the characters it perfoms quite well, not threhold tampering this time!
  • Alas Keidran have more trouble, it can't figure out how to parse their head properly. Not unexpected thought the system expectes to find a nose, two eyes and two ears but they are all in the wrong place!
  • It has similar trouble with keidran feet, again them being digitrade and openpose being trained on normal plantigrade people it can't infer proper foot joints
  • It is again quite fond of Natani
  • Works remarkably better on dressed characters. The less modest you are dressed the more problem it has seeing you. Again i doubt they trained the system on porn so it is biased in favor of clothes
  • But if you are rocking a long robe/dress it wont be able to see your hand/feet/paws(?) and wont find you.
What i can't show you
Spoiler!
Despite the omnious tittle, no it is not some secret defense project, just stuff i wish i could do but are over my league:
First is: text generation. Meaning training a system, knows as Long Short Term memory on the comic dialogue. With enough effort the machine learns how to write with more or less the same structure of the text. And in theory i can be used to generate new snippets of dialogue!
Alas as far is i am aware of they still produce janky results. If you are interested this little blog aritcle is quite comprehensive

Another cool thing you can do is Style transfer In theory this would let you take any image you want and transform it to Tom style. That is all fine and dandy but it is based on what is known as Adversarial networks and they are a pain to train properly. Way out of my league...

But hey one can dream right!
So that would be the end of the machine learned Keidran arc of my little thread. Hopefully you all had some fun.

So what did we learn today:
If the machine apocalypse ever happens, going around naked or dressing yourself as a giant cat might increase you chances of survival.
See you around!
There are three things that motivate people: Money, fear and love.
Links to my ramblings:
Twokinds [of] data
PhpBB in the age of facebook
If you are new to this phpBB thing:
BBCode guide

User avatar
Ddraig
Templar Master
Posts: 443
Joined: Sat Dec 26, 2015 11:06 pm

Re: TwoKinds [of] data

#15 Post by Ddraig »

Technic[Bot] wrote: Fri Sep 07, 2018 5:15 am You ask and i answer
Spoiler!
Arcus_Deer wrote: Thu Sep 06, 2018 8:09 pm This is incredibly interesting! Though I do not have any experience with this sort of thing, would it be possible to theoretically (I understand it would be a [censored] load of work to actually do) replace the COCO Dataset with a custom one made up of Twokinds characters? Would the program be smart enough to, say, identify the characters in Tom's colored Patreon art if the database was built from all of the comic work?

Also very interesting that it for some reason cannot identify Trace, can it really identify Karen better? I mean she has unusual hair and ears, which I assumed would be less human looking than Trace!

I can't wait for the next bit, thank you very much for sharing!! If I find any pictures I think would be interested to test, I'll dm you. Thanks again!
It is indeed completely plausible. A lot of people need object detection for specific tasks so COCO is not particularly useful for them and so they use their own dataset.
But it is a lot of work. You need to draw a box around every object you want to detect in your image set, manually and then set some configuration files. You do that for every image or until you go insane.
Personally I have done it once, for cutlery detection, and took me around 3 hours to label 500 images. The recommended number of examples per class y around a thousand so you can imagine this taking and incredible large amount of time.
But if you manage that i suppose it could recognize almost every character with over 80% accuracy. The problem is that the comic is fairly small, only a thousand images or so. It might not be enough to train the system.

On the other hand the problem with its his hair and long robe. As the system is trained only on real images of people it has a hard time recognizing his head a such, normal people don't have blue hair! And since he uses robes most of the time his hands and legs are obscured and such information would help the system.
I wonder if the blue triangle has anything to do with it
"Light thinks it travels faster than anything, but it's wrong. No matter how fast light travels, it always finds that darkness has gotten there first, and is waiting for it."

Post Reply