Gleitend Gaußer
Unschärfe für Anfänger Einleitung Dies ist ein kurzes Tutorial zu verwischenden Techniken für Anfänger. Als ich dieses Zeug lernte, gab es sehr wenig Material, das nützlich war. Das ist natürlich nicht wahr - es gab Massen von Material, aber die Hälfte davon war viel zu einfach und die andere Hälfte begann Let T eine Vektorfunktion, die über das halboffene Intervall ausgewertet wurde. Und war voll von sehr beängstigenden mehrzeiligen Gleichungen mit diesen großen Sigma-Symbolen und Sachen. Dieser Artikel soll das beheben. Ich rede über verschiedene Arten von Unschärfe und die Effekte, die du sie benutzen kannst, mit Quellcode in Java. Ein Haftungsausschluss Wenn immer verwischt wird, gibt es immer jemand, der sagt, Hey Thats nicht eine echte Bewegungsunschärfe, oder schreibt wütend Buchstaben in grüne Tinte beschwert, dass die Mathematik ist zweifelhaft oder dass theres eine viel schnellere Weg, dies zu tun mit dem Sponglerizer Register auf der HAL -9000 Ignoriere diese Leute. Dies ist ein großes Thema, und dieser Artikel ist nur für Anfänger (von denen ich stolz sagen kann, dass ich einer bin). Was zählt ist, bekommst du die Ergebnisse, die du anstrebst, und wenn die Ergebnisse, die du anstrebst, zweifelhafte Mathematik erfordern, dann sei es so. Wenn die Ergebnisse, die Sie anstreben, für mich schrecklich aussehen, dann ist das gut, solange sie dir gut aussehen. Ein weiterer Haftungsausschluss Theres Quellcode in Java für ziemlich gut alles, was ich hier rede. Ich behaupte nicht, dass diese in irgendeiner Weise optimiert sind - Ive entschied sich für die Einfachheit über Geschwindigkeit überall und youll wahrscheinlich in der Lage sein, die meisten dieser Dingen schneller mit ein bisschen Aufwand zu machen. Sie können den Quellcode für alles, was Sie wollen, einschließlich kommerzieller Zwecke, aber theres keine Haftung verwenden. Wenn Ihr Atomkraftwerk oder Raketensystem wegen einer unsachgemäßen Unschärfe ausfällt, ist es nicht meine Schuld. Was ist verwischt Wir alle wissen, was Unschärfe ist, nicht wir ist das Ding, das passiert, wenn deine Kamera unscharf ist oder der Hund deine Brille stiehlt. Was passiert ist, dass das, was als ein scharfer Punkt gesehen werden soll, verschmiert wird, in der Regel in eine Scheibenform. In Bildausdrücken bedeutet dies, dass jedes Pixel im Quellbild verteilt und in umgebende Pixel gemischt wird. Eine andere Möglichkeit, dies zu betrachten, ist, dass jedes Pixel im Zielbild aus einer Mischung von umgebenden Pixeln aus dem Quellbild besteht. Die Operation, die wir dafür benötigen, heißt Faltung. Das klingt kompliziert, aber das ist nur, weil Mathematiker gerne die Dinge klingen kompliziert, um diese Luft der Magie zu halten und halten die Finanzierung rollen in. Nun, Im auf sie und ich kann zeigen, dass Faltung ist nicht so kompliziert (auf meinem Niveau sowieso) . Die Art und Weise, wie es funktioniert, ist: Wir stellen uns vor, eine rechteckige Anzahl von Zahlen über unser Bild zu schieben. Dieses Array heißt Faltungskernel. Für jedes Pixel im Bild nehmen wir die entsprechenden Zahlen aus dem Kernel und die Pixel, die sie vorbei sind, multiplizieren sie zusammen und addieren alle Ergebnisse zusammen, um das neue Pixel zu machen. Zum Beispiel, stellen wir uns vor, wir wollen eine wirklich einfache Unschärfe machen, wo wir nur durchschnittlich zusammen jedes Pixel und seine acht unmittelbaren Nachbarn. Der Kernel, den wir brauchen, ist: Beachten Sie, dass diese alle zu 1 addieren, was bedeutet, dass unser resultierendes Bild genauso hell ist wie das Original. Ohne weiteres kann man ein Bild in Java verschwimmen lassen. Alles, was Faltungs-Sachen klingt schwierig zu implementieren, aber zum Glück Java kommt mit einem eingebauten und ready-to-use-Betreiber genau das zu tun. Ich spreche hier ConvolveOp. Heres der Code: Fantastisch Ein verschwommenes Bild Es ist nicht sehr verschwommen aber. Lets do eine wirklich große Unschärfe wie folgt: Hmmmmmm. Nun, das ist nicht so gut. Nicht nur hat es eine sehr lange Zeit gedauert, aber das Ergebnis ist etwas merkwürdig - alles sieht gut aus, quadratisch, und was auf der Erde ist um die Ränder passiert. Erst die Kanten: ConvolveOp ist ein schüchternes Namby-Pamby-Ding, das Angst hat Vom Rand des Bildes abfallen. Wenn der Kernel den Rand des Bildes überlappen würde, gibt es einfach auf und lässt das Pixel unverändert. Sie können dies ändern, indem Sie EDGEZEROFILL statt EDGENOOP, aber das ist noch schlimmer - die Pixel um den Rand nur auf Null gesetzt und effektiv verschwinden. Was sollen wir gut machen, wir könnten das Bild um die Ränder herumstoßen, bevor wir uns entschärfen und das Ergebnis zuschneiden, aber das gibt nur ein, und außerdem würden wir nichts lernen. Stattdessen schreiben Sie einen richtigen, furchtlosen, No-Nonsense-Operator, der keine Angst vor Kanten hat. Nun nennen es ConvolveFilter, um es von ConvolveOp zu unterscheiden. Ich werde nicht in Details der Quelle in diesem Artikel zu tun - theres nicht genug Zeit oder Raum und wir haben noch viel mehr Filter zu schreiben, aber du kannst die Quelle herunterladen oder ansehen und es sollte ziemlich selbsterklärend sein. Jetzt ist das Rechtwinkligkeitsproblem: Der Grund, warum alles quadratisch ist, ist, weil das, was hier getan wurde, eine Kasten-Unschärfe genannt wird - unser Kern ist wie ein Quadrat geformt, als ob sie eine Kamera benutzen würden, die eine quadratische Blende hat. Übrigens, lassen Sie niemand sagen, dass Box-Unschärfen nutzlos sind - in der Tat, wenn youre simulieren den Schatten von einem quadratischen Licht geworfen, ist genau das, was Sie wollen. Jedenfalls kommen sie auch weiter. Eine andere Sache: Dont get verwirrt - Im mit dem Begriff Box Blur auf die Form des Kernels beziehen, nicht sein Profil, die ich gehe, um einen Box-Filter aufrufen. Mehr dazu später Um eine realistischere Unschärfe zu bekommen, was wir tun sollten, wird ein kreisförmiger Kern verwendet. Das simuliert viel besser was eine echte kamera macht Das ist viel besser. Kommen Sie später wieder darauf, aber zuerst eine Ablenkung zurück zu der Box verschwimmen. Weve löste das Rand-Pixel-Problem, aber unsere Unschärfe geht immer noch langsam, und die Dinge werden nur noch schlimmer werden. Das Problem ist, dass die Anzahl der Multiplikationen in der Faltung als das Quadrat des Kernelradius steigt. Mit einem 100x100 Kernel würden 10000 multipliziert werden und fügt pro Pixel (ca.) hinzu. Wie können wir das umgehen. Es stellt sich heraus, dass es mehr Möglichkeiten gibt, darüber zu gehen, als ich vielleicht Zeit habe, darüber zu schreiben, oder sogar die Mühe zu sehen. Ein Weg, den ich schnell erwähnen werde, bevor ich ihn unter den Teppich gekehrt habe, ist das: Du kannst eine Schachtel verschwimmen lassen, indem du dein Bild herunterschreckst, es verwischt und es wieder skaliere. Dies kann für Ihre Zwecke gut sein, und Sie sollten es im Auge behalten. Ein Problem ist, dass es nicht sehr gut animiert, aber vielleicht nicht ein Anliegen für Sie. Lasst uns die Kiste wieder verschwimmen lassen: Es stellt sich heraus, dass theres ein paar wirklich einfache Möglichkeiten, um dies zu beschleunigen. Erstens stellt sich heraus, dass die Box-Unschärfe trennbar ist. Dies bedeutet, dass wir eine 2D-Unschärfe machen können, indem wir zwei 1D-Unschärfen machen, einmal in der horizontalen Richtung und einmal in der vertikalen Richtung. Das ist viel schneller als die 2D-Unschärfe zu machen, weil die Zeit im Verhältnis zur Kernelgröße steigt, nicht als Platz. Zweitens, denke an das Fenster, das über das Bild gleitet. Wenn wir es von links nach rechts verschieben, kommen die Pixel am rechten Rand an und werden zur Summe addiert und gleichzeitig werden die Pixel die linke Kante verlassen und von der Summe subtrahiert. Alles, was wir tun müssen, ist einfach das Hinzufügen und Subtrahieren für die Eingabe und Verlassen von Pixeln bei jedem Schritt statt zusammenzufügen alle Pixel im Fenster. Wir müssen nur einen Satz von laufenden Summen speichern, die die Breite oder Höhe des Kernels sind. Dies gibt eine massive Geschwindigkeitsverbesserung zu den Kosten, um etwas Code zu schreiben. Zum Glück hat Ive den Code für dich geschrieben, also gewinnst du rundum. Wir brauchen zwei Pässe, um einmal horizontal und einmal vertikal zu verwischen. Der Code für diese ist natürlich ganz anders. Aber warten Sie Theres ein Trick, den wir tun können, was uns erlaubt, den Code einmal zu schreiben. Wenn wir eine unscharfe Funktion schreiben, die die horizontale Unschärfe macht, sondern ihr Ausgabedatum umsetzt, dann können wir es einfach zweimal anrufen. Der erste Durchlauf verschwindet horizontal und transponiert, der zweite Durchlauf tut dasselbe, aber wie das Bild jetzt umgesetzt wird, macht es wirklich eine vertikale Unschärfe. Die zweite Transposition macht das Bild wieder richtig und voila - eine sehr schnelle Box-Unschärfe. Probieren Sie es in diesem Applet aus: Und heres den Quellcode. Sie haben vielleicht bemerkt, dass wir bisher nur einen ganzzahligen Radius benutzt haben, der es einfach macht, die Array-Indizes für die Unschärfe zu erarbeiten. Wir können die Technik erweitern, um eine Sub-Pixel-Unschärfe (d. h. einen Nicht-Integral-Radius) einfach durch lineare Interpolation zwischen den Array-Werten durchzuführen. Mein Quellcode tut dies nicht, aber es ist einfach, hinzuzufügen. Gaussian Blur Jetzt ist es an der Zeit, die Geschwindigkeit und quadratisch aussehende Unschärfe Probleme gleichzeitig zu adressieren. Um den quadratischen Blick auf die Unschärfe zu werfen, brauchen wir einen kreisförmigen Kernel. Leider ist der Trick, den wir für Box-Unschärfen benutzten, nicht mit einem Kreis arbeiten, sondern theres ein Lücke: Wenn der Kernel das richtige Profil hat - das Gaußsche Profil - dann können wir eine 2D-Unschärfe machen, indem wir zwei 1D-Unschärfen durchführen, genau wie wir es mit dem gemacht haben Box Unschärfe. Es ist nicht so schnell, weil der Schiebefenster Trick nicht funktioniert, aber es ist noch viel schneller als die 2D-Faltung zu tun. Das Profil, das wir brauchen, ist die vertraute glockenförmige oder Gaußsche Kurve, die du gehört hast: Heres ein Code, um einen 1D Gaußschen Kern für einen gegebenen Radius zu erzeugen. Alles, was wir tun müssen, ist, dies zweimal, einmal waagerecht und einmal vertikal anzuwenden. Als Bonus, Ive wickelte es in einem GaussianFilter, um es einfach zu bedienen. Dies ist der Grund, warum die Gaußsche Unschärfe in jedem Grafikpaket gefunden wird - viel schneller als andere Arten von Unschärfe. Das einzige Problem ist, dass es nicht sehr realistisch ist, wenn es darum geht, Kameraobjektive zu simulieren, aber mehr dazu später. Wenn du Dinge mögen möchtest, die Schatten simulieren, dann kann die Gaußsche Unschärfe oder sogar die Kastenunschärfe ganz gut sein. Theres ein Platz für all diese Effekte - nur weil sie arent realistisch bedeutet nicht, dass sie nicht nützlich sind. Die Gaußsche Unschärfe ist viel schneller, aber nirgendwo so nah wie die Kasten, die wir früher gemacht haben. Wenn es nur eine Möglichkeit gibt, die beiden zu kombinieren. Ich stelle mir vor, dass Sie jetzt erraten haben, dass es vielleicht einen geben würde, also krank nicht länger die Spannung? Wenn Sie viel Kasten verwischen, sieht das Ergebnis mehr und mehr wie eine Gaußsche Unschärfe. In der Tat können Sie es mathematisch beweisen, wenn Sie einen freien Moment (aber nicht sagen, wie - Im nicht interessiert). In der Praxis sehen 3 bis 5 Box-Unschärfen ziemlich gut aus. Nehmen Sie nicht einfach mein Wort dafür: Das Box Blur Applet oben hat einen Iterations Slider, so dass Sie es selbst ausprobieren können. Alpha-Kanäle Eine schnelle Umleitung hier, um ein Problem zu besprechen, das oft auftaucht: Stellen Sie sich vor, Sie wollen eine Form verwischen, die auf einem transparenten Hintergrund ist. Sie haben ein leeres Bild, und Sie zeichnen eine Form darauf, dann verschwimmen das Bild. Hang an - warum schlägt das verschwommene Bit zu dunkel Der Grund dafür ist, dass wir jeden Kanal getrennt verschwimmen lassen, aber wo der Alphakanal Null ist (die transparenten Bits), sind die roten, grünen und blauen Kanäle null oder schwarz. Wenn du die Unschärfe machst, wird der Schwarze mit den undurchsichtigen Bits gemischt und du bekommst einen dunklen Schatten. Die Lösung besteht darin, das Bild alpha vorzuzerkleinern, bevor es nachher verschwimmt und unpremultiply ist. Natürlich, wenn deine Bilder schon vorgemischt sind, bist du alles eingestellt. Motion Blur Zeit für einen Richtungswechsel. Bisher haben wir nur über Uniform verwischt, aber es gibt andere Typen. Bewegungsunschärfe ist die Unschärfe, die du bekommst, wenn sich ein Objekt (oder die Kamera) während der Belichtung bewegt. Das Bild verschwindet auf dem scheinbaren Weg des Objekts. Hier würde ich nur über die Simulation von Bewegungsunschärfe auf einem vorhandenen Standbild sprechen - Bewegungsunschärfe in Animationen ist ein ganz anderer Bereich. Wurden auch nur das ganze Bild verwischen - würden wir nicht versuchen, ein Objekt im Bild zu verwischen. Die gute Nachricht ist, dass wir schon eine einfache Bewegungsunschärfe gemacht haben. Gehe zurück zum Kasten-Unschärfe-Applet oben und setze den horizontalen Radius auf, sagen wir 10 und den vertikalen Radius auf Null. Das gibt dir eine schöne horizontale Bewegungsunschärfe. Für irgendwelche Zwecke kann das alles sein, was du brauchst. Zum Beispiel ist eine Möglichkeit, eine gebürstete Metallbeschaffenheit zu erzeugen, ein Bild aus zufälligem Rauschen zu machen und eine Bewegungsunschärfe anzuwenden. Wenn wir in eine andere Richtung als horizontal oder vertikal verschwimmen wollen, dann werden die Dinge komplizierter. Eine Technik könnte sein, um das Bild zu drehen, verschwimmen und dann zurückdrehen. Was gut hier tun, ist es, den harten und langsamen Weg zu machen. Was wir tun müssen, ist über das Bild zu schlagen, und für jedes Pixel addiere man alle Pixel entlang des Bewegungspfades. Für eine geradlinige Bewegungsunschärfe bedeutet das nur eine gerade Linie aus dem Pixel, aber du könntest einem wackeligen Pfad folgen, wenn du Langzeit-Kamera-Shake simulieren wolltest. Spin und Zoom Blur Sobald weve bekam den Code für Bewegungsunschärfe an Ort und Stelle, es ist eine einfache Angelegenheit, um es zu ändern, um Zoom und Spin-Unschärfen zu machen, oder sogar eine Kombination von allen drei. Es ist nur eine Frage der folgenden Weg für jeden Pixel. Für radiale Unschärfen, folgen Sie einfach einem Weg von der Unschärfe Zentrum. Für eine Spin-Unschärfe, folgen Sie einem tangentialen Weg. Versuchen Sie es in diesem Applet: Heres der Quellcode für die Durchführung dieser drei Arten von Bewegungsunschärfe: Schneller Motion Blur Sie haben vielleicht bemerkt, dass die Bewegungsunschärfe ist ein ziemlich langsames Geschäft - all jene Sines und Kosinus wirklich langsam Sachen unten. Wurden nicht so besorgt um Qualität aber können wir das beschleunigen. Alles, was wir tun müssen, ist, eine Menge von transformierten Versionen des Bildes in einer klugen Weise zusammenzufassen. Der schlaue Teil ist, dass wir eine 1-Pixel-Bewegungsunschärfe durch Mittelung des Bildes und das gleiche Bild durch ein Pixel übersetzt werden können. Wir können eine 2-Pixel-Unschärfe machen, indem wir dies mit den 1-Pixel-unscharfen Bildern wiederholen. Wenn wir dies wiederholen, können wir eine N-Pixel-Unschärfe in log2 (N) Operationen durchführen, was viel besser ist als es die harte und langsame Weise zu tun Zoom - und Spin-Unschärfen können durch Skalierung und Rotation statt übersetzen durchgeführt werden. Ein Filter wird alle drei mit einem AffineTransform machen. Probieren Sie es in diesem Applet aus: Domain Shifting Theres noch einen anderen Weg, um diese Bewegungsunschärfe zu machen: Denken Sie daran, dass Sie gesagt haben, dass Sie die lineare Bewegungsunschärfe durch Drehen des Bildes machen können, indem Sie eine horizontale Box verschwimmen und sich zurückdrehen Nun, das gleiche gilt für die Zoom und Spin Unschärfen, außer Sie brauchen etwas komplizierter als Rotation. Was Sie brauchen, ist die Polar-Transformation. Sobald Sie Ihr Bild verwandelt haben, ist eine horizontale Box-Unschärfe ein Spin, wenn Sie sich umwandeln, und eine vertikale Box-Unschärfe gibt Ihnen eine Zoom-Unschärfe. Ein Detail ist, dass Sie eine spezielle horizontale Kasten-Unschärfe benötigen, die an den Kanten wickelt, sonst youll erhalten eine scharfe vertikale Linie in Ihrem unscharfen Bild, in dem der Drehwinkel umwickeln sollte. Unschärfe durch Fourier-Transformation Die Gaußsche Unschärfe ist sehr gut, wenn du diesen Gaußschen Unschärfe-Effekt willst, aber was ist, wenn du eine richtige Objektiv-Unschärfe wünschst, die eine echte Kamera-Blende simuliert. Schau jeden Film oder TV-Programm für eine Weile an, besonders etwas, das nachts mit Lichter erschossen wird Im Hintergrund, und youll sehen, dass Dinge, die unscharf sind, bilden Scheibenformen, oder vielleicht Fünfecke. Theres auch ein Phänomen namens Blühen, wo helle Teile des Bildes waschen das Bild, immer noch heller im Vergleich zum Rest. Diese Formen heißen Bokeh. Manche Leute lieben es und manche Leute hassen es. Wir interessieren uns nicht, ob die Leute es lieben oder hassen, wir wollen es nur noch reproduzieren. Sie bekommen nicht diese Scheibenformen mit Gaußscher Unschärfe - es ist einfach zu unscharf um die Ränder. Was Sie brauchen, um es zu tun, verwenden Sie einen schönen scharfkantigen Faltungskernel in der Form Ihrer Kamera Blende. Das Problem, das du hierher kommst, ist, dass all diese Tricks mit trennbaren Kerneln zu tun haben, iterierte Box-Unschärfen und dergleichen werden hier nicht arbeiten - es gibt keine trennbaren Kerne, die dir ein Fünfeck geben werden (gut, vermutlich - Im kein Mathematiker) - waren zurück zu Das alte Problem der Unschärfe, die als das Quadrat des Unschärfenradius aufsteigt. Fürchte dich nicht, wir können die schweren mathematischen Pistolen auf das Problem stellen. Ich weiß nicht, wie die schweren Waffen arbeiten, aber ich kann sie zielen. Die schweren Kanonen sind Fourier-Transformationen. Ich weiß nicht, wie sie arbeiten, weil ich nicht in meinen Universitätsvorträgen zuhörte, aber theres eine beträchtliche Menge auf dem Thema, das Sie im Internet finden können, obwohl praktisch nichts praktisches (d. h. mit Quellcode) zum Thema Unschärfe. Mit Fourier-Transformationen kannst du eine Unschärfe machen, die eine Zeit unberührt von dem Unschärfenradius einnimmt (in der Praxis, der Umgang mit den Bildkanten bedeutet, dass dies nicht ganz richtig ist). Leider bedeutet dies, dass für einen kleinen Radius, seine langsam, aber Sie wirklich gewinnen mit einem großen Radius. Ein Weg, um damit umzugehen ist, die einfache Faltung für kleine Radien zu verwenden, und wechseln Sie zu Fourier Transforms, wenn Sie zum Crossover-Punkt in der Zeit zu erreichen, vorausgesetzt, Sie haben die Experimente durchgeführt, um festzustellen, wo das ist. Aber seien Sie vorsichtig, wenn Sie eine Unschärfe animieren, müssen Sie sicherstellen, dass Sie keine sichtbaren Artefakte an dem Punkt bekommen, an dem Sie den Algorithmus wechseln - das Auge ist wirklich gut darin, diese zu entdecken. Aus diesem Grund können Sie es vorziehen, mit einem Algorithmus für die ganze Animation zu bleiben. Für immer noch Bilder wird sich niemand bemerken. Ja wirklich. Ist es wirklich anders aussieht Sicherlich können wir mit einer Gaußschen Unschärfe weggehen. Nun, heres ein Beispiel, das dir helfen wird, dich zu entschuldigen. Das Prinzip hinter der Unschärfe ist nicht zu hart, obwohl es wie Magie scheint. Was wir tun, ist das Bild und den Kernel, und führen Sie die Fourier-Transformation auf sie beide. Wir vervielfachen dann die beiden zusammen und invers verwandeln sich zurück. Dies ist genau das gleiche wie die Durchführung der langen Faltung oben (abgesehen von Rundungsfehlern). Sie müssen eigentlich nicht wissen, was eine Fourier-Transformation macht, um dies zu implementieren, aber trotzdem, was es tut, ist, Ihr Bild in den Frequenzraum umzuwandeln - das resultierende Bild ist eine seltsam aussehende Darstellung der räumlichen Frequenzen im Bild. Die Umkehrung verwandelt sich natürlich wieder in den Raum. Er, raum. Denken Sie an es wie ein grafischer Equalizer für Bilder. Sie können denken, ein Bild zu verwischen, indem Sie hohe Frequenzen von ihm entfernen, so dass, wie Fourier-Transformationen in das Bild kommen. Die Umsetzung dieses ist eigentlich ziemlich einfach, aber es gibt viele böse Details zu sorgen. Zuerst brauchen wir einige Funktionen, um die Transformation und ihre Umkehrung zu machen. Diese finden Sie in der Klasse FFT. Dies ist keineswegs eine superoptimierte Implementierung - man findet viele andere woanders im Internet. Als nächstes müssen wir den Kernel in ein Bild umwandeln, das die gleiche Größe hat wie das Bild verwischt ist (Im sicher gibt es Möglichkeiten, dies zu vermeiden, aber ich weiß nicht genug Mathe - wenn nur Id in diesen Vorträgen zugehört hat). Wir müssen auch unser Quellbild durch den Radius der Unschärfe herausstellen, indem wir die Kantenpixel duplizieren, da es schwer ist, die FFT zu bekommen, um mit Kanten wie diesem umzugehen. Nun arbeitet die FFT auf komplexe Zahlen, also müssen wir das Bild und den Kernel in Float-Arrays kopieren. Wir können hier einen Trick machen - unsere Bilder haben vier Kanäle (alpha, rot, grün und blau), also müssen wir vier Transformationen plus eins für den Kernel machen, fünf machen, aber da wir mit komplexen Zahlen arbeiten, können wir gleich zwei Transformationen machen Durch puttng einen Kanal im realen Teil des Arrays und einen Kanal im Imaginärteil. Jetzt werden die Dinge einfach, einfach verwandeln das Bild und Kernel, komplex multiplizieren sie zusammen und inverse transformieren und wir haben unser Bild zurück, aber mit dem Kernel gefaltet. Ein letztes kleines Detail ist, dass der Transformationsprozess über die Quadranten des Bildes tauscht, so dass wir unserstellen müssen. Nur ein kleines Detail bleibt: Die FFT funktioniert nur auf Bildern, die in jeder Richtung eine Kraft von 2 sind. Was wir tun müssen, ist, die doppelte Unschärfe Radius auf die Breite und Höhe hinzufügen, finden Sie die nächsthöhere Leistung von 2 und machen unsere Arrays, die Größe. Für große Bilder hat das ein paar Probleme: Eins ist das, wo wir viel Gedächtnis verbrauchen. Denken Sie daran, wir haben unsere Bilder in Float-Arrays und wir brauchen 6 dieser Arrays, von denen jeder 4-mal so groß ist wie das Bild, wenn es auf eine Power von zwei erweitert wurde. Ihre Java virtuelle Maschine kann sich auch bei Ihnen beschweren, wenn Sie dies auf ein großes Bild ausprobieren (ich weiß, ich habe versucht). Das zweite Problem ist verwandt: Die Dinge gehen einfach langsamer mit den großen Bildern wegen der Speicher-Caching-Probleme. Die Antwort ist, das Bild in Fliesen aufzuteilen und jede Fliese einzeln zu verwischen. Die Wahl einer guten Kachelgröße ist ein optionales Forschungsproblem (dh ich habe mich nicht belästigt, viel zu experimentieren), aber ist knifflig - wir müssen die Fliesen mit dem Unschärfenradius überlappen, also wenn wir eine Fliesengröße von 256 mit einem Unschärferadius von 127 wählten , Wed nur verwischen 4 Pixel mit jeder Fliese. Probieren Sie es in diesem Applet aus: Threshold Blurs Etwas, das oft gewünscht wird, ist eine Unschärfe, die Teile des Bildes verwischt, die sehr ähnlich sind, aber scharfe Kanten bewahrt. Dies ist eine digitale Falten-Creme und man kann das in jedem Filmplakat sehen, das je gedruckt wird - die Stars Gesichter haben all diese bösen Makel gebügelt, ohne dass das Bild verschwommen erscheint. Oft ist das so übertrieben, dass die Schauspieler wie Wachsfiguren oder computergenerierte Figuren aussehen. Die Art, wie wir dies tun, ist, eine gewöhnliche Faltung zu machen, aber nur in umgebenden Pixeln zählen, die dem Zielpixel ähnlich sind. Insbesondere haben wir einen Schwellenwert und enthalten nur ein Pixel in der Faltung, wenn es sich von dem mittleren Pixel um weniger als den Schwellenwert unterscheidet. Leider sind die Abkürzungen, die wir oben nahmen, hier nicht funktionieren, da wir einen anderen Satz von umgebenden Pixeln für jedes Zielpixel einfügen müssen, also wieder in die volle Faltung zurückkehren. Nun, obwohl dies äußerst zweifelhaft ist, funktioniert es eigentlich ganz gut, um noch die beiden 1D-Windungen für eine Gaußsche Unschärfe zu machen, die schneller ist als die volle 2D-Faltung, also das ist, was ich hier gemacht habe. Fühlen Sie sich frei, die Quelle zu ändern, um die volle Sache zu tun. Probieren Sie es in diesem Applet aus: Variable Blurs Bisher haben wir nur über Uniform verwischt - wo der Unschärferadius an jedem Punkt gleich ist. Für einige Zwecke ist es nett, Unschärfen zu haben, die einen anderen Radius an jedem Punkt im Bild haben. Ein Beispiel ist die Simulation der Schärfentiefe: Sie könnten ein Bild nehmen, das ganz im Fokus ist und eine Variable Unschärfe anwenden, um die Teile aus dem Fokus zu machen. Die reale Schärfentiefe ist komplizierter als dies, weil ein Objekt, das hinter einem anderen Objekt steht, keine Unschärfe vom Objekt vorn erhalten muss, aber das ignorieren Sie das und lassen Sie es den Profis überlassen. Nun, unsere Phantasie Tricks über arent gehen uns hier viel zu helfen, da alles beinhaltet Vorkalkulierung Kernel oder beruht auf der Unschärfe Radius ist das gleiche über das Bild und auf den ersten Blick sieht es aus wie weve bekam keine Möglichkeit, aber auf die volle Faltung zurück fallen Jedes Pixel, nur dieses Mal ist es viel schlimmer, da sich der Kernel von dem vorherigen Pixel geändert haben könnte. Allerdings ist alles nicht verloren Denken Sie daran, dass Trick mit Box verwischt, wo wir nur in Pixel hinzugefügt, wie sie den Kernel eingegeben und subtrahierten sie, wie sie links Es scheint, als ob dies nicht in den variablen Radius Fall arbeiten, weil wed müssen Summen für jeden möglichen Radius zu halten, aber theres eine Änderung Wir können den Trick machen, der es uns ermöglicht, die Summen für jeden Radius mit nur einer Subtraktion magisch herauszuziehen. Was wir tun, ist das Bild vorzuarbeiten und jedes Pixel durch die Summe aller Pixel nach links zu ersetzen. Auf diese Weise, wenn wir die Summe aller Pixel zwischen zwei Punkten in einer Scanlinie finden wollen, müssen wir nur die erste von der zweiten subtrahieren. Dies ermöglicht es uns, eine schnelle Variable Unschärfe mit einer modifizierten Version der Box Blur Code oben zu tun. Der Umgang mit den Kanten ist etwas komplizierter, da einfach die Summe subtrahiert wird, funktioniert nicht für Pixel von der Kante, aber das ist ein kleines Detail. Wir brauchen auch ein bisschen mehr Speicherplatz, da die Summen über den Maximalwert eines Pixels gehen - auch ein int pro Kanal verwenden müssen, anstatt vier Kanäle in einem int zu speichern. Nun, ok, aber das ist ein Gaußer (ish) Unschärfe ist es nicht Was ist mit dem Objektiv verschwimmen Ding mit variablem Radius Leider bist du hier nicht glücklich. Im nicht sagen, es gibt nicht eine super schnelle Art und Weise zu tun, aber so weit ich weiß, youre gehen, um die volle Faltung Sache zu tun haben. Probieren Sie es in diesem Applet aus, das mehr verschwindet, wenn Sie nach rechts ziehen: Schärfen durch Unschärfen Sie können eine Unschärfe verwenden, um ein Bild zu schärfen und es mit einer Technik, die als unscharfe Maskierung bezeichnet wird, zu verwischen. Was Sie tun, ist das Bild zu nehmen und subtrahieren eine verschwommene Version, so dass Sie den Verlust der Helligkeit kompensieren. Das klingt wie Magie, aber es funktioniert wirklich: vergleiche dieses Bild mit dem Original. Probieren Sie es in diesem Applet aus: Wenn Sie eine verschwommene Version eines Bildes von sich selbst abschrecken, schärft es, was macht es zu tun Wie immer gibt es keine Notwendigkeit zu erraten - Im hier, um Sie zu informieren. Was bekommt man, ist eine Art glühende Wirkung, die ganz schön aussehen kann, oder ganz käsig je nach Ihrem Standpunkt. Wenn die Menge der hinzugefügten Unschärfe variiert, variiert der glühende Effekt. Sie können diesen Effekt sehen, der viel im Fernsehen für verträumte Übergänge verwendet wird. Probieren Sie es in diesem Applet aus: Schatten machen Ein Schatten machen ist nur eine Frage der Schaffung eines Bildes, das wie die Silhouette des Schattenobjekts aussieht, es verwischt, möglicherweise verzerren oder verschieben und das Originalbild über die Oberseite einfügen. Da dies eine wirklich häufige Sache zu tun ist, sollte es ein Filter sein, um es zu tun, Und hier ist es. Dies ist eigentlich eine sehr einfachere Implementierung - es verwischt einfach den Schatten und zieht das Originalbild über die Oberseite. In der Praxis ist es besser, die Pixel, die vollständig vom Objekt versteckt sind, nicht zu verwischen. Casting Rays Wir können den gleichen Trick machen, um Lichtstrahlen zu erscheinen, um aus einem Objekt herauszukommen, nur dieses Mal macht die Schattenfarbe weiß und benutzt eine Zoom-Unschärfe anstelle der gewöhnlichen Unschärfe und fügt dann das Ergebnis auf das Original hinzu. Die Strahlen sehen oft besser aus, wenn man sie nur aus hellen Teilen des Bildes abgibt, so dass der Filter eine Schwelle hat, die man einstellen kann, um Strahlen auf helle Bereiche zu beschränken. Dies ist eine gute Wirkung zu beleben: machen die Mitte der Strahlen bewegen sich über das Bild und Sie erhalten die Wirkung einer bewegten Lichtquelle hinter dem Bild. Fazit Nun, das ist es, und ich habe nicht einmal andere verwischende Methoden wie IIR-Filter, rekursive Filter und all die anderen bösen Dinge erwähnt. Ich hoffe, Sie kommen mit etwas Nützlichem davon, auch wenn es nur ein brennender Wunsch ist, eine grüne Tinte zu kaufen und mir einen Brief zu schreiben. Schließlich haben Sie vielleicht bemerkt, dass die oben genannte Quelle auf andere Klassen angewiesen ist. Keine Sorge, hier sind sie: Bell Curve BREAKING DOWN Bell Curve Bell Kurve ist ein allgemeiner Begriff, der verwendet wird, um eine grafische Darstellung einer normalen Wahrscheinlichkeitsverteilung zu beschreiben. Die normalen Wahrscheinlichkeitsverteilungen, die den Standardabweichungen vom Median zugrunde liegen, oder vom höchsten Punkt der Kurve, ist das, was ihm die Form einer gebogenen Glocke gibt. Eine Standardabweichung ist eine Messung zur Quantifizierung der Variabilität der Datenstreuung in einem Satz von Werten. Der Mittelwert ist der Durchschnitt aller Datenpunkte im Datensatz oder in der Sequenz. Standardabweichungen werden berechnet, nachdem der Mittelwert berechnet wurde und einen Prozentsatz der gesammelten Daten darstellt. Wenn beispielsweise eine Reihe von 100 Testergebnissen gesammelt und in einer normalen Wahrscheinlichkeitsverteilung verwendet wird, sollten 68 der 100 Testergebnisse innerhalb einer Standardabweichung oberhalb oder unterhalb des Mittelwerts liegen. Wenn man zwei Standardabweichungen von dem Mittelwert entfernt, sollte man 95 der 100 gesammelten Testergebnisse einbeziehen und drei Standardabweichungen von dem Mittelwert abweichen, sollte 99,7 der 100 Testergebnisse darstellen. Alle Testergebnisse, die extreme Ausreißer sind, wie z. B. eine Punktzahl von 100 oder 0, würden als Long-Tail-Datenpunkte betrachtet und liegen außerhalb des drei Standardabweichungsbereichs. Mit Datenverteilungen in der Finanzierung Finanzanalysten und Investoren verwenden oft eine normale Wahrscheinlichkeitsverteilung bei der Analyse der Erträge eines Wertpapiers oder der Gesamtmarktempfindlichkeit. Standardabweichungen, die die Rückkehr eines Wertpapiers darstellen, sind in der Finanzwelt als Volatilität bekannt. Zum Beispiel sind Aktien, die eine Glockenkurve anzeigen, normalerweise blaue Chipbestände und haben eine niedrigere und vorhersagbare Volatilität. Investoren nutzen die normale Wahrscheinlichkeitsverteilung eines Aktienrückkaufs, um Annahmen über die erwarteten zukünftigen Renditen zu treffen. Allerdings zeigen Aktien und andere Wertpapiere manchmal nicht-normale Ausschüttungen an, was bedeutet, dass sie nicht wie eine Glockenkurve aussehen. Nicht-normale Verteilungen haben fetter Schwänze als eine normale Wahrscheinlichkeitsverteilung. Wenn der fettere Schwanz negativ ist, ist ein Signal für die Anleger, dass es eine größere Wahrscheinlichkeit von negativen Renditen und umgekehrt gibt. Positiv schiefe fette Schwänze können ein Zeichen für abnorme Zukunft returns. Crowdsourcing ist ein sehr beliebtes Mittel, um die großen Mengen von markierten Daten, die moderne Maschine Lernmethoden erfordern. Obwohl billig und schnell zu erhalten, leiden Crowdsourced Etiketten unter erheblichen Mengen an Fehler, wodurch die Leistung der nachgelagerten Maschinen Lernaufgaben verschlechtert. Mit dem Ziel, die Qualität der markierten Daten zu verbessern, versuchen wir, die vielen Fehler, die durch alberne Fehler oder unbeabsichtigte Fehler durch Crowdsourcing-Arbeiter auftreten, zu mildern. Wir schlagen eine zweistufige Einstellung für das Crowdsourcing vor, bei der der Arbeiter zuerst die Fragen beantwortet und dann erlaubt ist, ihre Antworten zu ändern, nachdem er eine (laute) Referenzantwort gesehen hat. Wir formulieren diesen Prozess mathematisch und entwickeln Mechanismen, um die Arbeiter dazu zu bewegen, angemessen zu handeln. Unsere mathematischen Garantien zeigen, dass unser Mechanismus die Arbeiter dazu anregt, ehrlich in beiden Stufen zu antworten und sich in der ersten Phase zufällig zu befreien oder einfach in die zweite zu kopieren. Numerische Experimente zeigen einen signifikanten Leistungsanstieg, den diese 8220self-Korrektur8221 bei der Verwendung von Crowdsourcing zur Verfügung stellen kann, um maschinelle Lernalgorithmen zu trainieren. Es gibt verschiedene parametrische Modelle für die Analyse von paarweisen Vergleichsdaten, einschließlich der Bradley-Terry-Luce (BTL) und Thurstone-Modelle, aber ihre Abhängigkeit von starken parametrischen Annahmen ist begrenzt. In dieser Arbeit untersuchen wir ein flexibles Modell für paarweise Vergleiche, unter denen die Wahrscheinlichkeiten der Ergebnisse nur erforderlich sind, um eine natürliche Form der stochastischen Transitivität zu erfüllen. Diese Klasse umfasst parametrische Modelle einschließlich der BTL - und Thurstone-Modelle als Sonderfälle, ist aber wesentlich allgemeiner. Wir bieten verschiedene Beispiele für Modelle in dieser breiteren stochastisch transitiven Klasse, für die klassische parametrische Modelle schlechte Anpassungen bieten. Trotz dieser größeren Flexibilität zeigen wir, dass die Matrix der Wahrscheinlichkeiten mit der gleichen Geschwindigkeit wie bei den parametrischen Standardmodellen geschätzt werden kann. Auf der anderen Seite, anders als bei den BTL - und Thurstone-Modellen, ist die Berechnung des minimax-optimalen Schätzers im stochastisch-transitiven Modell nicht trivial, und wir erforschen verschiedene rechenfähige Alternativen. Wir zeigen, dass ein einfacher singulärer Wert-Schwellenwert-Algorithmus statistisch konsistent ist, aber nicht die Minimax-Rate erreicht. Wir schlagen und studieren Algorithmen, die die Minimax-Rate über interessante Subklassen der vollen stochastisch transitiven Klasse erreichen. Wir ergänzen unsere theoretischen Ergebnisse mit gründlichen numerischen Simulationen. Wir zeigen, wie jedes binäre Paarungsmodell zu einem vollsymmetrischen Modell entwurzelt werden kann, wobei die ursprünglichen Singletonpotentiale in Potenziale an Kanten zu einer hinzugefügten Variablen umgewandelt und dann zu einem neuen Modell auf die ursprüngliche Anzahl von Variablen umgeleitet werden. The new model is essentially equivalent to the original model, with the same partition function and allowing recovery of the original marginals or a MAP conguration, yet may have very different computational properties that allow much more efficient inference. This meta-approach deepens our understanding, may be applied to any existing algorithm to yield improved methods in practice, generalizes earlier theoretical results, and reveals a remarkable interpretation of the triplet-consistent polytope. We show how deep learning methods can be applied in the context of crowdsourcing and unsupervised ensemble learning. First, we prove that the popular model of Dawid and Skene, which assumes that all classifiers are conditionally independent, is to a Restricted Boltzmann Machine (RBM) with a single hidden node. Hence, under this model, the posterior probabilities of the true labels can be instead estimated via a trained RBM. Next, to address the more general case, where classifiers may strongly violate the conditional independence assumption, we propose to apply RBM-based Deep Neural Net (DNN). Experimental results on various simulated and real-world datasets demonstrate that our proposed DNN approach outperforms other state-of-the-art methods, in particular when the data violates the conditional independence assumption. Revisiting Semi-Supervised Learning with Graph Embeddings Zhilin Yang Carnegie Mellon University . William Cohen CMU . Ruslan Salakhudinov U. of Toronto Paper AbstractWe present a semi-supervised learning framework based on graph embeddings. Given a graph between instances, we train an embedding for each instance to jointly predict the class label and the neighborhood context in the graph. We develop both transductive and inductive variants of our method. In the transductive variant of our method, the class labels are determined by both the learned embeddings and input feature vectors, while in the inductive variant, the embeddings are defined as a parametric function of the feature vectors, so predictions can be made on instances not seen during training. On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, we show improved performance over many of the existing models. Reinforcement learning can acquire complex behaviors from high-level specifications. However, defining a cost function that can be optimized effectively and encodes the correct task is challenging in practice. We explore how inverse optimal control (IOC) can be used to learn behaviors from demonstrations, with applications to torque control of high-dimensional robotic systems. Our method addresses two key challenges in inverse optimal control: first, the need for informative features and effective regularization to impose structure on the cost, and second, the difficulty of learning the cost function under unknown dynamics for high-dimensional continuous systems. To address the former challenge, we present an algorithm capable of learning arbitrary nonlinear cost functions, such as neural networks, without meticulous feature engineering. To address the latter challenge, we formulate an efficient sample-based approximation for MaxEnt IOC. We evaluate our method on a series of simulated tasks and real-world robotic manipulation problems, demonstrating substantial improvement over prior methods both in terms of task complexity and sample efficiency. In learning latent variable models (LVMs), it is important to effectively capture infrequent patterns and shrink model size without sacrificing modeling power. Various studies have been done to 8220diversify8221 a LVM, which aim to learn a diverse set of latent components in LVMs. Most existing studies fall into a frequentist-style regularization framework, where the components are learned via point estimation. In this paper, we investigate how to 8220diversify8221 LVMs in the paradigm of Bayesian learning, which has advantages complementary to point estimation, such as alleviating overfitting via model averaging and quantifying uncertainty. We propose two approaches that have complementary advantages. One is to define diversity-promoting mutual angular priors which assign larger density to components with larger mutual angles based on Bayesian network and von Mises-Fisher distribution and use these priors to affect the posterior via Bayes rule. We develop two efficient approximate posterior inference algorithms based on variational inference and Markov chain Monte Carlo sampling. The other approach is to impose diversity-promoting regularization directly over the post-data distribution of components. These two methods are applied to the Bayesian mixture of experts model to encourage the 8220experts8221 to be diverse and experimental results demonstrate the effectiveness and efficiency of our methods. High dimensional nonparametric regression is an inherently difficult problem with known lower bounds depending exponentially in dimension. A popular strategy to alleviate this curse of dimensionality has been to use additive models of emph , which model the regression function as a sum of independent functions on each dimension. Though useful in controlling the variance of the estimate, such models are often too restrictive in practical settings. Between non-additive models which often have large variance and first order additive models which have large bias, there has been little work to exploit the trade-off in the middle via additive models of intermediate order. In this work, we propose salsa, which bridges this gap by allowing interactions between variables, but controls model capacity by limiting the order of interactions. salsas minimises the residual sum of squares with squared RKHS norm penalties. Algorithmically, it can be viewed as Kernel Ridge Regression with an additive kernel. When the regression function is additive, the excess risk is only polynomial in dimension. Using the Girard-Newton formulae, we efficiently sum over a combinatorial number of terms in the additive expansion. Via a comparison on 15 real datasets, we show that our method is competitive against 21 other alternatives. We propose an extension to Hawkes processes by treating the levels of self-excitation as a stochastic differential equation. Our new point process allows better approximation in application domains where events and intensities accelerate each other with correlated levels of contagion. We generalize a recent algorithm for simulating draws from Hawkes processes whose levels of excitation are stochastic processes, and propose a hybrid Markov chain Monte Carlo approach for model fitting. Our sampling procedure scales linearly with the number of required events and does not require stationarity of the point process. A modular inference procedure consisting of a combination between Gibbs and Metropolis Hastings steps is put forward. We recover expectation maximization as a special case. Our general approach is illustrated for contagion following geometric Brownian motion and exponential Langevin dynamics. Rank aggregation systems collect ordinal preferences from individuals to produce a global ranking that represents the social preference. To reduce the computational complexity of learning the global ranking, a common practice is to use rank-breaking. Individuals preferences are broken into pairwise comparisons and then applied to efficient algorithms tailored for independent pairwise comparisons. However, due to the ignored dependencies, naive rank-breaking approaches can result in inconsistent estimates. The key idea to produce unbiased and accurate estimates is to treat the paired comparisons outcomes unequally, depending on the topology of the collected data. In this paper, we provide the optimal rank-breaking estimator, which not only achieves consistency but also achieves the best error bound. This allows us to characterize the fundamental tradeoff between accuracy and complexity in some canonical scenarios. Further, we identify how the accuracy depends on the spectral gap of a corresponding comparison graph. Dropout distillation Samuel Rota Bul FBK . Lorenzo Porzi FBK . Peter Kontschieder Microsoft Research Cambridge Paper AbstractDropout is a popular stochastic regularization technique for deep neural networks that works by randomly dropping (i. e. zeroing) units from the network during training. This randomization process allows to implicitly train an ensemble of exponentially many networks sharing the same parametrization, which should be averaged at test time to deliver the final prediction. A typical workaround for this intractable averaging operation consists in scaling the layers undergoing dropout randomization. This simple rule called 8216standard dropout8217 is efficient, but might degrade the accuracy of the prediction. In this work we introduce a novel approach, coined 8216dropout distillation8217, that allows us to train a predictor in a way to better approximate the intractable, but preferable, averaging process, while keeping under control its computational efficiency. We are thus able to construct models that are as efficient as standard dropout, or even more efficient, while being more accurate. Experiments on standard benchmark datasets demonstrate the validity of our method, yielding consistent improvements over conventional dropout. Metadata-conscious anonymous messaging Giulia Fanti UIUC . Peter Kairouz UIUC . Sewoong Oh UIUC . Kannan Ramchandran UC Berkeley . Pramod Viswanath UIUC Paper AbstractAnonymous messaging platforms like Whisper and Yik Yak allow users to spread messages over a network (e. g. a social network) without revealing message authorship to other users. The spread of messages on these platforms can be modeled by a diffusion process over a graph. Recent advances in network analysis have revealed that such diffusion processes are vulnerable to author deanonymization by adversaries with access to metadata, such as timing information. In this work, we ask the fundamental question of how to propagate anonymous messages over a graph to make it difficult for adversaries to infer the source. In particular, we study the performance of a message propagation protocol called adaptive diffusion introduced in (Fanti et al. 2015). We prove that when the adversary has access to metadata at a fraction of corrupted graph nodes, adaptive diffusion achieves asymptotically optimal source-hiding and significantly outperforms standard diffusion. We further demonstrate empirically that adaptive diffusion hides the source effectively on real social networks. The Teaching Dimension of Linear Learners Ji Liu University of Rochester . Xiaojin Zhu University of Wisconsin . Hrag Ohannessian University of Wisconsin-Madison Paper AbstractTeaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners. Truthful Univariate Estimators Ioannis Caragiannis University of Patras . Ariel Procaccia Carnegie Mellon University . Nisarg Shah Carnegie Mellon University Paper AbstractWe revisit the classic problem of estimating the population mean of an unknown single-dimensional distribution from samples, taking a game-theoretic viewpoint. In our setting, samples are supplied by strategic agents, who wish to pull the estimate as close as possible to their own value. In this setting, the sample mean gives rise to manipulation opportunities, whereas the sample median does not. Our key question is whether the sample median is the best (in terms of mean squared error) truthful estimator of the population mean. We show that when the underlying distribution is symmetric, there are truthful estimators that dominate the median. Our main result is a characterization of worst-case optimal truthful estimators, which provably outperform the median, for possibly asymmetric distributions with bounded support. Why Regularized Auto-Encoders learn Sparse Representation Devansh Arpit SUNY Buffalo . Yingbo Zhou SUNY Buffalo . Hung Ngo SUNY Buffalo . Venu Govindaraju SUNY Buffalo Paper AbstractSparse distributed representation is the key to learning useful features in deep learning algorithms, because not only it is an efficient mode of data representation, but also 8212 more importantly 8212 it captures the generation process of most real world data. While a number of regularized auto-encoders (AE) enforce sparsity explicitly in their learned representation and others don8217t, there has been little formal analysis on what encourages sparsity in these models in general. Our objective is to formally study this general problem for regularized auto-encoders. We provide sufficient conditions on both regularization and activation functions that encourage sparsity. We show that multiple popular models (de-noising and contractive auto encoders, e. g.) and activations (rectified linear and sigmoid, e. g.) satisfy these conditions thus, our conditions help explain sparsity in their learned representation. Thus our theoretical and empirical analysis together shed light on the properties of regularizationactivation that are conductive to sparsity and unify a number of existing auto-encoder models and activation functions under the same analytical framework. k-variates: more pluses in the k-means Richard Nock Nicta 038 ANU . Raphael Canyasse Ecole Polytechnique and The Technion . Roksana Boreli Data61 . Frank Nielsen Ecole Polytechnique and Sony CS Labs Inc. Paper Abstractk-means seeding has become a de facto standard for hard clustering algorithms. In this paper, our first contribution is a two-way generalisation of this seeding, k-variates, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at the point locations, textit a generalisation of the well known Arthur-Vassilvitskii (AV) approximation guarantee, in the form of a textit approximation bound of the textit optimum. This approximation exhibits a reduced dependency on the 8220noise8221 component with respect to the optimal potential 8212 actually approaching the statistical lower bound. We show that k-variates textit to efficient (biased seeding) clustering algorithms tailored to specific frameworks these include distributed, streaming and on-line clustering, with textit approximation results for these algorithms. Finally, we present a novel application of k-variates to differential privacy. For either the specific frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of k-means and its approximation bounds 8212 state of the art contenders appear to be significantly more complex and or display less favorable (approximation) properties. We stress that our algorithms can still be run in cases where there is textit closed form solution for the population minimizer. We demonstrate the applicability of our analysis via experimental evaluation on several domains and settings, displaying competitive performances vs state of the art. Multi-Player Bandits 8212 a Musical Chairs Approach Jonathan Rosenski Weizmann Institute of Science . Ohad Shamir Weizmann Institute of Science . Liran Szlak Weizmann Institute of Science Paper AbstractWe consider a variant of the stochastic multi-armed bandit problem, where multiple players simultaneously choose from the same set of arms and may collide, receiving no reward. This setting has been motivated by problems arising in cognitive radio networks, and is especially challenging under the realistic assumption that communication between players is limited. We provide a communication-free algorithm (Musical Chairs) which attains constant regret with high probability, as well as a sublinear-regret, communication-free algorithm (Dynamic Musical Chairs) for the more difficult setting of players dynamically entering and leaving throughout the game. Moreover, both algorithms do not require prior knowledge of the number of players. To the best of our knowledge, these are the first communication-free algorithms with these types of formal guarantees. The Information Sieve Greg Ver Steeg Information Sciences Institute . Aram Galstyan Information Sciences Institute Paper AbstractWe introduce a new framework for unsupervised learning of representations based on a novel hierarchical decomposition of information. Intuitively, data is passed through a series of progressively fine-grained sieves. Each layer of the sieve recovers a single latent factor that is maximally informative about multivariate dependence in the data. The data is transformed after each pass so that the remaining unexplained information trickles down to the next layer. Ultimately, we are left with a set of latent factors explaining all the dependence in the original data and remainder information consisting of independent noise. We present a practical implementation of this framework for discrete variables and apply it to a variety of fundamental tasks in unsupervised learning including independent component analysis, lossy and lossless compression, and predicting missing values in data. Deep Speech 2. End-to-End Speech Recognition in English and Mandarin Dario Amodei . Rishita Anubhai . Eric Battenberg . Carl Case . Jared Casper . Bryan Catanzaro . JingDong Chen . Mike Chrzanowski Baidu USA, Inc. . Adam Coates . Greg Diamos Baidu USA, Inc. . Erich Elsen Baidu USA, Inc. . Jesse Engel . Linxi Fan . Christopher Fougner . Awni Hannun Baidu USA, Inc. . Billy Jun . Tony Han . Patrick LeGresley . Xiangang Li Baidu . Libby Lin . Sharan Narang . Andrew Ng . Sherjil Ozair . Ryan Prenger . Sheng Qian Baidu . Jonathan Raiman . Sanjeev Satheesh Baidu SVAIL . David Seetapun . Shubho Sengupta . Chong Wang . Yi Wang . Zhiqian Wang . Bo Xiao . Yan Xie Baidu . Dani Yogatama . Jun Zhan . zhenyao Zhu Paper AbstractWe show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speechtwo vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale. An important question in feature selection is whether a selection strategy recovers the 8220true8221 set of features, given enough data. We study this question in the context of the popular Least Absolute Shrinkage and Selection Operator (Lasso) feature selection strategy. In particular, we consider the scenario when the model is misspecified so that the learned model is linear while the underlying real target is nonlinear. Surprisingly, we prove that under certain conditions, Lasso is still able to recover the correct features in this case. We also carry out numerical studies to empirically verify the theoretical results and explore the necessity of the conditions under which the proof holds. We propose minimum regret search (MRS), a novel acquisition function for Bayesian optimization. MRS bears similarities with information-theoretic approaches such as entropy search (ES). However, while ES aims in each query at maximizing the information gain with respect to the global maximum, MRS aims at minimizing the expected simple regret of its ultimate recommendation for the optimum. While empirically ES and MRS perform similar in most of the cases, MRS produces fewer outliers with high simple regret than ES. We provide empirical results both for a synthetic single-task optimization problem as well as for a simulated multi-task robotic control problem. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy Ran Gilad-Bachrach Microsoft Research . Nathan Dowlin Princeton . Kim Laine Microsoft Research . Kristin Lauter Microsoft Research . Michael Naehrig Microsoft Research . John Wernsing Microsoft Research Paper AbstractApplying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99 accuracy and can make around 59000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions. Spectral methods for dimensionality reduction and clustering require solving an eigenproblem defined by a sparse affinity matrix. When this matrix is large, one seeks an approximate solution. The standard way to do this is the Nystrom method, which first solves a small eigenproblem considering only a subset of landmark points, and then applies an out-of-sample formula to extrapolate the solution to the entire dataset. We show that by constraining the original problem to satisfy the Nystrom formula, we obtain an approximation that is computationally simple and efficient, but achieves a lower approximation error using fewer landmarks and less runtime. We also study the role of normalization in the computational cost and quality of the resulting solution. As a widely used non-linear activation, Rectified Linear Unit (ReLU) separates noise and signal in a feature map by learning a threshold or bias. However, we argue that the classification of noise and signal not only depends on the magnitude of responses, but also the context of how the feature responses would be used to detect more abstract patterns in higher layers. In order to output multiple response maps with magnitude in different ranges for a particular visual pattern, existing networks employing ReLU and its variants have to learn a large number of redundant filters. In this paper, we propose a multi-bias non-linear activation (MBA) layer to explore the information hidden in the magnitudes of responses. It is placed after the convolution layer to decouple the responses to a convolution kernel into multiple maps by multi-thresholding magnitudes, thus generating more patterns in the feature space at a low computational cost. It provides great flexibility of selecting responses to different visual patterns in different magnitude ranges to form rich representations in higher layers. Such a simple and yet effective scheme achieves the state-of-the-art performance on several benchmarks. We propose a novel multi-task learning method that can minimize the effect of negative transfer by allowing asymmetric transfer between the tasks based on task relatedness as well as the amount of individual task losses, which we refer to as Asymmetric Multi-task Learning (AMTL). To tackle this problem, we couple multiple tasks via a sparse, directed regularization graph, that enforces each task parameter to be reconstructed as a sparse combination of other tasks, which are selected based on the task-wise loss. We present two different algorithms to solve this joint learning of the task predictors and the regularization graph. The first algorithm solves for the original learning objective using alternative optimization, and the second algorithm solves an approximation of it using curriculum learning strategy, that learns one task at a time. We perform experiments on multiple datasets for classification and regression, on which we obtain significant improvements in performance over the single task learning and symmetric multitask learning baselines. This paper illustrates a novel approach to the estimation of generalization error of decision tree classifiers. We set out the study of decision tree errors in the context of consistency analysis theory, which proved that the Bayes error can be achieved only if when the number of data samples thrown into each leaf node goes to infinity. For the more challenging and practical case where the sample size is finite or small, a novel sampling error term is introduced in this paper to cope with the small sample problem effectively and efficiently. Extensive experimental results show that the proposed error estimate is superior to the well known K-fold cross validation methods in terms of robustness and accuracy. Moreover it is orders of magnitudes more efficient than cross validation methods. We study the convergence properties of the VR-PCA algorithm introduced by cite for fast computation of leading singular vectors. We prove several new results, including a formal analysis of a block version of the algorithm, and convergence from random initialization. We also make a few observations of independent interest, such as how pre-initializing with just a single exact power iteration can significantly improve the analysis, and what are the convexity and non-convexity properties of the underlying optimization problem. We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i. i.d. data points in realsd. A simple and computationally cheap algorithm for this is stochastic gradient descent (SGD), which incrementally updates its estimate based on each new data point. However, due to the non-convex nature of the problem, analyzing its performance has been a challenge. In particular, existing guarantees rely on a non-trivial eigengap assumption on the covariance matrix, which is intuitively unnecessary. In this paper, we provide (to the best of our knowledge) the first eigengap-free convergence guarantees for SGD in the context of PCA. This also partially resolves an open problem posed in cite . Moreover, under an eigengap assumption, we show that the same techniques lead to new SGD convergence guarantees with better dependence on the eigengap. Dealbreaker: A Nonlinear Latent Variable Model for Educational Data Andrew Lan Rice University . Tom Goldstein University of Maryland . Richard Baraniuk Rice University . Christoph Studer Cornell University Paper AbstractStatistical models of student responses on assessment questions, such as those in homeworks and exams, enable educators and computer-based personalized learning systems to gain insights into students knowledge using machine learning. Popular student-response models, including the Rasch model and item response theory models, represent the probability of a student answering a question correctly using an affine function of latent factors. While such models can accurately predict student responses, their ability to interpret the underlying knowledge structure (which is certainly nonlinear) is limited. In response, we develop a new, nonlinear latent variable model that we call the dealbreaker model, in which a students success probability is determined by their weakest concept mastery. We develop efficient parameter inference algorithms for this model using novel methods for nonconvex optimization. We show that the dealbreaker model achieves comparable or better prediction performance as compared to affine models with real-world educational datasets. We further demonstrate that the parameters learned by the dealbreaker model are interpretablethey provide key insights into which concepts are critical (i. e. the dealbreaker) to answering a question correctly. We conclude by reporting preliminary results for a movie-rating dataset, which illustrate the broader applicability of the dealbreaker model. We derive a new discrepancy statistic for measuring differences between two probability distributions based on combining Stein8217s identity and the reproducing kernel Hilbert space theory. We apply our result to test how well a probabilistic model fits a set of observations, and derive a new class of powerful goodness-of-fit tests that are widely applicable for complex and high dimensional distributions, even for those with computationally intractable normalization constants. Both theoretical and empirical properties of our methods are studied thoroughly. Variable Elimination in the Fourier Domain Yexiang Xue Cornell University . Stefano Ermon . Ronan Le Bras Cornell University . Carla . Bart Paper AbstractThe ability to represent complex high dimensional probability distributions in a compact form is one of the key insights in the field of graphical models. Factored representations are ubiquitous in machine learning and lead to major computational advantages. We explore a different type of compact representation based on discrete Fourier representations, complementing the classical approach based on conditional independencies. We show that a large class of probabilistic graphical models have a compact Fourier representation. This theoretical result opens up an entirely new way of approximating a probability distribution. We demonstrate the significance of this approach by applying it to the variable elimination algorithm. Compared with the traditional bucket representation and other approximate inference algorithms, we obtain significant improvements. Low-rank matrix approximation has been widely adopted in machine learning applications with sparse data, such as recommender systems. However, the sparsity of the data, incomplete and noisy, introduces challenges to the algorithm stability 8212 small changes in the training data may significantly change the models. As a result, existing low-rank matrix approximation solutions yield low generalization performance, exhibiting high error variance on the training dataset, and minimizing the training error may not guarantee error reduction on the testing dataset. In this paper, we investigate the algorithm stability problem of low-rank matrix approximations. We present a new algorithm design framework, which (1) introduces new optimization objectives to guide stable matrix approximation algorithm design, and (2) solves the optimization problem to obtain stable low-rank approximation solutions with good generalization performance. Experimental results on real-world datasets demonstrate that the proposed work can achieve better prediction accuracy compared with both state-of-the-art low-rank matrix approximation methods and ensemble methods in recommendation task. Given samples from two densities p and q, density ratio estimation (DRE) is the problem of estimating the ratio pq. Two popular discriminative approaches to DRE are KL importance estimation (KLIEP), and least squares importance fitting (LSIF). In this paper, we show that KLIEP and LSIF both employ class-probability estimation (CPE) losses. Motivated by this, we formally relate DRE and CPE, and demonstrate the viability of using existing losses from one problem for the other. For the DRE problem, we show that essentially any CPE loss (eg logistic, exponential) can be used, as this equivalently minimises a Bregman divergence to the true density ratio. We show how different losses focus on accurately modelling different ranges of the density ratio, and use this to design new CPE losses for DRE. For the CPE problem, we argue that the LSIF loss is useful in the regime where one wishes to rank instances with maximal accuracy at the head of the ranking. In the course of our analysis, we establish a Bregman divergence identity that may be of independent interest. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD) but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to minibatching in parallel settings. Hierarchical Variational Models Rajesh Ranganath . Dustin Tran Columbia University . Blei David Columbia Paper AbstractBlack box variational inference allows researchers to easily prototype and evaluate an array of models. Recent advances allow such algorithms to scale to high dimensions. However, a central question remains: How to specify an expressive variational distribution that maintains efficient computation To address this, we develop hierarchical variational models (HVMs). HVMs augment a variational approximation with a prior on its parameters, which allows it to capture complex structure for both discrete and continuous latent variables. The algorithm we develop is black box, can be used for any HVM, and has the same computational efficiency as the original approximation. We study HVMs on a variety of deep discrete latent variable models. HVMs generalize other expressive variational distributions and maintains higher fidelity to the posterior. The field of mobile health (mHealth) has the potential to yield new insights into health and behavior through the analysis of continuously recorded data from wearable health and activity sensors. In this paper, we present a hierarchical span-based conditional random field model for the key problem of jointly detecting discrete events in such sensor data streams and segmenting these events into high-level activity sessions. Our model includes higher-order cardinality factors and inter-event duration factors to capture domain-specific structure in the label space. We show that our model supports exact MAP inference in quadratic time via dynamic programming, which we leverage to perform learning in the structured support vector machine framework. We apply the model to the problems of smoking and eating detection using four real data sets. Our results show statistically significant improvements in segmentation performance relative to a hierarchical pairwise CRF. Binary embeddings with structured hashed projections Anna Choromanska Courant Institute, NYU . Krzysztof Choromanski Google Research NYC . Mariusz Bojarski NVIDIA . Tony Jebara Columbia . Sanjiv Kumar . Yann Paper AbstractWe consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudorandom projection is described by a matrix, where not all entries are independent random variables but instead a fixed budget of randomness is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i. e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson - Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier. A Variational Analysis of Stochastic Gradient Algorithms Stephan Mandt Columbia University . Matthew Hoffman Adobe Research . Blei David Columbia Paper AbstractStochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how to adjust the tuning parameters of SGD such as to match the resulting stationary distribution to the posterior. This analysis rests on interpreting SGD as a continuous-time stochastic process and then minimizing the Kullback-Leibler divergence between its stationary distribution and the target posterior. (This is in the spirit of variational inference.) In more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then use properties of this process to derive the optimal parameters. This theoretical framework also connects SGD to modern scalable inference algorithms we analyze the recently proposed stochastic gradient Fisher scoring under this perspective. We demonstrate that SGD with properly chosen constant rates gives a new way to optimize hyperparameters in probabilistic models. This paper proposes a new mechanism for sampling training instances for stochastic gradient descent (SGD) methods by exploiting any side-information associated with the instances (for e. g. class-labels) to improve convergence. Previous methods have either relied on sampling from a distribution defined over training instances or from a static distribution that fixed before training. This results in two problems a) any distribution that is set apriori is independent of how the optimization progresses and b) maintaining a distribution over individual instances could be infeasible in large-scale scenarios. In this paper, we exploit the side information associated with the instances to tackle both problems. More specifically, we maintain a distribution over classes (instead of individual instances) that is adaptively estimated during the course of optimization to give the maximum reduction in the variance of the gradient. Intuitively, we sample more from those regions in space that have a textit gradient contribution. Our experiments on highly multiclass datasets show that our proposal converge significantly faster than existing techniques. Tensor regression has shown to be advantageous in learning tasks with multi-directional relatedness. Given massive multiway data, traditional methods are often too slow to operate on or suffer from memory bottleneck. In this paper, we introduce subsampled tensor projected gradient to solve the problem. Our algorithm is impressively simple and efficient. It is built upon projected gradient method with fast tensor power iterations, leveraging randomized sketching for further acceleration. Theoretical analysis shows that our algorithm converges to the correct solution in fixed number of iterations. The memory requirement grows linearly with the size of the problem. We demonstrate superior empirical performance on both multi-linear multi-task learning and spatio-temporal applications. This paper presents a novel distributed variational inference framework that unifies many parallel sparse Gaussian process regression (SGPR) models for scalable hyperparameter learning with big data. To achieve this, our framework exploits a structure of correlated noise process model that represents the observation noises as a finite realization of a high-order Gaussian Markov random process. By varying the Markov order and covariance function for the noise process model, different variational SGPR models result. This consequently allows the correlation structure of the noise process model to be characterized for which a particular variational SGPR model is optimal. We empirically evaluate the predictive performance and scalability of the distributed variational SGPR models unified by our framework on two real-world datasets. Online Stochastic Linear Optimization under One-bit Feedback Lijun Zhang Nanjing University . Tianbao Yang University of Iowa . Rong Jin Alibaba Group . Yichi Xiao Nanjing University . Zhi-hua Zhou Paper AbstractIn this paper, we study a special bandit setting of online stochastic linear optimization, where only one-bit of information is revealed to the learner at each round. This problem has found many applications including online advertisement and online recommendation. We assume the binary feedback is a random variable generated from the logit model, and aim to minimize the regret defined by the unknown linear function. Although the existing method for generalized linear bandit can be applied to our problem, the high computational cost makes it impractical for real-world applications. To address this challenge, we develop an efficient online learning algorithm by exploiting particular structures of the observation model. Specifically, we adopt online Newton step to estimate the unknown parameter and derive a tight confidence region based on the exponential concavity of the logistic loss. Our analysis shows that the proposed algorithm achieves a regret bound of O(dsqrt ), which matches the optimal result of stochastic linear bandits. We present an adaptive online gradient descent algorithm to solve online convex optimization problems with long-term constraints, which are constraints that need to be satisfied when accumulated over a finite number of rounds T, but can be violated in intermediate rounds. For some user-defined trade-off parameter beta in (0, 1), the proposed algorithm achieves cumulative regret bounds of O(Tmax ) and O(T ), respectively for the loss and the constraint violations. Our results hold for convex losses, can handle arbitrary convex constraints and rely on a single computationally efficient algorithm. Our contributions improve over the best known cumulative regret bounds of Mahdavi et al. (2012), which are respectively O(T12) and O(T34) for general convex domains, and respectively O(T23) and O(T23) when the domain is further restricted to be a polyhedral set. We supplement the analysis with experiments validating the performance of our algorithm in practice. Motivated by an application of eliciting users8217 preferences, we investigate the problem of learning hemimetrics, i. e. pairwise distances among a set of n items that satisfy triangle inequalities and non-negativity constraints. In our application, the (asymmetric) distances quantify private costs a user incurs when substituting one item by another. We aim to learn these distances (costs) by asking the users whether they are willing to switch from one item to another for a given incentive offer. Without exploiting structural constraints of the hemimetric polytope, learning the distances between each pair of items requires Theta(n2) queries. We propose an active learning algorithm that substantially reduces this sample complexity by exploiting the structural constraints on the version space of hemimetrics. Our proposed algorithm achieves provably-optimal sample complexity for various instances of the task. For example, when the items are embedded into K tight clusters, the sample complexity of our algorithm reduces to O(n K). Extensive experiments on a restaurant recommendation data set support the conclusions of our theoretical analysis. We present an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1-D tapes or 2-D grids that hold the input and output data. For the controller, we explore a range of neural network-based models which vary in their ability to abstract the underlying algorithm from training instances and generalize to test examples with many thousands of digits. The controller is trained using Q-learning with several enhancements and we show that the bottleneck is in the capabilities of the controller rather than in the search incurred by Q-learning. Learning Physical Intuition of Block Towers by Example Adam Lerer Facebook AI Research . Sam Gross Facebook AI Research . Rob Fergus Facebook AI Research Paper AbstractWooden blocks are a common toy for infants, allowing them to develop motor skills and gain intuition about the physical behavior of the world. In this paper, we explore the ability of deep feed-forward models to learn such intuitive physics. Using a 3D game engine, we create small towers of wooden blocks whose stability is randomized and render them collapsing (or remaining upright). This data allows us to train large convolutional network models which can accurately predict the outcome, as well as estimating the trajectories of the blocks. The models are also able to generalize in two important ways: (i) to new physical scenarios, e. g. towers with an additional block and (ii) to images of real wooden blocks, where it obtains a performance comparable to human subjects. Structure Learning of Partitioned Markov Networks Song Liu The Inst. of Stats. Math. . Taiji Suzuki . Masashi Sugiyama University of Tokyo . Kenji Fukumizu The Institute of Statistical Mathematics Paper AbstractWe learn the structure of a Markov Network between two groups of random variables from joint observations. Since modelling and learning the full MN structure may be hard, learning the links between two groups directly may be a preferable option. We introduce a novel concept called the emph whose factorization directly associates with the Markovian properties of random variables across two groups. A simple one-shot convex optimization procedure is proposed for learning the emph factorizations of the partitioned ratio and it is theoretically guaranteed to recover the correct inter-group structure under mild conditions. The performance of the proposed method is experimentally compared with the state of the art MN structure learning methods using ROC curves. Real applications on analyzing bipartisanship in US congress and pairwise DNAtime-series alignments are also reported. This work focuses on dynamic regret of online convex optimization that compares the performance of online learning to a clairvoyant who knows the sequence of loss functions in advance and hence selects the minimizer of the loss function at each step. By assuming that the clairvoyant moves slowly (i. e. the minimizers change slowly), we present several improved variation-based upper bounds of the dynamic regret under the true and noisy gradient feedback, which are in light of the presented lower bounds. The key to our analysis is to explore a regularity metric that measures the temporal changes in the clairvoyant8217s minimizers, to which we refer as path variation. Firstly, we present a general lower bound in terms of the path variation, and then show that under full information or gradient feedback we are able to achieve an optimal dynamic regret. Secondly, we present a lower bound with noisy gradient feedback and then show that we can achieve optimal dynamic regrets under a stochastic gradient feedback and two-point bandit feedback. Moreover, for a sequence of smooth loss functions that admit a small variation in the gradients, our dynamic regret under the two-point bandit feedback matches that is achieved with full information. Beyond CCA: Moment Matching for Multi-View Models Anastasia Podosinnikova INRIA 8211 ENS . Francis Bach Inria . Simon Lacoste-Julien INRIA Paper AbstractWe introduce three novel semi-parametric extensions of probabilistic canonical correlation analysis with identifiability guarantees. We consider moment matching techniques for estimation in these models. For that, by drawing explicit links between the new models and a discrete version of independent component analysis (DICA), we first extend the DICA cumulant tensors to the new discrete version of CCA. By further using a close connection with independent component analysis, we introduce generalized covariance matrices, which can replace the cumulant tensors in the moment matching framework, and, therefore, improve sample complexity and simplify derivations and algorithms significantly. As the tensor power method or orthogonal joint diagonalization are not applicable in the new setting, we use non-orthogonal joint diagonalization techniques for matching the cumulants. We demonstrate performance of the proposed models and estimation techniques on experiments with both synthetic and real datasets. We present two computationally inexpensive techniques for estimating the numerical rank of a matrix, combining powerful tools from computational linear algebra. These techniques exploit three key ingredients. The first is to approximate the projector on the non-null invariant subspace of the matrix by using a polynomial filter. Two types of filters are discussed, one based on Hermite interpolation and the other based on Chebyshev expansions. The second ingredient employs stochastic trace estimators to compute the rank of this wanted eigen-projector, which yields the desired rank of the matrix. In order to obtain a good filter, it is necessary to detect a gap between the eigenvalues that correspond to noise and the relevant eigenvalues that correspond to the non-null invariant subspace. The third ingredient of the proposed approaches exploits the idea of spectral density, popular in physics, and the Lanczos spectroscopic method to locate this gap. Unsupervised Deep Embedding for Clustering Analysis Junyuan Xie University of Washington . Ross Girshick Facebook . Ali Farhadi University of Washington Paper AbstractClustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods. Dimensionality reduction is a popular approach for dealing with high dimensional data that leads to substantial computational savings. Random projections are a simple and effective method for universal dimensionality reduction with rigorous theoretical guarantees. In this paper, we theoretically study the problem of differentially private empirical risk minimization in the projected subspace (compressed domain). Empirical risk minimization (ERM) is a fundamental technique in statistical machine learning that forms the basis for various learning algorithms. Starting from the results of Chaudhuri et al. (NIPS 2009, JMLR 2011), there is a long line of work in designing differentially private algorithms for empirical risk minimization problems that operate in the original data space. We ask: is it possible to design differentially private algorithms with small excess risk given access to only projected data In this paper, we answer this question in affirmative, by showing that for the class of generalized linear functions, we can obtain excess risk bounds of O(w(Theta) n ) under eps-differential privacy, and O((w(Theta)n) ) under (eps, delta)-differential privacy, given only the projected data and the projection matrix. Here n is the sample size and w(Theta) is the Gaussian width of the parameter space that we optimize over. Our strategy is based on adding noise for privacy in the projected subspace and then lifting the solution to original space by using high-dimensional estimation techniques. A simple consequence of these results is that, for a large class of ERM problems, in the traditional setting (i. e. with access to the original data), under eps-differential privacy, we improve the worst-case risk bounds of Bassily et al. (FOCS 2014). We consider the maximum likelihood parameter estimation problem for a generalized Thurstone choice model, where choices are from comparison sets of two or more items. We provide tight characterizations of the mean square error, as well as necessary and sufficient conditions for correct classification when each item belongs to one of two classes. These results provide insights into how the estimation accuracy depends on the choice of a generalized Thurstone choice model and the structure of comparison sets. We find that for a priori unbiased structures of comparisons, e. g. when comparison sets are drawn independently and uniformly at random, the number of observations needed to achieve a prescribed estimation accuracy depends on the choice of a generalized Thurstone choice model. For a broad set of generalized Thurstone choice models, which includes all popular instances used in practice, the estimation error is shown to be largely insensitive to the cardinality of comparison sets. On the other hand, we found that there exist generalized Thurstone choice models for which the estimation error decreases much faster with the cardinality of comparison sets. Large-Margin Softmax Loss for Convolutional Neural Networks Weiyang Liu Peking University . Yandong Wen South China University of Technology . Zhiding Yu Carnegie Mellon University . Meng Yang Shenzhen University Paper AbstractCross-entropy loss together with softmax is arguably one of the most common used supervision components in convolutional neural networks (CNNs). Despite its simplicity, popularity and excellent performance, the component does not explicitly encourage discriminative learning of features. In this paper, we propose a generalized large-margin softmax (L-Softmax) loss which explicitly encourages intra-class compactness and inter-class separability between learned features. Moreover, L-Softmax not only can adjust the desired margin but also can avoid overfitting. We also show that the L-Softmax loss can be optimized by typical stochastic gradient descent. Extensive experiments on four benchmark datasets demonstrate that the deeply-learned features with L-softmax loss become more discriminative, hence significantly boosting the performance on a variety of visual classification and verification tasks. A Random Matrix Approach to Echo-State Neural Networks Romain Couillet CentraleSupelec . Gilles Wainrib ENS Ulm, Paris, France . Hafiz Tiomoko Ali CentraleSupelec, Gif-sur-Yvette, France . Harry Sevi ENS Lyon, Lyon, Paris Paper AbstractRecurrent neural networks, especially in their linear version, have provided many qualitative insights on their performance under different configurations. This article provides, through a novel random matrix framework, the quantitative counterpart of these performance results, specifically in the case of echo-state networks. Beyond mere insights, our approach conveys a deeper understanding on the core mechanism under play for both training and testing. One-hot CNN (convolutional neural network) has been shown to be effective for text categorization (Johnson 038 Zhang, 2015). We view it as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of text region embedding pooling8217. Under this framework, we explore a more sophisticated region embedding method using Long Short-Term Memory (LSTM). LSTM can embed text regions of variable (and possibly large) sizes, whereas the region size needs to be fixed in a CNN. We seek effective and efficient use of LSTM for this purpose in the supervised and semi-supervised settings. The best results were obtained by combining region embeddings in the form of LSTM and convolution layers trained on unlabeled data. The results indicate that on this task, embeddings of text regions, which can convey complex concepts, are more useful than embeddings of single words in isolation. We report performances exceeding the previous best results on four benchmark datasets. Crowdsourcing systems are popular for solving large-scale labelling tasks with low-paid (or even non-paid) workers. We study the problem of recovering the true labels from noisy crowdsourced labels under the popular Dawid-Skene model. To address this inference problem, several algorithms have recently been proposed, but the best known guarantee is still significantly larger than the fundamental limit. We close this gap under a simple but canonical scenario where each worker is assigned at most two tasks. In particular, we introduce a tighter lower bound on the fundamental limit and prove that Belief Propagation (BP) exactly matches this lower bound. The guaranteed optimality of BP is the strongest in the sense that it is information-theoretically impossible for any other algorithm to correctly la - bel a larger fraction of the tasks. In the general setting, when more than two tasks are assigned to each worker, we establish the dominance result on BP that it outperforms other existing algorithms with known provable guarantees. Experimental results suggest that BP is close to optimal for all regimes considered, while existing state-of-the-art algorithms exhibit suboptimal performances. Learning control has become an appealing alternative to the derivation of control laws based on classic control theory. However, a major shortcoming of learning control is the lack of performance guarantees which prevents its application in many real-world scenarios. As a step in this direction, we provide a stability analysis tool for controllers acting on dynamics represented by Gaussian processes (GPs). We consider arbitrary Markovian control policies and system dynamics given as (i) the mean of a GP, and (ii) the full GP distribution. For the first case, our tool finds a state space region, where the closed-loop system is provably stable. In the second case, it is well known that infinite horizon stability guarantees cannot exist. Instead, our tool analyzes finite time stability. Empirical evaluations on simulated benchmark problems support our theoretical results. Learning a classifier from private data distributed across multiple parties is an important problem that has many potential applications. How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any partys private data We propose to transfer the knowledge of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global differentially private classifier. We show that majority voting is too sensitive and therefore propose a new risk weighted by class probabilities estimated from the ensemble. Relative to a non-private solution, our private solution has a generalization error bounded by O(epsilon M ). This allows strong privacy without performance loss when the number of participating parties M is large, such as in crowdsensing applications. We demonstrate the performance of our framework with realistic tasks of activity recognition, network intrusion detection, and malicious URL detection. Network Morphism Tao Wei University at Buffalo . Changhu Wang Microsoft Research . Yong Rui Microsoft Research . Chang Wen Chen Paper AbstractWe present a systematic study on how to morph a well-trained neural network to a new one so that its network function can be completely preserved. We define this as network morphism in this research. After morphing a parent network, the child network is expected to inherit the knowledge from its parent network and also has the potential to continue growing into a more powerful one with much shortened training time. The first requirement for this network morphism is its ability to handle diverse morphing types of networks, including changes of depth, width, kernel size, and even subnet. To meet this requirement, we first introduce the network morphism equations, and then develop novel morphing algorithms for all these morphing types for both classic and convolutional neural networks. The second requirement is its ability to deal with non-linearity in a network. We propose a family of parametric-activation functions to facilitate the morphing of any continuous non-linear activation neurons. Experimental results on benchmark datasets and typical neural networks demonstrate the effectiveness of the proposed network morphism scheme. Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kronecker Factors for Convolution (KFC), a tractable approximation to the Fisher matrix for convolutional networks based on a structured probabilistic model for the distribution over backpropagated derivatives. Similarly to the recently proposed Kronecker-Factored Approximate Curvature (K-FAC), each block of the approximate Fisher matrix decomposes as the Kronecker product of small matrices, allowing for efficient inversion. KFC captures important curvature information while still yielding comparably efficient updates to stochastic gradient descent (SGD). We show that the updates are invariant to commonly used reparameterizations, such as centering of the activations. In our experiments, approximate natural gradient descent with KFC was able to train convolutional networks several times faster than carefully tuned SGD. Furthermore, it was able to train the networks in 10-20 times fewer iterations than SGD, suggesting its potential applicability in a distributed setting. Budget constrained optimal design of experiments is a classical problem in statistics. Although the optimal design literature is very mature, few efficient strategies are available when these design problems appear in the context of sparse linear models commonly encountered in high dimensional machine learning and statistics. In this work, we study experimental design for the setting where the underlying regression model is characterized by a ell1-regularized linear function. We propose two novel strategies: the first is motivated geometrically whereas the second is algebraic in nature. We obtain tractable algorithms for this problem and also hold for a more general class of sparse linear models. We perform an extensive set of experiments, on benchmarks and a large multi-site neuroscience study, showing that the proposed models are effective in practice. The latter experiment suggests that these ideas may play a small role in informing enrollment strategies for similar scientific studies in the short-to-medium term future. Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs Anton Osokin . Jean-Baptiste Alayrac ENS . Isabella Lukasewitz INRIA . Puneet Dokania INRIA and Ecole Centrale Paris . Simon Lacoste-Julien INRIA Paper AbstractIn this paper, we propose several improvements on the block-coordinate Frank-Wolfe (BCFW) algorithm from Lacoste-Julien et al. (2013) recently used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, though it has wider applications. The key intuition behind our improvements is that the estimates of block gaps maintained by BCFW reveal the block suboptimality that can be used as an adaptive criterion. First, we sample objects at each iteration of BCFW in an adaptive non-uniform way via gap-based sampling. Second, we incorporate pairwise and away-step variants of Frank-Wolfe into the block-coordinate setting. Third, we cache oracle calls with a cache-hit criterion based on the block gaps. Fourth, we provide the first method to compute an approximate regularization path for SSVM. Finally, we provide an exhaustive empirical evaluation of all our methods on four structured prediction datasets. Exact Exponent in Optimal Rates for Crowdsourcing Chao Gao Yale University . Yu Lu Yale University . Dengyong Zhou Microsoft Research Paper AbstractCrowdsourcing has become a popular tool for labeling large datasets. This paper studies the optimal error rate for aggregating crowdsourced labels provided by a collection of amateur workers. Under the Dawid-Skene probabilistic model, we establish matching upper and lower bounds with an exact exponent mI(pi), where m is the number of workers and I(pi) is the average Chernoff information that characterizes the workers8217 collective ability. Such an exact characterization of the error exponent allows us to state a precise sample size requirement m ge frac logfrac in order to achieve an epsilon misclassification error. In addition, our results imply optimality of various forms of EM algorithms given accurate initializers of the model parameters. Unsupervised learning and supervised learning are key research topics in deep learning. However, as high-capacity supervised neural networks trained with a large amount of labels have achieved remarkable success in many computer vision tasks, the availability of large-scale labeled images reduced the significance of unsupervised learning. Inspired by the recent trend toward revisiting the importance of unsupervised learning, we investigate joint supervised and unsupervised learning in a large-scale setting by augmenting existing neural networks with decoding pathways for reconstruction. First, we demonstrate that the intermediate activations of pretrained large-scale classification networks preserve almost all the information of input images except a portion of local spatial details. Then, by end-to-end training of the entire augmented architecture with the reconstructive objective, we show improvement of the network performance for supervised tasks. We evaluate several variants of autoencoders, including the recently proposed 8220what-where8221 autoencoder that uses the encoder pooling switches, to study the importance of the architecture design. Taking the 16-layer VGGNet trained under the ImageNet ILSVRC 2012 protocol as a strong baseline for image classification, our methods improve the validation-set accuracy by a noticeable margin. (LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is also known that solving LRR is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is n-by-n (where n is the number of samples). In this paper, we thereby develop a novel online implementation of LRR that reduces the memory cost from O(n2) to O(pd), with p being the ambient dimension and d being some estimated rank (d 20 reduction in the model size without any loss in accuracy on CIFAR-10 benchmark. We also demonstrate that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. In doing so, we report a new state-of-the-art fixed point performance of 6.78 error-rate on CIFAR-10 benchmark. Provable Algorithms for Inference in Topic Models Sanjeev Arora Princeton University . Rong Ge . Frederic Koehler Princeton University . Tengyu Ma Princeton University . Ankur Moitra Paper AbstractRecently, there has been considerable progress on designing algorithms with provable guarantees 8212typically using linear algebraic methods8212for parameter learning in latent variable models. Designing provable algorithms for inference has proved more difficult. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling. This paper develops an approach for efficiently solving general convex optimization problems specified as disciplined convex programs (DCP), a common general-purpose modeling framework. Specifically we develop an algorithm based upon fast epigraph projections, projections onto the epigraph of a convex function, an approach closely linked to proximal operator methods. We show that by using these operators, we can solve any disciplined convex program without transforming the problem to a standard cone form, as is done by current DCP libraries. We then develop a large library of efficient epigraph projection operators, mirroring and extending work on fast proximal algorithms, for many common convex functions. Finally, we evaluate the performance of the algorithm, and show it often achieves order of magnitude speedups over existing general-purpose optimization solvers. We study the fixed design segmented regression problem: Given noisy samples from a piecewise linear function f, we want to recover f up to a desired accuracy in mean-squared error. Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that 8211 while not being minimax optimal 8211 achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of 2 to 4, while achieving speedups of three orders of magnitude. Energetic Natural Gradient Descent Philip Thomas CMU . Bruno Castro da Silva . Christoph Dann Carnegie Mellon University . Emma Paper AbstractWe propose a new class of algorithms for minimizing or maximizing functions of parametric probabilistic models. These new algorithms are natural gradient algorithms that leverage more information than prior methods by using a new metric tensor in place of the commonly used Fisher information matrix. This new metric tensor is derived by computing directions of steepest ascent where the distance between distributions is measured using an approximation of energy distance (as opposed to Kullback-Leibler divergence, which produces the Fisher information matrix), and so we refer to our new ascent direction as the energetic natural gradient. Partition Functions from Rao-Blackwellized Tempered Sampling David Carlson Columbia University . Patrick Stinson Columbia University . Ari Pakman Columbia University . Liam Paper AbstractPartition functions of probability distributions are important quantities for model evaluation and comparisons. We present a new method to compute partition functions of complex and multimodal distributions. Such distributions are often sampled using simulated tempering, which augments the target space with an auxiliary inverse temperature variable. Our method exploits the multinomial probability law of the inverse temperatures, and provides estimates of the partition function in terms of a simple quotient of Rao-Blackwellized marginal inverse temperature probability estimates, which are updated while sampling. We show that the method has interesting connections with several alternative popular methods, and offers some significant advantages. In particular, we empirically find that the new method provides more accurate estimates than Annealed Importance Sampling when calculating partition functions of large Restricted Boltzmann Machines (RBM) moreover, the method is sufficiently accurate to track training and validation log-likelihoods during learning of RBMs, at minimal computational cost. In this paper we address the identifiability and efficient learning problems of finite mixtures of Plackett-Luce models for rank data. We prove that for any kgeq 2, the mixture of k Plackett-Luce models for no more than 2k-1 alternatives is non-identifiable and this bound is tight for k2. For generic identifiability, we prove that the mixture of k Plackett-Luce models over m alternatives is if kleqlfloorfrac 2rfloor. We also propose an efficient generalized method of moments (GMM) algorithm to learn the mixture of two Plackett-Luce models and show that the algorithm is consistent. Our experiments show that our GMM algorithm is significantly faster than the EMM algorithm by Gormley 038 Murphy (2008), while achieving competitive statistical efficiency. The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities for abstraction in environments where no two situations are exactly alike. In this work, we investigate approximate state abstractions, which treat nearly-identical situations as equivalent. We present theoretical guarantees of the quality of behaviors derived from four types of approximate abstractions. Additionally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity and bounded loss of optimality of behavior in a variety of environments. Power of Ordered Hypothesis Testing Lihua Lei Lihua . William Fithian UC Berkeley, Department of Statistics Paper AbstractOrdered testing procedures are multiple testing procedures that exploit a pre-specified ordering of the null hypotheses, from most to least promising. We analyze and compare the power of several recent proposals using the asymptotic framework of Li 038 Barber (2015). While accumulation tests including ForwardStop can be quite powerful when the ordering is very informative, they are asymptotically powerless when the ordering is weaker. By contrast, Selective SeqStep, proposed by Barber 038 Candes (2015), is much less sensitive to the quality of the ordering. We compare the power of these procedures in different regimes, concluding that Selective SeqStep dominates accumulation tests if either the ordering is weak or non-null hypotheses are sparse or weak. Motivated by our asymptotic analysis, we derive an improved version of Selective SeqStep which we call Adaptive SeqStep, analogous to Storeys improvement on the Benjamini-Hochberg proce - dure. We compare these methods using the GEO-Query data set analyzed by (Li 038 Barber, 2015) and find Adaptive SeqStep has favorable performance for both good and bad prior orderings. PHOG: Probabilistic Model for Code Pavol Bielik ETH Zurich . Veselin Raychev ETH Zurich . Martin Vechev ETH Zurich Paper AbstractWe introduce a new generative model for code called probabilistic higher order grammar (PHOG). PHOG generalizes probabilistic context free grammars (PCFGs) by allowing conditioning of a production rule beyond the parent non-terminal, thus capturing rich contexts relevant to programs. Even though PHOG is more powerful than a PCFG, it can be learned from data just as efficiently. We trained a PHOG model on a large JavaScript code corpus and show that it is more precise than existing models, while similarly fast. As a result, PHOG can immediately benefit existing programming tools based on probabilistic models of code. We consider the problem of online prediction in changing environments. In this framework the performance of a predictor is evaluated as the loss relative to an arbitrarily changing predictor, whose individual components come from a base class of predictors. Typical results in the literature consider different base classes (experts, linear predictors on the simplex, etc.) separately. Introducing an arbitrary mapping inside the mirror decent algorithm, we provide a framework that unifies and extends existing results. As an example, we prove new shifting regret bounds for matrix prediction problems. Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance. We propose a gradient-based approach for locally adjusting hyperparameters during training of the model. Hyperparameters are adjusted so as to make the model parameter gradients, and hence updates, more advantageous for the validation cost. We explore the approach for tuning regularization hyperparameters and find that in experiments on MNIST, SVHN and CIFAR-10, the resulting regularization levels are within the optimal regions. The additional computational cost depends on how frequently the hyperparameters are trained, but the tested scheme adds only 30 computational overhead regardless of the model size. Since the method is significantly less computationally demanding compared to similar gradient-based approaches to hyperparameter optimization, and consistently finds good hyperparameter values, it can be a useful tool for training neural network models. Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics. Due to its numerous applications, rank aggregation has become a problem of major interest across many fields of the computer science literature. In the vast majority of situations, Kemeny consensus(es) are considered as the ideal solutions. It is however well known that their computation is NP-hard. Many contributions have thus established various results to apprehend this complexity. In this paper we introduce a practical method to predict, for a ranking and a dataset, how close the Kemeny consensus(es) are to this ranking. A major strength of this method is its generality: it does not require any assumption on the dataset nor the ranking. Furthermore, it relies on a new geometric interpretation of Kemeny aggregation that, we believe, could lead to many other results. Horizontally Scalable Submodular Maximization Mario Lucic ETH Zurich . Olivier Bachem ETH Zurich . Morteza Zadimoghaddam Google Research . Andreas Krause Paper AbstractA variety of large-scale machine learning problems can be cast as instances of constrained submodular maximization. Existing approaches for distributed submodular maximization have a critical drawback: The capacity 8211 number of instances that can fit in memory 8211 must grow with the data set size. In practice, while one can provision many machines, the capacity of each machine is limited by physical constraints. We propose a truly scalable approach for distributed submodular maximization under fixed capacity. The proposed framework applies to a broad class of algorithms and constraints and provides theoretical guarantees on the approximation factor for any available capacity. We empirically evaluate the proposed algorithm on a variety of data sets and demonstrate that it achieves performance competitive with the centralized greedy solution. Group Equivariant Convolutional Networks Taco Cohen University of Amsterdam . Max Welling University of Amsterdam CIFAR Paper AbstractWe introduce Group equivariant Convolutional Neural Networks (G-CNNs), a natural generalization of convolutional neural networks that reduces sample complexity by exploiting symmetries. G-CNNs use G-convolutions, a new type of layer that enjoys a substantially higher degree of weight sharing than regular convolution layers. G-convolutions increase the expressive capacity of the network without increasing the number of parameters. Group convolution layers are easy to use and can be implemented with negligible computational overhead for discrete groups generated by translations, reflections and rotations. G-CNNs achieve state of the art results on CIFAR10 and rotated MNIST. The partition function is fundamental for probabilistic graphical models8212it is required for inference, parameter estimation, and model selection. Evaluating this function corresponds to discrete integration, namely a weighted sum over an exponentially large set. This task quickly becomes intractable as the dimensionality of the problem increases. We propose an approximation scheme that, for any discrete graphical model whose parameter vector has bounded norm, estimates the partition function with arbitrarily small error. Our algorithm relies on a near minimax optimal polynomial approximation to the potential function and a Clenshaw-Curtis style quadrature. Furthermore, we show that this algorithm can be randomized to split the computation into a high-complexity part and a low-complexity part, where the latter may be carried out on small computational devices. Experiments confirm that the new randomized algorithm is highly accurate if the parameter norm is small, and is otherwise comparable to methods with unbounded error. Correcting Forecasts with Multifactor Neural Attention Matthew Riemer IBM . Aditya Vempaty IBM . Flavio Calmon IBM . Fenno Heath IBM . Richard Hull IBM . Elham Khabiri IBM Paper AbstractAutomatic forecasting of time series data is a challenging problem in many industries. Current forecast models adopted by businesses do not provide adequate means for including data representing external factors that may have a significant impact on the time series, such as weather, national events, local events, social media trends, promotions, etc. This paper introduces a novel neural network attention mechanism that naturally incorporates data from multiple external sources without the feature engineering needed to get other techniques to work. We demonstrate empirically that the proposed model achieves superior performance for predicting the demand of 20 commodities across 107 stores of one of America8217s largest retailers when compared to other baseline models, including neural networks, linear models, certain kernel methods, Bayesian regression, and decision trees. Our method ultimately accounts for a 23.9 relative improvement as a result of the incorporation of external data sources, and provides an unprecedented level of descriptive ability for a neural network forecasting model. Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, 8220Would this patient have lower blood sugar had she received a different medication8221. We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art. Gaussian Processes (GPs) provide a general and analytically tractable way of modeling complex time-varying, nonparametric functions. The Automatic Bayesian Covariance Discovery (ABCD) system constructs natural-language description of time-series data by treating unknown time-series data nonparametrically using GP with a composite covariance kernel function. Unfortunately, learning a composite covariance kernel with a single time-series data set often results in less informative kernel that may not give qualitative, distinctive descriptions of data. We address this challenge by proposing two relational kernel learning methods which can model multiple time-series data sets by finding common, shared causes of changes. We show that the relational kernel learning methods find more accurate models for regression problems on several real-world data sets US stock data, US house price index data and currency exchange rate data. We introduce a new approach for amortizing inference in directed graphical models by learning heuristic approximations to stochastic inverses, designed specifically for use as proposal distributions in sequential Monte Carlo methods. We describe a procedure for constructing and learning a structured neural network which represents an inverse factorization of the graphical model, resulting in a conditional density estimator that takes as input particular values of the observed random variables, and returns an approximation to the distribution of the latent variables. This recognition model can be learned offline, independent from any particular dataset, prior to performing inference. The output of these networks can be used as automatically-learned high-quality proposal distributions to accelerate sequential Monte Carlo across a diverse range of problem settings. Slice Sampling on Hamiltonian Trajectories Benjamin Bloem-Reddy Columbia University . John Cunningham Columbia University Paper AbstractHamiltonian Monte Carlo and slice sampling are amongst the most widely used and studied classes of Markov Chain Monte Carlo samplers. We connect these two methods and present Hamiltonian slice sampling, which allows slice sampling to be carried out along Hamiltonian trajectories, or transformations thereof. Hamiltonian slice sampling clarifies a class of model priors that induce closed-form slice samplers. More pragmatically, inheriting properties of slice samplers, it offers advantages over Hamiltonian Monte Carlo, in that it has fewer tunable hyperparameters and does not require gradient information. We demonstrate the utility of Hamiltonian slice sampling out of the box on problems ranging from Gaussian process regression to Pitman-Yor based mixture models. Noisy Activation Functions Caglar Glehre . Marcin Moczulski . Misha Denil . Yoshua Bengio U. of Montreal Paper AbstractCommon nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradients. Large noise will dominate the noise-free gradient and allow stochastic gradient descent to explore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps optimization in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e. g. when curriculum learning is necessary to obtain good results. PD-Sparse. A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification Ian En-Hsu Yen University of Texas at Austin . Xiangru Huang UTaustin . Pradeep Ravikumar UT Austin . Kai Zhong ICES department, University of Texas at Austin . Inderjit Paper AbstractWe consider Multiclass and Multilabel classification with extremely large number of classes, of which only few are labeled to each instance. In such setting, standard methods that have training, prediction cost linear to the number of classes become intractable. State-of-the-art methods thus aim to reduce the complexity by exploiting correlation between labels under assumption that the similarity between labels can be captured by structures such as low-rank matrix or balanced tree. However, as the diversity of labels increases in the feature space, structural assumption can be easily violated, which leads to degrade in the testing performance. In this work, we show that a margin-maximizing loss with l1 penalty, in case of Extreme Classification, yields extremely sparse solution both in primal and in dual without sacrificing the expressive power of predictor. We thus propose a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW) algorithm that exploits both primal and dual sparsity to achieve a complexity sublinear to the number of primal and dual variables. A bi-stochastic search method is proposed to further improve the efficiency. In our experiments on both Multiclass and Multilabel problems, the proposed method achieves significant higher accuracy than existing approaches of Extreme Classification with very competitive training and prediction time.
Comments
Post a Comment